Open Directory Project
I'm an editor
for the Open Directory Project and also
fool around with the "RDF" dumps
from time to time. The ODP dumps have a history of being rather dirty
files containing a wide assortment of errors and non-standard usage of XML
and RDF syntax.
I've recently started working with ODP to try to improve the
quality of the data dumps and this has resulted in the creation of the
related code as well as a lengthy Data Dump ToDo
List on which I try to keep track of the various bugs and
improvement requests related to the ODP dumps.
Another source of information that ODP data dump experimenters may
find useful is Amir Salihefendic's
data dump tag
documentation.
The programs below are intended for developers and probably won't
be very useful to you if your goal is to put ODP data on your
website. If you're looking for a complete solution for using ODP
data on your website, I recommend having a look at the
minimoz
package written by Andreas Steinmetz. Minimoz parses ODP data,
stores it in an SQL database, and generates an ODP-like
template-based website using the data. It's written in C and is
Free Software licensed under the GNU GPL. Another option is
phpODPWorld.
It's written in PHP, licensed under the GNU GPL. It can display
selected categories or all of ODP using any SQL database supported
by the PHP Pear module.
In addition to the software on this page, I also try to maintain
a listing of UTF-8 and XML
errors in the currently available ODP data dump. You can often find
news updates related to ODP/dmoz data dump issues at the ODP Weblog
dumpcheck
dumpcheck is a C program that will scan the UTF-8 encoded XML data
dump files exported by ODP and report the location of invalid UTF-8
sequences, illegal XML characters, illegal Unicode characters, and XML
well-formedness errors. This helps ODP staff to find and correct
problems in the data dump generation scripts. The line number, byte
offset, approximate ODP category ID, and hex values of offending XML
and UTF-8 characters are reported. The line number of XML
well-formedness errors is reported (thanks to
libxml2!). While this program is
intended to assist with debugging of the ODP
data dumps it may be useful for other tasks as well.
dumpcheck-1.11.tar.gz [17Kb]
odp2db
odp2db is a collection of Perl programs that can be used
to parse the ODP data dumps and insert the data into an SQL database.
Both the structure.rdf.u8 and content.rdf.u8 files are parsed. A
minimal table structure is included that is suitable for loading the
database but probably not useful for any real work. The
XML::Parse
and DBI
Perl modules are required. I developed this for use with PostgreSQL
but have tried to stick with standard ANSI SQL as much as possible
so it should work with MySQL and anything else supported by DBI with
only very minimal changes.
odp2db-1.2.tar.gz [17Kb]
License: All software on this page is
Free Software
licensed under the
GNU GPL.
|