Open Directory Project
I'm an editor for the Open Directory Project and also fool around with the "RDF" dumps from time to time. The ODP dumps have a history of being rather dirty files containing a wide assortment of errors and non-standard usage of XML and RDF syntax.
I've recently started working with ODP to try to improve the quality of the data dumps and this has resulted in the creation of the related code as well as a lengthy Data Dump ToDo List on which I try to keep track of the various bugs and improvement requests related to the ODP dumps.
Another source of information that ODP data dump experimenters may find useful is Amir Salihefendic's data dump tag documentation.
The programs below are intended for developers and probably won't be very useful to you if your goal is to put ODP data on your website. If you're looking for a complete solution for using ODP data on your website, I recommend having a look at the minimoz package written by Andreas Steinmetz. Minimoz parses ODP data, stores it in an SQL database, and generates an ODP-like template-based website using the data. It's written in C and is Free Software licensed under the GNU GPL. Another option is phpODPWorld. It's written in PHP, licensed under the GNU GPL. It can display selected categories or all of ODP using any SQL database supported by the PHP Pear module.
In addition to the software on this page, I also try to maintain a listing of UTF-8 and XML errors in the currently available ODP data dump. You can often find news updates related to ODP/dmoz data dump issues at the ODP Weblog
dumpcheck is a C program that will scan the UTF-8 encoded XML data dump files exported by ODP and report the location of invalid UTF-8 sequences, illegal XML characters, illegal Unicode characters, and XML well-formedness errors. This helps ODP staff to find and correct problems in the data dump generation scripts. The line number, byte offset, approximate ODP category ID, and hex values of offending XML and UTF-8 characters are reported. The line number of XML well-formedness errors is reported (thanks to libxml2!). While this program is intended to assist with debugging of the ODP data dumps it may be useful for other tasks as well.
odp2db is a collection of Perl programs that can be used to parse the ODP data dumps and insert the data into an SQL database. Both the structure.rdf.u8 and content.rdf.u8 files are parsed. A minimal table structure is included that is suitable for loading the database but probably not useful for any real work. The XML::Parse and DBI Perl modules are required. I developed this for use with PostgreSQL but have tried to stick with standard ANSI SQL as much as possible so it should work with MySQL and anything else supported by DBI with only very minimal changes.
License: All software on this page is Free Software licensed under the GNU GPL.