ODP/dmoz Data Dump ToDo List

This list is compilation of known bugs and requested improvements in the Open Directory Projects data export dump format. Most of the items on this list have been reported or requested by multiple people over the last few years - for the most part, all I've done is collect them into one place in the hopes of speeding up the process of improving the data dump files.

Update [13 June 2005] - ODP has set up a bugzilla for handling ODP related bugs. I've transferred all the outstandings bugs and feature requested from this list into the ODP bugzilla. Unfortunately, ODP has chosen to keep their bug database secret from their users and the general public, so I will continue to maintain this list as well. If you're an ODP editor, you can monitor the status of all ODP data dump bugs using this link: [All ODP Bugzilla Data Dump Bugs]. Otherwise, scroll down for my list of the major bugs and feature requests.

If you've got a bug or feature request for the data dump files that isn't listed here, email me and I'll add it.

Bugs

Optimizations/Improvements

Bug: Illegal UTF-8 encoding

Description: Data dump files are intended to use UTF-8 encoding but frequently contain illegal UTF-8 byte sequences. These UTF-8 errors can cause problems with programs design to read UTF-8 files. The primary source of UTF-8 errors seems to invalid input provided by editors.

Status: Actively being worked on. Autumn has been working on UTF-8 validation code for the editor input forms. sfromis has been manually deleting any reported UTF-8 sequences from the ODP database. I've created a C program that will process data dumps and report details about the errors found that should assist in locating and fixing them. No illegal UTF-8 sequences were present in data dumps between March and July of 2003. After completion of the server hardware upgrade, however, the proliferation of UTF-8 errors returned.

After a lengthy period of work, we believe all input sources to the ODP database are now being properly checked for UTF-8 and XML character-level problems. All existing errors from the UTF-8 migration of the World cats have been fixed and we've had clean data dumps since July 30, 2004.

Bug: Illegal XML characters [Fixed]

Description: Data dump files are intended to be legal XML files but frequently contain characters which are not legal in XML according to the W3C XML standards. These illegal XML characters will cause most common XML parsing libraries to abort processing, making it necessary to purge the bad characters from each data dump file prior to use. The primary source of illegal XML characters seems to be invalid input provided by editors.

Status: Actively being worked on. I provided a Perl regular expression to Autumn that can be used to filter illegal XML from the editor input forms. Autumn has modified the code to filter illegal input. sfromis has been manually deleting any report illegal XML characters from the ODP database. I've created a C program that will process data dumps and report details about the errors found that should assist in located and fixing them. No illegal XML characters have been present in data dumps since mid February 2003!

Bug: XML is not well-formed and/or valid [Invalid bug]

Description: Data dump files are intended to be valid, well-formed XML according to the current W3C XML standard. There have been many reports that this is not the case.

Status: I have investigated several specific complaints from the forums but all of them I've been able to track down turned out to actually be cases of bad UTF-8 encoding or illegal XML characters, rather than XML well-formedness problems. I have modified the code in my ODP/dmoz dump checker to test for XML well-formedness. All recent dumps have passed the well-formedness testing except those containing UTF-8 problems. The dump files do not include a DTD reference so, by definition, they cannot be called "valid". But they are well-formed and comply with the requirments of the W3C XML 1.0 standards. An error report is now run on each new dump shortly after it is released.

Bug: RDF formatting is invalid

Description: Data dump files are intended to be a valid W3C RDF model described in the W3C RDF/XML Syntax Specification. They are not. The current file format does, in some ways, resemble RDF but is by no means close to being actual RDF. Strict RDF parsers cannot parse the ODP dump file format at all. Some RDF parsers include both a standard RDF parser and an ODP/dmoz file parser.

Status: A quick check of the RDF standards and the ODP dump will verify that this is a major, undeniable bug. There are historical reasons behind this problem. The ODP data dump format was designed at the same time that the RDF standards were being defined (in fact, it was designed by one of the architects of the RDF standard). However, the ODP format was completed and put into active use before the RDF standard was completed. When the RDF standard was released, it came as a bit of a suprise to ODP staff that it had changed beyond recognition from the format used in the ODP dump.

There are two possible solutions to this problem. We could simply call the ODP dump an XML format and be done with it. But that would leave a variety of unresolved problems with the format even if it solves the RDF compliance issue. The other option is to develop a valid RDF model that will work for ODP dump data. Such a model could solve a number of other bugs with the dump format and provide an ideal opportunity to greatly reduce the file size of the dumps.

ODP staff has indicated that the single most important requirement when considering changes to the format is that no problems are introduced for existing users of the dump. The best way to work around this would be to maintain both an old and new dump for a transition period. The new dump could take advantage of an optimized XML or RDF format, better compression, and other enhancements. Existing users could adopt the new format at their own pace and once most major users switched to the new format, the legacy dump could be dropped.

Switching to a new, valid RDF dump model would also be an ideal time to replace the current Perl based RDF generation script with a faster C-based generator to reduce the time needed to produce a dump (it's worth noting here that ODP staff has said the primary reason for the slowness of the current RDF generator is disk contention rather than the speed of the generator itself).

No current work is being done on this bug. No RDF validation is done at present.

It would helpful in limiting the confusion caused by this bug if everyone would refer to "ODP data dumps" rather than "ODP RDF dumps". The use of the acronym RDF is incorrect and leads to false expectations on the part of our downstream users that the data files are actually RDF files.

Related forum threads:
Using the RDF thing

Bug: Markup does not indicate catid of duplicate sites in content dump

Description: The content dump often contains duplicate ExternalPage tags without specifying what category they belong in. (The RDF format provides several types of container mechanisms which would prevent this type of problem with duplicate sites if it were used.) Example:

<ExternalPage about="https://www.dprg.org/">
  <d:Title>DPRG</d:Title>
  <d:Description>Dallas Personal Robotics Group.</d:Description>
</ExternalPage>

and

<ExternalPage about="https://www.dprg.org/">
  <d:Title>DPRG - Dallas Personal Robotics Group Web Page</d:Title>
  <d:Description>The Dallas Personal Robotics Group is one the 
  nation's oldest special interest groups dedicated to the development
  and use of personal robotics and has been around since 1984.
  </d:Description>
</ExternalPage>

Both of these records describe the same site. One of them should be 
associated with cat 590682 and one with cat 4258. Both cats have a 
link tag that looks like this:

<link r:resource="https://www.dprg.org/"/>

Which ExternalPage record goes with which category?

In practice, one can work around this bug by making the assumption that any given ExternalPage goes with the Catid or Topic immediately preceding it. However, this assumption can only be used if one is processing the XML file as a linear stream. Once the data has been stored in a database, any relationships that existed merely because of the file organization are lost and there is no way to positively place dupicates in their original categories.

Three possible solutions present themselves:

Solution 1: ExternalPage records should be nested within the Topic record they are associated with.

<Topic r:id="Top/CategoryOne">
  <link r:resource="https://www.samplesite.com/"/>
  <ExternalPage about="https://www.samplesite.com/">
    <d:Title>Sample Site instance 1</d:Title>
    <d:Description>Description of instance 1</d:Description>
  </ExternalPage>  
</Topic>

<Topic r:id="Top/CategoryTwo">
  <link r:resource="https://www.samplesite.com/"/>
  <ExternalPage about="https://www.samplesite.com/">
    <d:Title>Sample Site instance 2</d:Title>
    <d:Description>Description of instance 2</d:Description>
  </ExternalPage>  
</Topic>

There is now no question in which category each instance of the duplicate site resides.

Solution 2: Add a specific Topic or Catid tag inside of each ExternalPage record that indicates which category it belongs in.

<ExternalPage about="https://www.samplesite.com/">
  <Topic r:id="Top/CategoryOne">
  <d:Title>Sample Site instance 1</d:Title>
  <d:Description>Description of instance 1</d:Description>
</ExternalPage>  

<ExternalPage about="https://www.samplesite.com/">
  <Topic r:id="Top/CategoryTwo">
  <d:Title>Sample Site instance 2</d:Title>
  <d:Description>Description of instance 2</d:Description>
</ExternalPage>

Solution 3: Convert the dumps to valid RDF, which provides containers that could be used to hold sites belonging to specific categories.

Status: Acknowledged and under discussion. When last reported, Autumn indicated she would look into a solution after the RDF generation problems were solved. [update] Autumn implented solution #2 and updated the changelog. Most recent dump looks good. Marking as fixed! [update] The most recent data dump has reverted to the old format. [update] The fix mentioned in the change log is back in place and the problem seems resolved again. (thanks due to tschild for noticing!). [update 13 June 2005] sigh... Autumn's fix has vanished again and the bug is back.

Related forum threads:
[RDF Bug] Duplicate site handling
[rdf-content] duplicate sites

Bug: ODP data dump changelog dating is inconsistent [Fixed]

Description: The chronology of the RDF changelog is inconsistent. It begins at the top in 2000, then progress backwards to 1999, then jumps forward to 2002. Traditionally, a changelog is in reverse chronological order with the most recent changes at the top. The out of sequence entries in the current changelog could cause casual readers to miss the most recent changes. (also, the changlog indicates the ODP data dump is in RDF format, which is not correct.)

Status: Reported. No response from ODP staff yet. This is something that an interested editor could fix. Just grab the HTML, fix it, and email it to Autumn. Update: I emailed a corrected changelog to Autumn 15 May 2003 and it is now online.

Related forum threads:
[bug] broken links on https://rdf.dmoz.org/

Optimization Request: Dumps for subcats

Description: One of the most frequent feature requests for the data dumps has been to create more granular dumps. The full dump is about 1.5GB and is simply too large to be useful to anyone who isn't attempting to mirror the entire ODP database. Most users of the data use only one (or a very small number) of categories.

Status: Active work is being done. Until recently ODP staff have declined to offer this feature on esoteric technical grounds (e.g., symbolic links would reference categories that were not present in the subcat dump file). However, this objection is not relavent here because users face it if ODP provides subcat dumps or if they download the entire dump and then split off the portion they want. I've written a script that automates the splitting of the complete dumps into subcat dumps for the major top-level categories. At present this is being done as a test and user feedback is being gathered to detemine how to proceed. The test subcat dumps are available here: https://rodan.ncc.com/rdf/cats/

Update [23 May 2006] : ODP staff requested download stats for my subcat dump test and may be considering adopt some type of subcat dump official in the future.

Bug: No freely available SQL import sample code [Fixed]

Description: Requests appear from time to time for some sample code that will allow the import of the ODP dump into an SQL database. Various proprietary products are available that claim to be able to do this but little is available for the Free Software community. What little Free Software or Open Source code exists for this purpose is listed below:

Nurey's ODP MySQL project - Nurey's code is several years old, large, and uses weird MySQL syntax which isn't really legal SQL (maybe this is normal for MySQL stuff?). This code was not intended as a sample of how to import the ODP dumps but may be of some use to those who are attempting it.
Use of ODP Data - in the strangely named ODP category, "/Use_of_ODP_Data/Upload Tools", you'll find not upload tools but a variety of tools that claim to allow importing, conversion, slicing, and other processing of the ODP dumps. Most are proprietary, shareware, or don't work. None of the tools provide a way to "upload" anything - probably because ODP supplies dumps for user download and doesn't accept uploads of any kind. :-)

Status: I've written some sample Perl code to import the ODP dump into an SQL database. It is Free Software licensed under the GNU GPL. It's mostly ANSI SQL and uses a minimal table structure. It was developed for use with PostgreSQL but should work with any SQL database supported by the Perl DBD API. The code isn't fast or elegant but it is small and relatively simple. Because the ODP dumps are not currently available in a valid RDF format, my code parses at the XML level to avoid the need to write a complicated ODP-RDF parser.

Related forum threads:
RDF dump -> SQL

Optimization Request: bzip2 instead of gzip compression

Description: At present ODP dumps are compressed using the GNU gzip compression utility which was developed in 1997 to replace the Unix compress utility after Unisys claimed patent rights on the LZW compression algorithm, thus rendering it unavailable to the Free Software community. Most modern Unix systems now include a newer compression utility known as bzip2 which offers significantly better compression.

The work required to switch to this compression format is trivial (probably on the order of five minutes of work to edit the dump generation script - maybe 6 minutes if a Sun Solaris binary of bzip2 needs to be downloaded first) and the effort needed by downstream users of the data is also trivial.

Why make the switch?

1. A 25-30% reduction in file size (less disk space used at both ends).

2. Much faster download time for users

3. A measurable reduction in bandwidth and HTTP transfer usage by the dmoz.org server.

4. Fewer user compaints: gzip is used by some web servers to compress documents on the fly. Browser bugs and user error cause many problems for users who attempt to download the current gzip-compressed dump only to find that their browser is trying to decompress the file during the download (trust me, no one wants wants to load a 1GB data file in a browser window).

Status: Unknown. Requests to update the compression scheme have gone unanswered and unacknowledged for several years.

Optimization Request: Catid should be used in place of Topic

Description: At present the category name is identified in the content dump by use of a Topic tag. Use of the catid tag instead would have several positive effects; first, the file size of the content dump would be markedly smaller (400k cats times the size of a category title string). Second, it would simplify processing of the files. Most users of the dumps must parse them and load them into a database. Part of this process usually involves using the catid as an index. If the content records are identified by the Topic tag in the dumps, each Topic must be converted to its equivalent Catid during the database loading process. And, third, the category title is not really part of the content anyway. It simplifies things to keep structure information in the structure dump and content information in the content dump as much as possible.

Additionally, in the structure file, the Topic tag is used multiple times when it would be more appropriate to use it only once, favoring the smaller Catid tag in other instances such as resource links.

Status: Unknown. This topic has been brought up at least once in the forums in the past but has never been acknowledged or addressed by ODP staff. Converting the dumps to a valid RDF format would probably take care of this issue along with several other markup optimization oddities.

Related forum threads:
RDF symbolic, related should use catid instead of Topic

Optimization Request: dump diffs

Description: Rather than downloading the entire ODP dump each week, it would be faster and more efficient for users to download only a diff that could be applied against their existing data. A new user would download the entire dump one time and then download a small diff containing adds, deletes, catmvs, etc.

Status: Unknown. The problem of diffs has been brought up many times over the last several years. ODP staff has alternately indicated it could not be done, could but wouldn't be done, and, at times, that it was going to be done. This is an ideal application for an interested editor or other outside party to write since it could be done using the existing dump. Doing it outside of ODP would allow us to keep the software GPL'd as well.

Optimization Request: Live RSS, XML, or RDF feeds for individual cats or pages

Description: Providing an easy, fast, up-to-date method of retrieving data for single category would help a lot of folks. It would also reduce the ODP server load by reducing the number of screen-scrapers. Using RSS could allow existing syndication software to subscribe to feeds of ODP categories and receive updated listings on a daily or hourly basis just like an RSS newsfeed.

Status: There are now two external options for RSS feeds.

1. xmlhub.com offers RSS feeds of ODP searches and ODP categories generated from the weekly ODP data dump:
https://www.xmlhub.com/odp_feed.php

2. ODP editor rpfuller has created a tool that lets users subscribe to an RSS feed of a selected category and, optionally, any subcategories beneath the selected category:
https://research.dmoz.org/~rpfuller/live/

Improvement Request: ODP Dump should have a free/open content license

Description: The current ODP Dump usage license includes restrictions that prevent it from qualifying as an Open/Free license. Specifically, the license to use the data is not permanent because it requires the user to continously check ODP for changes and requires them to modify any derived works to incorporate those changes. If the derived work is unchangable (for example, a CD-ROM or printed work), it's unclear if the license remains valid for more than the one week interval between dumps. The ideal solution would be to set up a dialog with the license experts at the Free Software Foundation to discuss how the license could be improved. Another possiblity would be to adopt one of the widely used open content licenses such as the Open Publication License. An additional license problem that comes up frequently is the apparent need to include category-relative HTML links back to ODP in a derived work - this is obviously not possible in non-HTML works such as databases, PDF, plain text, or XML files (such as the dump itself).

Update [23 May 2006] : Unconfirmed posts to the ODP discussion forum indicate that ODP staff and the AOL legal team are creating new rev of the ODP data use license.

Status: Unknown

Related URLs:
https://dmoz.org/license.html - The current ODP Data Use License
https://www.gnu.org/licenses/license-list.html - FSF GNU Non-Free license list

Task Status Color Codes
Verified Bug	Active Work	Fixed	Unknown