Heritrix

Heritrix is the Internet Archive's Web crawler, which is released under the GPL licence, but depends on many other packages whose licence terms may not be compatible. In many cases, the dependency is to provide handling for specific content types and may not be critical for (X)HTML-only retreival.

Classes in the org.archive.crawler.extractor package depend on packages in com.anotherbigidea.flash.*. These Flash parsing packages are released under a BSD License that is compatible with the GPL, see the JavaSWF2-BSD License.
Classes in the org.archive.crawler.extractor package depend on packages in com.lowagie.text.pdf.*. This iText PDF parsing package is released under the Library General Public License, see the master version, and the Mozilla Public License.
One class, org.archive.util.GateSync, depends on the class EDU.oswego.cs.dl.util.concurrent.Sync. The package as a whole is released to the public domain, but the CopyOnWriteArrayList and ConcurrentReaderHashMap classes are released under a special licence from Sun Microsystems, see the TECHNOLOGY LICENSE FROM SUN MICROSYSTEMS, INC. TO DOUG LEA (PDF). Dependency on these classes has not been established.
Various classes depend on Apache packages released under the Apache Software License version 2.0:
- The Command Line Interface (CLI) package, org.apache.commons.cli.
- The Commons Collections package, org.apache.commons.collections.
- The Commons HTTP Client package, org.apache.commons.httpclient.
- The Commons Logging package, org.apache.commons.logging.
- The Commons Net package, org.apache.commons.net.
- The Commons Pool package, org.apache.commons.pool.
- The Jakarta POI package, org.apache.poi.hdf.extractor.
Classes in the org.archive.crawler package depend on classes in the org.mortbay.http and org.mortbay.jetty packages. These packages are released under the Apache Software License version 2.0 with special restrictions.
Classes in various packages depend on the Java DNS package, org.xbill.DNS, released under the BSD License.
JUnit tests depend on the junit.extensions and junit.framework packages.
Classes in org.archive.util and org.archive.datamodel depend on classes in the st.ata.util package, which does not appear to be maintained except by the Heritrix project. The source code contains no licence information nor copyright statement.

Heritrix also has dependencies on standard Java extensions that may not be fully implemented by GNU Classpath extensions:

Classes in many packages depend on the javax.management package.
Classes in several packages depend on the javax.net and javax.net.ssl packages.
Classes in the org.archive.crawler package depend on the javax.xml.parsers and javax.xml.transform packages, which should be compatible with GNU JAXP.
Classes in the org.archive.settings package depend on classes in org.xml.sax, which should be compatible with GNU JAXP.

Initial review notes

Heritrix is evidently an extremely complex system that is capable of handling a great diversity of Web content and presumably handling the process in a robust way. The system is composed of 274 classes and has 20 supporting libraries. It is clearly highly refined for its specific purpose; producing ARC archive files for vast amounts of Web content.

The great complexity of the Heritrix system and its close dependence on several Apache libraries in particular would make a very distracting task to extricate core components for the MKSearch project. It would be difficult to tell whether this task is practicable at all without significant further analysis, so Heritrix is not recommended unless other alternatives fail. In any case, it is likely that a system based on Metis would be more tractable than Heritrix.

This document was last modified on 2005-03-24 05:46:40.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html