Heritrix
Heritrix is the Internet Archive's Web crawler, which is released under the GPL licence, but depends on many other packages whose licence terms may not be compatible. In many cases, the dependency is to provide handling for specific content types and may not be critical for (X)HTML-only retreival.
- Classes in the
org.archive.crawler.extractor
package depend on packages incom.anotherbigidea.flash.*
. These Flash parsing packages are released under a BSD License that is compatible with the GPL, see the JavaSWF2-BSD License. - Classes in the
org.archive.crawler.extractor
package depend on packages incom.lowagie.text.pdf.*
. This iText PDF parsing package is released under the Library General Public License, see the master version, and the Mozilla Public License. - One class,
org.archive.util.GateSync
, depends on the classEDU.oswego.cs.dl.util.concurrent.Sync
. The package as a whole is released to the public domain, but theCopyOnWriteArrayList
andConcurrentReaderHashMap
classes are released under a special licence from Sun Microsystems, see the TECHNOLOGY LICENSE FROM SUN MICROSYSTEMS, INC. TO DOUG LEA (PDF). Dependency on these classes has not been established. - Various classes depend on Apache packages released under the Apache Software License version 2.0:
- The Command Line Interface (CLI) package,
org.apache.commons.cli
. - The Commons Collections package,
org.apache.commons.collections
. - The Commons HTTP Client package,
org.apache.commons.httpclient
. - The Commons Logging package,
org.apache.commons.logging
. - The Commons Net package,
org.apache.commons.net
. - The Commons Pool package,
org.apache.commons.pool
. - The Jakarta POI package,
org.apache.poi.hdf.extractor
.
- The Command Line Interface (CLI) package,
- Classes in the
org.archive.crawler
package depend on classes in theorg.mortbay.http
andorg.mortbay.jetty
packages. These packages are released under the Apache Software License version 2.0 with special restrictions. - Classes in various packages depend on the Java DNS package,
org.xbill.DNS
, released under the BSD License. - JUnit tests depend on the
junit.extensions
andjunit.framework
packages. - Classes in
org.archive.util
andorg.archive.datamodel
depend on classes in thest.ata.util
package, which does not appear to be maintained except by the Heritrix project. The source code contains no licence information nor copyright statement.
Heritrix also has dependencies on standard Java extensions that may not be fully implemented by GNU Classpath extensions:
- Classes in many packages depend on the
javax.management
package. - Classes in several packages depend on the
javax.net
andjavax.net.ssl
packages. - Classes in the
org.archive.crawler
package depend on thejavax.xml.parsers
andjavax.xml.transform
packages, which should be compatible with GNU JAXP. - Classes in the
org.archive.settings
package depend on classes inorg.xml.sax
, which should be compatible with GNU JAXP.
Initial review notes
Heritrix is evidently an extremely complex system that is capable of handling a great diversity of Web content and presumably handling the process in a robust way. The system is composed of 274 classes and has 20 supporting libraries. It is clearly highly refined for its specific purpose; producing ARC archive files for vast amounts of Web content.
The great complexity of the Heritrix system and its close dependence on several Apache libraries in particular would make a very distracting task to extricate core components for the MKSearch project. It would be difficult to tell whether this task is practicable at all without significant further analysis, so Heritrix is not recommended unless other alternatives fail. In any case, it is likely that a system based on Metis would be more tractable than Heritrix.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html