Heritrix

Heritrix is the Internet Archive's Web crawler, which is released under the GPL licence, but depends on many other packages whose licence terms may not be compatible. In many cases, the dependency is to provide handling for specific content types and may not be critical for (X)HTML-only retreival.

Heritrix also has dependencies on standard Java extensions that may not be fully implemented by GNU Classpath extensions:

Initial review notes

Heritrix is evidently an extremely complex system that is capable of handling a great diversity of Web content and presumably handling the process in a robust way. The system is composed of 274 classes and has 20 supporting libraries. It is clearly highly refined for its specific purpose; producing ARC archive files for vast amounts of Web content.

The great complexity of the Heritrix system and its close dependence on several Apache libraries in particular would make a very distracting task to extricate core components for the MKSearch project. It would be difficult to tell whether this task is practicable at all without significant further analysis, so Heritrix is not recommended unless other alternatives fail. In any case, it is likely that a system based on Metis would be more tractable than Heritrix.

This document was last modified on 2005-03-24 05:46:40.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html