Grunk

Grunk is a framework for spidering and indexing metadata from semi-structured text formats. Grunk itself is released under the GPL licence, but it depends on various Apache packages.

Grunk also has dependencies on standard Java extensions, which should be compatible with GNU JAXP:

Initial review notes

Grunk turns out to be more of a content parser than a spidering application per se. It is a tool for analysing source data structures and applying appropriate parsing tools to the content.

Grunk uses layered sets of Importer, Scanner, Preprocessor components to identify an appropriate parsing scheme for a source then apply it. This makes the system quite large in terms of class numbers and biased towards plain text formats, rather than HTML or XML. Grunk seems to have a capacity for extremely large input source.

This document was last modified on 2004-11-03 06:31:55.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html