Grunk

Grunk is a framework for spidering and indexing metadata from semi-structured text formats. Grunk itself is released under the GPL licence, but it depends on various Apache packages.

The emerge.grunk.regex.RegEx class depends on the org.apache.oro.text.regex package, released under the Apache Software License.
The nsca.emerge.grunk.xml.XSLTransformer class depends on org.apache.xml.serialize.XMLSerializer, which is part of the Apache Xerces XML parser, released under the Apache Software License.

Grunk also has dependencies on standard Java extensions, which should be compatible with GNU JAXP:

The nsca.emerge.grunk.xml.XSLTransformer class depends on the javax.xml.transform and javax.xml.transform.stream packages.
Various packages depend on the org.w3c.dom package.
Various packages depend on the org.xml.sax package.

Initial review notes

Grunk turns out to be more of a content parser than a spidering application per se. It is a tool for analysing source data structures and applying appropriate parsing tools to the content.

Grunk uses layered sets of Importer, Scanner, Preprocessor components to identify an appropriate parsing scheme for a source then apply it. This makes the system quite large in terms of class numbers and biased towards plain text formats, rather than HTML or XML. Grunk seems to have a capacity for extremely large input source.

This document was last modified by Philip Shaw on 2004-11-03 06:31:55
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Sign up

Grunk

Initial review notes