Grunk is a framework for spidering and indexing metadata from semi-structured text formats. Grunk itself is released under the GPL licence, but it depends on various Apache packages.
- The
emerge.grunk.regex.RegEx class depends on the org.apache.oro.text.regex package, released under the Apache Software License.
- The
nsca.emerge.grunk.xml.XSLTransformer class depends on org.apache.xml.serialize.XMLSerializer, which is part of the Apache Xerces XML parser, released under the Apache Software License.
Grunk also has dependencies on standard Java extensions, which should be compatible with GNU JAXP:
- The
nsca.emerge.grunk.xml.XSLTransformer class depends on the javax.xml.transform and javax.xml.transform.stream packages.
- Various packages depend on the
org.w3c.dom package.
- Various packages depend on the
org.xml.sax package.
Initial review notes
Grunk turns out to be more of a content parser than a spidering application per se. It is a tool for analysing source data structures and applying appropriate parsing tools to the content.
Grunk uses layered sets of Importer, Scanner, Preprocessor components to identify an appropriate parsing scheme for a source then apply it. This makes the system quite large in terms of class numbers and biased towards plain text formats, rather than HTML or XML. Grunk seems to have a capacity for extremely large input source.