Grunk is a framework for spidering and indexing metadata from semi-structured text formats. Grunk itself is released under the GPL licence, but it depends on various Apache packages.
- The
emerge.grunk.regex.RegEx
class depends on the org.apache.oro.text.regex
package, released under the Apache Software License.
- The
nsca.emerge.grunk.xml.XSLTransformer
class depends on org.apache.xml.serialize.XMLSerializer
, which is part of the Apache Xerces XML parser, released under the Apache Software License.
Grunk also has dependencies on standard Java extensions, which should be compatible with GNU JAXP:
- The
nsca.emerge.grunk.xml.XSLTransformer
class depends on the javax.xml.transform
and javax.xml.transform.stream
packages.
- Various packages depend on the
org.w3c.dom
package.
- Various packages depend on the
org.xml.sax
package.
Initial review notes
Grunk turns out to be more of a content parser than a spidering application per se. It is a tool for analysing source data structures and applying appropriate parsing tools to the content.
Grunk uses layered sets of Importer, Scanner, Preprocessor components to identify an appropriate parsing scheme for a source then apply it. This makes the system quite large in terms of class numbers and biased towards plain text formats, rather than HTML or XML. Grunk seems to have a capacity for extremely large input source.