Arachnid
Arachnid is released under the GPL, it appears to be a tool for mapping a link structure. It is a small package of 8 classes.
- One class,
bplatt.spider.PageInfo
, depends on thejavax.swing.text
andjavax.swing.text.html
packages, which may not be fully implemented by GNU Classpath.
Initial review notes
Arachnid is of a similar nature to Spindle, but its structure is more modular and extensive. The package has a layered scheme for creating spiders by extending the abstract Arachnid
base class. Subclasses implement template methods for handling bad links, IO exceptions, unrecognised links and external links.
Arachnid is not multi-threaded, but the Arachnid
base class can be used to create threaded applications. It uses an HTML tokenizer that appears slightly more sophisticated than that used with Spindle, but may not handle all cases of invalid markup. The PageInfo
class uses a WebPageXtractor
to get document content.
There are no problematic package dependencies.
Limitations
Arachnid does allow a fixed period delay between URL requests, but does not support the robot exclusion protocol.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html