Skip Navigation

Spiders

J-Spider

JoBo

Arachnid

Spindle

Acme Spider

Metis

Heritrix

HouseSpider

WebLech

Excluded spiders

Link mappers

Content parsers

RDF Crawlers

Sign up

If you sign up for an account on this web site you can customise elements of this site and subscribe to an email newsletter.

If you have an account on this web site you may login.

If you have an account on this site but have forgotten your user name and / or your password then you can request an account reminder email.

Arachnid

Arachnid is released under the GPL, it appears to be a tool for mapping a link structure. It is a small package of 8 classes.

  • One class, bplatt.spider.PageInfo, depends on the javax.swing.text and javax.swing.text.html packages, which may not be fully implemented by GNU Classpath.

Initial review notes

Arachnid is of a similar nature to Spindle, but its structure is more modular and extensive. The package has a layered scheme for creating spiders by extending the abstract Arachnid base class. Subclasses implement template methods for handling bad links, IO exceptions, unrecognised links and external links.

Arachnid is not multi-threaded, but the Arachnid base class can be used to create threaded applications. It uses an HTML tokenizer that appears slightly more sophisticated than that used with Spindle, but may not handle all cases of invalid markup. The PageInfo class uses a WebPageXtractor to get document content.

There are no problematic package dependencies.

Limitations

Arachnid does allow a fixed period delay between URL requests, but does not support the robot exclusion protocol.

<< | Up | >>

This document was last modified by Philip Shaw on 2004-11-04 03:14:48
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html