Spindle

Spindle is a spidering index tool for Apache Lucene. It includes the Apache Software License, but the spider application code is released under the GPL, see the master copy. The Spindle source depends on the following packages:

The com.bitmechanic.spindle.Search class includes the package com.bitmechanic.listlib, which in turn includes the Apache Commons Bean Utilities package, released under the Apache Software License, see the master copy.
The com.bitmechanic.spindle.Spider class includes classes from the cvu.html package, which is released under the GPL licence.
The com.bitmechanic.spindle.Search and com.bitmechanic.spindle.Spider classes depend on org.apache.lucene.analysis.standard.StandardAnalyzer. Lucene is released under the Apache Software Licence version 2.0.

Spindle also depends on standard Java extensions, which should be compatible with the GNU Servlet API:

The com.bitmechanic.spindle.Search class depends on javax.servlet.jsp.PageContext.

Initial review notes

One of the great advantages (and drawbacks) of the Spindle spider is its great simplicity. The spider is a single class that runs multiple instances as threads. It reads input streams via a Reader.

The number of threads that Spindle spawns is specified on the command line. A thread is created to handle each URL and named for logging purposes. The host calls the join() method on the thread and waits for it to complete. URLs are added and retreived from a synchronized queue.

The HTML content is tokenized via the supporting HTML tokenizer class based on < and > separators. This creates an Enumeration of TextToken and TagToken objects that contain attributes. The spider identifies links and extracts URLs as it indexes the page and adds them to the queue.

Spindle recognises standard anchor elements and frame source attributes. It ignores links with protocols other than HTTP and HTTPS, and those that start with "", “#” and “javascript:”.

The tokenizer seems to assume tags do not break over separate lines and it is not clear whether it would properly handle empty XHTML elements.

Limitations

Included and excluded URLs are specified via command line arguments. The HTTP client features are minimal, only status code 200 is recognised.

The spider checks a new URL is not indexed before adding it to the queue, so several threads might load the same page before one of them indexes it.

Does not recognise the robot exclusion protocol nor equivalent meta element directions.

This document was last modified on 2004-11-03 06:32:18.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html