Crawler

Component description

The MKSearch crawler component is responsible for identifying Web documents to be indexed and passing them to custom content handlers for indexing. The crawler is largely composed of the JSpider Web spidering engine configured with custom MKSearch plugins that pass documents to the indexing component.

Development plans

Correct JSpider issues

To date, a range of issues have been identified with the JSpider alpha 0.5 release that affect how links are identified in HTML documents and various other aspects of the tool's operation. The issues have been logged as JSpider bug reports and must be corrected or worked around. Access has been granted to the JSpider CVS repository to make the corrections upstream of MKSearch.

Custom schema plugin types

The default metadata schema support includes Dublin Core elements and qualifiers. However, government Web sites may have their own metadata schemas, such as the UK e-Government Metadata Standard, which need to be integrated with these schemes. The crawler plugins need to allow for custom Schema support.

Task progress: Re-factored the plugin class hierarchy to share more common functionality.

XhtmlTripleWriterPlugin completed.

Additional plugin types

The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the text/html content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle the application/rss+xml content type. Plugins for other content types may be feasible if time allows.

Store management plugins

MKSearch currently uses a single-pass indexing scheme where a new repository is created for each crawl session. The metadata store ultimately needs to be managed dynamically, so that incremental indexing can take place while a repository is servicing public queries on the front end. For instance, indexed documents that are subsequently removed from the origin server must also be removed from the MKSearch repository. This will require plugins that handle HTTP 404, not found, and other error types.

Alternative configuration schemes

JSpider currently uses static factory-based configuration loaders with Java property files, which work fine, but cause some difficulties in unit testing. This is not a critical issue, but an alternative form of configuration may be devised.

This document was last modified by Philip Shaw on 2005-03-02 07:18:10
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Links

Sign up

Crawler

Component description

Development plans