Crawler

Component description

The MKSearch crawler component is responsible for identifying Web documents to be indexed and passing them to custom content handlers for indexing. The crawler is largely composed of the JSpider Web spidering engine configured with custom MKSearch plugins that pass documents to the indexing component.

Development plans

Correct JSpider issues

To date, a range of issues have been identified with the JSpider alpha 0.5 release that affect how links are identified in HTML documents and various other aspects of the tool's operation. The issues have been logged as JSpider bug reports and must be corrected or worked around. Access has been granted to the JSpider CVS repository to make the corrections upstream of MKSearch.

Custom schema plugin types

The default metadata schema support includes Dublin Core elements and qualifiers. However, government Web sites may have their own metadata schemas, such as the UK e-Government Metadata Standard, which need to be integrated with these schemes. The crawler plugins need to allow for custom Schema support.

Task progress: Re-factored the plugin class hierarchy to share more common functionality.

XhtmlTripleWriterPlugin completed.
XhtmlStoreWriterPlugin completed.
Custom schema support complete.

Custom Rule types

The standard rule set for JSpider includes a "parse only text/html" rule used for standard Web content. MKSearch will require other rules for RDF and RSS content types, and perhaps PDF and other document types.

Task progress

Draft RdfContentTypeOnly rule prepared for testing.

Additional plugin types

The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the text/html content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle the application/rss+xml content type. Plugins for other content types may be feasible if time allows.

Store management plugins

MKSearch currently uses a single-pass indexing scheme where a new repository is created for each crawl session. The metadata store ultimately needs to be managed dynamically, so that incremental indexing can take place while a repository is servicing public queries on the front end. For instance, indexed documents that are subsequently removed from the origin server must also be removed from the MKSearch repository. This will require plugins that handle HTTP 404, not found, and other error types.

Prospective store management cases, discovered during normal link traversal:

Document was present at last pass, still present: replace all previous statements.
Document was not present at last pass, now accessible (robots policy, etc.): add statements.
Document was present at last pass, now missing or otherwise inaccessible: purge statements.
Document was present at last pass, now cannot parse: purge statements.
Document was present at last pass, now has an un-parsed content type: purge statements.
Document was present at last pass, now robots policy blocks: purge statements.

Retrospective store management cases, cannot be discovered by normal link traversal:

Document was present at last pass, no longer linked from accessible documents: purge statements.
Document was present at last pass, no longer present or linked from accessible documents: purge statements.
Site was present at last pass, no longer linked: purge statements.
Site was present at last pass, no longer configured to index: purge statements.

Alternative configuration schemes

JSpider currently uses static factory-based configuration loaders with Java property files, which work fine, but cause some difficulties in unit testing. This is not a critical issue, but an alternative form of configuration may be devised.

This document was last modified by Philip Shaw on 2005-04-21 02:50:06
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Links

Sign up

Crawler

Component description

Development plans