Crawler

Component description

The MKSearch crawler component is responsible for identifying Web documents to be indexed and passing them to custom content handlers for indexing. The crawler is largely composed of the JSpider Web spidering engine configured with custom MKSearch plugins that pass documents to the indexing component.

Completed task information has been moved to the beta 1 crawler plans archive.

Additional plugin types

The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the text/html content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle the application/rss+xml content type. Plugins for other content types may be feasible if time allows.

Task progress:

Prepared a first draft RdfStoreWriterPlugin, ready for testing.
Created a set of RDF test case documents on the test site.

Beta 2 development plans

Correct JSpider issues

To date, a range of issues have been identified with the JSpider alpha 0.5 release that affect how links are identified in HTML documents and various other aspects of the tool's operation. The issues have been logged as JSpider bug reports and must be corrected or worked around. Access has been granted to the JSpider CVS repository to make the corrections upstream of MKSearch.

Store management plugins

MKSearch currently uses a single-pass indexing scheme where a new repository is created for each crawl session. The metadata store ultimately needs to be managed dynamically, so that incremental indexing can take place while a repository is servicing public queries on the front end. For instance, indexed documents that are subsequently removed from the origin server must also be removed from the MKSearch repository. This will require plugins that handle HTTP 404, not found, and other error types.

Prospective store management cases, discovered during normal link traversal:

Document was present at last pass, still present: replace all previous statements.
Document was not present at last pass, now accessible (robots policy, etc.): add statements.
Document was present at last pass, now missing or otherwise inaccessible: purge statements.
Document was present at last pass, now cannot parse: purge statements.
Document was present at last pass, now has an un-parsed content type: purge statements.
Document was present at last pass, now robots policy blocks: purge statements.

Retrospective store management cases, cannot be discovered by normal link traversal:

Document was present at last pass, no longer linked from accessible documents: purge statements.
Document was present at last pass, no longer present or linked from accessible documents: purge statements.
Site was present at last pass, no longer linked: purge statements.
Site was present at last pass, no longer configured to index: purge statements.

This document was last modified by Philip Shaw on 2005-08-04 08:13:19
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Links

Sign up

Crawler

Component description

Beta 2 development plans