Crawler

Component description

The MKSearch crawler component is responsible for identifying Web documents to be indexed and passing them to custom content handlers for indexing. The crawler is largely composed of the JSpider Web spidering engine configured with custom MKSearch plugins that pass documents to the indexing component.

Completed task information has been moved to the beta 1 crawler plans archive.

Beta 2 development plans

Correct JSpider issues
To date, a range of issues have been identified with the JSpider alpha 0.5 release that affect how links are identified in HTML documents and various other aspects of the tool's operation. The issues have been logged as JSpider bug reports and must be corrected or worked around. Access has been granted to the JSpider CVS repository to make the corrections upstream of MKSearch.
Store management plugins

MKSearch currently uses a single-pass indexing scheme where a new repository is created for each crawl session. The metadata store ultimately needs to be managed dynamically, so that incremental indexing can take place while a repository is servicing public queries on the front end. For instance, indexed documents that are subsequently removed from the origin server must also be removed from the MKSearch repository. This will require plugins that handle HTTP 404, not found, and other error types.

Prospective store management cases, discovered during normal link traversal:

  • Document was present at last pass, still present: replace all previous statements.
  • Document was not present at last pass, now accessible (robots policy, etc.): add statements.
  • Document was present at last pass, now missing or otherwise inaccessible: purge statements.
  • Document was present at last pass, now cannot parse: purge statements.
  • Document was present at last pass, now has an un-parsed content type: purge statements.
  • Document was present at last pass, now robots policy blocks: purge statements.

Retrospective store management cases, cannot be discovered by normal link traversal:

  • Document was present at last pass, no longer linked from accessible documents: purge statements.
  • Document was present at last pass, no longer present or linked from accessible documents: purge statements.
  • Site was present at last pass, no longer linked: purge statements.
  • Site was present at last pass, no longer configured to index: purge statements.

Document Links

JSpider
The JSpider project page on SourceForge
http://sourceforge.net/projects/j-spider/
JSpider bug reports
The JSpider bug report listing at SourceForge
http://sourceforge.net/tracker/?group_id=65617&atid=511632
Dublin Core in HTML recommendation
The latest version of the Dublin Core in HTML recommendation with meta and link elements
http://dublincore.org/documents/dcq-html/
beta 1 crawler plans
Summary task and progress notes for the beta 1 release of the MKSearch crawler component
http://mksearch.mkdoc.org.archived.website/plans/beta-1-release-tasks/beta-1-crawler-plans/
This document was last modified on 2005-09-29 09:25:44.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html