Checker

Component description

The MKSearch checker component is essentially an integration layer between the data acquisition system and the repository, ensuring the currency of the data store. The checker component does not exist as a distinct component in the alpha version of MKSearch, a new repository is created for each crawl session.

Development plans

Repository management interfaces

A wrapper layer is required around the repository to dynamically maintain the contents to permit incremental indexing. The wrapper will need to handle exception messages from the crawler, validator and indexer components to purge invalid records from the repository. It will also need to add RDF statements generated by the indexer and ultimately purge stale entries from the result cache.

Task progress

Completed a set of store management interfaces to insert and purge statements.
Prepared a set of concrete SubjectManager types to stand in place of plugin-based storage control:
- HtmlFileSystemSubjectManager handles HTML output for plugins such as JTidyFileWriterPlugin
- TripleFileSystemSubjectManager handles N-Triple output for plugins such as XhtmlTripleWriterPlugin
- LocalRepositorySubjectManager handles Sesame local repository storage for plugins such as XhtmlStoreWriterPlugin
Refactored all RdfContentHandler types and plugins to use the new store manager system via a StoreManagerFactory; a SubjectManager is allocated according to the PropertySet configured for the plugin.
Adapted all unit tests to the new scheme and completed coverage tests.
Store management interface complete

Check un-linked documents

At present, the crawler component pushes the whole data aquisition process by following published hyperlinks and creates a new repository for each session. However, with an incremental indexing scheme, previously indexed documents may be removed between sessions and un-linked. In this case, the crawler will not discover resources are obsolete. The checker component therefore needs periodically to check whether "old" source documents still exist.

This document was last modified by Philip Shaw on 2005-06-09 10:17:48
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Sign up

Checker

Component description

Development plans