Indexer

Component description

The MKSearch indexer component is responsible for extracting metadata from Web documents. It's current implementation is in the form of a set of SAX content handlers and XML filters, which are limited to processing HTML meta elements. The content handlers are used in JSpider plugins and are triggered by document download callback events.

Development plans

HTML link element indexing

The current content handlers only process HTML meta elements. The latest Dublin Core in HTML recommendation also allows link elements to be used. This will require a new type of content handler and an XhtmlLinkFilter, ultimately composed into a general purpose XHTML metadata processor.

Task progress: Many of the original alpha classes have been refactored and new interfaces introduced to provide more flexible usage and integration.

XhtmlLinkFilter completed.
LinkTripleWriter completed.
LinkRDFStoreWriter completed.
Composite XhtmlTripleWriter completed.
Composite XhtmlRDFStoreWriter completed.
Full Dublin Core in HTML compatibility complete.

e-GIF compatibility

Currently the system only indexes Dublin Core namespace metadata. The content handlers must be extended to cover UK e-GIF Metadata Standard markup. This standard includes records management fields specified by the National Archives' Requirements for Electronic Records Management Systems, and is based on the e-GMS Application Profile Version 1. This will require an e-GIF com.mkdoc.schema.Schema and a new RDF store writer implementation.

Task progress:

First draft UKeGMS Schema class prepared, ready to test.

RSS 1.0 processing

The current system only processes (X)HTML document metadata. MKSearch is also required to index RSS 1.0 feed metadata according to the RSS Dublin Core Module and RSS Qualified Dublin Core Module. This will require a different set of content handlers (and an additional JSpider plugin).

Incremental indexing

The current indexing process is driven by worker threads in the JSpider Web crawler and results in a new RDF repository for each crawl session. This will be satisfactory for development and testing purposes and may suit some production configurations, however incremental indexing will ultimately be required, so that "live" repositories can be updated while continuing to provide public query services. This development is co-dependent with more sophisticated repository management interfaces for the Triple Store component.

This document was last modified by Philip Shaw on 2005-02-24 11:45:28
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html

Links

Sign up

Indexer

Component description

Development plans