Indexer
Component description
The MKSearch indexer component is responsible for extracting metadata from Web documents. It's current implementation is in the form of a set of SAX content handlers and XML filters. The content handlers are used in JSpider plugins and are triggered by document download callback events.
Completed task information has been moved to the beta 1 indexer plans archive.
Beta 2 development plans
- Incremental indexing
-
The current indexing process is driven by worker threads in the JSpider Web crawler and results in a new RDF repository for each crawl session. This will be satisfactory for development and testing purposes and may suit some production configurations, however incremental indexing will ultimately be required, so that "live" repositories can be updated while continuing to provide public query services. This development is co-dependent with more sophisticated repository management interfaces for the Triple Store component.
Development plans
- RSS 1.0 processing
-
The current system only processes (X)HTML document metadata. MKSearch is also required to index RSS 1.0 feed metadata according to the RSS Dublin Core Module and RSS Qualified Dublin Core Module. This will require a different set of content handlers (and an additional JSpider plugin).
Task progress:
- Draft
RdfStoreWriterPlugin
class andRdfContentTypeOnly
rule prepared. - Test indexing successful, reviewing plugin configuration.
- Draft
Document Links
- Dublin Core in HTML recommendation
-
The latest version of the Dublin Core in HTML recommendation with meta and link elements
http://dublincore.org/documents/dcq-html/
- e-GIF Metadata Standard
-
The document download page for version 3.0 of the e-GIF Metadata Standard.
http://www.govtalk.gov.uk/schemasstandards/metadata_document.asp?docnum=872
- RSS 1.0
-
The RDF Site Summary verison 1.0 specification.
http://purl.org/rss/1.0/spec
- RSS Dublin Core Module
-
The RSS 1.0 Dublin Core metadata module specification
http://purl.org/rss/1.0/modules/dc/
- RSS Qualified Dublin Core Module
-
The RSS 1.0 Qualified Dublin Core metadata module specification
http://purl.org/rss/1.0/modules/dcterms/
- Requirements for Electronic Records Management Systems
-
Requirements specification from the National Archives
http://www.nationalarchives.gov.uk/electronicrecords/reqs2002/pdf/metadatafinal.pdf
- e-GMS Application Profile Version 1
-
An explicit mapping for e-GMS metadata elements.
http://www.govtalk.gov.uk/schemasstandards/metadata_document.asp?docnum=805
- Crawler development plans
-
Development plans for the MKSearch Crawler component
http://mksearch.mkdoc.org.archived.website/plans/crawler/
- Test document Web site
-
Static test documents for MKSearch indexer and crawler
http://test.mksearch.mkdoc.org/
- Beta 1 indexer plans
-
Archived task and progress notes for the beta 1 indexer
http://mksearch.mkdoc.org.archived.website/plans/beta-1-release-tasks/beta-1-indexer-plans/
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html