Indexer
Component description
The MKSearch indexer component is responsible for extracting metadata from Web documents. It's current implementation is in the form of a set of SAX content handlers and XML filters. The content handlers are used in JSpider plugins and are triggered by document download callback events.
Development plans
-
HTML
link
element indexing -
The current content handlers only process HTML
meta
elements. The latest Dublin Core in HTML recommendation also allowslink
elements to be used. This will require a new type of content handler and anXhtmlLinkFilter
, ultimately composed into a general purpose XHTML metadata processor.Task progress: Many of the original alpha classes have been refactored and new interfaces introduced to provide more flexible usage and integration.
-
XhtmlLinkFilter
completed. -
LinkTripleWriter
completed. -
LinkRDFStoreWriter
completed. - Composite
XhtmlTripleWriter
completed. - Composite
XhtmlRDFStoreWriter
completed. - Full Dublin Core in HTML compatibility complete.
-
- e-GIF compatibility
-
Currently the system only indexes Dublin Core namespace metadata. The content handlers must be extended to cover UK e-GIF Metadata Standard markup. This standard includes records management fields specified by the National Archives' Requirements for Electronic Records Management Systems, and is based on the e-GMS Application Profile Version 1. This will require an e-GIF
com.mkdoc.schema.Schema
and a new RDF store writer implementation.Task progress:
-
UKeGMS
Schema
class completed to e-GMS Application Profile Version 1 specification. - Custom schema configuration introduced to the
XhtmlTripleWriterPlugin
class to enable e-GMS indexing. See Crawler development plans. - Extended the test document Web site to include all e-GMS elements, refinements and encoding schemes.
- Custom schema configuration introduced to the
XhtmlStoreWriterPlugin
class to enable e-GMS indexing. See Crawler development plans. - Full e-GIF compatibility complete.
-
- RSS 1.0 processing
-
The current system only processes (X)HTML document metadata. MKSearch is also required to index RSS 1.0 feed metadata according to the RSS Dublin Core Module and RSS Qualified Dublin Core Module. This will require a different set of content handlers (and an additional JSpider plugin).
Task progress:
- Draft
RdfStoreWriterPlugin
class andRdfContentTypeOnly
rule prepared.
- Draft
- Application profiles
-
The original indexing system used the
Schema
interface to expand metadata values to URIs. Extension methods were added to the interface to permit the type of mixed form used in the UK e-GMS schema, which shares elements with the Dublin Core element set and provides its own refinements. AnApplicationProfile
interface is required to allow more flexible configuration and enable dynamic configuration of the Query component.The
ApplicationProfile
interface will also need methods to iterate through all the predicates they contain to dynamically generate search forms in the Query component.Task progress:
- Refactored the
com.mkdoc.schema
andcom.mkdoc.sax
packages to introduce the newApplicationProfile
interface. - Introduced
DublinCoreProfile
andUKeGMSProfile
classes in place of the formerSchema
types. - Completed
DublinCoreProfile
andAbstractApplicationProfile
. - Completed
UKeGMSProfile
. - Custom application profiles complete (indexing functions)
- Refactored the
- Incremental indexing
-
The current indexing process is driven by worker threads in the JSpider Web crawler and results in a new RDF repository for each crawl session. This will be satisfactory for development and testing purposes and may suit some production configurations, however incremental indexing will ultimately be required, so that "live" repositories can be updated while continuing to provide public query services. This development is co-dependent with more sophisticated repository management interfaces for the Triple Store component.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html