Indexer
Component description
The MKSearch indexer component is responsible for extracting metadata from Web documents. It's current implementation is in the form of a set of SAX content handlers and XML filters, which are limited to processing HTML meta
elements. The content handlers are used in JSpider plugins and are triggered by document download callback events.
Development plans
-
HTML
link
element indexing -
The current content handlers only process HTML
meta
elements. The latest Dublin Core in HTML recommendation also allowslink
elements to be used. This will require a new type of content handler and anXhtmlLinkFilter
, ultimately composed into a general purpose XHTML metadata processor. - e-GIF compatibility
-
Currently the system only indexes Dublin Core namespace metadata. The content handlers must be extended to cover UK e-GIF Metadata Standard markup. This will require an e-GIF
com.mkdoc.schema.Schema
and a new RDF store writer implementation. - RSS 1.0 processing
- The current system only processes (X)HTML document metadata. MKSearch is also required to index RSS feed metadata, which will require a different set of content handlers (and an additional JSpider plugin).
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html