Beta 1 crawler plans
This page lists summary task and progress notes for the beta 1 release of the MKSearch crawler component. This is an archive page.
Development plans
- Custom schema plugin types
-
The default metadata schema support includes Dublin Core elements and qualifiers. However, government Web sites may have their own metadata schemas, such as the UK e-Government Metadata Standard, which need to be integrated with these schemes. The crawler plugins need to allow for custom
Schemasupport.Task progress: Re-factored the plugin class hierarchy to share more common functionality.
-
XhtmlTripleWriterPlugincompleted. -
XhtmlStoreWriterPlugincompleted. - Custom schema support complete.
-
- Custom application profile support
-
The dynamic interface of the Query component has brought forward the need for a
ApplicationProfileinterface to adapt variousSchema. The initial implementation of these types is compatible with the customSchemaconfiguration mechanism in the JSpider plugins, but should ultimately change to customApplicationProfile.Task progress: Refactored
AbstractRdfContentHandler, concrete types and plugins to new interface.- Custom application profile support complete
- Custom Rule types
-
The standard rule set for JSpider includes a "parse only
text/html" rule used for standard Web content. MKSearch will require other rules for RDF and RSS content types, and perhaps PDF and other document types.Task progress
- Draft
RdfContentTypeOnlyrule prepared for testing.
- Draft
- Additional plugin types
-
The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the
text/htmlcontent type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle theapplication/rss+xmlcontent type. Plugins for other content types may be feasible if time allows.Task progress:
- Prepared a first draft
RdfStoreWriterPlugin, ready for testing. - Created a set of RDF test case documents on the test site. Indexing successful.
- Created a set of RSS test case documents on the test site. Indexing successful.
- Additional plugin types completed.
- Prepared a first draft
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html