Skip Navigation

Sign up

If you sign up for an account on this web site you can customise elements of this site and subscribe to an email newsletter.

If you have an account on this web site you may login.

If you have an account on this site but have forgotten your user name and / or your password then you can request an account reminder email.

Beta 1 crawler plans

The page lists summary task and progress notes for the beta 1 release of the MKSearch crawler component. This is an archive page.

Development plans

Custom schema plugin types

The default metadata schema support includes Dublin Core elements and qualifiers. However, government Web sites may have their own metadata schemas, such as the UK e-Government Metadata Standard, which need to be integrated with these schemes. The crawler plugins need to allow for custom Schema support.

Task progress: Re-factored the plugin class hierarchy to share more common functionality.

  • XhtmlTripleWriterPlugin completed.
  • XhtmlStoreWriterPlugin completed.
  • Custom schema support complete.
Custom application profile support

The dynamic interface of the Query component has brought forward the need for a ApplicationProfile interface to adapt various Schema. The initial implementation of these types is compatible with the custom Schema configuration mechanism in the JSpider plugins, but should ultimately change to custom ApplicationProfile.

Task progress: Refactored AbstractRdfContentHandler, concrete types and plugins to new interface.

  • Custom application profile support complete
Custom Rule types

The standard rule set for JSpider includes a "parse only text/html" rule used for standard Web content. MKSearch will require other rules for RDF and RSS content types, and perhaps PDF and other document types.

Task progress

  • Draft RdfContentTypeOnly rule prepared for testing.
Additional plugin types

The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the text/html content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle the application/rss+xml content type. Plugins for other content types may be feasible if time allows.

Task progress:

  • Prepared a first draft RdfStoreWriterPlugin, ready for testing.
  • Created a set of RDF test case documents on the test site.

Up

This document was last modified by Philip Shaw on 2005-08-04 07:51:44
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html