Beta 1 crawler plans
This page lists summary task and progress notes for the beta 1 release of the MKSearch crawler component. This is an archive page.
Development plans
- Custom schema plugin types
-
The default metadata schema support includes Dublin Core elements and qualifiers. However, government Web sites may have their own metadata schemas, such as the UK e-Government Metadata Standard, which need to be integrated with these schemes. The crawler plugins need to allow for custom
Schema
support.Task progress: Re-factored the plugin class hierarchy to share more common functionality.
-
XhtmlTripleWriterPlugin
completed. -
XhtmlStoreWriterPlugin
completed. - Custom schema support complete.
-
- Custom application profile support
-
The dynamic interface of the Query component has brought forward the need for a
ApplicationProfile
interface to adapt variousSchema
. The initial implementation of these types is compatible with the customSchema
configuration mechanism in the JSpider plugins, but should ultimately change to customApplicationProfile
.Task progress: Refactored
AbstractRdfContentHandler
, concrete types and plugins to new interface.- Custom application profile support complete
- Custom Rule types
-
The standard rule set for JSpider includes a "parse only
text/html
" rule used for standard Web content. MKSearch will require other rules for RDF and RSS content types, and perhaps PDF and other document types.Task progress
- Draft
RdfContentTypeOnly
rule prepared for testing.
- Draft
- Additional plugin types
-
The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the
text/html
content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle theapplication/rss+xml
content type. Plugins for other content types may be feasible if time allows.Task progress:
- Prepared a first draft
RdfStoreWriterPlugin
, ready for testing. - Created a set of RDF test case documents on the test site. Indexing successful.
- Created a set of RSS test case documents on the test site. Indexing successful.
- Additional plugin types completed.
- Prepared a first draft
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html