Crawler
Component description
The MKSearch crawler component is responsible for identifying Web documents to be indexed and passing them to custom content handlers for indexing. The crawler is largely composed of the JSpider Web spidering engine configured with custom MKSearch plugins that pass documents to the indexing component.
Development plans
- Correct JSpider issues
- To date, a range of issues have been identified with the JSpider alpha 0.5 release that affect how links are identified in HTML documents and various other aspects of the tool's operation. The issues have been logged as JSpider bug reports and must be corrected or worked around. Access has been granted to the JSpider CVS repository to make the corrections upstream of MKSearch.
- Additional plugin types
-
The current MKSearch plugins are concerned with extracting metadata from HTML documents identified by the
text/html
content type. The system also needs to process RSS 1.0 documents and additional plugins will be required to handle theapplication/rss+xml
content type. Plugins for other content types may be feasible if time allows. - Store management plugins
- MKSearch currently uses a single-pass indexing scheme where a new repository is created for each crawl session. The metadata store ultimately needs to be managed dynamically, so that incremental indexing can take place while a repository is servicing public queries on the front end. For instance, indexed documents that are subsequently removed from the origin server must also be removed from the MKSearch repository. This will require plugins that handle HTTP 404, not found, and other error types.
- Alternative configuration schemes
- JSpider currently uses static factory-based configuration loaders with Java property files, which work fine, but cause some difficulties in unit testing. This is not a critical issue, but an alternative form of configuration may be devised.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html