Validator
Component description
The MKSearch validator component is responsible for ensuring that source documents are well-formed, valid XML documents; it therefore converts HTML documents to XHTML on the fly. The validator is largely composed of JTidy, which checks and corrects common HTML markup errors, and a validating XML parser.
Beta 2 development plans
- Integrated exception handling
-
The alpha version of MKSearch does not have very sophisticated handling for HTML documents that cannot be parsed by JTidy; the output stream is empty and results in a
SAXException
when it is parsed and the document is not indexed. The beta validator will have to report such problems to the checker component so that the repository can be purged of any existing records for the problem document. The next release of JTidy is expected to implement a MessageListener interface that can be used to monitor the parse, see below. - Upgrade to release r8 of JTidy
- A significant number of bugs have been reported against the current version r7 release of JTidy, many of which are expected to be corrected in the next release. No issues are known to affect MKSearch, but system tests have been relatively limited to date and it would be better to work with a cleaner version.
Copyright MKDoc Ltd. and others.
The Free Documentation License http://www.gnu.org/copyleft/fdl.html