Initial review notes
JoBo looks a strong candidate because there is not a strong coupling between the Apache classes and the core code. One issue would be removing the logging dependencies from the <code>WebRobot</code>, <code>FormFiller</code>, <code>HtmlDocument</code> and <code>HttpTool</code> classes. Logging could be handled dynamically through an interface adapter. Secondly, the Apache regular expression handling in the <code>RegExpRule</code> and <code>RegExpURLCheck</code> classes would have to be switched to use the GNU RegExp package.
One of the strengths of JoBo is that it is already part-integrated with JTidy and has an <code>HttpDocManager</code> interface for post-processing documents. The default document interface saves all content to individual files.
JoBo appears to have advanced support for HTTP methods including cookies and form handling, and respects the robots exclusion protocol. The rate of spidering can also be throttled to moderate the load on the origin servers.