The MKSearch system has been designed to work with the GNU Compiler for Java (GCJ). These notes explain how to index Web content with two of the default configuration sets provided with the project.
Environment settings
The MKSearch build and execution scripts use variable substitution to run from an arbitrary installation directory. It is assumed that the MKSearch source is installed in a single base directory and reflects the original structure of the project in the Subversion repository.
Before running the scripts, four environment variables must be set, see the instructions below.
GNU/Linux environment settings
You can set these properties in your .bash_profile script for instance:
export mk_build=/home/mksearch/build
export mk_home=/home/mksearch
export CLASSPATH=/usr/share/java/libgcj-3.4.1.jar
Substitute the actual path to your MKSearch installation for the mk_home variable.
The path for the temporary build directory may be outside the MKSearch home path.
Include the actual path of your core Java class repository in the CLASSPATH variable.
Exit your current session and log in again to apply the changes. To check the settings have been applied, use the env command piped through less:
$ env | less
Use the down key to scroll down. You should see two lines that look like this:
mk_build=/home/mksearch/build
mk_home=/home/mksearch
CLASSPATH=/usr/share/java/libgcj-3.4.1.jar
Press Q to exit less.
Java compatibility
The example commands below assume you have installed the JPackage compatibility package for GCJ and can call the java
command as if the Sun JVM was installed. If not, an equivalent set of scripts are available in the $mk_home/bin
directory with a gij-
prefix.
N-Triple index example
The MKSearch project includes a static test site that is used to check the correct operation of the indexer. For simplicity, the "triple" configuration indexes a set of Web pages and generates an N-Triple output file for each on the local file system. The example below runs the MKSearch indexer on the test site using the triple
configuration.
$mk_home/bin/java-jspider.sh http://test.mksearch.mkdoc.org/ triple
The output from this run will generate a new directory structure at: $mk_home/output/org.mkdoc.mksearch.test
.
Sesame RDF repository example
After basic operation of the indexer has been confirmed using the N-Triple configuration, you can run the "rdfstore" configuration to build a file-based Sesame repository of the test site metadata. The Sesame repository is stored as a single XML-serialised RDF file on the file system.
$mk_home/bin/java-jspider.sh http://test.mksearch.mkdoc.org/ rdfstore
This run will generate a single XML/RDF file at $mk_home/output/com.mkdoc.jspider.XhtmlStoreWriterPlugin.rdf
Indexing performance
At the time of writing, the MKSearch static test site had 210 test pages and these index configurations were set to use a single thread with a throttle of 500 milliseconds between requests. Performance varies depending on other applications that may be running and general network traffic between 5 and 20 minutes for a complete run.