GIR Logo

Arctic Region Supercomputing Center

Software Components of Multisearch

Introduction
Various software components were used to create Multisearch. As illustrated below, Tomcat is the base of all the software, holding both the backends and the front-end.

Two information retrieval packages were used: Lemur and Lucene. These were used to generate the indexes to be searched. Axis services and OGSA-DAI were both used to generate backend services. Technically, OGSA-DAI uses Axis to launch services, but the user doesn't deal directly with Axis. These services all interact with Hadoop.

Software Components

Tomcat
Apache Tomcat is a servlet container that enables Java Web Applications to be used with a browser. Tomcat is stored with a location variable:

$CATALINA_HOME
which is the subdirectory in which Tomcat is stored in

Tomcat allows web services, both front-ends (where the client interacts with the software) and the backends (where the services run the query and return the results).

Axis Services & OGSA-DAI
Apache Axis and OGSA-DAI are both tools that enable users to launch web services on Tomcat or other servlet-holders. Both of these tools also enable the user to create a client-side application that can contact the backend and use it.

OGSA-DAI was originally incorporated into Multisearch in 2007, when the variation was OGSA-DAI WSI 2.2 It is a middleware that provides a structure for grid computing services. OGSA-DAI is great for people or groups that want to launch multiple backends quickly in a structured environment. OGSA-DAI uses Axis to do this.

However, Axis services can be created by users as well. Multisearch 2006 used Axis services for backends. Now Multisearch 2008 can use both the structured OGSA-DAI backends and the other Axis services generated. Axis services are good for users/groups that want more flexibility in programming, especially with fewer backend services.

Lemur
Lemur is an information retrival package written in C++ that can index and search data. Lemur also has a Java wrapped so it can be used in Java applications as well. It is especially useful because it has Indri Engine along with many different ranking algorithms to search with, such as TFIDF, Okapi, etc.

Lucene
Apache Lucene is another information retrieval package that Multisearch has been using for a very long time. It is written in Java and completely usable.

Hadoop
Apache Hadoop was inspired by Map/Reduce, a form of astraction used first by Google to explain large, distributed File Systems. It's scalable reliably up to petabytes.

Hadoop uses a map() function to take a set of <key, value> pairs and maps them to be run at the same time. The map() function produces an intermediate set of <key, value> pairs (not necessarily of the same type) which are then passed to the reduce() function. The reduce() function takes values with the same key and groups them together.

Multisearch uses Hadoop to run its distributed searching. At the map() phase, it creates clients to connect to the backend services, using <query, ServiceWritable>. The clients return ResultSets, which are then outputted by the map() function as <query, ResultSetWritable> values. The reduce() function merges all the ResultSets together by rescoring and reordering the documents in the ResultSets passed to it by the mapper.

ARSC UAF

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 12 August 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.