GIR Logo

Arctic Region Supercomputing Center

Multisearch Log

These are the log files from 30 June - 3 July, 2008. For additional log files, please return to the main log page.

Thursday, 3 July
As of 12:30pm, I have managed to get NaiveMergeSet functioning with the backend tester for Multisearch. It's having some difficulties with Indri's ranking methods, which use negative numbers. I tried taking the document score and multiplying it by -1 to ensure that the result of the calculations would not be NaN, but now all the Lemur results are being scored insanely high!

I think there might be a better way to do NaiveMerge, a way that allows for negative numbers to be compensated. I'll drop Chris a line about this one.

Wednesday, 2 July
Right now I am going over some of the "grittier" parts of the Hadoop code, which includes some of the errors I might expect from the code. I'm going to first build a simple version of Multisearch without Hadoop (just a quick client side to merge Lemur and Lucene) today. That way, I'll know that Axis, Tomcat, Lemur, Lucene, and the Merge algorithms are all working, so if the other software fails, I'll be certain its Hadoop issues.

I am planning on making a Lemur Web Service today as well, which I am very excited about. I've already indexed one index with Indri, and I'm working on another Index with the key variety in Lemur.

As of right now, I have the following indexes:

Index Tool Type
ap88 Lemur key
GX200 Lemur key - stopped
GX238 Lemur key
GX030 Lemur Indri
gov2.dsub.1165 Lucene normal
gov2.dsub.1318 Lucene normal

I pulled a large stoplist file out to use with Lemur, since it would be cool to have different types of indexes. We ne now I have Indri, key, and Lucene!

I also made a small transform.DataModify Java function. To make a Lemur Key index, you need a list of all the files containing data. The GX directories have many files that need to be indexed separately. So, there needed to be a fast way to generate the list of files that Lemur could use.

transform.DataModify - a quick how-to
 
1. ls directory > filename (which should be directory.files)
2. run transform.DataModify filename
3. open directory.file_list.dat (generated) and delete the first line
 
'directory' is the name of the directory you want to index. The filename should reflex the index name so the correct path can be generated to the files.

This can be useful later if I need to make more Lemur Key indexes.

I've set up the Pseudo-Distributed Operation on Hadoop using the QuickStart instructions. I'm working on playing with the examples, too, but I wanted to ensure that all my paths were working. Now there is $HADOOP which points to /home/mccormic/hadoop/ with hadoop 0.17 installed.

I also remembered that Lemur comes with different types of searching/ranking variables, all of which can be found on RetMethodManager: tfidf, okapi, kl, inquery, cori_cs, cos, ino_struct, and indri.

There seems to be a minor set back with Lemur Key Indexes. Right now, it's not generating a .key file for the index, which makes it very hard to program a searcher for, since the .key file is what opens the index. I'm re-running the indexing to see if Ic an find the problem.

Update: There is now a LuceneSearcher and a LemurIndriSearcher up and running, collecting and returning beans! Hopefully by tomorrow I can get a merge function going on them and get the results merged. I've put in a query about the mysterious indexing issues with Lemur, but I doubt I'll get a response before the end of the holiday weekend.

Tuesday, 1 July
Yay! Now the LemurSearcher code works, so we can merge Lucene and Lemur results. I'm running an Indri search right now, but I can make the key search as well later. That means we can have three different ranking algorithms to merge via NaiveMerge and such.

I'm working on fully documenting the design part of Multisearch with Hadoop, although I'm certain I might have to work more on the Map half of the design when I work directly with the code. Hopefully I'll be done with the website this week.

The architecture page has new graphics, although they'll need some work. I want to be able to compare the different architectures for Multisearch on this page as well, although it is very bare-bones at the moment. I've also added information to the Contacts page.

Monday, 30 June
It looks like we're going to have to have a serializer of some kind for passing the objects over the wire. However, what if we're looking at something like an OGSA-DAI service, which uses XML and doesn't need beans? I suppose this means we're going to have to use another modular approach to another object: Client.

The good news about this is that most services provide a client with their work because they know how they want their services to be accessed, which means it should be okay to assume the maker of a service could provide a client-side.

The bad news is that it'll have to be dropped in, like the different types of algorithms and such last year. It will also have to be predicted from the service. (EG--An axis service would need to provide a client that can contact it. The same is true with an OGSA-DAI client.)

I can do what I did before: use Reflection, so it's easier to add clients on the fly. However, I think I'll also need to look into other information on Axis. I've found a few resource, like this one from JavaBoutique and another OnJava reference.

Client object (Reflection, extendable, cloneable, etc.)
 
Client will provide the basic ins-and-outs for Multisearch's contact with another service. We'll need to set certain information and then contact the service.
 
Assumed Needs
String endpoint :: the uri to connect to
String Query :: the query string to be sent
 
Client() :: create the client object - will have to use .getInstance() function instead
call() :: call the service - should return objects
getResults() :: get the result set - optional
 
To integrate properly with Hadoop, we'll need this to be the Map part of the program. We'll declare a base Client and then everyone will have to provide a Client class to help use communicate with the Service. I can set up for Axis services.

It might be possible, given the reflection I plan on using with the Clients, to provide the object being sent over the wire and serializing it. This would mean that groups that have designed a special object would be able to use Multisearch.

I've also been having an error with Lemur:

Exception in thread "main" java.lang.UnsatisfiedLinkError: /usr/local/lib/liblemur_jni.so: /usr/local/lib/liblemur_jni.so: wrong ELF class: ELFCLASS64 (Possible cause: architecture word width mismatch)

I've tried fixing it by updating my Library variable to include the linking file, but then I get the same error. I've tried adding -d64 to the command, which should enable it to use 64 bits, but it then tells me:

Running a 64-bit JVM is not supported on this platform.

Of course, this isn't true. I've dropped Greg a line about it, and Lemur is 64-bit as well as Java -- and Snowy can handle 64-bit java. I've signed up on the Lemur Toolkit Discussion page to get a username and password, but haven't gotten it yet. Hopefully I can post about this soon. Update: Post posted! Here's to hoping for fast responses!

Another Update: While technically LuceneSearch and LemurSearch need an index, I'm going to remove this option from being sent over the wire. Chances are, each service will have its own index that we need not specify. Also, it's harder to do anything on-the-fly when we need to know that much information about them. Current code: LuceneSearcherImpl.java and LuceneIndexer.java

I've managed to lunch a LuceneSearcher restricted to gov2.dsub.1165 currently located here as an Axis service.

Design for Multisearch Client/Service connection with Hadoop
 
For the design of Client/Service connection, I'm using Hadoop's outlined code and the understanding of Axis and OGSA-DAI services. Essentailly, the goal is to have a client.getInstance() ability to call any kind of client class we need and send query over the wire and produce a resultset.
 
JobConf conf = new JobConf(Multisearch.class);
conf.setJobName("multisearch");
 
conf.setOutputKeyClass(ResultSetKey.class);
conf.setOutputValueClass(ResultSet.class);
conf.setMapperClass(MultisearchMap.class);
conf.setCombinerClass(MultisearchReduce.class); //weeds out repitition only
conf.setReducerClass(-multisearch-merge-class.class);
conf.setInputFormat(?.class)
conf.setOutputFormat(?.class)
 
The <key, value> pairings will be <client, query> during the Map phase. From there, we will move into the reduce phase where the pairing will be <query, list<documents>> in the Reduce function, which will merge them into a single ResultSet file.

I'm not entirely certain if this makes logical sense, however, to Hadoop. In English, these are the goals (so this design will be modified around them):

  • Map will run a search on one given backend/service. This requires a URI, ClientClass, and Query.
  • CombinerClass might not even be needed - simply combine all the same documents.
  • OutputReader will collect edu.arsc.multisearch.backend.ResultSet objects from each given search and pipe them into Reduce.
  • Reduce takes all the OutputReader inputs and merges them with a given Merge algorithm.

I feel fairly good about the OutputReader/Reduce sections, since I am certain of how those will work. I am not certain, however, of the best way to approach the Map/OutputReader section. I am fairly certain that Multisearch will run faster on Hadoop.

ARSC UAF

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 7 July 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.