Multisearch Log

These are the log files from 7 - 11 July 2008. For additional log files, please return to the main log page.

Friday, 11 July
Today I've been working on the Job Input for Hadoop, which apparently needs to be a RecordReader so that there is a record-oriented method of splitting jobs. I am highly tempted to use XML again to define the records...

<service>
    <url> http:// </url>
    <clientclass> client.class.name </clientclass>
    <trecname> TracName (such as gov2.dsub.#) </trecname>
</service>

I am fairly certain that all we will need is URL and Client Class, but it would be easier to expand Multisearch should other needs come up if it is in this format. The 'trecname' section is for our server-selection algorithms. A simple 'identifier' would also be sufficient here, since it won't always be "trecname". In any case, I sent a detailed question to the Hadoop list, because the major RecordReader that I found that suits closest my needs is the StreamXMLRecordReader, which is not exactly what I want. I'm hoping there will be an entire tutorial on Input for Hadoop somewhere, but I can't find it!

Alternative Option for Input: The TextInputFormat object can use plain text files, which are broken up via line. Instead of using a complex (and therefore human-reable) XML system, there could be a simple (but ugly) line system:

http://snowy.arsc.alaska.edu:8080/axis/service/name/:client.class.name:special-name

Essentially, the file would be broken up via lines, and then the input thing would have to split up the taken String at the colon and "know" the format is URL:client class: name. It's messy, but should Multisearch need to be extended, we could always add additional colon sections, with the knowledge that the files would still be backwards compatible, since the StringTokenizer would only iterate through the first three sections in the older version.

Not wanting to use a string separated by colons, I've decided to continue making my own RecordReader, based off of one of the source-code provided ones, which can be viewed as a class file here. I have the basic read-in functions done, which produce a ServiceWritable object (yay!) so far.

My major concern is debugging, as I cannot test each entity alone to see if it functions properly. I'm mimicing the code from Hadoop as much as possible, trying to make sure it will function as Hadoop needs it to. I'll be doing some work over the weekend and later tonight with this, although I am concerned that the debugging will be quite painful.

Thursday, 10 July
The coding is going much more smoothly since I reviewed a lot of the text last night, and my code notes are being updated to reflect the change (IdentityReducer). However, I've joined the Hadoop Core Mailing Lists to see if I can get more information about using a MergeReducer that would produce 1 and only 1 output each time, regardless of how many mappers are brought together.

My name is Kylie McCormick, and I'm currently working on creating a distributed information retrieval package with Hadoop based on my previous work with other middlewares like OGSA-DAI. I've been developing a design that works with the structures of the other systems I have put together for distributed IR.

Essentially, each service (search) returns a ResultSet, which is then merged into a single FinalSet object as soon as it is returned to the main program. Merging a ResultSet generally entails rescoring the documents and putting them in the same OrderedList as documents from other services that have also been rescored.

I have re-designed this so at the Map phase a service is invoked and the ResultSet is collected by the OutputCollector. In the Reduce phase, I hoped to merge all the results together. Is it possible to have reduce produce one (and only one) object output?

I'm hoping this will be enough for a strong answer. What I realized I could also do is create my own Reduce function that rescores the documents and outputs ResultSets with these new scores, and then merge together the documents into a finalset. But, if Hadoop can handle all of it, I want to go for that.

I did get a reply to the above e-mail: "If you tell Hadoop to use a single reducer, it should produce a single file of output." I've asked if this means one object, or the file is the same, which can make a huge difference! But I will continue working with the code with the IdentityReducer for now, and see if I can get something running at least.

Update: It seems to me that the curret code I am using for AxisClient is not a great option. Below is the example:

edu.arsc.multisearch.backend.lemur.indriws.LemurIndriSearcherService service = new edu.arsc.multisearch.backend.lemur.indriws.LemurIndriSearcherServiceLocator();
edu.arsc.multisearch.backend.lemur.indriws.LemurIndriSearcher_PortType lemur = service.getLemurIndriSearcher();
edu.arsc.multisearch.backend.lemur.indriws.ResultSet rs = lemur.search(query);

The obvious flaw is that we need to know a great dreal about the system we're contacting, including packaging names. This hurt's Multisearch's flexibility, although it is much nice than the code I have worked with before:

Service service = new Service();
Call call = (Call)service.createCall();

//Let's make sure that the call knows how to deserialize the
//given WebService items, in this case LuceneBackend: ResultSet
QName qNameRS = new QName("urn:LuceneSeacher", "ResultSet");
BeanSerializerFactory bsfRS = new BeanSerializerFactory(edu.arsc.multisearch.backend. lucene.webservice.ResultSet.class, qNameRS);
BeanDeserializerFactory bdsfRS = new BeanDeserializerFactory(edu.arsc.multisearch.backend. lucene.webservice.ResultSet.class, qNameRS);
call.registerTypeMapping(edu.arsc.multisearch.backend. lucene.webservice.lucene.webservice.ResultSet.class, qNameRS, bsfRS, bdsfRS);

//Making the call work!
call.setTargetEndpointAddress(new java.net.URL(endpt));
call.setOperationName(new QName("urn:LuceneSearcher", "search"));
call.setReturnClass(edu.arsc.multisearch.backend. lucene.webservice.ResultSet.class);
call.setReturnType(qNameRS);
call.setUseSOAPAction(true);
call.setEncodingStyle(org.apache.axis.Constants.URI_SOAP11_ENC);
call.setSOAPVersion(org.apache.axis.soap.SOAPConstants.SOAP11_CONSTANTS);
call.addParameter("in0", XMLType.XSD_STRING, javax.xml.rpc.ParameterMode.IN);
call.addParameter("rss", qNameRS, edu.arsc.multisearch.backend. lucene.webservice.ResultSet.class, javax.xml.rpc.ParameterMode.OUT); String[] params = new String[1];

params[0] = query;

edu.arsc.multisearch.backend.ResultSet wsResultSet = (edu.arsc.multisearch.backend. lucene.webservice.ResultSet)call.invoke(params);

I've seen a simplified variation of this code like this, and if that works we're set. If not, new users of Multisearch might have to load in Clients to be used with the services they add. Side Note: The above code works with both Lemur and Lucene objects...odd!

Right now I think providing-your-own-client would be easiest thing to do, as the code is simplier and easier to read. Assuming that each service would build its own client is logical, but they would have to extend Client.java.

Wednesday, 9 July
The good news is, I feel off to a great start in the work right now with Hadoop. The bad news is I still am not really ready to tinker with the code, so I am again presently left with pen-and-paper, which I want to work on as code today.

Things Hadoop Does that we Don't Necessarily Want

"Typically both the input and the output of the job are stored in a file-system." We don't need to store the output into a file-system.

"Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes." (Part of Job Configuration) We need to be more specific than some applications.

I'm stepping through the requirements of the key/value pairs right now, compiling them as I go (with no functioning code, just making sure I have all the right components.)

I have a very basic skeleton set up that doesn't actually function at all. It uses all of the Hadoop architecture and structure, but the functions actually don't do anything yet. I might work a little more on the classes, but right now I need to develop two "type" classes: Client and FinalSet. It may be that we only use the Map() functionality of Hadoop and then have a synchronized merge() function in the finalset.

I'm presently looking through org.apache.hadoop.mapred.lib, trying to figure out which type of output collector I should use. There is some functionality, like IdentityReducer<K,V> that can be used in place of MergeReduce<K,V> should it be found that the merge() function should not be a part of the reduce.

Right now, I am thinking that the Reduce() function would do the following:

Take in the output collected from Map()
Merge it into the final set
Output a final set

This means if I use a reduce function, it would require that I use an additional function to reduce them to all the same object. I think I might use the IdenityReducer instead, which makes more sense, and merge the output from the output collector, if possible.

As of 5:30pm, I have some bare-bones code for the Map, the main function, and so on. I realized I should also be working on the input/output files that will be handeled in this process (especially the file that handles the Service information). I'll be doing that tomorrow mostly.

Tuesday, 8 July
Chris and I met today to talk about the goals of NaiveMerge in general, and how it does what it does. Since it assumes scores will range between 0 and 1, all sets will need to be normalized to this (via the standard below) in order to have the algorithm working at its best. There may be another method of doing this, where not all the top scores will be 1 and not all the bottom scores will be 0, but this is a cheap-and-dirty way of pushing the performance of Naive Merge.

I'm also re-doing some of the mathematical code for Naive Merge, since there are few comments and it is insanely hard to debug. I'll be looking into quick methods of calculating results today, such as special math libraries.

I'm also dealing with a major Lemur issue involving key indexing, which I posted on Lemur Forum. The BuildIndex function isn't generating a .key file, which is preventing it from being usable in programming. After I fix this, we can have various searches applied to the same indexes, including TFIDF and Okapi!

I am not sure if I can find a Math Library for NaiveMerge's Least Square needs, although I think I can figure out how to impliment the math with comments.

I'm not making much progress with Naive Merge, so I'll turn over to Hadoop for the rest of the day. I think I am going to start playing with the code by late afternoon.

After working a bit more with the code, I am finding that the outline I did needs more solid work before I can properly impliment the code. I'm working with the details provided MapReduce tutorial, but a lot of the underpinnings are subtle.

I've given up for the day on the code--I'll be starting bright and early tomorrow.

Monday, 7 July
Chirs got back to me on how to work with negative numbers in Naive Merge. The trouble is, multiplying the number by -1 flips the scores around, so the worse-scoring documents become the best-scoring documents, and that's definitely not what we want. Also, the merge algorithm is happier with scores between 0 and 1.

Solution from Chris
First translate the score s in a ranked list with max score max_I and min score min_I to all positive numbers
s <- s - min_l

Then scale the scores
s <- s / (max_I - min_I)

Today I've been working almost exclusively with paper-and-pen, working through the finer points of Hadoop's Architecture. For example, working on the Writable Object Code needed for the Map/Reduce. Right now, I'm looking at some simple set-by-step outlines as well as the "pickier" areas that need to be defined.

Map - ServerMap
Inputs Needed: URI of Server, Type of Service offered, Query

Create instance (or cast-up) Client to Call Service
Call the Service, return the appropriate ResultSet
If necessary, translate the ResultSet into something readable
Output <query, ResultSet> pair

Reduce - MergeReduce
Inputs Needed: Multiple ResultSets given the same Query

OutputCollector gives <query, ResultSet> pair to MergeReduce
ResultSet document's scores are recalculated and set
The new document scores are used to set them to a new ranked position in the FinalResultSet
Output <query, FinalSet> pair

I'm almost ready to start coding the Hadoop software components, except I would like to work on Chris's suggestion of correcting the LemurIndri scores for Multisearch before I do that, since I know it will be a problem in the future.

This week's mighty to-do list is as follows:

JobConf - MultisearchJob.class
OutputKeyClass - ResultSetKey.class
OutputValueClass - ResultSetValue.class
MapperClass - ServiceMap.class
CombinerClass - for weeding out repitition, we might not need this
ReducerClass - mulitsearch-merge-class-selected.class (different per selection)
InputFormant - ?
OutputFormat - ?

I hope to get most of this done before the end of the week, and have a basic variation running. I'm still not sure what to do with the Input/Output Formats. There is a lot of work to be done with splitting and such in this area, which I am not familiar with. Getting the InputFormat to work might be a bit trickier since there are logical splits, but I can write my own.

NaiveMergeSet: The good news is that the mechanism used to re-score negative scores to positive ones worked, but when it gets put through the merge algorithm, it comes out to NaN. I'm still looking into this...

Update: I've added some error checking to the code, and found out that it would divide by zero (no checking) and then y = Infinity. I fixed it.

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 11 July 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.