Multisearch Log

These are the log files from 21-25 July. For additional log files, please return to the main log page.

Friday, 25 July
I've made a lot of progress with Multisearch since Hadoop is now functioning fully with it. Lemur now works--I found out that there were ^M characters and end-of-line characters in Lemur's results. I've trimmed them and now it works!

Next on the agenda is cleaning up the code so the comments are clear and all of that...

Output Text from 25 July: Query "War Tactics during World War"

I've added a TRECOutputFormat option, so if anyone wants to run TREC results, all that has to be done is to change the output format declared in main().

To Do for 'clean up'

~~Revert FinalSet/NaiveMergeSet to reflect their original methods and ability~~
Modify ServiceWritable so the readFields and write work properly
Modify DocumentWritable so the readFields and write work properly
Since ResultSet() and ResultSetWritable() don't have to be separate objects, merge them
Go through the various objects and ensure they are needed as-is

To Do for added functionality

~~Add TREC output format printing~~
Add a variable to the configuration files to select the merge function (or declare it in the command-line, or both)
Add the other merge options (LeapOfFaith, Rank Shuffle)
Test the file-input Multisearch interface to see how effective it is
Add more Clients to deal with additional backends, namely OGSA-DAI backends
Solve the Lemur Indexing issue so that different .key searches can be used

Thursday, 24 July
There is the possibility that the ServiceWritable objects generated might be empty instances too, which explains why the ResultSetWritable objects are producing null documents. I'll be looking into this, but I think it might be generated by reflection again...Update: No, the ServiceWritable objects are fine.

This means that this issue occurs after the ResultSetWritable has been created and sent to the OutputCollector at the end of the map() function...because in the reduce() function, new ResultSetWritable objects are being generated with what appears to be reflection.

Small Note: Adding MergeReduce as a combiner as well as a reducer made Hadoop MUCH faster.

The intermediate, sorted outputs are always stored in files of SequenceFile format. Applications can control if, and how, the intermediate outputs are to be compressed and the CompressionCodec to be used via the JobConf.

This could be generating errors with the ResultSetWritable objects, I'll look into it.

Update: You can run Hadoop with a set limit of 0 reduce objects, which makes the output print after the map() function. This works perfectly: All the output is printed to the file. So, it clearly is a problem between the map() output and the iterator...

The query government equity is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- job_local_1
java.lang.NullPointerException
at edu.arsc.multisearch.ResultSetWritable.write (ResultSetWritable.java:47)
at org.apache.hadoop.io.serializer.WritableSerialization $WritableSerializer.serialize(WritableSerialization.java:90)
at org.apache.hadoop.io.serializer.WritableSerialization $WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.io.SequenceFile $Writer.append(SequenceFile.java:1016)
at org.apache.hadoop.mapred.MapTask $CombineOutputCollector.collect(MapTask.java:1079)
at edu.arsc.multisearch.MergeReduce. reduce(MergeReduce.java:74)
at edu.arsc.multisearch.MergeReduce. reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.combineAndSpill(MapTask.java:872)
at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.sortAndSpill(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.flush(MapTask.java:691)
at org.apache.hadoop.mapred.MapTask. run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:157)
@Multisearch.run(): java.io.IOException: Job failed!
complete.

According to Hadoop's Map/Reduce tutorial, all intermediate map outputs are stored in SequenceFile objects, which confused me... but it seems that in order for the Writable object to be serializable to the DataOut, it must overwrite the write() function to do this...I am assuming readFields() takes the output and reads them into an object.

UPDATE: Indeed, this was the problem! I've managed to do a (rather messy) out/in variation where there is separation by equal signs (=). This means equal signs cannot be in the title/filenames of documents nor in the info or names of the services...

Lemur results are not parsing correctly with the write() and read() functions, but Lucene results are... this is probably due to something Lemur adds to titles (like EOL or something). But otherwise it works, huzzah!

Wednesday, 23 July
I'm currently reviewing some of the issues with Multisearch, step-by-step. I'll probably be posting some architecture notes... I can't seem to throw Exceptions from the Map() function.

Good News: I understand where the problem is!
Bad News: I don't understand why it's an error.

Output of Multisearch when run with a Single <service> input

The query economic state of europe is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=
- No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+192
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- reduce > reduce
- Task 'reduce_jupman' done.
- Saved output of task 'reduce_jupman' to file:/home/mccormic/merge/programs/hadoopOutput
- Job complete: job_local_1
- Counters: 11
- File Systems
- Local bytes read=26085
- Local bytes written=52258
- Map-Reduce Framework
- Map input records=0
- Map output records=0
- Map input bytes=0
- Map output bytes=0
- Combine input records=0
- Combine output records=0
- Reduce input groups=0
- Reduce input records=0
- Reduce output records=0

Output of Multisearch when run with an additional <service> input

The query economic state of europe is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=
- No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+384
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- job_local_1
java.lang.NullPointerException: Hits is null before fromNeg is run.
at edu.arsc.multisearch.merge.NaiveMergeSet.merge(NaiveMergeSet.java:75)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:43)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
@Multisearch.run(): java.io.IOException: Job failed!
complete.

So, apparently I wasn't generating any Map() functions at all! Which explains why I wasn't getting any of my errors thrown! However, now I'm back to the original Reduce() issue, which is that somewhere between the Output Collector and the Reduce() function, my objects are turning null...

Turns out (as per below) that there is some reflection going on with the Iterator!

java.lang.RuntimeException: java.lang.NoSuchMethodException: edu.arsc.multisearch.ResultSetWritable.<init>()
at org.apache.hadoop.util.ReflectionUtils. newInstance(ReflectionUtils.java:80)
at org.apache.hadoop.io.serializer.WritableSerialization$ WritableDeserializer.deserialize(WritableSerialization.java:62)
at org.apache.hadoop.io.serializer.WritableSerialization$ WritableDeserializer.deserialize(WritableSerialization.java:40)
at org.apache.hadoop.mapred.ReduceTask$ValuesIterator. readNextValue(ReduceTask.java:291)
at org.apache.hadoop.mapred.ReduceTask$ValuesIterator. next(ReduceTask.java:232)
at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator. next(ReduceTask.java:311)
at edu.arsc.multisearch.MergeReduce. reduce(MergeReduce.java:38)
at edu.arsc.multisearch.MergeReduce. reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.ReduceTask. run(ReduceTask.java:391)
at org.apache.hadoop.mapred.LocalJobRunner$Job. run(LocalJobRunner.java:201)
Caused by: java.lang.NoSuchMethodException: edu.arsc.multisearch. ResultSetWritable.()
at java.lang.Class.getConstructor0(Class.java:2706)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
... 9 more

So, what this is telling me is that the Iterator, which I thought was iterating through ResultSetWritables from the OutputCollector that all had the same Key, is using reflection to generate new instances....?

The query economic state of europe is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+3184
- map 100% reduce 0%
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+3184
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+3184
- file:/home/mccormic/merge/programs/inputfiles/xmlinput.txt:0+3184
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- map 100% reduce 0%
- reduce > reduce
- map 100% reduce 68%
- reduce > reduce
- reduce > reduce
- reduce > reduce
- Task 'reduce_4mnhn3' done.
- Saved output of task 'reduce_4mnhn3' to file:/home/mccormic/merge/programs/hadoopOutput
- Job complete: job_local_1
- Counters: 11
- File Systems
- Local bytes read=1494313668
- Local bytes written=3011899723
- Map-Reduce Framework
- Map input records=15
- Map output records=15
- Map input bytes=-2991
- Map output bytes=375
- Combine input records=0
- Combine output records=0
- Reduce input groups=1
- Reduce input records=15
- Reduce output records=1
complete.

So now Hadoop will run, but the Document[] files are still all null. I sent a detailed question to the list about this issue, and while I've waited I've re-launched LemurIndriSearch to reflect the same objects as we want.

To-Do: Re-code the NaiveMergeSearch/Reducer phase so that the NMS object works as it did in the old multisearch. Then return the Document[] objects to be set to a ResultSetWritable for outputcollection. Also, create a mechanism that would allow for alternative merge algorithms to be used.

Tuesday, 22 July
I've spent a lot of time walking through the Hadoop architecture. Here's the thing. At the Map() phase, I am getting back a full result set that functions from over the wire. At the Reduce() phase, all the objects are null.

At first I thought this was an output collector issue, but it could also be a simple record-reader issue. I'm going to look through the Documentation a bit more to see what the Iterator() does specifically and what the get() and set() functions of a writable object are supposed to contain before I modify any code. It could be that the RecordReader is going to the next option, which of course is null right now...

I've thoroughly reviewed the RecordReader options. I am fairly certain that ServiceRecordReader is fully functioning, since there are no more NullPointerExceptions! However, no output is being written to the file. Now I'll be looking into issues with the Writer object.

Some useful files from Hadoop: FileOutputFormat, MapOutputFormat, SequenceFileOutputFormat, TextOutputFormat, OutputFormatBase.

Overall, I am not getting any results or error codes back from Hadoop, and it looks like it's not reporting anything. I think this might be because the Map/Reduce classes are not static as they are in the WordCount example. This hasn't gotten me anywhere yet...

Update: I am not getting any of the thrown exceptions from my code in Hadoop as I was before, which leads to me to believe I've modified something that I shouldn't. If I use the code outside of Hadoop, the errors are all thrown. I'm not sure if the objects aren't being used, but I can tell that the ServiceRecordReader is throwing errors in Hadoop, but that's the last step at which it does. I'll walk through the ServiceMap and other Hadoop objects tomorrow.

Monday, 21 July
I'm still getting NullPointerExceptions for the web services I created. However, the ResultSet object is fully functional other than returning no Document objects. I'm currently looking at The Axis 1.2 User Guide to see if I can figure out why this is. It's refering to code similar to the AxisClient.java code I've used before.

I'm looking in $CATALINA_HOME/logs, and I did find the following:

java.lang.NullPointerException
at org.apache.catalina.loader.WebappClassLoader. findResourceInternal(WebappClassLoader.java:1774)
at org.apache.catalina.loader.WebappClassLoader. findClassInternal(WebappClassLoader.java:1575)
at org.apache.catalina.loader.WebappClassLoader. findClass(WebappClassLoader.java:860)
at org.apache.catalina.loader.WebappClassLoader. loadClass(WebappClassLoader.java:1307)
at org.apache.catalina.loader.WebappClassLoader. loadClass(WebappClassLoader.java:1189)
at java.lang.ClassLoader.loadClassInternal (ClassLoader.java:319)
at org.apache.axis.AxisFault.addHostnameIfNeeded (AxisFault.java:877)
at org.apache.axis.AxisFault.initFromException (AxisFault.java:280)
at org.apache.axis.AxisFault.<init> (AxisFault.java:181)

I think this means I need to serialize the Document object... Update: I've added a BeanFactory and all that jazz for Document, but it still isn't working in Hadoop.I have found a package in Hadoop, though, specifically for IPC/RPC clients/services, so it can be done!

Well, so far no luck on that front. I've made the Document object a Writable, but that doesn't seem to have worked for sending anything across the wire.

Update: I have now found that there is an issue on the WebService side when Hadoop calls. The try section is either being skipped or has another issue--because it's returning the null object at the bottom... Good news: I am getting results over the wire, but now the NullPointer is coming out during the Reduce() phaze.

java edu.arsc.multisearch.Multisearch -q government spending patterns
The query government spending patterns is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=id34390
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/xmlinput.txt:0+192
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- map 100% reduce 0%
- job_local_1
java.lang.NullPointerException: Document is null for
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:46)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
@Multisearch.run(): java.io.IOException: Job failed!

I think this might be an outputcollecting error of some kind, since it seems like a new, blank object of ResultSet is being used, not an established one...

Good news: I am getting full result sets over the wire from Lucene.
Good news: After a bit of swapping, I have an appropriate Map/Reduce combination.
Bad news: I still am having some issues with NullPointers in the Reduce section.
Good news: The reduce section seems to be working.
Bad news: Nothing is being written out.

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 25 July 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.