Multisearch Log

These are log files from 14-18 July. For additional log files, please return to the main log page.

Friday, 18 July
I'm still trying to figure out the NullPointerException. I've been running searches on a tester backend, which have been producing viable results some of the time. It could be that the Document object is not properly serialized by Axis stubs or something.

I have tried a few options, attempting to figure out why I am getting a null document[] set. I first tried using the messier BeanSerializerFactory version of the code, which did not work. Then I attempted to change the ResultSet object from Lucene (first) such that there was only one kind of special object (since before there was ResultSet with Document[]). That also has failed.

The thing is, I am getting values for info and name, both of which the backend returns with the ResultSet object. I'll be looking more into this...

I've reversed back to the initial Document[] within ResultSet object, since it is neater. I'm not certain why I am getting null documents returned from the searches still, though.

Thursday, 17 July
I figured out why the Map/Reduce was freezing on Reduce--the Map() was assigned but not resolving, there was a while-loop that wasn't finishing.

The query government spending is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=id34390
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
@ServiceMap: Cannot instantiate given client class.java.lang.InstantiationException: edu.arsc.multisearch.LemurIndriClient
- Cannot instantiate given client class : java.lang.InstantiationException: edu.arsc.multisearch.LemurIndriClient
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- reduce > reduce
- Task 'reduce_fso9tg' done.
- Saved output of task 'reduce_fso9tg' to file:/home/mccormic/merge/programs/hadoopOutput
- Job complete: job_local_1
- Counters: 11
- File Systems
- Local bytes read=72505
- Local bytes written=98948
- Map-Reduce Framework
- Map input records=1
- Map output records=0
- Map input bytes=-386
- Map output bytes=0
- Combine input records=0
- Combine output records=0
- Reduce input groups=0
- Reduce input records=0
- Reduce output records=0
complete.

The java.lang.InstantiationException errors are odd, since the classes should be functional, but at least Hadoop is working!

I'm fixed the Client instances, which was due to the fact that the Client() base class had no params for a new objects, but the extended classes all required a URL. I updated it, and now I am getting another error:

The query government spending is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=id34390
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/xmlinput.txt:0+386
- map 100% reduce 0%
@ServiceMap: Client has responded with error. java.lang.NullPointerException - Error from Client: java.lang.NullPointerException
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- reduce > reduce
- Task 'reduce_iqqfg6' done.
- Saved output of task 'reduce_iqqfg6' to file:/home/mccormic/merge/programs/hadoopOutput
- Job complete: job_local_1
- Counters: 11
- File Systems
- Local bytes read=72505
- Local bytes written=98948
- Map-Reduce Framework
- Map input records=1
- Map output records=0
- Map input bytes=-386
- Map output bytes=0
- Combine input records=0
- Combine output records=0
- Reduce input groups=0
- Reduce input records=0
- Reduce output records=0
complete.

I've managed to find the NullPointerExceptions pointing to a null Document option, which I am fairly certain has to do with the connection to the Axis Services. If not, it might be due to the output collector and/or Iterator.

The query government spending is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=id34390
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
java.lang.NullPointerException: Documents are null. Curses.
at edu.arsc.multisearch.LemurIndriClient.call(LemurIndriClient.java:53)
at edu.arsc.multisearch.ServiceMap.map(ServiceMap.java:42)
at edu.arsc.multisearch.ServiceMap.map(ServiceMap.java:13)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/xmlinput.txt:0+386
- Task 'job_local_1_map_0000' done.
- Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/merge/programs/hadoopOutput
- map 100% reduce 0%
- job_local_1
java.lang.NullPointerException: MergeReduce has a result set with null documents.
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:53)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
@Multisearch.run(): java.io.IOException: Job failed!
complete.

Wednesday, 16 July
Apparently, according to people on the list, CodecCompression is only for compressed files, not all files--so I'll have to modify the code around that to see if I can get rid of the nullpoitnerexception. Update: I've fixed the ServiceRecordReader so it no longer uses Codec Code, and now the Map function is working!

The query government spending issues in 1987 is being run.
- Initializing JVM Metrics with processName=JobTracker, sessionId=
- No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
- Total input paths to process : 1
- Running job: job_local_1
- numReduceTasks: 1
- map 0% reduce 0%
- file:/home/mccormic/merge/programs/xmlinput.txt:0+386
- map 100% reduce 0%

Admittedly, now Hadoop has been running Reduce for quite some time, and I suspect that it's not presently working. I'll be looking into that. However, it seems that the Map() is now working!

I checked and found some substainial XML errors, which I have since fixed, but I am still not moving forward into the Reduce() functionality. I've been trying to turn on the logging by using the Cluster Setup Page.

After trying to figure out setting up logging for a few hours, I have some basic logging (although not job history, which would be the most helpful) and I am still getting Map 100% Reduce 0% ... and then nothing. I don't think it has anything to do with the Reduce() function itself, because I had IdentityReducer (and now MergeReduce) set as my reduce class, and they've both produced this same result.

I'm going to creep through the documentation one last time to see if there is anything I'm missing at all. Maybe the list will get back to me soon, too.

Tuesday, 15 July
I have received further clarification on the RecordReader/Writer vs. readFields() and write() functions.

Not quite. Your RecordReader may produce MyWritable records, but readFields may not be involved. For your MyWritable records to get to your reduce, they should implement the Writable interface so the framework may regard them as streams of bytes. Your OutputFormat- which may use your MyWriter- may take the MyWritable objects you emit from your reduce and make them conform to whatever format your spec requires.

* Your InputFormat takes XML and provides MyWritable objects to your mapper
* The framework calls MyWritable::write(byte_stream) and MyWritable::readFields(byte_stream) to push records you emit from your mapper across the network, between abstractions, etc.
* Your OuputFormat takes MyWritable objects you emit from your reducer and stores them according to the format you specify

With many exceptions, most RecordReaders calling readFields are reading from structured, generic formats (like SequenceFile). -Chris Douglas

I'll be working on getting this last functionality up and running, then seeing what kind of new errors I produce!

Due to the error I got yesterday, java.lang.NoClassDefFoundError: org/apache/commons/httpclient/HttpMethod, I looked into finding out where the class is stored. HTTP Client 3.x had to be downloaded and added to $CATALINA_HOME/common/lib

I've narrowed down the NullPointerException to the following line of code.

final CompressionCodec codec = compressionCodecs.getCodec(file);

All the other objects surrounding this one are valid, I've checked them. I've mailed the list to see if anyone can tell me why it's returning the null object.

Monday, 14 July
Over the weekend, especially through the Hadoop mailing list, I have gotten a list of resources:

Lucene-base Distributed Index Leveraging Hadoop

I have also gotten a possible solution to the Lemur issue, which seems to be that I must construct the .key file out of the generated files by running BuildIndex again. I'm putting this on the backburner for now, since I want to get rolling on the Hadoop reducing, but I wanted to make note of it here.

User can view the history logs summary in specified directory using the following command
$ bin/hadoop job -history output-dir
This command will print job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed using the following command
$ bin/hadoop job -history all output-dir

I have completed the ServiceRecordReader and now am working on a ResultSetWriter, based on the TextOutputFormat.LineWriter code. I think the smartest move will be to print the document's information, its new score and old score, and its old rank. Then when the result(s) are read from the OutputStream into some other object, they will be put in order and given a new rank. This'll save us the issue of merging into the same FinalSet object.

Update: I have gotten Hadoop Map/Reduce started! Of course, that simply means I have an error, but I still have code that can run and produce an error! That's very exciting!

java.lang.NullPointerException
at edu.arsc.multisearch.ServiceRecordReader. close(ServiceRecordReader.java:212)
at org.apache.hadoop.mapred. MapTask$TrackedRecordReader.close(MapTask.java:166)
at org.apache.hadoop.mapred. MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred. LocalJobRunner$Job.run(LocalJobRunner.java:157)
Exception in thread "Thread-7" java.lang.NoClassDefFoundError: org/apache/commons/httpclient/HttpMethod
at org.apache.hadoop.mapred. LocalJobRunner$Job.run(LocalJobRunner.java:236)
Caused by: java.lang.ClassNotFoundException: org.apache. commons.httpclient.HttpMethod
at java.net.URLClassLoader$1. run(URLClassLoader.java:200)
at java.security.AccessController. doPrivileged(Native Method)
at java.net.URLClassLoader. findClass(URLClassLoader.java:188)
at java.lang.ClassLoader. loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader. loadClass(Launcher.java:276)
at java.lang.ClassLoader. loadClass(ClassLoader.java:251)
at java.lang.ClassLoader. loadClassInternal(ClassLoader.java:319)
... 1 more
@Multisearch.run(): java.io.IOException: Job failed!

I am fairly certain that all my errors are circulating around readFields and write functions as part of the serialized Writable variation--along with their relationship to the RecordReader I've created. I wrote to the list, and, luckily, I have an e-mail to work with now:

It's easiest to consider write as a function that converts your record to bytes and readFields as a function restoring your record from bytes. So it should be the case that:

MyWritable i = new MyWritable();
i.initWithData(some_data);
i.write(byte_stream);
...
MyWritable j = new MyWritable();
j.initWithData(some_other_data); // (1)
j.readFields(byte_stream);
assert i.equals(j);

Note that the assert should be true whether or not (1) is present, i.e. a call to readFields should be deterministic and without hysteresis (it should make no difference whether the Writable is newly created or if it formally held some other state). readFields must also consume the entire record, so for example, if write outputs three integers, readFields must consume three integers. Variable-sized Writables are common, but any optional/variably sized fields must be encoded to satisfy the preceding.

So if your MyBigWritable record held two ints (integerA, integerB) and a MyWritable (my_writable), its write method might look like:

out.writeInt(integerA);
out.writeInt(integerB);
my_writable.write(out);

and readFields would restore:

integerA = in.readInt(in);
integerB = in.readInt(in);
my_writable.readFields(in);

There are many examples in the source of simple, compound, and variably-sized Writables.

Your RecordReader is responsible for providing a key and value to your map. Most generic formats rely on Writables or another mode of serialization to write and restore objects to/from structured byte sequences, but less generic InputFormats will create Writables from byte streams. TextInputFormat, for example, will create Text objects from CR-delimited files, though Text objects are not, themselves, encoded in the file. In constrast, a SequenceFile storing the same data will encode the Text object (using its write method) and will restore that object as encoded.

The critical difference is that the framework needs to convert your record to a byte stream at various points- hence the Writable interface- while you may be more particular about the format from which you consume and the format to which you need your output to conform. Note that you can elect to use a different serialization framework if you prefer.

If your data structure will be used as a key (implementing WritableComparable), it's strongly recommended that you implement a RawComparator, which can compare the serialized bytes directly without deserializing both arguments. -Chirs Douglas

I am going to clarify what this means structurally to make sure I don't tinker with the wrong parts. Essentially, Hadoop is byte-oriented; whereas, Multisearch is Record-oriented. I need to convert to the other in the functions.

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 21 July 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.