GIR Logo

Arctic Region Supercomputing Center

Multisearch Log

These are the log entries from 4-8 August. To see more entries, please go to the main log page.

Friday, 8 August
Huzzah! The Servlet now can run a Hadoop Search! Of course, I still need to take the output and drop it back into the servlet, but there is a clean execution. The problem was (apparently) rooted in the .jar files. I had to have them copied both to $CATALINA_HOME/common/lib and $CATALINA_HOME/webapps/axis/WEB-INF/lib in order to get it to run properly. These errors didn't show up at first, but they came up today.

Update: Now the servlet is fully functional, including a clean up! The only things that need to be added are the following:

  • Query parsing (full)
  • Restriction algorithms

Thursday, 7 August
I'm working on the Servlet on Snowy, since Nimbus is the newer form of Tomcat. When I do get that working, I'll try dropping it in on Nimbus. The Servlet is now available, although not entirely functional @ Snowy

I've gotten the servlet to operate, but right now I'm not able to search. Output files are being generated (temporary ones), though, which means some part of hadoop is started/stopped during the search provided by the servlet. I'm not sure where the error output would be generated, but I'm going to try a few things.

I'm looking into Nutch's servlet, which is called Cached. I still don't have many hints, but I am working on some of the writing parts while I wait to hear back from the list. I know Hadoop is used for WebApps, so it is definitely possible.

Wednesday, 6 August
I've reformatted the input values so there is no order to them:

Multisearch Command-Line Input
 
-q query terms separated by spaces
-m merge.class.name (full class name, including package)
-r select.class.name (full class name)

I still need to do some "basics" on the query, like normalize and such--but right now I am running my own, so that shouldn't be a problem. The major to-dos I am now working on include:

  • Getting Snowy's OGSA-DAI backends to run with Multisearch (Pileus' work fine)
  • Enable batch Multisearch generating
  • Solve the Lemur Indexing issue (.key file)
  • Add a servlet component to Multisearch

Since the vast majority of the backends I loaded were on Snowy, I want to get those working. I'm currently getting the following error:

uk.org.ogsadai.client.toolkit.exception.ServiceCommsException: A problem arose during communication with service http://snowy.arsc.alaska.edu:8080/axis/services/ogsadai/Gov2259?WSDL.
at uk.org.ogsadai.client.toolkit.GenericServiceFetcher. findDataService(GenericServiceFetcher.java:212)
at uk.org.ogsadai.client.toolkit.GenericServiceFetcher. getDataService(GenericServiceFetcher.java:71)
at edu.arsc.multisearch.client.LuceneDaiClient. call(LuceneDaiClient.java:81)
at edu.arsc.multisearch.ServiceMap. map(ServiceMap.java:56)
at edu.arsc.multisearch.ServiceMap. map(ServiceMap.java:22)
at org.apache.hadoop.mapred.MapRunner. run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask. run(MapTask.java:219)
at org.apache.hadoop.mapred.LocalJobRunner$Job. run(LocalJobRunner.java:157)
Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://snowy.arsc.alaska.edu:8080/axis/services/ogsadai/Gov2259?WSDL

After tons of moving files around, I have managed to get log4j.properties to work again with OGSA-DAI, which I need to figure out the HTTP error. Here are some select errors from the log:

java.lang.ClassNotFoundException: lucene.webservice1164.LuceneGov1165SoapBindingImpl
java.lang.ClassNotFoundException: lucene.webservice1164.LuceneGov1165SoapBindingImpl

This error is being reproduced in all url/Gov#?WSDL files as well, so maybe Axis is not happy with this. I think I just forgot to undeploy it. I'm remaking the files and re-deploying it (to undeploy it) again. There is a jar file LL2s.jar in $CATALINA_HOME/webapps/axis/WEB-INF/lib that should be removable now... I undeployed the service associated with it.

Now that the WSDL error is gone, I'm getting an error that is from my code:

Caused by: uk.org.ogsadai.client.toolkit.exception.RequestException: There was an error during the execution of activity LuceneSearchToolkit-ogsadai-11b9ab0d25d that was caused by an incorrect value for parameter index.
Caused by: uk.org.ogsadai.client.toolkit.exception.RequestException: Directory not found: index

It turns out there were some overlapping class files (in this case, between the newer LuceneService.jar and ARSC.jar). I have deleted ARSC.jar. Huzzah! Now Snowy's backends work!

For Batch Multisearch Queries
 
Each query will have its own output directory. In this case, it would be basedir/queryNumber. After that, some kind of iterative process could be used to combine all the results (stored in basedir/queryNumber/part-0000 files) and put them into a single file.

Tuesday, 5 August
The RecordReader is still not happy--It will read in input but not use it all. I'm working from the example file and the Old File for RecordReader. I've stolen a lot of code from the LineRecordReader, since all <service> xml is 5-lines long.

[Fatal Error] :-1:-1: Premature end of file.
org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser. parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl. parse(Unknown Source)
at edu.arsc.multisearch.ServiceRecordReader$ ServiceReader.readService(ServiceRecordReader.java:175)
at edu.arsc.multisearch.ServiceRecordReader. next(ServiceRecordReader.java:353)
at edu.arsc.multisearch.ServiceRecordReader. next(ServiceRecordReader.java:33)
at org.apache.hadoop.mapred.MapTask$ TrackedRecordReader.next(MapTask.java:158)
at org.apache.hadoop.mapred.MapRunner. run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask. run(MapTask.java:219)
at org.apache.hadoop.mapred.LocalJobRunner$ Job.run(LocalJobRunner.java:157)

Well, having Service objects in the same file still doesn't work (although I can tell they are being read in). However, putting one service per file works out just fine. I'm not sure why...

To interact with the C++ stuff Darren is working on, this is my solution:

Generate file-per-entry
 
Create input subfolder query_terms_like_this/
Each file will be numbered 0, 1, etc. with one record entry in it

I'll still be looking for reasons why this isn't working, of course. Moving records in/out of the input is easy with this method, but there might be a lot of files involved in it.

Monday, 4 August
Ever since I have tried to make Hadoop iterative, I have been getting this error from LemurIndriSearcher:

@AxisClient: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity.
AxisFault
faultCode: {http://schemas.xmlsoap.org/soap/envelope/} Server.userException
faultSubcode:
faultString: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity.
faultActor:
faultNode:
faultDetail:
{http://xml.apache.org/axis/}stackTrace:org.xml.sax. SAXParseException: XML document structures must start and end within the same entity.

Or...

@AxisClient: org.xml.sax.SAXParseException: The element type "title" must be terminated by the matching end-tag "</title>".
AxisFault
faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
faultSubcode:
faultString: org.xml.sax.SAXParseException: The element type "title" must be terminated by the matching end-tag "</title>".
faultActor:
faultNode:
faultDetail:
{http://xml.apache.org/axis/}stackTrace:org.xml.sax.SAXParseException: The element type "title" must be terminated by the matching end-tag "</title>".

When running a tester code, I am getting the following:

Exception in thread "main" java.lang.NoSuchMethodError: edu.arsc.multisearch.backend.lemur. indriws.LemurIndriSearcher_PortType.search(Ljava/lang/String;) Ledu/arsc/multisearch/backend/lemur/indriws/ResultSet; at edu.arsc.multisearch.backend.LemurIndriClient.call (LemurIndriClient.java:39) at edu.arsc.multisearch.backend.OtherTester.main (OtherTester.java:46)

All right, when run in another tester, I am getting full results back (HappyLemur). I'm not sure what's causing this error: It could be the query is not appropriate or something...I tried re-deploying newly formed LemurIndriSearcher, but that clearly is not the problem. Outside of Hadoop, there doesn't seem to be a problem at all, actually, so it must be something else...

Update: The problem was simple. I was returning 1000 results from Lemur, which was not being serialized correctly. I cut it down to 100 and it was fixed.

Update: The LuceneDaiClient is now also working! Huzzah!

The last major error I am experiencing is not having the input being read properly unless it is printed twice. If there is only one <service> object, it doesn't function properly. I'm not sure why...

RecordReader
 
According to Hadoop's RecordReader, there are a few actions that related to placement in the stream:
 
getPos() - Returns current position (long) in the input
getProgress() - Returns how much of the input ahs been consumed via float (between 0 and 1)
 
getStatus() - Returns (int) how close to the ned of file we are

I've found the problem! Currently, the readService() will occur even if the next line is null! I've fixed this, but I am still only reading in one service per file, which is annoying. I am assuming it has to do with the fact that hadoop is byte-oriented.

I'm going back to the basic LineReader to see if I can fix this issue... I've looked through it, and it seems as if there is an issue with the position and consumption of the different Records being read in. I am starting from scratch, but I am starting it up tomorrow.

ARSC UAF

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 8 August 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.