Introduction Components |
Multisearch LogThese are the log files from 7 - 11 July 2008. For additional log files, please return to the main log page. Friday, 11 July
<service> I am fairly certain that all we will need is URL and Client Class, but it would be easier to expand Multisearch should other needs come up if it is in this format. The 'trecname' section is for our server-selection algorithms. A simple 'identifier' would also be sufficient here, since it won't always be "trecname". In any case, I sent a detailed question to the Hadoop list, because the major RecordReader that I found that suits closest my needs is the StreamXMLRecordReader, which is not exactly what I want. I'm hoping there will be an entire tutorial on Input for Hadoop somewhere, but I can't find it! Alternative Option for Input: The TextInputFormat object can use plain text files, which are broken up via line. Instead of using a complex (and therefore human-reable) XML system, there could be a simple (but ugly) line system: http://snowy.arsc.alaska.edu:8080/axis/service/name/:client.class.name:special-name Essentially, the file would be broken up via lines, and then the input thing would have to split up the taken String at the colon and "know" the format is URL:client class: name. It's messy, but should Multisearch need to be extended, we could always add additional colon sections, with the knowledge that the files would still be backwards compatible, since the StringTokenizer would only iterate through the first three sections in the older version. Not wanting to use a string separated by colons, I've decided to continue making my own RecordReader, based off of one of the source-code provided ones, which can be viewed as a class file here. I have the basic read-in functions done, which produce a ServiceWritable object (yay!) so far. My major concern is debugging, as I cannot test each entity alone to see if it functions properly. I'm mimicing the code from Hadoop as much as possible, trying to make sure it will function as Hadoop needs it to. I'll be doing some work over the weekend and later tonight with this, although I am concerned that the debugging will be quite painful. Thursday, 10 July My name is Kylie McCormick, and I'm currently working on creating a distributed information retrieval package with Hadoop based on my previous work with other middlewares like OGSA-DAI. I've been developing a design that works with the structures of the other systems I have put together for distributed IR. I'm hoping this will be enough for a strong answer. What I realized I could also do is create my own Reduce function that rescores the documents and outputs ResultSets with these new scores, and then merge together the documents into a finalset. But, if Hadoop can handle all of it, I want to go for that. I did get a reply to the above e-mail: "If you tell Hadoop to use a single reducer, it should produce a single file of output." I've asked if this means one object, or the file is the same, which can make a huge difference! But I will continue working with the code with the IdentityReducer for now, and see if I can get something running at least. Update: It seems to me that the curret code I am using for AxisClient is not a great option. Below is the example: edu.arsc.multisearch.backend.lemur.indriws.LemurIndriSearcherService service = new edu.arsc.multisearch.backend.lemur.indriws.LemurIndriSearcherServiceLocator(); The obvious flaw is that we need to know a great dreal about the system we're contacting, including packaging names. This hurt's Multisearch's flexibility, although it is much nice than the code I have worked with before: Service service = new Service(); I've seen a simplified variation of this code like this, and if that works we're set. If not, new users of Multisearch might have to load in Clients to be used with the services they add. Side Note: The above code works with both Lemur and Lucene objects...odd! Right now I think providing-your-own-client would be easiest thing to do, as the code is simplier and easier to read. Assuming that each service would build its own client is logical, but they would have to extend Client.java. Wednesday, 9 July Things Hadoop Does that we Don't Necessarily Want I'm stepping through the requirements of the key/value pairs right now, compiling them as I go (with no functioning code, just making sure I have all the right components.) I have a very basic skeleton set up that doesn't actually function at all. It uses all of the Hadoop architecture and structure, but the functions actually don't do anything yet. I might work a little more on the classes, but right now I need to develop two "type" classes: Client and FinalSet. It may be that we only use the Map() functionality of Hadoop and then have a synchronized merge() function in the finalset. I'm presently looking through org.apache.hadoop.mapred.lib, trying to figure out which type of output collector I should use. There is some functionality, like IdentityReducer<K,V> that can be used in place of MergeReduce<K,V> should it be found that the merge() function should not be a part of the reduce. Right now, I am thinking that the Reduce() function would do the following:
This means if I use a reduce function, it would require that I use an additional function to reduce them to all the same object. I think I might use the IdenityReducer instead, which makes more sense, and merge the output from the output collector, if possible. As of 5:30pm, I have some bare-bones code for the Map, the main function, and so on. I realized I should also be working on the input/output files that will be handeled in this process (especially the file that handles the Service information). I'll be doing that tomorrow mostly. Tuesday, 8 July I'm also re-doing some of the mathematical code for Naive Merge, since there are few comments and it is insanely hard to debug. I'll be looking into quick methods of calculating results today, such as special math libraries. I'm also dealing with a major Lemur issue involving key indexing, which I posted on Lemur Forum. The BuildIndex function isn't generating a .key file, which is preventing it from being usable in programming. After I fix this, we can have various searches applied to the same indexes, including TFIDF and Okapi! I am not sure if I can find a Math Library for NaiveMerge's Least Square needs, although I think I can figure out how to impliment the math with comments. I'm not making much progress with Naive Merge, so I'll turn over to Hadoop for the rest of the day. I think I am going to start playing with the code by late afternoon. After working a bit more with the code, I am finding that the outline I did needs more solid work before I can properly impliment the code. I'm working with the details provided MapReduce tutorial, but a lot of the underpinnings are subtle. I've given up for the day on the code--I'll be starting bright and early tomorrow. Monday, 7 July
Solution from Chris Today I've been working almost exclusively with paper-and-pen, working through the finer points of Hadoop's Architecture. For example, working on the Writable Object Code needed for the Map/Reduce. Right now, I'm looking at some simple set-by-step outlines as well as the "pickier" areas that need to be defined. Map - ServerMap
Reduce - MergeReduce
I'm almost ready to start coding the Hadoop software components, except I would like to work on Chris's suggestion of correcting the LemurIndri scores for Multisearch before I do that, since I know it will be a problem in the future. This week's mighty to-do list is as follows:
I hope to get most of this done before the end of the week, and have a basic variation running. I'm still not sure what to do with the Input/Output Formats. There is a lot of work to be done with splitting and such in this area, which I am not familiar with. Getting the InputFormat to work might be a bit trickier since there are logical splits, but I can write my own. NaiveMergeSet: The good news is that the mechanism used to re-score negative scores to positive ones worked, but when it gets put through the merge algorithm, it comes out to NaN. I'm still looking into this... Update: I've added some error checking to the code, and found out that it would divide by zero (no checking) and then y = Infinity. I fixed it. |
Arctic Region Supercomputing Center © Arctic Region Supercomputing Center 2006-2008. This page was last updated on 11 July 2008.
|