Multisearch Log

These are the log files from 27 - 30 May, 2008. For additional log files, please return to the main log page.

Friday, 30 May
I spoke with Chris about running the search input/output for his TREC track this summer, due 15 June. With some updates, Multisearch should be able to handle it, no problem.

The trouble I've been having is that all the Lucene indexes were created with 2.2.*, I'm waiting to hear back from Chris about what exact version, and I need to update the toolkits to reflect that. I might have to tar and back up the older backends, which are running on Lucene 1.4.3 still, which is guaranteed to be messy.

The other thing Greg pointed out to me is that GIR is about distributed searching, not necessarily distributed indexing. So, Lemur + Egothor, if I add them, don't have to index over Hadoop, just search. I'm looking into that stuff right now.

Thursday, 29 May
Installing Hadoop for Standalone operation wasn't that bad, I managed to get through it this morning. Right now it's set up on Nimbus because new Snowy doesn't have Java installed on it. As per the Quickstart instructions, I ran the following command as a test to ensure that 0.17 had been installed correctly.

nimbus(mccormic) ~/hadoop [226] > bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
08/05/29 09:25:56 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
08/05/29 09:25:56 INFO mapred.FileInputFormat: Total input paths to process : 2
08/05/29 09:25:56 INFO mapred.JobClient: Running job: job_local_1
08/05/29 09:25:56 INFO mapred.MapTask: numReduceTasks: 1
08/05/29 09:25:57 INFO mapred.LocalJobRunner: file:/home/mccormic/hadoop/input/hadoop-site.xml:0+178
08/05/29 09:25:57 INFO mapred.TaskRunner: Task 'job_local_1_map_0000' done.
08/05/29 09:25:57 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_0000' to file:/home/mccormic/hadoop/grep-temp-1378420935
08/05/29 09:25:57 INFO mapred.MapTask: numReduceTasks: 1
08/05/29 09:25:57 INFO mapred.LocalJobRunner: file:/home/mccormic/hadoop/input/hadoop-default.xml:0+37978
08/05/29 09:25:57 INFO mapred.TaskRunner: Task 'job_local_1_map_0001' done.
08/05/29 09:25:57 INFO mapred.TaskRunner: Saved output of task 'job_local_1_map_0001' to file:/home/mccormic/hadoop/grep-temp-1378420935
08/05/29 09:25:57 INFO mapred.LocalJobRunner: reduce > reduce
08/05/29 09:25:57 INFO mapred.TaskRunner: Task 'reduce_73bg6c' done.
08/05/29 09:25:57 INFO mapred.TaskRunner: Saved output of task 'reduce_73bg6c' to file:/home/mccormic/hadoop/grep-temp-1378420935
08/05/29 09:25:57 INFO mapred.JobClient: Job complete: job_local_1
08/05/29 09:25:57 INFO mapred.JobClient: Counters: 11
08/05/29 09:25:57 INFO mapred.JobClient: File Systems
08/05/29 09:25:57 INFO mapred.JobClient: Local bytes read=390675
08/05/29 09:25:57 INFO mapred.JobClient: Local bytes written=360503
08/05/29 09:25:57 INFO mapred.JobClient: Map-Reduce Framework
08/05/29 09:25:57 INFO mapred.JobClient: Map input records=1239
08/05/29 09:25:57 INFO mapred.JobClient: Map output records=41
08/05/29 09:25:57 INFO mapred.JobClient: Map input bytes=38156
08/05/29 09:25:57 INFO mapred.JobClient: Map output bytes=1161
08/05/29 09:25:57 INFO mapred.JobClient: Combine input records=41
08/05/29 09:25:57 INFO mapred.JobClient: Combine output records=39
08/05/29 09:25:57 INFO mapred.JobClient: Reduce input groups=39
08/05/29 09:25:57 INFO mapred.JobClient: Reduce input records=39
08/05/29 09:25:57 INFO mapred.JobClient: Reduce output records=39
08/05/29 09:25:57 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
08/05/29 09:25:57 INFO mapred.FileInputFormat: Total input paths to process : 1
08/05/29 09:25:57 INFO mapred.JobClient: Running job: job_local_2
08/05/29 09:25:57 INFO mapred.MapTask: numReduceTasks: 1
08/05/29 09:25:58 INFO mapred.LocalJobRunner: file:/home/mccormic/hadoop/grep-temp-1378420935/part-00000:0+1533
08/05/29 09:25:58 INFO mapred.TaskRunner: Task 'job_local_2_map_0000' done.
08/05/29 09:25:58 INFO mapred.TaskRunner: Saved output of task 'job_local_2_map_0000' to file:/home/mccormic/hadoop/output
08/05/29 09:25:58 INFO mapred.LocalJobRunner: reduce > reduce
08/05/29 09:25:58 INFO mapred.TaskRunner: Task 'reduce_oppti5' done.
08/05/29 09:25:58 INFO mapred.TaskRunner: Saved output of task 'reduce_oppti5' to file:/home/mccormic/hadoop/output
08/05/29 09:25:58 INFO mapred.JobClient: Job complete: job_local_2
08/05/29 09:25:58 INFO mapred.JobClient: Counters: 11
08/05/29 09:25:58 INFO mapred.JobClient: File Systems
08/05/29 09:25:58 INFO mapred.JobClient: Local bytes read=500227
08/05/29 09:25:58 INFO mapred.JobClient: Local bytes written=482447
08/05/29 09:25:58 INFO mapred.JobClient: Map-Reduce Framework
08/05/29 09:25:58 INFO mapred.JobClient: Map input records=39
08/05/29 09:25:58 INFO mapred.JobClient: Map output records=39
08/05/29 09:25:58 INFO mapred.JobClient: Map input bytes=1447
08/05/29 09:25:58 INFO mapred.JobClient: Map output bytes=1135
08/05/29 09:25:58 INFO mapred.JobClient: Combine input records=0
08/05/29 09:25:58 INFO mapred.JobClient: Combine output records=0
08/05/29 09:25:58 INFO mapred.JobClient: Reduce input groups=2
08/05/29 09:25:58 INFO mapred.JobClient: Reduce input records=39
08/05/29 09:25:58 INFO mapred.JobClient: Reduce output records=39

nimbus(mccormic) ~/hadoop [227] > cat output/*
3 dfs.
1 dfs.impl
1 dfs.max.objects
1 dfs.name.dir
1 dfs.namenode.decommission.interval
1 dfs.namenode.handler.count
1 dfs.namenode.logging.level
1 dfs.permissions
1 dfs.permissions.supergroup
1 dfs.replication.consider
1 dfs.replication
1 dfs.replication.interval
1 dfs.replication.max
1 dfs.replication.min
1 dfs.replication.min.
1 dfs.safemode.extension
1 dfs.safemode.threshold.pct
1 dfs.secondary.http.address
1 dfs.web.ugi
1 dfs.http.address
1 dfs.datanode.dns.nameserver
1 dfs.balance.bandwidth
1 dfs.block.size
1 dfs.blockreport.interval
1 dfs.client.block.write.retries
1 dfs.client.buffer.dir
1 dfs.data.dir
1 dfs.datanode.address
1 dfs.datanode.dns.interface
1 dfs.https.address
1 dfs.datanode.du.pct
1 dfs.datanode.du.reserved
1 dfs.datanode.http.address
1 dfs.datanode.https.address
1 dfs.default.chunk.view.size
1 dfs.df.interval
1 dfs.heartbeat.interval
1 dfs.hosts
1 dfs.hosts.exclude

As far as combinding Lucene and Hadoop, it's already been done by Apache under Nutch. We may work with Lemur and/or Egothor as well as Lucene to see how Hadoop shapes up with multiple types of indexing. I'm working on looking at Nutch as a viable option of GIR, which might make the job a lot easier. The project is a bit older and has a lot of updates, and at first I thought it required Windows! But it seems to run fine on Linux. I'm working with the startup for dummies.

Update: Good news! Nutch can run files as we need them. There are some errors (below) that I'm looking into, but the index files look ok. Below is the log file generated by the command.

nimbus(mccormic) ~/nutch [246] > bin/nutch crawl urls -dir testcraw -depth 4 >& crawl.log
crawl started in: testcraw
rootUrlDir = urls
threads = 10
depth = 4
Injector: starting
Injector: crawlDb: testcraw/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: testcraw/segments/20080529105106
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: testcraw/segments/20080529105106
Fetcher: threads: 10
fetching http://pileus.arsc.alaska.edu:8080/domain2/data/
fetch of http://pileus.arsc.alaska.edu:8080/domain2/data/ failed with: java.lang.RuntimeException: Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: testcraw/crawldb
CrawlDb update: segments: [testcraw/segments/20080529105106]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: testcraw/segments/20080529105113
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: testcraw/segments/20080529105113
Fetcher: threads: 10
fetching http://pileus.arsc.alaska.edu:8080/domain2/data/
fetch of http://pileus.arsc.alaska.edu:8080/domain2/data/ failed with: java.lang.RuntimeException: Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: testcraw/crawldb
CrawlDb update: segments: [testcraw/segments/20080529105113]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: testcraw/segments/20080529105120
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: testcraw/segments/20080529105120
Fetcher: threads: 10
fetching http://pileus.arsc.alaska.edu:8080/domain2/data/
fetch of http://pileus.arsc.alaska.edu:8080/domain2/data/ failed with: java.lang.RuntimeException: Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: testcraw/crawldb
CrawlDb update: segments: [testcraw/segments/20080529105120]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: testcraw/segments/20080529105127
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: testcraw/segments/20080529105127
Fetcher: threads: 10
fetching http://pileus.arsc.alaska.edu:8080/domain2/data/
fetch of http://pileus.arsc.alaska.edu:8080/domain2/data/ failed with: java.lang.RuntimeException: Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: testcraw/crawldb
CrawlDb update: segments: [testcraw/segments/20080529105127]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: testcraw/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: testcraw/segments/20080529105113
LinkDb: adding segment: testcraw/segments/20080529105127
LinkDb: adding segment: testcraw/segments/20080529105120
LinkDb: adding segment: testcraw/segments/20080529105106
LinkDb: done
Indexer: starting
Indexer: linkdb: testcraw/linkdb
Indexer: adding segment: testcraw/segments/20080529105113
Indexer: adding segment: testcraw/segments/20080529105127
Indexer: adding segment: testcraw/segments/20080529105120
Indexer: adding segment: testcraw/segments/20080529105106
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: testcraw/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

Update: After reading up on Nutch, I'm fairly certain it's what we're looking for. It uses Hadoop's MapReduce to run Lucene. I'm certain I could use this same method to run Egothor or Lemur, if that's what Greg wants. I'm going to figure out some things with my PolarExpress card so I can check things out of the library...Hopefully I'll be back at work soon.

Wednesday, 28 May
My computer, Dulcet, is having problems opening certain software, namely Firefox. I spoke to Don, who was on call, about it, and he told me it was probably just my quota. He doubled it and told me to just make sure I deleted things. However, Firefox didn't load still! Turns out Dulcet has some problems with its /usr/local directory, which apparently I either don't have access to or doesn't exist.

Don offered another computer to me, but another student already had it (we're slowly getting the other interns in, namely the Cadets and the GWU research assistants). So I'm serving the internet with Links (text-only browser). It's been slowing me down, especially with my work with Hadoop. But hopefully the problem will be fixed soon and I can get cracking.

By 11:30am I managed to get the old Servlet working on Nimbus. There are only three backends (because Balto, Pileus, and Snowy are all down), but it is up and running. I've also found that the Restriction Algorithms aren't working. They're just causing zero results to be returned.

After exploring the Multisearch Servlet for a while, I realized that the ratio limit was dropping the number of servers to 0. I modified the code so that the servers must always be at one, so if the number of servers drops below one, it gets set back to that. Now it's working again!

I read MapReduce: Simplified Data Processing on Large Clusters, which Hadoop is based off of. Essentially, MapReduce hides parallelization and fault-controls because its library already deals with all of those issues. As long as the user can reduce his or her needs to an input key pair and output key pair, MapReduce can help.

Hadoop, however, is only based on MapReduce. It is written in Java and has more flexibility because it is open source. I'm now reading up more specifically about Hadoop and how it operates.

I've downloaded the newest version of Hadoop and untarred it on Nimbus, but I'm not sure if I should install it yet. I'll have to make sshing into localhost passwordless, which I'm not sure I want to do right now without knowing the full impact of it.

I'm also looking into Lucene with Hadoop, since they're both Apache tools that work. There seems to be a lot of work out there with these two, maybe we can solidify them? Or use something that has already been solidified? I've asked Greg via e-mail if he thinks using Egothor (another Java freeware search engine) and/or Lemur might be a good idea, as we can use different search indexing styles in GIR.

I've also read a bit about Hadoop Quickstart, loading Hadoop from the jars and such. More information can be found at the quickstart page, so it can be referenced later. I think I leave a big thing like installation until tomorrow, now that I've gotten my feet wet with the basic architecture.

Tuesday, 27 May
Greg and I met and discussed some options for my work this week. He doesn't have a plan yet, but he wants me to look directly into Hadoop, a distributed file-storage system that's more robust than OGSA-DAI. We also spoke about resurrecting the old Multisearch from last year, which shouldn't be too hard since I've been updating it throughout the Fiscal Year.

I spent most of my time reading up on Hadoop, learning about its Map/Reduce structure and how its used. I've read a bunch of presentations and hope they'll help me figure out how to make GIR run over it. I'll add more notes as I go tomorrow.

I looked into getting Nimbus and Snowy back up and running with Multisearch, but as of right now that won't be a possibility. First, Snowy is down and is being replaced by Tintin tomorrow (which will be renamed Snowy). Second, Nimbus' Java is no longer working. I dropped Greg a line about adding Java back to these two before trying to get things rolling again.

Tintin is a Linux x86_64, so I couldn't find the JDK for it. However, Nimbus' Java was updated by the end of the day.

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775

© Arctic Region Supercomputing Center 2006-2008. This page was last updated on 2 June 2008.
These files are part of a portfolio for Kylie McCormick's online resume. See the disclaimer for more information.