Difference between revisions of "ArcFiles:"

From Lawa
Jump to: navigation, search
Line 2: Line 2:
 
<br>
 
<br>
 
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
 
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
 +
 +
The reader of the arc file is based on Heritrix parser.
 +
 +
Done:
 +
======
 +
- Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord)
 +
- Implemented processing class
 +
- Implemented layer between HDFS and heritrix's parser
 +
- Tests on 1GB, waiting for main cluster to process more.
 +
 +
To be checked:
 +
==============
 +
- Corrupted files exist, I removed some manually. I am checking it with Internet Memory.

Revision as of 04:50, 19 May 2011

The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
Job run example: bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4

The reader of the arc file is based on Heritrix parser.

Done:

==

- Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord) - Implemented processing class - Implemented layer between HDFS and heritrix's parser - Tests on 1GB, waiting for main cluster to process more.

To be checked:

==

- Corrupted files exist, I removed some manually. I am checking it with Internet Memory.