Difference between revisions of "ArcFiles:"
From Lawa
Line 2: | Line 2: | ||
<br> | <br> | ||
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4 | '''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4 | ||
+ | |||
+ | The reader of the arc file is based on Heritrix parser. | ||
+ | |||
+ | Done: | ||
+ | ====== | ||
+ | - Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord) | ||
+ | - Implemented processing class | ||
+ | - Implemented layer between HDFS and heritrix's parser | ||
+ | - Tests on 1GB, waiting for main cluster to process more. | ||
+ | |||
+ | To be checked: | ||
+ | ============== | ||
+ | - Corrupted files exist, I removed some manually. I am checking it with Internet Memory. |
Revision as of 03:50, 19 May 2011
The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
Job run example: bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
The reader of the arc file is based on Heritrix parser.
Done:
==
- Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord) - Implemented processing class - Implemented layer between HDFS and heritrix's parser - Tests on 1GB, waiting for main cluster to process more.
To be checked:
==
- Corrupted files exist, I removed some manually. I am checking it with Internet Memory.