Difference between revisions of "ArcFiles:"
From Lawa
(Created page with 'The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar '''Job run example:''' bin/hadoop jar ~/process_arc.j…') |
|||
(8 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar | The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar | ||
+ | <br> | ||
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4 | '''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4 | ||
+ | |||
+ | The reader of the arc file is based on Heritrix parser. | ||
+ | |||
+ | '''Done:'''<br> | ||
+ | * Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord)<br> | ||
+ | * Implemented processing class<br> | ||
+ | * Implemented layer between HDFS and heritrix's parser<br> | ||
+ | * Processed 1GB of URLs count on the test cluster. <br> | ||
+ | * Memory issues - solved. | ||
+ | |||
+ | '''To be checked:'''<br> | ||
+ | * Corrupted files exist in our data sets. Even though rare (less than 2%), these files prevents to complete the job, thus removed. I am checking issue with Internet Memory. |
Latest revision as of 04:10, 19 May 2011
The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
Job run example: bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
The reader of the arc file is based on Heritrix parser.
Done:
- Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord)
- Implemented processing class
- Implemented layer between HDFS and heritrix's parser
- Processed 1GB of URLs count on the test cluster.
- Memory issues - solved.
To be checked:
- Corrupted files exist in our data sets. Even though rare (less than 2%), these files prevents to complete the job, thus removed. I am checking issue with Internet Memory.