Difference between revisions of "ArcFiles:"

From Lawa
Jump to: navigation, search
(Created page with 'The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar '''Job run example:''' bin/hadoop jar ~/process_arc.j…')
 
 
(8 intermediate revisions by one user not shown)
Line 1: Line 1:
 
The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
 
The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
 +
<br>
 
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
 
'''Job run example:''' bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4
 +
 +
The reader of the arc file is based on Heritrix parser.
 +
 +
'''Done:'''<br>
 +
* Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord)<br>
 +
* Implemented processing class<br>
 +
* Implemented layer between HDFS and heritrix's parser<br>
 +
* Processed 1GB of URLs count on the test cluster. <br>
 +
* Memory issues - solved.
 +
 +
'''To be checked:'''<br>
 +
* Corrupted files exist in our data sets. Even though rare (less than 2%), these files prevents to complete the job, thus removed. I am checking issue with Internet Memory.

Latest revision as of 04:10, 19 May 2011

The ArcTool package is parsing and processing arc files. I uploaded the Jar with sources on /cs/phd/ouaknine/process_arc.jar
Job run example: bin/hadoop jar ~/process_arc.jar ArcProcessing -libjars=/cs/phd/ouaknine/PhD/lawa/arcTools/heritrix-1.14.4.jar,/cs/phd/ouaknine/PhD/lawa/arcTools/fastutil-6.1.0.jar keren1GB/ /user/ouaknine/output/sum4

The reader of the arc file is based on Heritrix parser.

Done:

  • Implemented new InputFormat (ArcInputFormat, ArcRecordReader, ArcRecord)
  • Implemented processing class
  • Implemented layer between HDFS and heritrix's parser
  • Processed 1GB of URLs count on the test cluster.
  • Memory issues - solved.

To be checked:

  • Corrupted files exist in our data sets. Even though rare (less than 2%), these files prevents to complete the job, thus removed. I am checking issue with Internet Memory.