System | A cluster composed of 70 dual 3GHz Pentium-IV Xeons nodes (140 CPUs) running Linux, internally partitioned to two disjoint sub-clusters composed of 28 and 42 nodes. Part of the EGEE grid project. |
Duration | August 2004 through May 2005 (ten months) |
Jobs | 244,821 (all are serial jobs; of these, 230,448 have actually started running and 14,373 were canceled before that time). |
Analysis | Found in the paper [medernach05], which introduces and models the LPC log. |
What's LPC? | LPC stands for "Laboratoire de Physique Corpusculaire" (Laboratory of Corpuscular Physics) of Université Blaise-Pascal, Clermont-Ferrand, France. |
Context |
LPC is a cluster that is part of the
EGEE project
(Enabling Grids for E-science in Europe).
This grid employs the LCG
middleware as the grid's infrastructure (LCG is the Large hadron
collider Computing Grid project). One of this project's goals is to develop an infrastructure to handle and analyze the expected per-year 15 petabytes of data to be generated by the LHC (the Large Hadron Collider mentioned earlier, developed at CERN), scheduled to begin its operation in 2007. The LPC cluster is one site within the LCG infrastructure. It is used mostly for biomedical and high-energy physics research. You can monitor the status of all the sites composing the LCG at http://goc.grid.sinica.edu.tw/gstat/, and in particular, the status of LPC at http://goc.grid.sinica.edu.tw/gstat/IN2P3-LPC. The LPC site is http://clrwww.in2p3.fr/ (but it's only in French). |
Workload |
All the jobs in the logs are serial:
|
Graciously provided by | Emmanuel Medernach (medernac AT clermont.in2p3.fr), the author of [medernach05], who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment. |
Nodes | 70 X 3GHz dual Pentium-IV Xeon = 140 CPUs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Node OS | RedHat or Scientific Linux | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Node memory | 1GB RAM and 20GB of local storage | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cluster partitions | The LPC cluster is divided to two disjoint sub-clusters
(there's no connection, and therefore no load balancing, between
the two).
The following is the description the two partitions:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cluster batch system | OpenPBS | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cluster scheduling scheme |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fair share policy |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The Grid |
Downtime |
|
|
zcat LPC-EGEE-2004-0old.pbs.gz LPC-EGEE-2004-0ce1.pbs.gz LPC-EGEE-2004-0ce2.pbs.gz | pbs2swf.pl \ \ --output=l_lpc \ \ --proc_used=1,started \ --proc_req=1,all \ --executable=-1,all,overwrite \ \ --mem_req.type=physical \ \ --anonymize.partition=clrglop195.in2p3.fr:1 \ --anonymize.partition=clrce01.in2p3.fr:2 \ --anonymize.partition=clrce02.in2p3.fr:3 \ \ --anonymize.queue=test:1 \ --anonymize.queue=short:2 \ --anonymize.queue=long:3 \ --anonymize.queue=day:4 \ --anonymize.queue=infinite:5 \ --anonymize.queue=batch:6 \ \ --anonymize.gid=dteam:1 \ --anonymize.gid=dteam005:1 \ --anonymize.gid=biomed:2 \ --anonymize.gid=biomgrid:2 \ \ --Computer="3GHz Pentium-IV Xeon Linux Cluster" \ --Installation="LPC (Laboratoire de Physique Corpusculaire)" \ --Installation="Part of the LCG (Large hadron collider Computing Grid project)" \ --Information="http://www.cs.huji.ac.il/labs/parallel/workload/l_lpc.html" \ --Information="JSSPP'05 - Workload Analysis of a Cluster in a Grid Environment" \ --Acknowledge="Emmanuel Medernach - medernac AT clermont.in2p3.fr" \ --Conversion="Dan Tsafrir - dants AT cs.huji.ac.il" \ --MaxNodes="70 (dual)" \ --MaxProcs=140 \ --TimeZoneString="Europe/Paris" \ --MaxRuntime=259200 \ --AllowOveruse=False \ --Queues="Queues enforce a runtime limit on the jobs that populate them." \ --Queues="See URL in 'Information' for details." \ --Partitions="One small partition, later replaced by two disjoint partitions." \ --Partitions="See URL in 'Information' for details." \ --Note="Jobs are always serial." |
option flag | meaning | details |
--proc_used=1,started --proc_req=1,all |
Number of requested- (all jobs) and used- (started jobs) processors is set to be 1. | The size of all the jobs in the LPC log is 1. However some PBS records are missing this data. We therefore decide that the number of requested-processors (proc_req) of all the jobs is 1. This is also true with respect to used-processors (proc_used) but only for jobs that have actually started to run. Jobs that were canceled before this point, are always assigned with proc_used=0 by the pbs2swf.pl script. |
--executable=-1,all,overwrite |
Set the executable of all jobs in the SWF version to be undefined (-1). | This data is actually available for all the started jobs (hence we overwrite it), but is meaningless, because it species the names of the PBS submittal scripts rather than names of actual applications. And so, almost 88% of the jobs specify "STDIN" as their executable name. Another 6% specify "test.job", and another 6% are canceled jobs (before start), thus their executable name is missing form the PBS log altogether. This leaves us with only tens of jobs that also usually have names like "test1.job", "job.sh" etc. |
--mem_req.type=physical |
SWF data regarding requested memory is associated with physical (rather than virtual) memory. | By default, pbs2swf.pl prefers extracting data from the PBS log that is associated with virtual memory. However, no such data is available in the LPC log, whereas some data specifying requested physical memory is in fact available (but only for 480 jobs). |
--anonymize.partition=* |
Explicitly associating PBS partitions with SWF codes that reflect the chronological order in which they were defined. | For example, the earliest partition is the 'old' one (clrglop195.in2p3.fr) and so it is set to be partition number 1. If SWF codes were not assigned explicitly, they would have been assigned arbitrarily by pbs2swf.pl. |
--anonymize.queue=* |
Explicitly associating PBS queues with SWF codes such that the bigger the code, the longer the jobs that may populate it. | For example, the 'test' queue has the smallest limit on the requested runtime of the jobs that may populate it, and so it is set to be queue number 1. |
--anonymize.gid=* |
Unite PBS groups that appear different but are actually the same. | For example, PBS jobs associated with groups 'dteam' and 'dteam005' actually originate from the same group (which is indeed collectively referred to as 'dteam' in [medernach05]). And so, they are both explicitly assigned to the same SWF group code 1. |
Others | Some predefined SWF header fields. | Including only those that pbs2swf.pl cannot compute by itself (those that can be computed, may not be given as command line options). |
field | default quantity | default type |
requested memory | per-job (aggregated) | virtual |
used memory | per-job (aggregated) | virtual |
used CPU | per-process | - |
job.SWF_ID <= 9932were deleted.
A search for flurries hasn't been conducted yet.
Further information on flurries and the justification for removing
them can be found in:
D. Tsafrir and D. G. Feitelson,
Workload flurries.
Technical Report 2003-85, School of Computer Science and Engineering,
The Hebrew University of Jerusalem, Nov 2003.
Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.