The LPC Log

Table of Content:

Introduction:

System A cluster composed of 70 dual 3GHz Pentium-IV Xeons nodes (140 CPUs) running Linux, internally partitioned to two disjoint sub-clusters composed of 28 and 42 nodes. Part of the EGEE grid project.

Duration August 2004 through May 2005 (ten months)

Jobs 244,821
(all are serial jobs; of these, 230,448 have actually started running and 14,373 were canceled before that time).

Analysis Found in the paper [medernach05], which introduces and models the LPC log.

What's LPC? LPC stands for "Laboratoire de Physique Corpusculaire" (Laboratory of Corpuscular Physics) of Université Blaise-Pascal, Clermont-Ferrand, France.

Context LPC is a cluster that is part of the EGEE project (Enabling Grids for E-science in Europe). This grid employs the LCG middleware as the grid's infrastructure (LCG is the Large hadron collider Computing Grid project).
One of this project's goals is to develop an infrastructure to handle and analyze the expected per-year 15 petabytes of data to be generated by the LHC (the Large Hadron Collider mentioned earlier, developed at CERN), scheduled to begin its operation in 2007.
The LPC cluster is one site within the LCG infrastructure. It is used mostly for biomedical and high-energy physics research. You can monitor the status of all the sites composing the LCG at http://goc.grid.sinica.edu.tw/gstat/, and in particular, the status of LPC at http://goc.grid.sinica.edu.tw/gstat/IN2P3-LPC. The LPC site is http://clrwww.in2p3.fr/ (but it's only in French).

Workload All the jobs in the logs are serial:

The LPC cluster is currently one of its kind within the parallel workload archive, because its workload is solely composed of work-pile tasks called bags. Bags are submitted in remote sites (outside of LPC) and are assigned to the LPC cluster through the EGEE infrastructure.

A bag is a collection of serial independent jobs that perform no communication and therefore are not required to execute simultaneously or even within the same site. Instead of communicating, jobs can write output files to grid storage-elements or to the user's machine. Other jobs may read and work on the generated data forming "pipelines" of jobs. Once all the jobs are done, the bag is complete.
A higher level grid scheduler fragments each bag into individual jobs and places them on (possibly) different machines (one of which is the LPC cluster). Here, we only see one machine's view (LPC's), so we get a set of independent processes. Unfortunately, we have no knowledge about which serial jobs belong to the same bag, nor if there are additional jobs belonging to the same bag running on other, remote, machines.

Graciously provided by Emmanuel Medernach (medernac AT clermont.in2p3.fr), the author of [medernach05], who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Downloads:

Download the original gzip-ed PBS logs (see conversion notes for details):

Partition
(see System Environment) size

LPC-EGEE-2004-0old.pbs (old partition) 2M

LPC-EGEE-2004-0ce1.pbs (CE1 partition) 6M

LPC-EGEE-2004-0ce2.pbs (CE2 partition) 3M

concatenated version of the above 13M

LPC-EGEE-2004-1.swf: Download the log in SWF (2M, gzip-ed).
LPC-EGEE-2004-1.1-cln.swf: Download version 1 of the cleaned log in SWF: (2M, gzip-ed). Provided for compatibility only.
LPC-EGEE-2004-1.2-cln.swf: Download version 2 of the cleaned log in SWF: (2M, gzip-ed). Recommended: see Usage Notes.

Papers Using this Log:

This log was used in the following papers: [medernach05] [amar08a] [amar08b] [verma08] [thebe09] [agmonby11] [lindsay12] [deng13] [emeras13] [skowron13] [rajbhandary13]

System Environment:

Nodes 70 X 3GHz dual Pentium-IV Xeon = 140 CPUs

Node OS RedHat or Scientific Linux

Node memory 1GB RAM and 20GB of local storage

Cluster partitions The LPC cluster is divided to two disjoint sub-clusters (there's no connection, and therefore no load balancing, between the two). The following is the description the two partitions:

4 Aug 2004 The LPC is in configuration / testing phase. The "old" partition is defined (jobs serviced by clrglop195.in2p3.fr in the original PBS log) with only 2 CPUs. This is partition #1 in the SWF log.

15 Sep 2004 Arrival of 70 dual nodes (140 CPUs).
The 'old' partition is enlarged to 28 nodes (56 CPUs).
A new partition called "CE1" is defined (jobs serviced by clrce01.in2p3.fr in the original PBS log) which contains the remaining 42 nodes (84 CPUs). This is partition #2 in the SWF log. "CE" stands for Computing Element.

1 Dec 2004 Changing some configuration parameters of the 'old' partition which is now called "CE2" (jobs serviced by clrce02.in2p3.fr in the original log). This is partition #3 in the SWF log.

Cluster batch system OpenPBS

Cluster scheduling scheme

The scheduler used is Maui, which periodically polls the PBS wait queues and decides which jobs to run next. Each job is allocated a CPU for its exclusive use and runs to completion (or until it exceeds its runtime limit, after which it is killed by the system).

Roughly speaking, Maui is used with its default EASY configuration, that is, FCFS with backfilling (plus adding some bonus to shorter jobs as will shortly be explained). However, since all the jobs are serial, there's actually no backfilling activity: whenever a CPU is available, this is enough for the job at the head of the wait-queue, and thus no other job is allowed to skip it.

Jobs arriving to the LPC are always associated with a runtime (wall-clock) estimate. This is either explicitly given by users, or alternatively, is taken from their (per-user) defaults file.

Jobs are always submitted on remote sites (there are no local jobs). When a job arrives to LPC, it is already designated (by the grid infrastructure) to a certain partition and queue, which may be one of the following:

partition queue time limits [HH:MM] other features

name in SWF lifetime / size name in SWF max CPU [1] max wall clock max run jobs [2] bonus [3]

old 1

4 Aug 2004 defined with 2 CPUs

15 Sep 2004 enlarged to 56 CPUs

1 Dec 2004 replaced by CE2

short 2 00:15 02:00 - -

long 3 12:00 24:00 - -

infinite 5 48:00 72:00 - -

CE1 2

15 Sep 2004 defined with 84 CPUs

test 1 00:05 00:15 10 1,000,000

short 2 00:15 02:00 80 270,000

long 3 12:00 24:00 80 10,000

day 4 24:00 36:00 80 0

infinite 5 48:00 72:00 80 -360,000

CE2 3

1 Dec 2004 defined with 56 CPUs to replace 'old'

test 1 00:05 00:15 10 1,000,000

short 2 00:20 01:30 50 270,000

long 3 08:00 24:00 50 10,000

day 4 24:00 36:00 50 0

infinite 5 48:00 72:00 50 -360,000

batch 6 - 01:00 - 0

[1] Max CPU is not used by Maui (which only uses wallclock runtime to determine when jobs should be killed). Rather, it is used by the grid infrastructure as will be described below.

[2] Effective since 7 Mar 2005

[3] Effective since 1 Jun 2005 (upgrade from LCG-2.3.0 to LCG-2.4.0)

Whenever a CPU becomes available (recall all jobs are serial), MAUI chooses for execution the job J with the highest priority in the partition. But only if the number of currently running jobs that are associated with J.queue, doesn't already exceed the limit of max-running-jobs as specified in the above table. The priority of J is
J.priority = 10 * J.wait_time + bonus
where 'bonus' is determined according to J's partition/queue as specified in the above table. The 'bonus' is effective since 1 Jun 2005 (before that, 'bonus' values can be considered as 0). The 'max-running-jobs' limits are effective since 7 Mar 2005 (before that, these limits can be considered as the partition size). This means that until 1 Jun 2005, the division of jobs between queues was meaningless and that the scheduling scheme was actually plain FCFS.

In addition to the above, there are also fairness considerations involved. These are described below.

Fair share policy

There are 8 groups appearing in the log, most are referred to in the accompanying paper [medernach05]:

PBS group in SWF

dteam (also includes dteam005) 1

biomed (also includes biomgrid) 2

lhcb 3

cms 4

atlas 5

alice 6

dzero 7

sixt 8

The fare-share policy of the above groups is part of the Maui configuration and the syntax in which it is expressed (as seen below) is explained here:
http://www.clusterresources.com/products/maui/docs/6.3fairshare.shtml

A fair-share policy was only utilized after 1 Jun 2005 at which point it was set to be:
FSPOLICY DEDICATEDPES FSINTERVAL 00:15:00 FSDEPTH 24 FSDECAY 0.85 FSWEIGHT 500 FSUSERWEIGHT 10 FSGROUPWEIGHT 30 GROUPCFG[alice] FSTARGET=15 GROUPCFG[atlas] FSTARGET=15 GROUPCFG[biomed] FSTARGET=15 GROUPCFG[cms] FSTARGET=15 GROUPCFG[lhcb] FSTARGET=15 GROUPCFG[dteam] FSTARGET=15

Then, at 24 Mar 2005, the policy was changed to reflect the different group needs (some groups pay for using the LPC cluster):
CE1: GROUPCFG[alice] FSTARGET=15 MAXPROC=20,67 GROUPCFG[atlas] FSTARGET=15 MAXPROC=20,67 GROUPCFG[biomed] FSTARGET=45 MAXPROC=20,81 GROUPCFG[cms] FSTARGET=15 MAXPROC=20,67 GROUPCFG[lhcb] FSTARGET=15 MAXPROC=20,67 GROUPCFG[dteam] FSTARGET=55 MAXPROC=6,67 GROUPCFG[dzero] FSTARGET=8 MAXPROC=20,67 CE2: GROUPCFG[alice] FSTARGET=18 MAXPROC=20,45 GROUPCFG[atlas] FSTARGET=18 MAXPROC=20,45 GROUPCFG[biomed] FSTARGET=18 MAXPROC=20,45 GROUPCFG[lhcb] FSTARGET=18 MAXPROC=20,45 GROUPCFG[dteam] FSTARGET=15 MAXPROC=6,6 GROUPCFG[cms] FSTARGET=5 MAXPROC=20,25 GROUPCFG[dzero] FSTARGET=8 MAXPROC=20,45

The Grid

The LCG middleware serves as the infrastructure of the LCG/EGEE grid. It includes components that have been developed by a number of projects such as the Globus Toolkit and the Condor system. The middleware acts as a layer of software that provides homogenous access to different grid resource centers.

The grid is organized as a collection of Computing Elements (CEs) and Storage Elements (SEs). A computing element is a collection of batch queues (like those specified above) on top of a homogenous cluster.

LCG users are associated with Virtual Organizations (VOs). These are collections of individuals and/or organizations that share resources. Each VO corresponds to a "group" like those specified above.

LCG grid users specify their job needs using a file that is written in JDL (Job Description Language). Users can specify the executable, arguments, standard I/O, needed environment (OS, distribution, dependencies), required visible files from their local file system, and more. Users should also specify either the maximal wallclock- or CPU-time, required by their job. If this is not specified, a local default for the maximal wallclock is used (taken from a per-user configuration file of defaults).

The JDL file is then submitted through a user interface to a Resource Broker (RB) which identifies all queues (within all CEs) that can be used by the user's VO, and meet the job's requirements. If more than one such queue is available, the RB tries to choose the "best" one. Users can influence this decision by optionally specifying their preferred strategy. This can be a ranked combination of considerations like "choose the site with the biggest number of free CPUs", "least mean waiting time", "least number of waiting jobs", etc. More details on this can be found in this presentation (which provide some additional references in the final slide).

Apparently, based on the data in the LPC log, the RB doesn't always do a good job in assigning jobs to the "best" queue:

queue max runtime (wallclock) jobs in this queue have used runtime estimates of

long 24:00 8:00, 24:00

day 36:00 8:00, 16:00, 36:00

infinite 72:00 8:00, 16:00, 24:00, 32:00, 60:00, 72:00

This means, for example, that an 8h job that was submitted to the 'infinite' queue was also so eligible for the 'long' queue, in which it would have enjoyed a far superior priority (as specified in the partition/queue table defined above).

Downtime

start end description

18:00:00 10 Nov 2004 12:00:00 15 Nov 2004 Electrical problems

02:03:30 25 Nov 2004 18:00:00 26 Nov 2004 Replacement of an old CE

12:00:00 01 Dec 2004 08:00:00 03 Dec 2004 Network reconfiguration

12:00:00 24 Jan 2005 18:00:00 25 Jan 2005 General electrical power cut

08:00:00 25 Feb 2005 18:00:00 01 Mar 2005 Air cooling failure

11:00:00 10 Apr 2005 14:00:00 10 Apr 2005 Air cooling failure

11:00:00 27 Apr 2005 08:00:00 03 Apr 2005 SE Disk server problem

06:00:00 17 May 2005 09:00:00 19 May 2005 Moving the cluster from one place to another

Conversion Notes:

The original PBS-logs were converted to SWF using the pbs2swf utility (this links contains a detailed explanation describing how should PBS logs be parsed, using LPC as an example).
The exact shell conversion command is:

zcat LPC-EGEE-2004-0old.pbs.gz LPC-EGEE-2004-0ce1.pbs.gz LPC-EGEE-2004-0ce2.pbs.gz | pbs2swf.pl \
                                                                                          \
               --output=l_lpc                                                             \
                                                                                          \
            --proc_used=1,started                                                         \
             --proc_req=1,all                                                             \
           --executable=-1,all,overwrite                                                  \
                                                                                          \
         --mem_req.type=physical                                                          \
                                                                                          \
  --anonymize.partition=clrglop195.in2p3.fr:1                                             \
  --anonymize.partition=clrce01.in2p3.fr:2                                                \
  --anonymize.partition=clrce02.in2p3.fr:3                                                \
                                                                                          \
      --anonymize.queue=test:1                                                            \
      --anonymize.queue=short:2                                                           \
      --anonymize.queue=long:3                                                            \
      --anonymize.queue=day:4                                                             \
      --anonymize.queue=infinite:5                                                        \
      --anonymize.queue=batch:6                                                           \
                                                                                          \
        --anonymize.gid=dteam:1                                                           \
        --anonymize.gid=dteam005:1                                                        \
        --anonymize.gid=biomed:2                                                          \
        --anonymize.gid=biomgrid:2                                                        \
                                                                                          \
             --Computer="3GHz Pentium-IV Xeon Linux Cluster"                              \
         --Installation="LPC (Laboratoire de Physique Corpusculaire)"                     \
         --Installation="Part of the LCG (Large hadron collider Computing Grid project)"  \
          --Information="http://www.cs.huji.ac.il/labs/parallel/workload/l_lpc.html"      \
          --Information="JSSPP'05 - Workload Analysis of a Cluster in a Grid Environment" \
          --Acknowledge="Emmanuel Medernach - medernac AT clermont.in2p3.fr"              \
           --Conversion="Dan Tsafrir - dants AT cs.huji.ac.il"                            \
             --MaxNodes="70  (dual)"                                                      \
             --MaxProcs=140                                                               \
       --TimeZoneString="Europe/Paris"                                                    \
           --MaxRuntime=259200                                                            \
         --AllowOveruse=False                                                             \
               --Queues="Queues enforce a runtime limit on the jobs that populate them."  \
               --Queues="See URL in 'Information' for details."                           \
           --Partitions="One small partition, later replaced by two disjoint partitions." \
           --Partitions="See URL in 'Information' for details."                           \
                 --Note="Jobs are always serial."

Here's an explanation of the more important options, that directly affect the outcome of the conversion:

option flag
meaning details

--proc_used=1,started --proc_req=1,all
Number of requested- (all jobs) and used- (started jobs) processors is set to be 1. The size of all the jobs in the LPC log is 1. However some PBS records are missing this data. We therefore decide that the number of requested-processors (proc_req) of all the jobs is 1. This is also true with respect to used-processors (proc_used) but only for jobs that have actually started to run. Jobs that were canceled before this point, are always assigned with proc_used=0 by the pbs2swf.pl script.

--executable=-1,all,overwrite
Set the executable of all jobs in the SWF version to be undefined (-1). This data is actually available for all the started jobs (hence we overwrite it), but is meaningless, because it species the names of the PBS submittal scripts rather than names of actual applications. And so, almost 88% of the jobs specify "STDIN" as their executable name. Another 6% specify "test.job", and another 6% are canceled jobs (before start), thus their executable name is missing form the PBS log altogether. This leaves us with only tens of jobs that also usually have names like "test1.job", "job.sh" etc.

--mem_req.type=physical
SWF data regarding requested memory is associated with physical (rather than virtual) memory. By default, pbs2swf.pl prefers extracting data from the PBS log that is associated with virtual memory. However, no such data is available in the LPC log, whereas some data specifying requested physical memory is in fact available (but only for 480 jobs).

--anonymize.partition=*
Explicitly associating PBS partitions with SWF codes that reflect the chronological order in which they were defined. For example, the earliest partition is the 'old' one (clrglop195.in2p3.fr) and so it is set to be partition number 1. If SWF codes were not assigned explicitly, they would have been assigned arbitrarily by pbs2swf.pl.

--anonymize.queue=*
Explicitly associating PBS queues with SWF codes such that the bigger the code, the longer the jobs that may populate it. For example, the 'test' queue has the smallest limit on the requested runtime of the jobs that may populate it, and so it is set to be queue number 1.

--anonymize.gid=*
Unite PBS groups that appear different but are actually the same. For example, PBS jobs associated with groups 'dteam' and 'dteam005' actually originate from the same group (which is indeed collectively referred to as 'dteam' in [medernach05]). And so, they are both explicitly assigned to the same SWF group code 1.

Others Some predefined SWF header fields. Including only those that pbs2swf.pl cannot compute by itself (those that can be computed, may not be given as command line options).

The PBS log may contain data in aggregated/per-process form, that relate to physical/virtual memory and used-CPU. These are the defaults used by the pbs2swf.pl script (recall that LPC's requested-memory diverts from the default as mentioned above):

field default quantity default type

requested memory per-job (aggregated) virtual

used memory per-job (aggregated) virtual

used CPU per-process -

In addition to the SWF version of the input PBS log files, pbs2swf.pl also generates a report file that includes various statistics (e.g. the most frequent values for each SWF field) and a summary of all the problems that resulted in data loss (e.g. there are 51 PBS jobs that were ignored in the conversion because their PBS-end-record is missing). You can view the summary file generated for the LPC log here.

Usage Notes:

Only from 15 Sep 2004 was the machine at its full size (140 CPUs). Until then (between 4 Aug - 14 Sep) the machine was in a testing phase and was composed of only 2 CPUs. Therefore, in the cleaned version of the log, the testing period is removed, which means that all the jobs with

     job.SWF_ID <= 9932

were deleted.

A search for flurries hasn't been conducted yet.
Further information on flurries and the justification for removing them can be found in:
D. Tsafrir and D. G. Feitelson, Workload flurries. Technical Report 2003-85, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Nov 2003.

Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.

The Log in Graphics

File LPC-EGEE-2004-1.2-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

Parallel Workloads Archive - Logs

System	A cluster composed of 70 dual 3GHz Pentium-IV Xeons nodes (140 CPUs) running Linux, internally partitioned to two disjoint sub-clusters composed of 28 and 42 nodes. Part of the EGEE grid project.
Duration	August 2004 through May 2005 (ten months)
Jobs	244,821 (all are serial jobs; of these, 230,448 have actually started running and 14,373 were canceled before that time).
Analysis	Found in the paper [medernach05], which introduces and models the LPC log.
What's LPC?	LPC stands for "Laboratoire de Physique Corpusculaire" (Laboratory of Corpuscular Physics) of Université Blaise-Pascal, Clermont-Ferrand, France.
Context	LPC is a cluster that is part of the EGEE project (Enabling Grids for E-science in Europe). This grid employs the LCG middleware as the grid's infrastructure (LCG is the Large hadron collider Computing Grid project). One of this project's goals is to develop an infrastructure to handle and analyze the expected per-year 15 petabytes of data to be generated by the LHC (the Large Hadron Collider mentioned earlier, developed at CERN), scheduled to begin its operation in 2007. The LPC cluster is one site within the LCG infrastructure. It is used mostly for biomedical and high-energy physics research. You can monitor the status of all the sites composing the LCG at http://goc.grid.sinica.edu.tw/gstat/, and in particular, the status of LPC at http://goc.grid.sinica.edu.tw/gstat/IN2P3-LPC. The LPC site is http://clrwww.in2p3.fr/ (but it's only in French).
Workload	All the jobs in the logs are serial: The LPC cluster is currently one of its kind within the parallel workload archive, because its workload is solely composed of work-pile tasks called bags. Bags are submitted in remote sites (outside of LPC) and are assigned to the LPC cluster through the EGEE infrastructure. A bag is a collection of serial independent jobs that perform no communication and therefore are not required to execute simultaneously or even within the same site. Instead of communicating, jobs can write output files to grid storage-elements or to the user's machine. Other jobs may read and work on the generated data forming "pipelines" of jobs. Once all the jobs are done, the bag is complete. A higher level grid scheduler fragments each bag into individual jobs and places them on (possibly) different machines (one of which is the LPC cluster). Here, we only see one machine's view (LPC's), so we get a set of independent processes. Unfortunately, we have no knowledge about which serial jobs belong to the same bag, nor if there are additional jobs belonging to the same bag running on other, remote, machines.
Graciously provided by	Emmanuel Medernach (medernac AT clermont.in2p3.fr), the author of [medernach05], who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Partition (see System Environment)	size
LPC-EGEE-2004-0old.pbs (old partition)	2M
LPC-EGEE-2004-0ce1.pbs (CE1 partition)	6M
LPC-EGEE-2004-0ce2.pbs (CE2 partition)	3M
concatenated version of the above	13M

option flag	meaning	details
--proc_used=1,started --proc_req=1,all	Number of requested- (all jobs) and used- (started jobs) processors is set to be 1.	The size of all the jobs in the LPC log is 1. However some PBS records are missing this data. We therefore decide that the number of requested-processors (`proc_req`) of all the jobs is 1. This is also true with respect to used-processors (`proc_used`) but only for jobs that have actually `started` to run. Jobs that were canceled before this point, are always assigned with `proc_used=0` by the pbs2swf.pl script.
--executable=-1,all,overwrite	Set the executable of all jobs in the SWF version to be undefined (-1).	This data is actually available for all the started jobs (hence we `overwrite` it), but is meaningless, because it species the names of the PBS submittal scripts rather than names of actual applications. And so, almost 88% of the jobs specify `"STDIN"` as their executable name. Another 6% specify `"test.job"`, and another 6% are canceled jobs (before start), thus their executable name is missing form the PBS log altogether. This leaves us with only tens of jobs that also usually have names like `"test1.job"`, `"job.sh"` etc.
--mem_req.type=physical	SWF data regarding requested memory is associated with physical (rather than virtual) memory.	By default, pbs2swf.pl prefers extracting data from the PBS log that is associated with virtual memory. However, no such data is available in the LPC log, whereas some data specifying requested physical memory is in fact available (but only for 480 jobs).
--anonymize.partition=*	Explicitly associating PBS partitions with SWF codes that reflect the chronological order in which they were defined.	For example, the earliest partition is the 'old' one (`clrglop195.in2p3.fr`) and so it is set to be partition number 1. If SWF codes were not assigned explicitly, they would have been assigned arbitrarily by pbs2swf.pl.
--anonymize.queue=*	Explicitly associating PBS queues with SWF codes such that the bigger the code, the longer the jobs that may populate it.	For example, the `'test'` queue has the smallest limit on the requested runtime of the jobs that may populate it, and so it is set to be queue number 1.
--anonymize.gid=*	Unite PBS groups that appear different but are actually the same.	For example, PBS jobs associated with groups `'dteam'` and `'dteam005'` actually originate from the same group (which is indeed collectively referred to as `'dteam'` in [medernach05]). And so, they are both explicitly assigned to the same SWF group code 1.
Others	Some predefined SWF header fields.	Including only those that pbs2swf.pl cannot compute by itself (those that can be computed, may not be given as command line options).

field	default quantity	default type
requested memory	per-job (aggregated)	virtual
used memory	per-job (aggregated)	virtual
used CPU	per-process	-