Parallel Workloads Archive: LPC

The LPC Log



Table of Content:



Introduction:

System A cluster composed of 70 dual 3GHz Pentium-IV Xeons nodes (140 CPUs) running Linux, internally partitioned to two disjoint sub-clusters composed of 28 and 42 nodes. Part of the EGEE grid project.
Duration August 2004 through May 2005 (ten months)
Jobs 244,821
(all are serial jobs; of these, 230,448 have actually started running and 14,373 were canceled before that time).
Analysis Found in the paper [medernach05], which introduces and models the LPC log.
What's LPC? LPC stands for "Laboratoire de Physique Corpusculaire" (Laboratory of Corpuscular Physics) of Université Blaise-Pascal, Clermont-Ferrand, France.
Context LPC is a cluster that is part of the EGEE project (Enabling Grids for E-science in Europe). This grid employs the LCG middleware as the grid's infrastructure (LCG is the Large hadron collider Computing Grid project).
One of this project's goals is to develop an infrastructure to handle and analyze the expected per-year 15 petabytes of data to be generated by the LHC (the Large Hadron Collider mentioned earlier, developed at CERN), scheduled to begin its operation in 2007.
The LPC cluster is one site within the LCG infrastructure. It is used mostly for biomedical and high-energy physics research. You can monitor the status of all the sites composing the LCG at http://goc.grid.sinica.edu.tw/gstat/, and in particular, the status of LPC at http://goc.grid.sinica.edu.tw/gstat/IN2P3-LPC. The LPC site is http://clrwww.in2p3.fr/ (but it's only in French).
Workload All the jobs in the logs are serial:
  • The LPC cluster is currently one of its kind within the parallel workload archive, because its workload is solely composed of work-pile tasks called bags. Bags are submitted in remote sites (outside of LPC) and are assigned to the LPC cluster through the EGEE infrastructure.
  • A bag is a collection of serial independent jobs that perform no communication and therefore are not required to execute simultaneously or even within the same site. Instead of communicating, jobs can write output files to grid storage-elements or to the user's machine. Other jobs may read and work on the generated data forming "pipelines" of jobs. Once all the jobs are done, the bag is complete.
  • A higher level grid scheduler fragments each bag into individual jobs and places them on (possibly) different machines (one of which is the LPC cluster). Here, we only see one machine's view (LPC's), so we get a set of independent processes. Unfortunately, we have no knowledge about which serial jobs belong to the same bag, nor if there are additional jobs belonging to the same bag running on other, remote, machines.
Graciously provided by Emmanuel Medernach (medernac AT clermont.in2p3.fr), the author of [medernach05], who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.


Downloads:



Papers Using this Log:

This log was used in the following papers: [medernach05] [amar08a] [amar08b] [verma08] [thebe09] [agmonby11] [lindsay12] [deng13] [emeras13] [skowron13] [rajbhandary13]


System Environment:

Nodes 70 X 3GHz dual Pentium-IV Xeon = 140 CPUs
Node OS RedHat or Scientific Linux
Node memory 1GB RAM and 20GB of local storage
Cluster partitions The LPC cluster is divided to two disjoint sub-clusters (there's no connection, and therefore no load balancing, between the two). The following is the description the two partitions:
4 Aug 2004 The LPC is in configuration / testing phase. The "old" partition is defined (jobs serviced by clrglop195.in2p3.fr in the original PBS log) with only 2 CPUs. This is partition #1 in the SWF log.
15 Sep 2004 Arrival of 70 dual nodes (140 CPUs).
The 'old' partition is enlarged to 28 nodes (56 CPUs).
A new partition called "CE1" is defined (jobs serviced by clrce01.in2p3.fr in the original PBS log) which contains the remaining 42 nodes (84 CPUs). This is partition #2 in the SWF log. "CE" stands for Computing Element.
1 Dec 2004 Changing some configuration parameters of the 'old' partition which is now called "CE2" (jobs serviced by clrce02.in2p3.fr in the original log). This is partition #3 in the SWF log.
Cluster batch system OpenPBS
Cluster scheduling scheme
  • The scheduler used is Maui, which periodically polls the PBS wait queues and decides which jobs to run next. Each job is allocated a CPU for its exclusive use and runs to completion (or until it exceeds its runtime limit, after which it is killed by the system).
  • Roughly speaking, Maui is used with its default EASY configuration, that is, FCFS with backfilling (plus adding some bonus to shorter jobs as will shortly be explained). However, since all the jobs are serial, there's actually no backfilling activity: whenever a CPU is available, this is enough for the job at the head of the wait-queue, and thus no other job is allowed to skip it.
  • Jobs arriving to the LPC are always associated with a runtime (wall-clock) estimate. This is either explicitly given by users, or alternatively, is taken from their (per-user) defaults file.
  • Jobs are always submitted on remote sites (there are no local jobs). When a job arrives to LPC, it is already designated (by the grid infrastructure) to a certain partition and queue, which may be one of the following:

    partition
    queue
    time limits [HH:MM]
    other features
    name
    in SWF
    lifetime / size
    name
    in SWF
    max CPU [1]
    max wall clock
    max run jobs [2]
    bonus [3]
    old 1
    4 Aug 2004defined with 2 CPUs
    15 Sep 2004enlarged to 56 CPUs
    1 Dec 2004replaced by CE2
    short 2 00:15 02:00 - -
    long 3 12:00 24:00 - -
    infinite 5 48:00 72:00 - -
    CE1 2
    15 Sep 2004defined with 84 CPUs
    test 1 00:05 00:15 10 1,000,000
    short 2 00:15 02:00 80 270,000
    long 3 12:00 24:00 80 10,000
    day 4 24:00 36:00 80 0
    infinite 5 48:00 72:00 80 -360,000
    CE2 3
    1 Dec 2004 defined with 56 CPUs to replace 'old'
    test 1 00:05 00:15 10 1,000,000
    short 2 00:20 01:30 50 270,000
    long 3 08:00 24:00 50 10,000
    day 4 24:00 36:00 50 0
    infinite 5 48:00 72:00 50 -360,000
    batch 6 - 01:00 - 0
    [1] Max CPU is not used by Maui (which only uses wallclock runtime to determine when jobs should be killed). Rather, it is used by the grid infrastructure as will be described below.
    [2] Effective since 7 Mar 2005
    [3] Effective since 1 Jun 2005 (upgrade from LCG-2.3.0 to LCG-2.4.0)

  • Whenever a CPU becomes available (recall all jobs are serial), MAUI chooses for execution the job J with the highest priority in the partition. But only if the number of currently running jobs that are associated with J.queue, doesn't already exceed the limit of max-running-jobs as specified in the above table. The priority of J is
        J.priority = 10 * J.wait_time  +  bonus
    
    where 'bonus' is determined according to J's partition/queue as specified in the above table. The 'bonus' is effective since 1 Jun 2005 (before that, 'bonus' values can be considered as 0). The 'max-running-jobs' limits are effective since 7 Mar 2005 (before that, these limits can be considered as the partition size). This means that until 1 Jun 2005, the division of jobs between queues was meaningless and that the scheduling scheme was actually plain FCFS.
  • In addition to the above, there are also fairness considerations involved. These are described below.
Fair share policy
  • There are 8 groups appearing in the log, most are referred to in the accompanying paper [medernach05]:

    PBS group in SWF
    dteam (also includes dteam005)1
    biomed (also includes biomgrid)2
    lhcb 3
    cms 4
    atlas 5
    alice 6
    dzero 7
    sixt 8

  • The fare-share policy of the above groups is part of the Maui configuration and the syntax in which it is expressed (as seen below) is explained here:
    http://www.clusterresources.com/products/maui/docs/6.3fairshare.shtml

  • A fair-share policy was only utilized after 1 Jun 2005 at which point it was set to be:
        FSPOLICY         DEDICATEDPES
        FSINTERVAL       00:15:00
        FSDEPTH          24
        FSDECAY          0.85
        FSWEIGHT         500
        FSUSERWEIGHT     10
        FSGROUPWEIGHT    30
    
        GROUPCFG[alice]  FSTARGET=15 GROUPCFG[atlas] FSTARGET=15
        GROUPCFG[biomed] FSTARGET=15 GROUPCFG[cms]   FSTARGET=15
        GROUPCFG[lhcb]   FSTARGET=15 GROUPCFG[dteam] FSTARGET=15
    
  • Then, at 24 Mar 2005, the policy was changed to reflect the different group needs (some groups pay for using the LPC cluster):
        CE1:  GROUPCFG[alice]  FSTARGET=15 MAXPROC=20,67 
              GROUPCFG[atlas]  FSTARGET=15 MAXPROC=20,67  
    	  GROUPCFG[biomed] FSTARGET=45 MAXPROC=20,81
    	  GROUPCFG[cms]    FSTARGET=15 MAXPROC=20,67
    	  GROUPCFG[lhcb]   FSTARGET=15 MAXPROC=20,67
    	  GROUPCFG[dteam]  FSTARGET=55 MAXPROC=6,67
    	  GROUPCFG[dzero]  FSTARGET=8  MAXPROC=20,67
    
        CE2:  GROUPCFG[alice]  FSTARGET=18 MAXPROC=20,45
              GROUPCFG[atlas]  FSTARGET=18 MAXPROC=20,45
    	  GROUPCFG[biomed] FSTARGET=18 MAXPROC=20,45
    	  GROUPCFG[lhcb]   FSTARGET=18 MAXPROC=20,45
    	  GROUPCFG[dteam]  FSTARGET=15 MAXPROC=6,6 
    	  GROUPCFG[cms]    FSTARGET=5  MAXPROC=20,25
    	  GROUPCFG[dzero]  FSTARGET=8  MAXPROC=20,45 
    
The Grid
  • The LCG middleware serves as the infrastructure of the LCG/EGEE grid. It includes components that have been developed by a number of projects such as the Globus Toolkit and the Condor system. The middleware acts as a layer of software that provides homogenous access to different grid resource centers.
  • The grid is organized as a collection of Computing Elements (CEs) and Storage Elements (SEs). A computing element is a collection of batch queues (like those specified above) on top of a homogenous cluster.
  • LCG users are associated with Virtual Organizations (VOs). These are collections of individuals and/or organizations that share resources. Each VO corresponds to a "group" like those specified above.
  • LCG grid users specify their job needs using a file that is written in JDL (Job Description Language). Users can specify the executable, arguments, standard I/O, needed environment (OS, distribution, dependencies), required visible files from their local file system, and more. Users should also specify either the maximal wallclock- or CPU-time, required by their job. If this is not specified, a local default for the maximal wallclock is used (taken from a per-user configuration file of defaults).
  • The JDL file is then submitted through a user interface to a Resource Broker (RB) which identifies all queues (within all CEs) that can be used by the user's VO, and meet the job's requirements. If more than one such queue is available, the RB tries to choose the "best" one. Users can influence this decision by optionally specifying their preferred strategy. This can be a ranked combination of considerations like "choose the site with the biggest number of free CPUs", "least mean waiting time", "least number of waiting jobs", etc. More details on this can be found in this presentation (which provide some additional references in the final slide).
  • Apparently, based on the data in the LPC log, the RB doesn't always do a good job in assigning jobs to the "best" queue:

    queue max runtime (wallclock) jobs in this queue have used runtime estimates of
    long 24:008:00, 24:00
    day 36:008:00, 16:00, 36:00
    infinite72:008:00, 16:00, 24:00, 32:00, 60:00, 72:00

    This means, for example, that an 8h job that was submitted to the 'infinite' queue was also so eligible for the 'long' queue, in which it would have enjoyed a far superior priority (as specified in the partition/queue table defined above).
Downtime
start end description
18:00:00 10 Nov 200412:00:00 15 Nov 2004 Electrical problems
02:03:30 25 Nov 200418:00:00 26 Nov 2004 Replacement of an old CE
12:00:00 01 Dec 200408:00:00 03 Dec 2004 Network reconfiguration
12:00:00 24 Jan 200518:00:00 25 Jan 2005 General electrical power cut
08:00:00 25 Feb 200518:00:00 01 Mar 2005 Air cooling failure
11:00:00 10 Apr 200514:00:00 10 Apr 2005 Air cooling failure
11:00:00 27 Apr 200508:00:00 03 Apr 2005 SE Disk server problem
06:00:00 17 May 200509:00:00 19 May 2005 Moving the cluster from one place to another


Conversion Notes:



Usage Notes:

Only from 15 Sep 2004 was the machine at its full size (140 CPUs). Until then (between 4 Aug - 14 Sep) the machine was in a testing phase and was composed of only 2 CPUs. Therefore, in the cleaned version of the log, the testing period is removed, which means that all the jobs with
     job.SWF_ID <= 9932 
were deleted.

A search for flurries hasn't been conducted yet.
Further information on flurries and the justification for removing them can be found in:
D. Tsafrir and D. G. Feitelson, Workload flurries. Technical Report 2003-85, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Nov 2003.

Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.

The Log in Graphics

File LPC-EGEE-2004-1.2-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance


Parallel Workloads Archive - Logs