Parallel Workloads Archive: CEA Curie

The CEA Curie log

System:	CEA Curie
Duration:	Feb 2011 to Oct 2012
Jobs:	773,138

This log contains more than 20 months worth of data from the Curie supercomputer operated by CEA (a French government-funded technological research organization). The data comes from three partitions with a total of 11,808 Intel processors (93,312 cores) and an additional 288 Nvidia PGUs. However, in the first year only one and then two partition were available, with a capacity of only around 1/6 of the total, and the full capacity was in effect for only the last 10 months. This implies that the load in different parts of the log was quite different, making it unsuitable for bulk usage in simulations. However, the cleaned version is perfectly usable, see usage notes.

The workload log from the CEA Curie system was graciously provided by Joseph Emeras (Joseph.Emeras@imag.fr). If you use this log in your work, please use a similar acknowledgment.

Downloads:

CEA-Curie-2011-1.swf	6.0 MB gz	original converted log as received
CEA-Curie-2011-2.swf	6.6 MB gz	re-converted log
CEA-Curie-2011-2.1-cln.swf	3.6 MB gz	cleaned log -- RECOMMENDED, see usage notes

(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers:
[zakay14] [zakay14b] [feitelson14] [lucarelli17] [carastans17]

System Environment

The system was designed and built by BULL, France. The system scheduler is SLURM. The installed configuration has changed with time, as have the names of partitions used by the scheduling system.

Initially the system comprised 360 "fat" nodes, model S6010 bullx. Each node has four 8-core Intel Nehalem-EX X7560 2.26 GHz processors. The total is therefore 1,440 processors and 11,520 cores. Each node also has 128 GB of memory, and a 2TB local disk. The scheduler used 3 partition names to access this hardware: test, parallel, and batch.

In the summer of 2011 (late August) another partition was added, called hybrid, because its nodes combine Intel processors and Nvidia GPUs. In the hybrid partition there are 16 bullx B chassis, each with 9 hybrid B505 blades. Each such blade has 2 Intel Westmere 2.66 GHz processors and 2 Nvidia M2090 T20A GPUs, for a total of 288 Intel + 288 Nvidia processors. The Intel processors have 4 cores each. The Nvidia GPUs have 512 cores and 6 GB of on-board memory.

At about this time the original partition became known as large. Later each 4 fat nodes were transformed into a superfat node (without changing the number of cores or the amount of memory), and the partition name was changed to xlarge.

Later yet another partition was added, composed of "thin" nodes. These are 5,040 Bullx model B510 nodes. Each node has 2 Intel Sandy Bridge EP (E5-2680) 2.7 GHz processors, 64GB of memory, and an SSD disk. Each processor has 8 cores, so the total number of cores in the whole partition is 80,640. the name of this partition is standard, and it appears in the log starting from February 2012.

Note that in the first year of the trace the system capacity was only the 360 fat nodes, and then also the hybrid nodes. The full capacity was only in effect starting from Februaly 2012.

The system's nodes are connected by an InfiniBand QDR full fat tree network. There is also a global file system based on 5 PB of disks (100 GB/s bandwith), 10 PB of tapes, and 1 PB of disk cache.

For additional information see http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm.

Log Format

The log is available directly in SWF. It is based on accounting data collected by SLURM.

Conversion Notes

There is no data about any problems in the conversion process. However, the original converted version was found to be problematic in the following ways:

561 jobs had excessivly long run times, on the order of 15 months. Moreover, the sum of their arrival time and run time were always equal to 41102044. As these were all failed jobs, this is evidently some quirk of the logging process. This happened predominantly at the beginning of the log.
2 jobs had excessivly long wait times, with the same characteristics.
In 481 cases the job numbers were not consecutive. In the vast majority of cases a single job was skipped.
In 1117 cases jobs were not sorted correctly by arrival time.
Some of the partition names are more like queue names, in that they refer to the same set of nodes. Moreover, partition names above 12 are most probably bogus and result from error in submitting jobs.

To correct this, an SWF parser (customized for this log) was used in conjunction with a general converter module. This sorted and renumbered the jobs, the replaced the bogus long run and wait times with -1. the following anomalies were observed:

561 jobs had undefined run times, and 2 had undefined wait times. All these jobs had "failed" status. (These are the jobs that originally had the strange long run and wait times.)
135 jobs got less processors than they requested.
105,019 jobs got more processors than they requested. This does not indicate a problem but is often just a rounding-up by the scheduler. Alternatively it can be due to a requirement for more memory per core than configured, so extra cores need to be allocated in order to provide sufficient memory.
162,270 jobs were recorded as requesting 0 processors; this was changed to -1.
16,814 jobs were recorded as using 0 processors; this was changed to -1. Of these, 15,585 had "failed" status, but 1,229 had "success" status.
7,542 jobs got more runtime than they requested. in 4,848 cases the extra runtime was larger than 1 min.

Usage Notes

Due to the large configuration change in the middle of the log, the complete log should probably not be used as-is. There is also a two-week gap in the log in May 2011.

In addition, a number of large flurries exist in the log.

The cleaned version of the log removes both these problems. It is recommended that the clean version be used.

The clean version is available as CEA-Curie-2011-2.1-cln.swf. The filters used to remove the initial section and the five flurries that were identified are

submitted before 03 Feb 2012 (272,392057 jobs)
user=204 and job>274117 and job<303565 (28,878 jobs)
user=288 and job>372152 and job<593821 (118,014 jobs)
user=553 and job>542319 and job<587601 (37,905 jobs)
user=4 and job>518257 and job<600525 (3,123 jobs)

Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive. Moreover, due to the fact that the whole initial part of the log is discarded, the start time indication in the header comments is also wrong.

Also, jobs using the hybrid partition should probably not be used when conventional parallel machines are of interest. The recorded number of cores used refers only to the Intel cores, not to the GPUs. The allocation is always in full nodes (and thus multiples of 8 Intel cores). But these jobs remain in the cleaned log.

The Log in Graphics

File CEA-Curie-2011-2.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

File CEA-Curie-2011-2.1-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

Parallel Workloads Archive - Logs