This log contains more than 20 months worth of data from the Curie supercomputer operated by CEA (a French government-funded technological research organization). The data comes from three partitions with a total of 11,808 Intel processors (93,312 cores) and an additional 288 Nvidia PGUs. However, in the first year only one and then two partition were available, with a capacity of only around 1/6 of the total, and the full capacity was in effect for only the last 10 months. This implies that the load in different parts of the log was quite different, making it unsuitable for bulk usage in simulations. However, the cleaned version is perfectly usable, see usage notes. The workload log from the CEA Curie system was graciously provided by Joseph Emeras (Joseph.Emeras@imag.fr). If you use this log in your work, please use a similar acknowledgment.
Downloads:
|
|
Initially the system comprised 360 "fat" nodes, model S6010 bullx. Each node has four 8-core Intel Nehalem-EX X7560 2.26 GHz processors. The total is therefore 1,440 processors and 11,520 cores. Each node also has 128 GB of memory, and a 2TB local disk. The scheduler used 3 partition names to access this hardware: test, parallel, and batch.
In the summer of 2011 (late August) another partition was added, called hybrid, because its nodes combine Intel processors and Nvidia GPUs. In the hybrid partition there are 16 bullx B chassis, each with 9 hybrid B505 blades. Each such blade has 2 Intel Westmere 2.66 GHz processors and 2 Nvidia M2090 T20A GPUs, for a total of 288 Intel + 288 Nvidia processors. The Intel processors have 4 cores each. The Nvidia GPUs have 512 cores and 6 GB of on-board memory.
At about this time the original partition became known as large. Later each 4 fat nodes were transformed into a superfat node (without changing the number of cores or the amount of memory), and the partition name was changed to xlarge.
Later yet another partition was added, composed of "thin" nodes. These are 5,040 Bullx model B510 nodes. Each node has 2 Intel Sandy Bridge EP (E5-2680) 2.7 GHz processors, 64GB of memory, and an SSD disk. Each processor has 8 cores, so the total number of cores in the whole partition is 80,640. the name of this partition is standard, and it appears in the log starting from February 2012.
Note that in the first year of the trace the system capacity was only the 360 fat nodes, and then also the hybrid nodes. The full capacity was only in effect starting from Februaly 2012.
The system's nodes are connected by an InfiniBand QDR full fat tree network. There is also a global file system based on 5 PB of disks (100 GB/s bandwith), 10 PB of tapes, and 1 PB of disk cache.
For additional information see http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm.
In addition, a number of large flurries exist in the log.
The cleaned version of the log removes both these problems. It is recommended that the clean version be used.
The clean version is available as CEA-Curie-2011-2.1-cln.swf. The filters used to remove the initial section and the five flurries that were identified are
submitted before 03 Feb 2012 (272,392057 jobs)Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive. Moreover, due to the fact that the whole initial part of the log is discarded, the start time indication in the header comments is also wrong.
user=204 and job>274117 and job<303565 (28,878 jobs)
user=288 and job>372152 and job<593821 (118,014 jobs)
user=553 and job>542319 and job<587601 (37,905 jobs)
user=4 and job>518257 and job<600525 (3,123 jobs)
Also, jobs using the hybrid partition should probably not be used when conventional parallel machines are of interest. The recorded number of cores used refers only to the Intel cores, not to the GPUs. The allocation is always in full nodes (and thus multiples of 8 Intel cores). But these jobs remain in the cleaned log.
File CEA-Curie-2011-2.1-cln.swf