This log contains several months worth of accounting records from a large Linux cluster called Thunder installed at Lawrence Livermore National Lab. For more information about Linux clusters at LLNL, see URL https://computing.llnl.gov/tutorials/linux_clusters/. This specific cluster has 1024 nodes, each with 4 processors, for a total of 4096 processors. At the time that this log was recorded, Thunder was considered a "capacity" computing resource, meaning that it was intended for running large numbers of smaller to medium jobs. This is in contrast with the newer Atlas cluster, which is a "capability" machine, used for running large parallel jobs that cannot execute on lesser machines. Note that the log does not include arrival information, only start times. The LLNL Thunder workload log was graciously provided by Moe Jette, who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.
Downloads:
|
|
The nodes are divided into the following partitions:
login | 4 nodes |
debug | 16 nodes |
batch | 986 nodes |
file servers | 16 nodes |
metadata servers | 2 nodes |
This file contains one line per completed job in the Slurm format. The fields are
COMPLETED | 1 |
FAILED | 0 |
TIMEOUT | 0 |
NODE_FAIL | 0 |
CANCELLED | 5 |
A flurry is a burst of very high activity by a single user. The filters used to remove the three flurries that were identified are
user=160 and job>19279 and job<19453 (173 jobs)Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.
user=79 and job>47409 and job<58080 (6539 jobs)
user=40 and job>109910 and job<110858 (911 jobs)
Further information on flurries and the justification for removing them can be found in:
File LLNL-Thunder-2007-1.1-cln.swf