Parallel Workloads Archive: LLNL Thunder

The LLNL Thunder log

System:	Linux Cluster (Thunder) at LLNL
Duration:	Feb 2007 to Jun 2007
Jobs:	128,662

This log contains several months worth of accounting records from a large Linux cluster called Thunder installed at Lawrence Livermore National Lab. For more information about Linux clusters at LLNL, see URL https://computing.llnl.gov/tutorials/linux_clusters/. This specific cluster has 1024 nodes, each with 4 processors, for a total of 4096 processors.

At the time that this log was recorded, Thunder was considered a "capacity" computing resource, meaning that it was intended for running large numbers of smaller to medium jobs. This is in contrast with the newer Atlas cluster, which is a "capability" machine, used for running large parallel jobs that cannot execute on lesser machines.

Note that the log does not include arrival information, only start times.

The LLNL Thunder workload log was graciously provided by Moe Jette, who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Downloads:

LLNL-Thunder-2007-0	2.3 MB gz	original log
LLNL-Thunder-2007-1.swf	1.4 MB gz	converted log
LLNL-Thunder-2007-1.1-cln.swf	1.3 MB gz	cleaned log -- RECOMMENDED, see usage notes

(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers: [thebe09] [pascual09] [minh11] [kleineweber11] [yuan11] [garg11] [lindsay12] [etinski12] [gomezm13] [liang13] [ming13] [rajbhandary13] [tian14] [feitelson14] [lic14] [lucarelli17]

System Environment

Thunder is an 1024 node Linux cluster. Each node boasts 4 Intel IA-64 Itanium processors clocked at 1.4 GHz and 8 GB of memory. The nodes are connected by a Quadrics network. When it was installed in 2004, this was the #2 machine on the Top500 list.

The nodes are divided into the following partitions:

login 4 nodes

debug 16 nodes

batch 986 nodes

file servers 16 nodes

metadata servers 2 nodes

The data in the log pertains to jobs that ran on the debug and batch partitions. Scheduling is performed with the LCRM and Slurm resource management systems. For more information about Slurm, see URL https://computing.llnl.gov/LCdocs/slurm/.

Log Format

The original log is available as LLNL-Thunder-2007-0.

This file contains one line per completed job in the Slurm format. The fields are

JobId=<number>
UserId=xxxxx(<number>) - the x's hide the actual user name to conserve privacy
Name=<string> - name of executable (script), could be empty
JobState=<status>
Partition=<string>
TimeLimit=<number> - in minutes
StartTime=<date and time>
EndTime=<date and time>
NodeList=<string> - comma separated list of single nodes and ranges
NodeCnt=<number>

Conversion Notes

The converted log is available as LLNL-Thunder-2007-1.swf. The conversion from the original format to SWF was done subject to the following.

The log does not indicate each job's submittal time. Therefore the jobs' submit times were set to their start times, and the wait times were set to -1.
The number of processors used is taken from the NodeList field, by parsing the list. The number requested is taken from the NodeCnt field.
Requested time is the wallclock limit, not a precise estimate.
The status field mapping used was

COMPLETED 1

FAILED 0

TIMEOUT 0

NODE_FAIL 0

CANCELLED 5
1208 jobs had start and end times specified as 16:00 on 31 December (with no indication of a year). All of them had NodeList=(null), but also JobState=COMPLETED. This is most probably a result of some problem with the logging. The start and end times of these jobs were converted to -1 (and by implication, the submits time too). Nevertheless, these jobs are left in the same place in the job sequence in which they originally appeared.
The conversion loses the following data, that cannot be represented in the SWF:
- Distinction between failure modes (timeout and node failure).
- The precise list of nodes used by each job, as given in the NodeList field.
- The actual command that was executed, as given in the Name field.
- When the requested runtime is explicitly specified as UNLIMITED
The following anomalies were identified in the conversion:
- 155 jobs got more processors than they requested
- 1101 jobs got more runtime than they requested, but the difference was never larger than a minute.

The conversion was done by a log-specific parser in conjunction with a more general converter module (version 3).

Usage Notes

The original log contains several flurries of very high activity by individual users, which may not be representative of normal usage. These were removed in the cleaned version. It is recommended that the clean version be used.
The cleaned log is available as LLNL-Thunder-2007-1.1-cln.swf.

A flurry is a burst of very high activity by a single user. The filters used to remove the three flurries that were identified are

user=160 and job>19279 and job<19453 (173 jobs)
user=79 and job>47409 and job<58080 (6539 jobs)
user=40 and job>109910 and job<110858 (911 jobs)

Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.

Further information on flurries and the justification for removing them can be found in:

D. G. Feitelson and D. Tsafrir, ``Workload sanitation for performance evaluation''. In IEEE Intl. Symp. Performance Analysis of Systems and Software, pp. 221-230, Mar 2006.
D. Tsafrir and D. G. Feitelson, ``Instability in parallel job scheduling simulation: the role of workload flurries''. In 20th Intl. Parallel and Distributed Processing Symp., Apr 2006.

The Log in Graphics

File LLNL-Thunder-2007-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization

File LLNL-Thunder-2007-1.1-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization

Parallel Workloads Archive - Logs

login	4 nodes
debug	16 nodes
batch	986 nodes
file servers	16 nodes
metadata servers	2 nodes

COMPLETED	1
FAILED	0
TIMEOUT	0
NODE_FAIL	0
CANCELLED	5