Parallel Workloads Archive: LANL CM-5

The Los Alamos National Lab (LANL) CM-5 log

System: 1024-node Connection Machine CM-5 from Thinking Machines
Duration: October 1994 thru September 1996
Jobs: 201,387

This log contains two years worth of accounting records produced by the DJM software running on the 1024-node CM-5 at Los Alamos National Lab (LANL). For more information about LANL, see URL http://www.lanl.gov/.

The log contains detailed information about resource requests and use, including memory. It also contains data on the user, executable, project, and submit, start, and end times. Jobs on the CM-5 use powers of two nodes according to a fixed partitioning. Gang scheduling is used, especially on smaller partitions, but jobs can also run in dedicated mode. Using gang scheduling implies that runtime information may be inaccurate, see usage notes.

The log is available in two formats. One is the original daily log files created by DJM (the job management software on the CM-5), which also include details on the operation of DJM itself and various special cases (such as re-running an application after a failure, or forcing it to run immediately). The other is a condensed form with one line per job, with only the conventional timing and resource usage information.

The workload log from the LANL CM-5 was graciously provided by Curt Canada, who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Downloads:

LANL-CM5-1994-0a.tar 14 MB gz original daily logs
LANL-CM5-1994-0b 3.6 MB gz original condensed log
LANL-CM5-1994-4.swf 2.8 MB gz converted log
LANL-CM5-1994-4.1-cln.swf 2.1 MB gz cleaned log -- RECOMMENDED, see usage notes
LANL-CM5-1994-2.swf 2.8 MB gz OLD VERSION of converted log (replaced 1 Aug 2006)
LANL-CM5-1994-2.2-cln.swf 2.1 MB gz OLD VERSION of cleaned log (replaced 1 Aug 2006)
LANL-CM5-1994-3.swf 2.8 MB gz OLD VERSION of converted log (replaced 15 Nov 2011)
LANL-CM5-1994-3.1-cln.swf 2.1 MB gz OLD VERSION of cleaned log (replaced 15 Nov 2011)
(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers: [feitelson97b] [downey99] [talby99b] [batat00] [kavas01] [mualem01] [feitelson01] [srinivasan02] [xiao02] [ernemann03] [lublin03] [song04] [streit04] [england04] [feitelson04b] [feitelson05c] [feitelson05d] [zilber05] [feitelson06a] [tsafrir06a] [franke06] [ranjan06] [feitelson07a] [talby07] [ranjan08] [iosup08] [goh08] [minh09] [thebe09] [zeng09] [sodan09] [sodan10] [sodan11] [lindsay12] [liux12] [kumar12] [ababneh12] [zakay13] [rajbhandary13] [kumar14] [zakay14] [zakay14b] [feitelson14] [liu15]

System Environment

This is a 1024-node Connection Machine CM-5 system. Scheduling was performed by the DJM software. Processors are allocated only in powers of 2, with the minimal partition size being 32 processors.

Raw Log Format

The raw log files are available as LANL-CM5-1994-0a.tar. This archive opens to a directory with a separate log file per day.

These files contain multi-line entries for each event that took place. Examples of events are job submittal, job start, job termination, etc. The format is largely self-explanatory.

This was parsed by a special perl script to produce the condensed format.

Condensed Log Format

The condensed log is available as LANL-CM5-1994-0b. The data contains one line per job with the following white-space separated fields: If a field value is missing, it appears as "unknown". For example, this happens for many fields in foreign jobs.

Conversion Notes

The converted log is available as LANL-CM5-1994-4.swf. The conversion from the original format to SWF was done subject to the following. The conversion was done by a log-specific parser in conjunction with a more general converter module.

The differences between conversion 4 (reflected in LANL-CM5-1994-4.swf) and conversion 3 (LANL-CM5-1994-3.swf) are

The differences between conversion 3 (reflected in LANL-CM5-1994-3.swf) and conversion 2 (LANL-CM5-1994-2.swf) are

Usage Notes

The Connection Machine CM-5 was one of the only commercial parallel supercomputers to support gang scheduling. This meant that it could context switch from one parallel job to another. As a result the runtimes in the log may be inaccurate, because the jobs did not run for the full duration from their initiation to their termination. Indeed, if one calculates the apparent maximal utilization of the machine, it is found to be between 100% and 200% on most days. Using the CPU time may therefore be more accurate. However, this is not available for all jobs.

The original log contains several flurries of very high activity by individual users, which may not be representative of normal usage. These were removed in the cleaned version, and it is recommended that this version be used. The cleaned log is available as LANL-CM5-1994-4.1-cln.swf.

A flurry is a burst of very high activity by a single user. The filters used to remove the three flurries that were identified are

user=50 and job>24438 and job<64543 (33452 jobs)
user=31 and job>64586 and job<115041 (34307 jobs)
user=38 and job>178584 and job<192711 (11568 jobs)
In total, 79327 jobs were removed. Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered log job numbering is not consecutive.

Further information on flurries and the justification for removing them can be found in:

The Log in Graphics

File LANL-CM5-1994-4.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load

File LANL-CM5-1994-4.1-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load

Parallel Workloads Archive - Logs