Parallel Workloads Archive: DAS2

The DAS2 5-Cluster Grid Logs

Introduction

System: Research grid composed of five Pentium/Linux clusters (one cluster with 144 CPUs and the rest with 64)
Duration: January 2003 through December 2003
Jobs: From 33,795 to 225,711 per cluster

These one-year logs, presented and thoroughly analyzed in [li04], are fundamentally different than the rest of the logs in the archive, in that they record workloads produced by parallel and distributed computing research communities (located at five different universities in the Netherlands), rather than workload of "regular" production machine users. This fact is reflected in the very low utilization exhibited in the logs. DAS2 stands for "Distributed ASCI Supercomputer-2". ASCI stands for "Advanced School for Computing and Imaging in the Netherlands".

DAS2 is essentially a grid composed of five clusters, and therefore co-allocation (running a single job on two or more remote clusters) is possible and actually used. Unfortunately, there is no record regarding co-allocation, that is, if two (or more) seemingly distinctive jobs running in two (or more) clusters, actually compose a single co-allocated job. For this reason, we have chosen not to merge the five traces to one (as this would be misleading). The original logs contain most of the data as specified in the SWF and therefore the converted SWF-version loose no data (see details below). Further information about DAS2 is available at http://www.cs.vu.nl/das2.

The origin of the five DAS2 traces is:
# Cluster Name Location CPUs Jobs
1 fs0 Vrije Univ. Amsterdam 144 225,711
2 fs1 Leiden Univ. 64 40,315
3 fs2 Univ. of Amsterdam 64 66,429
4 fs3 Delft Univ. of Technology 64 66,737
5 fs4 Utrecht Univ. 64 33,795
[ Remark: the above job numbers are greater than those reported in [li04]. See details in conversion notes below. ]

The workload logs from DAS2 were graciously provided by the authors of [li04]:
Author From Email
Hui Li Leiden Univ. hli AT liacs.nl
David Groep National Institute for Nuclear High
Energy Physics, The Netherlands
davidg AT nikhef.nl
Lex Wolters Leiden Univ. llexx AT liacs.nl
who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.


Downloads:

(May need to click with right mouse button to save to disk)


Papers Using the Logs:

This log was used in the following papers: [li04] [feitelson06a] [tsafrir06a] [amar08a] [amar08b] [thebe09] [minh11] [yuan11] [lindsay12] [kumar12] [deng13] [shih13] [rajbhandary13] [cao14] [jackson14] [meng15]


System Environment


Conversion Notes

The content of the original gzipped tar file DAS2-2003-0.tgz
README A brief explanation of the content of tar file and the format of the original files.
orig/ Directory Containing the original files as supplied by Hui Li. For cluster fsN there's exists:
  • orig/fsN-2003/fsN.trace
    The actual trace files including cancelled jobs that were actually started (but not including jobs that were cancelled before they were started). In contrast, the conversion resulting SWF files contain all cancelled jobs. [ Sole exception to the naming convention: fs0.dque.trace ]
  • orig/fsN-2003/fsN.cancelled
    All cancelled jobs (those that were started and those that didn't).
  • orig/fsN-2003/fsN.stats
    Some misleading meta-data about the trace file (contrary to what's claimed), the *.trace files are NOT in SWF: even though some fields have the same name as in the SWF, some don't have the same meaning).
swf/ Location of the conversion result.
orig2swf.pl Conversion script generating: swf/l_das2_fsN.{swf,err} ( *.err describe bad/inconsistent data found in the original files ).
mails-from-hui/ Mails with Hui's answers to various questions regarding the format and meaning of the original files.
swf2stats.pl Generates statistics for each SWF column into swf/l_das2_fsN.stat
SWF.pm Used by swf2stats.pl to parse SWF files.
Stats.pm Used by swf2stats.pl to compute SWF fields distribution.
Record fields structure of *.trace files
JobNumber simple serial, NOT the SWF job-id
JobID this is an OpenPBS job-id; also not SWF's; may be used as a cross-reference to the the associated *.cancelled file
SubmitTime as in SWF
WaitTime as in SWF
RunTime as in SWF
NumCPUs as in SWF
UsedCPUTime as in SWF [however makes little sense considering RunTime (too small)]
UsedMem as in SWF
ReqNumCPUs as in SWF
ReqTime as in SWF
ReqMem as in SWF
Status OpenPBS exist status
  • 0 = job successfully completed
  • 1 = job was cancelled
  • 2 = job didn't complete successfully due to some error [ however, some records that appear in *.cancelled might sometimes have Status different than 1; see swf/*.err ]
UID as in SWF
GID as in SWF
AppNum as in SWF
QueNum as in SWF
Partition as in SWF
PreJobNum as in SWF
ThinkTime as in SWF
submitHour hour of submit-time
submitMday month-day of submit-time
submitMon month of submit-time (0=January)
submitYear year of submit-time
submitWday week-day of submit-time (0=Sunday)
submitYday year-day of submit-time
Record fields structure of *.cancelled files
Serial simple serial, NOT the SWF job-id
JobID this is an OpenPBS job-id; also not SWF's; may be used as a cross-reference to the the associated *.trace file; JobIDs that are not found in the latter were cancelled before job was started
AfterStart 0 = job cancelled BEFORE started executing
1 = job cancelled AFTER started executing
SubmitTime as in SWF
CancellationLag time between submission and cancellation; this should supposedly be equal to RunTime+WaitTime of the associated *.trace file (for jobs with AfterStart=1) but is sometimes significantly different (see appropriate *.err file).
submitHour hour of submit-time
submitMday month-day of submit-time
submitMon month of submit-time (0=January)
submitYear year of submit-time
submitWday week-day of submit-time (0=Sunday)
submitYday year-day of submit-time
The issue of cancelled jobs

The *.trace files only contain some of the cancelled jobs: those that were cancelled by users after they were started. The *.cancelled files, in addition to containing the started jobs, also contain jobs that were cancelled by users before they were started.

Consequently, the JobID field (of OpenPBS), as described above, may be used as a cross reference value between a fsN.trace file and the associated fsN.cancelled file.

The number of jobs as reported in [li04] is smaller than the number of jobs found in the *.swf files since the data of fsN.trace and fsN.cancelled has been merged. Unfortunately, the only information available about jobs that were cancelled before being started, is their submission time and the "lag" time (which is assigned to the "wait" SWF field), other fields of such jobs are set to -1.

There exist inconsistencies between fsN.trace and the associated fsN.cancelled e.g. jobs with OpenPBS-Status=0 (successful completion) in the former, might sometimes appear in the latter. We have chosen to use fsN.cancelled as the definitive criterion in determining whether jobs were cancelled or not and therefore any job that appears in fsN.cancelled will have an SWF-status=5 (cancelled), even if the data in fsN.trace indicates otherwise.

Errors in the original files
(as reported in the swf/*.err files).
1 There are 1,548 jobs that appear in fsN.trace with OpenPBS's Status=0 (successful completion) but also appear in fsN.cancelled (cancelled by users). Such jobs were assigned a SWF status=5 (cancelled).
2 If a job J appears in both fsN.trace and fsN.cancelled than the sum of its wait-time and runtime as appear in fsN.trace, should be equal to the cancellation-lag (time between submission until cancellation) as appears in fsN.cancelled. There are 62 jobs for which the difference between the two values is bigger than one minute (goes up to a number of days!). These differences were ignored and the data used is from fsN.trace (that is, the lag time of such jobs is not recorded in the SWF version).
3 There are 6 jobs with huge negative wait times. Their wait time was set to be -1 (unknown).
4 There is one job with runtime of more than 30 years (fs2). Its runtime was set to be -1 (unknown).


Usage Notes

All these logs seem to have large flurries, as indicated in the graphs below. However, due to the very low level of activity exhibited in these logs, these logs cannot be considered as representative of production use anyway, and the flurries have not been cleaned.

The Log in Graphics

File DAS2-fs0-2003-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot

File DAS2-fs1-2003-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot

File DAS2-fs2-2003-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot

File DAS2-fs3-2003-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot

File DAS2-fs4-2003-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot


Parallel Workloads Archive - Logs