The Swedish Royal Institute of Technology (KTH) IBM SP2 log
System: | 100-node IBM SP2 |
Duration: | October 1996 thru August 1997 |
Jobs: | 28,490 |
This log contains eleven months worth of accounting records from the
100-node IBM SP2 at the Swedish Royal Institute of Technology (KTH) in
Stockholm.
For more information about this installation, see URL
http://www.pdc.kth.se
Note that the first couple of weeks of the log
exhibit a somewhat reduced utilization.
this could indicate that the system's configuration was different
during this period.
However the effect is modest and its duration relatively short.
The cleaned version of the log disposes of much of the problem.
The workload log from the KTH SP2 was graciously provided
by Lars Malinowsky (lama@pdc.kth.se),
who also helped with background information and interpretation.
If you use this log in your work, please use a similar acknowledgment.
Downloads:
(May need to click with right mouse button to save to disk)
There is no cleaned version of this log as no serious anomalies have
been found so far.
|
|
System Environment
The 100 nodes in the batch pool are divided into different types:
number | code | type |
88 | T | thin2 |
10 | W | wide |
2 | Z | wide with more memory |
64 | U | another remote machine (experimental) |
Over the period of time covered the actual number of nodes available
has fluctuated due to PEs being set aside for
reserved/interactive/course use, as PIOFS-servers, upgrades, service,
etc.
The system imposes limits on job run times, and this was changes a
couple of times during the period that the log was recorded.
The limits in effect were as follows.
Prior to June 6, 1997
Weekdays between 0700 and 1600: limit of 4h.
Weekday nights: limit of 15h.
Weekends: limit of 60h.
However, no jobs could run across 0700 and
1600 weekdays (synchronization points.)
June 6 to July 15, 1997
Same as the above, but the restriction due to the synchronization
points was removed.
For example, a job could start at 0600 on a weekday, and it would have
to terminate by 1100, so as not to violate the 4-hour rule that came
into effect at 0700.
Starting July 15, 1997
Same as the above, but the 4-hour restriction during weekdays only
applies to 64T and 2W nodes.
The rest of the nodes allow a 15-hour limit even during weekdays.
In addition, a `fallback' mechanism was activated;
for example, if you request a T node you might get any kind of TWZ.
Log Format
The original log is available as KTH-SP2-1996-0.
This file contains one line per completed job with the
following white-space separated fields:
- usr: username.
- cac: accounting group — was enforced towards the end of the log period.
- jid: job ID with embedded submit date and time
- req: requested nodes, possibly designating desired types, e.g. 72T8W.
- tstart: date and time when all nodes were available.
- tstop: date and time when the last node was returned
(jobs may deallocate individual nodes).
- npe: total number of CPUs (should match the sum of different types
from req).
- treq: wall time requested (used by EASY for backfilling).
- uwall: used wall time (first node deallocation minus last node allocation).
- reqcpu: requested CPU-time (npe x treq).
- ucpu: used CPU-time (the sum of for how long each PE was allocated).
- twait: difference between job start and when job entered the FIFO queue.
- status: jobs that were released automatically by the system,
e.g. because they exceeded their requested time, are marked by
"autorel".
Jobs that terminated normally do not have anything in this field.
Elapsed and aggregate times are reported in a unique format, with the
hours and minutes separated by the letter `h'.
For example, 4h is 4 hours, 0h02 is 2 minutes, and 84h25 is 84 hours
and 25 minutes (about 3.5 days).
Note that uwall is not the same as the run time usually reported in
other logs.
A better match to common practice is to calculate tstop - tstart, the
time from when all nodes became available and the job started running
until the last node was returned.
Finally, the system administrators report that
sometimes they have pushed jobs through the FIFO by
giving them artificially low `enter-fifo' times.
Thus the value of the wait field will be bogus.
Conversion Notes
The converted log is available as KTH-SP2-1996-2.swf.
The conversion from the original format to SWF was done subject to the following.
-
This log does not include the job submittal time.
However, this can be calculated as tstart - twait, and this was indeed
done in the conversion.
-
The above trick does not work for jobs that got artificial `enter
FIFO' times to increase their priority.
Luckily, a version of the submit time is also encoded in the job-id.
Typically, a job can not have waited for longer than what its job-id
indicates.
So actually the submit time is calculated as
submit = max{ tstart – twait, jobID.submit }
This correction actually happened 46 times.
-
The option to request U nodes (from another machine) was only used in
15 jobs, of which 11 requested one such node.
In 3 cases the total number of nodes was more than 100.
In any case, U nodes were deleted from
the job's size, only leaving nodes used on this machine.
-
The conversion loses the following data, that cannot be represented in
the SWF:
- Node type requested
- "Overlap time" during which all the jobs nodes were allocated
-
The following anomalies were identified in the conversion:
- 219 jobs got more processors than they requested.
- 475 jobs got more runtime than they requested.
In 64 cases the extra runtime was larger than 1 minute.
- One job (job 27313) was recorded as having requested and used 0 processors,
but terminated successfully.
This is a job that used only one U node.
It was removed from the log.
The conversion was done by
a log-specific parser
in conjunction with a more general
converter module.
The differences between conversion 2 (reflected in KTH-SP2-1996-2.swf)
and conversion 1 (KTH-SP2-1996-1.swf) are
-
In the conversion 1 the U nodes were left in,
so in 3 jobs the size was bigger than the machine size.
-
In the conversion 1, all timestamps were off
by one or two hours, partly due to mishandling daylight saving time.
This was corrected in conversion 2.
-
In conversion 1, jobs that requested nodes
without specifying any node type were erroneously recorded as having
requested 0 nodes (however, the number of used nodes was correct).
This was corrected in conversion 2.
Usage Notes
The log has a cleaned version available as KTH-SP2-1996-2.1-cln.swf.
It is recommended that this version be used.
The cleaning consisted of removing the first 14 jobs, as they seem to
represent activity from long before the actual logging started.
The Log in Graphics
File KTH-SP2-1996-2.swf
Parallel Workloads Archive - Logs