The MetaCentrum log
System: |
MetaCentrum Czech National Grid |
Duration: |
Jan 2009 to May 2009 |
Jobs: |
103,656 |
This log contains several months worth of accounting records from the
national grid of the Czech republic, called MetaCentrum.
This grid is composed of 14 clusters (called nodes), each with several
multiprocessor machines, for a total of 806 processors.
For more information about the system, see URL
http://www.metacentrum.cz/en/.
The MetaCentrum workload log was graciously provided by
Czech National Grid Infrastructure MetaCentrum.
If you use this log in your work, please use a similar
acknowledgment.
It was made available via the
web
page of Dalibor Klusacek.
Data about failures and maintenance is also available.
Downloads:
(May need to click with right mouse button to save to disk)
|
|
System Environment
MetaCentrum is composed of 14 Linux clusters, with different
configurations, as follows:
Cluster | Processor | Nodes | Total CPUs |
0 | Itanium2 1.5GHz | 8 | 8 |
1 | Opteron 2.2GHz | 16 | 16 |
2 | Xeon 3.2GHz | 10 | 10 |
3 | Opteron 2.6GHz | 5 | 80 |
4 | AthlonMP 1.6GHz | 16 | 32 |
5 | Xeon 2.4GHz | 32 | 64 |
6 | Xeon 2.7GHz | 36 | 148 |
7 | Xeon 3.1GHz | 35 | 70 |
8 | Opteron 1.6GHz | 10 | 20 |
9 | Opteron 2.4GHz | 3 | 6 |
10 | Opteron 2.0GHz | 23 | 92 |
11 | Xeon 3.0GHz | 19 | 152 |
12 | Xeon 2.7GHz | 8 | 64 |
13 | Xeon 2.3GHz | 11 | 44 |
Jobs could run on processors from more than one cluster.
While relatively rare, this did happen for 586 jobs in the log.
Scheduling is done with PBSpro, employing a system of 11 queues as
follows:
Queue | Priority | Time limit (hr) |
q1 | 62 | 720 |
q2 | 70 | 720 |
q3 | 50 | 24 |
q4 | 60 | 2 |
q5 | 80 | 24 |
q6 | 65 | 720 |
q7 | 70 | 720 |
q8 | 70 | 4 |
q9 | 70 | 720 |
q10 | 99 | 720 |
q11 | 65 | 720 |
Importantly, data about failures and other special circumstances
is
provided together with the log.
This is considered important for reliable evaluations, and in fact is
the main point of the paper that introduced this log:
D. Klusacek and H. Rudova,
``The
Importance of Complete Data Sets for Job Scheduling Simulations''.
In Job Scheduling Strategies for Parallel Processing,
Springer Verlag LNCS vol. 6253, pp. 132-153, 2010.
Log Format
The original log is available as METACENTRUM-2009-0.
This file contains one line per completed job with the following
tab separated fields:
- Job ID
- User
- Queue
- Number of processors used
- Number of grid clusters used (originally called nodes)
- Properties required by the application (given as a list of
property numbers)
- Memory used (KB)
- Arrival time (UTC timestamp)
- Start time (UTC timestamp)
- End time (UTC timestamp)
- Duration (seconds)
- Exit status
- List of assigned processors (space separated)
Conversion Notes
The converted log is available as METACENTRUM-2009-2.swf.
The conversion from the original format to SWF was done subject to the
following.
- The status 0 was taken to mean success, and was converted to 1.
All other status values were converted to 0.
- 1118 jobs were recorded as using 0 memory; this was changed to -1.
- The conversion loses the following data, that cannot be
represented in the SWF:
- The number of clusters used by each job, as given
in the used clusters field.
- The precise list of processors allocated to the job, and
which clusters they belong to.
- The properties required by the application.
Note that the meaning of the properties is unknown; they are
simply listed as p1, p2, p3, etc.
- The following anomalies were identified in the conversion:
- 1118 jobs were recorded as using 0 memory; this was changed to -1.
8 of them had "success" status.
- All the jobs in the log passed the following two sanity checks:
the duration was equal the difference between the start and
end times, and the length of the list of assigned processors was
equal the number of assigned processors.
The difference between the first conversion (reflected in
METACENTRUM-2009-1.swf) and the second conversion (reflected in
METACENTRUM-2009-2.swf) is
- In the first conversion clusters data was not recovered.
In the second conversion it was extracted using the CPU IDs specified
for each job.
For jobs that use more than one cluster, only the first one is noted.
The conversion was done by
a log-specific parser
in conjunction with a more general
converter module.
Flurries seem to exist but have not been cleaned yet.
The log contains all the jobs that terminated in the logging period.
Some of these jobs are extremely long, as the maximal runtime allowed
on this system is 30 days.
Thus some of the logged jobs may have started up to 30 days before the
start of the logging period.
As a result the initial portion of the log is extremely sparse.
This effect also occurs (to a lesser degree) towards the end of the
log, because extremely long jobs that run in this period are not logged
because they did not terminate by the end of the logging period.
The Log in Graphics
File METACENTRUM-2009-2.swf
Parallel
Workloads
Archive - Logs