Parallel Workloads Archive: MetaCentrum

The MetaCentrum log

System:	MetaCentrum Czech National Grid
Duration:	Jan 2009 to May 2009
Jobs:	103,656

This log contains several months worth of accounting records from the national grid of the Czech republic, called MetaCentrum. This grid is composed of 14 clusters (called nodes), each with several multiprocessor machines, for a total of 806 processors.

For more information about the system, see URL http://www.metacentrum.cz/en/.

The MetaCentrum workload log was graciously provided by Czech National Grid Infrastructure MetaCentrum. If you use this log in your work, please use a similar acknowledgment. It was made available via the web page of Dalibor Klusacek. Data about failures and maintenance is also available.

Downloads:

METACENTRUM-2009-0	2.0 MB gz	original log
METACENTRUM-2009-2.swf	1.5 MB gz	converted log
METACENTRUM-2009-1.swf	1.5 MB gz	OLD VERSION of converted log (replaced 13 Dec 2011)

(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers:
[klusacek10] [di12] [klusacek12] [feitelson14] [lic14] [lucarelli17]

System Environment

MetaCentrum is composed of 14 Linux clusters, with different configurations, as follows:

Cluster	Processor	Nodes	Total CPUs
0	Itanium2 1.5GHz	8	8
1	Opteron 2.2GHz	16	16
2	Xeon 3.2GHz	10	10
3	Opteron 2.6GHz	5	80
4	AthlonMP 1.6GHz	16	32
5	Xeon 2.4GHz	32	64
6	Xeon 2.7GHz	36	148
7	Xeon 3.1GHz	35	70
8	Opteron 1.6GHz	10	20
9	Opteron 2.4GHz	3	6
10	Opteron 2.0GHz	23	92
11	Xeon 3.0GHz	19	152
12	Xeon 2.7GHz	8	64
13	Xeon 2.3GHz	11	44

Jobs could run on processors from more than one cluster. While relatively rare, this did happen for 586 jobs in the log.

Scheduling is done with PBSpro, employing a system of 11 queues as follows:

Queue Priority Time limit (hr)

q1 62 720

q2 70 720

q3 50 24

q4 60 2

q5 80 24

q6 65 720

q7 70 720

q8 70 4

q9 70 720

q10 99 720

q11 65 720

Queue	Priority	Time limit (hr)
q1	62	720
q2	70	720
q3	50	24
q4	60	2
q5	80	24
q6	65	720
q7	70	720
q8	70	4
q9	70	720
q10	99	720
q11	65	720

Importantly, data about failures and other special circumstances is provided together with the log. This is considered important for reliable evaluations, and in fact is the main point of the paper that introduced this log:

D. Klusacek and H. Rudova, ``The Importance of Complete Data Sets for Job Scheduling Simulations''. In Job Scheduling Strategies for Parallel Processing, Springer Verlag LNCS vol. 6253, pp. 132-153, 2010.

Log Format

The original log is available as METACENTRUM-2009-0.

This file contains one line per completed job with the following tab separated fields:

Job ID
User
Queue
Number of processors used
Number of grid clusters used (originally called nodes)
Properties required by the application (given as a list of property numbers)
Memory used (KB)
Arrival time (UTC timestamp)
Start time (UTC timestamp)
End time (UTC timestamp)
Duration (seconds)
Exit status
List of assigned processors (space separated)

Conversion Notes

The converted log is available as METACENTRUM-2009-2.swf. The conversion from the original format to SWF was done subject to the following.

The status 0 was taken to mean success, and was converted to 1. All other status values were converted to 0.
1118 jobs were recorded as using 0 memory; this was changed to -1.
The conversion loses the following data, that cannot be represented in the SWF:
- The number of clusters used by each job, as given in the used clusters field.
- The precise list of processors allocated to the job, and which clusters they belong to.
- The properties required by the application. Note that the meaning of the properties is unknown; they are simply listed as p1, p2, p3, etc.
The following anomalies were identified in the conversion:
- 1118 jobs were recorded as using 0 memory; this was changed to -1. 8 of them had "success" status.
- All the jobs in the log passed the following two sanity checks: the duration was equal the difference between the start and end times, and the length of the list of assigned processors was equal the number of assigned processors.

The difference between the first conversion (reflected in METACENTRUM-2009-1.swf) and the second conversion (reflected in METACENTRUM-2009-2.swf) is

In the first conversion clusters data was not recovered. In the second conversion it was extracted using the CPU IDs specified for each job. For jobs that use more than one cluster, only the first one is noted.

The conversion was done by a log-specific parser in conjunction with a more general converter module.

Usage Notes

Flurries seem to exist but have not been cleaned yet.

The log contains all the jobs that terminated in the logging period. Some of these jobs are extremely long, as the maximal runtime allowed on this system is 30 days. Thus some of the logged jobs may have started up to 30 days before the start of the logging period. As a result the initial portion of the log is extremely sparse. This effect also occurs (to a lesser degree) towards the end of the log, because extremely long jobs that run in this period are not logged because they did not terminate by the end of the logging period.

The Log in Graphics

File METACENTRUM-2009-2.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot clusters utilization offered load performance

Parallel Workloads Archive - Logs

Queue	Priority	Time limit (hr)
q1	62	720
q2	70	720
q3	50	24
q4	60	2
q5	80	24
q6	65	720
q7	70	720
q8	70	4
q9	70	720
q10	99	720
q11	65	720

Queue	Priority	Time limit (hr)
q1	62	720
q2	70	720
q3	50	24
q4	60	2
q5	80	24
q6	65	720
q7	70	720
q8	70	4
q9	70	720
q10	99	720
q11	65	720

Queue	Priority	Time limit (hr)
q1	62	720
q2	70	720
q3	50	24
q4	60	2
q5	80	24
q6	65	720
q7	70	720
q8	70	4
q9	70	720
q10	99	720
q11	65	720