Parallel Workloads Archive: A Model/Utility for Generating User Runtime Estimates and Appending Them to a Standard-Workload-Format File

A Model/Utility for Generating User Runtime Estimates and Appending Them to a SWF File

Table of Content:

Quick how-to
Introduction
Files to download
Usage
API
Choosing main parameters
Example: adding estimates to the Lublin model
Example: adding estimates to the OSC log
Bugs

A C++ program and API to artificially generate a distribution of user runtime estimates, and optionally append the generated estimates to an existing Standard Workload Format (SWF) file (which in turn can be either a real log or generated by some other model). This model is the result of the paper Modeling User Runtime Estimates published in JSSPP'05 [tsafrir05b].

Quick how-to:

# what how

1 download this tar file est.tgz

2 open the tar and compile code tar -xvzf est.tgz cd m_tsafrir05 make

3 run the program and follow the instructions;
you may found the examples below helpful ./est_driver

If you use this model in your work, please acknowledge it by citing the following references:
1 Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson, ``Modeling User Runtime Estimates''. In 11th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 1-35, Jun 2005. Lecture Notes in Computer Science Volume 3834.

2 Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson, ``A Model/Utility for Generating User Runtime Estimates and Appending Them to a Standard-Workload-Format File''. URL http://www.cs.huji.ac.il/labs/parallel/workload/m_tsafrir05

Introduction:

Runtime estimates provided by users to schedulers of parallel machines have the unfortunate property of worsening performance. Therefore, in order to get realistic performance evaluation results, one must use realistic estimates, as provided by the model suggested here. All the details that explain why user estimates are so bad for performances along with justifying the model suggested here are found in the paper that accompanies this model titled "Modeling User Runtime Estimates" [tsafrir05b]. This document provides a brief overview about the available files, how to use the model, and how to choose its parameters.

Files to download:

The file est.tgz contains all the files that compose the estimate-model utility. These are:

# file description

1 est_model.hh, est_model.cc The API of the model (only 2 functions; described below) and its implementation.

2 est_swf_job.hh Defines the Job_t structure that encapsulates a single SWF-file record, and knows how to read/write it.

3 est_driver.cc A simple shell utility to generate an estimate distribution and optionally combines it with an existing SWF-file. Contains the main function of the utility and can be viewed as an example of how to use the model's API.

4 est_parse.hh, est_parse.cc Used by the shell utility to parse its command line arguments and optionally an SWF-file.

5 Makefile Compiles the above and generates the est_driver utility.

6 Est05JSSPP.pdf The paper that explains this model and its rationale.

7 index.html This file.

Usage:

SYNOPSIS

    est_driver <maxest> <njobs|swf> [-b<userbins>] [-p<precision>] [-s<seed>]

PARAMETERS

parameter description

maxest
[mandatory] The maximal allowed estimate in seconds. If appending estimates to an SWF-file, this is also the maximal allowed runtime (as job are killed if they exceed their estimate) and therefore all runtimes that are bigger than maxest will be truncated to be exactly maxest.
On how exactly to choose this value, see below.
njobs | swf
[mandatory] The second parameter is either njobs or swf:
njobs: means "number of jobs". This must be a positive integer that indicates the number of estimates that the model must produce. If njobs is used then the resulting output would be the estimates distribution (in a PDF and CDF form).
swf: this is a name of an SWF-file. If swf is given then (1) njobs is taken to be the number of jobs within the swf, (2) the swf is parsed, (3) artificial estimates are added to it, and (4) the swf is reprinted with the new estimates.

userbins
[optional]
A user-specified list of estimate bins composed of comma-separated pairs of the form seconds=percent e.g. 3600=10.5,300=2.2 means 10.5% of the jobs will be set to use 1 hour as estimate (3600 sec) and 2.2% of the jobs will be set to use 5 minutes estimate (300 seconds).
Though not mandatory, this option is especially important because it allows the user to associate maxest with the amount of jobs that use it. This is important because:

The number of jobs that are associated with maxest is always substantial, but varies significantly between log to log (unlike other aspects of user estimates that are almost identical).

The amount of jobs associated with maxest has decisive effect on performance: the more jobs associated with maxest, the worst the performance of the scheduler becomes (as jobs are indistinguishable in the eyes of the scheduler and cannot be backfilled).

For further discussion about if and how to choose userbins, see below.

precision
[optional]
In case an swf file is given (which means it is reprinted with artificial estimates), this parameter (a nonnegative integer) determines the number of precision digits in the output, that is, the number of decimal digits to the right of the floating point, in all the SWF fields that have a floating-point nature (e.g. runtime, used-memory, etc). The default is 0.

seed
[optional]
The seed for the random number generator used (using the drand48 family). Default is 0.

EXAMPLES

# example

1
est_driver 64800 l_ctc_sp2.swf
Reads the CTC-SP2 log, replaces the original user estimates with artificial estimates generated by the model (using its default settings) and prints the resulting SWF file. The maximal value of generated estimates is set to be 18 hours (64800 seconds). Indeed this was administrative maximal value of runtimes and estimates in the CTC-SP2 site.
The maximal allowed value relates both for runtimes and for estimates, as users are not allowed to provide estimates that are bigger than the maximal allowed runtime. However, it is almost always the case that a few jobs have runtimes that are bigger than the maximal value allowed. Usually, the difference between the actual runtime and the maximal value is less than one minute, and probably reflects the time it takes to kill a job that exceeded its runtime. But on very rare occasions, the difference is much bigger (probably because the system administrator explicitly allowed these jobs to continue running beyond their runtime estimates).
Regardless of the reason, if a job's runtime is bigger than the maximal allowed value (=maxest, the current model truncates the runtime of the job to be exactly maxest).

2
est_driver 64800 l_ctc_sp2.swf -b 64800=23.8 -s 2
Same as in th previous example, but now 23.8% of the jobs are assigned with an estimate of 18 hours (64800 seconds). Indeed, in the original CTC-SP2 trace file, 23.8% of the jobs are associated with a user estimates of 18h. The default settings of this model (used in the previous example) associates around 22% of the jobs with the maximal estimate. The seed for the random number generator used by the model is set to be 2. A different seed will results in a different artificial distribution of user runtime estimates.

3
est_driver 129600 l_sdsc_blue.swf -b 129600=1.1,7200=27.3
This is similar to the above example: here we use a maximal estimate value of 36h (129,600 seconds) for the SDSC-BLUE log and instruct the model to associate only 1.1% of the jobs with this value, while associating 27.3% of the jobs with a smaller estimate of 2h (7200 seconds).
Indeed, the SDSC-BLUE log is somewhat different than usual: the number of jobs associated with the maximal estimate is very small.
The reason for this abnormality is that the "real" (or "de-facto") maximal estimate of this log is 2h, as this is the runtime upper bound of jobs submitted to the "express" and "interactive" queues of the the SDSC BLUE Horizon machine, that are used by most jobs.
Note that in order to change the "effective" maximal estimate of a log from maxest to a smaller value (let this value be denoted E), it is not enough to instruct the model to associate a lot of jobs with E (e.g. in the above example "-b 7200=27.3" is insufficient) because, unless instructed otherwise, the model will always associate very many jobs with maxest. Therefore, we must explicitly indicate that we want the effective maximal value to be changed by associating this value with a small number of jobs. Indeed, in the SDSC-BLUE log, only 1.1% of the jobs are associated with the maximal value, whereas 27.3% of the jobs are associated with the maximal value of the two high priority queues, which is E=2h.

API:

The above describes how to generate an estimate distribution using the shell utility est_driver. However, it also very easy (and provides more control over model parameters) to use this model as a C++ module and invoke its functions directly. The model's interface is composed of only two functions that use three structure types, all defined in est_model.hh (with the exception of the first structure). Here is the specification of the interface:

# structure / function usage

1
struct Job_t { int id; int submit; int wait; int runtime; // all the rest of the SWF fields... }; typedef std::vector<Job_t> SWF_t;
The Job_t structure encapsulates all 18 data fields of one job (=line =record) within an SWF-file. This structure is defined in est_swf_job.hh.
When assigning user estimates to an existing SWF-file, the latter is represented using a standard vector of Job_t-s. (We denote this vector type as SWF_t, for the purpose of explaining the interface here; but the type std::vector<Job_t> is directly used within est_model.hh.)
Though not part of the interface, you can use the very short function in est_parse.cc called parse_swf(), that given a name of an SWF-file, easily and efficiently parses the file and generates the associated SWF_t job vector.

2
struct EstBin_t { int time; // estimate in seconds int njobs; // # of jobs using it }; typedef std::vector<EstBin_t> Dist_t;
This structure represents one "bin" in the estimate histogram: the time field holds the estimate value (in seconds) and the njobs field specifies the number of jobs that have time as their user estimate.
To represent the entire estimate distribution we use a standard vector of EstBin_t-s. (We denote this vector type as Dist_t, for the purpose of explaining the interface here; but the type std::vector<EstBin_t> is directly used within est_model.hh.)

3
struct EstParams_t { // model parameters ... // constructor EstParams_t( int njobs, int maxest, Dist_t userbins=Dist_t() ); };
The are more than a dozen parameters for the estimate model (encapsulated within the EstParams_t structure). However, the constructor of EstParams_t accepts only the most important ones (2 mandatory, one optional):

njobs The number of job there are (which is the number of estimates the model must produce).

maxest The maximal allowed estimate, as define above.

userbins Allows to determine (e.g.) the important (and highly varying) number of jobs associated with maxest (the importance of this data is explained above). All user supplied estimate bins can be specified within the userbins vector.

The other parameters are briefly described in est_model.hh and explained in detail in [tsafrir05b]. All these parameters are supplied with reasonable default values (as detemined by [tsafrir05b]), in the constructor of EstParams_t. However, you can change them by explicitly assigning other values to the fields of EstParams_t.

4
void est_gen_dist(EstParams_t params, Dist_t* est_dist);
Given params (specifies the model parameters), assign the generated estimate distribution into the place pointed to by est_dist (a pointer to some preallocated vector).

5
void est_assign(const Dist_t& est_dist, SWF_t* jobs);
Given est_dist (an estimate distribution generated by the function est_gen_dist()) and jobs (a pointer to a vector that holds the content of some SWF-file with njobs records), randomly assign the artificial estimates as embodied by est_dist into the jobs, such that the estimate of each job is equal to or bigger than the runtime of that job.

Choosing main parameters

BACKGROUND

In order to correctly choose the main parameters of the model, you must be aware of two simple facts:
1. The estimate distribution is very modal, e.g. human users will always extensively use, say, 1 hour as an estimate. Indeed, the number of popular estimate values is very small and it turns out ~90% of the jobs use the same 20 estimate values.
2. Usually, the most popular value is maxest -- the maximal allowed estimate. maxest also serves as the upper bound of job runtimes, as a job is killed if it exceeds its estimate.
This is exemplified in the following figure that shows the cumulative distribution function (CDF) of runtimes and estimates, in 4 different production systems.

The Y-axis shows the percent of jobs that are associated with a value that is equal to or smaller than the corresponding X-value. For example, in KTH, ~30% of the jobs have runtime <= 1m (one minute) and ~6% of the jobs have 1m estimate. The estimate curves look like a staircase because of the modal nature of human approximations. The larger the stair is, the more popular the associated estimate, e.g., in SDSC-SP2 ~10% of the jobs use estimate=5m; in SDSC-BLUE ~27% of the job use estimate=2h. Note that the fact estimate-curves are lower than runtime-curves indicates runtimes are shorter than estimates (which is only natural as jobs are killed once their estimate is reached).

Recall maxest is usually the most popular estimate. This is also true for KTH and BLUE, despite appearances. In KTH, maxest is actually 4h during weekdays (and 15h during weeknights, and 60h during weekends). In BLUE, as mentioned earlier, the maxest of the "express" and "interactive" queues (used by most jobs) is 2h:

property logs ¹
SDSC-SP2 CTC-SP2 KTH-SP2 SDSC-BLUE

maxest 18h 18h

4h weekdays
15h weeknights
60h weekends

2h express/interactive
36h other

% of jobs 9.7% 23.8% 10.1% ² 27.3% ^{3 ,
4}
popularity
rank 3 1 1 1
1. using cleaned version of logs
2. jobs associated with estimate=4h
3. jobs associated with estimate=2h
4. including 6% of jobs with estimate=01:59:00

Note that even though maxest is the upper bound of both estimates and runtimes, there's usually a very small fraction of the jobs with runtimes that exceed this limit. Mostly, the limit is exceeded by less than a minute, and probably reflects the time it takes to kill the job. However, on rare occasions, the runtime is much longer (e.g. in SDSC-SP2 there's a job with runtime > 100h). The reason for this phenomenon is unknown, but we speculate that such jobs were explicitly given special treatment by the system administrators. Regardless of the reason, this observation is important when determining the value of maxest that is supplied to the model (see below).

Modality in general and the vast popularity of maxest in particular are very unfortunate. Modality is bad because all jobs look the same in the eyes of the scheduler, which means it has less opportunities to exploit existing "holes" in the schedule for backfilling. The popularity of maxest is even worse, as jobs that use this estimate are never backfilled (except from on the expense of the "extra nodes" [mualem01]), they are simply (appearing to be) too long.
Finally, note that the percent of jobs associated with maxest exhibits very high variance. The good news is that this is where most of the variance is concentrated, and setting this difference aside, the estimate distributions of different logs are remarkably similar [tsafrir05b]. Indeed, if the model is supplied with this information (% of jobs associated with maxest), the estimate distribution it produces is very similar to that of the original.

GUIDANCE IN CHOOSING PARAMETERS

Due to the reasons mentioned above, the most significant parameters of the model are (1) the mandatory maxest, and (2) the optional percent of jobs that are associated with it (can be specified using the -b option). Let this percent be denoted as P.
How to choose P: The bigger P is, the worse the performance gets. At the extreme, P=100%, in which case backfilling activity completely stops (with the exception of "extra nodes" backfilling [mualem01]). The scheduling then becomes relatively similar to FCFS. The default settings of this model sets P to be ~22%. This is a compromise between the P values listed in the above table and is inclined towards the bigger values, so that the model will be closer to the worst case. However, as can be seen, the situation in CTC-SP2 and SDSC-BLUE is worse. You can allow the model to use the its compromise value, or you can make it better (smaller) or worse (bigger), depending on the focus of your interest.
How to choose maxest: When we're appending estimates to a log (that lacks this data) which was generated by a production system, the best alternative is to find out what was this site's administrative upper bound on runtimes, and use this value as maxest. If this information is unavailable, one should examine the runtime CDF of the log, as plotted in the above figure, and approximate the upper bound. Note that, as explained above, simply using the maximal runtime value that is found within the log is usually inadequate, as often there's a small number of jobs that were given special permission to run beyond the real administrative upper bound. Indeed, in the above figures, it is usually quite evident what value should be used.
When estimates are appended to a workload that was generated by a model, best thing to do is to simply use the maximal possible runtime (if such a model parameter exists), or the biggest runtime value that is found in the workload, otherwise.
How to choose other user bins: This is only applicable if there's information available about the system that generated the workload. As was shown for SDSC-BLUE, if the administrative runtime upper bounds of the various queues (if any) are known, it is certainly a good idea to explicitly specify these values using the -b option and assign to them a significant portion of the jobs. Obviously, this should be done while examining the runtime CDF, to check if your choice is reasonable (e.g. if most jobs run more than an hour, it is unreasonable to decide that 1h is a very popular estimate value, as jobs are killed once their estimate is reached).

Example: adding estimates to the Lublin model

The first step is to generate the (estimate-less) workload. We do this by using the original C program provided by Lublin. By default, Lublin's model generates an SWF-file composed of 1000 jobs only. We change this to be (say) 50,000 by assigning the new value to the SIZE constant in Lublin's program and recompiling it:
```
    gcc -lm m_lublin99.c -o m_lublin99
```
(last link points to the modified file that generates 50,000 jobs).
We now generate the estimate-less SWF file:
```
    ./m_lublin99 > lublin.est-less.swf
```
Lublin's model does not define a runtime upper bound, but rather, a mean runtime (the more jobs there are, the bigger the maximal runtime gets). We therefore have no choice but to examine the runtime CDF, which indicates that while 45h is the maximum, a more natural value for maxest is 10h (e.g. compare with SDSC-SP2's CDF above):
There is no additional information available, and so all that's left is to append the artificial estimates with the chosen maxest:
```
     ./est_driver 36000  lublin.est-less.swf > lublin.swf
```
The resulting estimate distribution looks like so (though obviously supplying different seeds using the -s option will result in somewhat different distributions):

Example: adding estimates to the OSC log

We have very little knowledge about the system that generated the OSC log, so here too all we have to do is determine maxest by ourselves. When no information is available, it can be very helpful to inspect the actual runtime distribution:

CDF tail runtimes

[order: top to bottom, left to right]

159:48:33 159:57:35 160:00:50 160:02:04 159:48:55 159:57:55 160:00:51 160:02:14 159:49:06 160:00:10 160:01:12 160:05:58 159:49:37 160:00:21 160:01:16 160:09:55 159:49:48 160:00:26 160:01:22 163:21:52 159:50:15 160:00:28 160:01:23 188:16:04 159:53:29 160:00:34 160:01:32 188:16:46 159:56:49 160:00:37 160:01:35 306:45:24 159:57:09 160:00:43 160:01:43 320:01:51 159:57:18 160:00:47 160:01:53 320:02:06

As can be seen, the maximal runtime is ~320h, but there are only 6 jobs with runtimes that are bigger than 160h by more than a few minutes and these are scattered across a huge domain. In contrast, starting from 160h (and downwards), the runtime distribution becomes continuous, which means choosing maxest to be 160h (= 576,000 seconds) is a reasonable speculation.
Thus, we append estimates to the OSC log as follows:
```
     ./est_driver 576000 -i1 l_osc_cln.swf > l_osc_cln.with-est.swf
```
which yields the following distribution

Bugs

The est_driver utility filters "insane" jobs (with size<1 or runtime<0) if it operates on an existing SWF file. Therefore, the number of jobs it prints might be smaller than the number of jobs in the original file. However, the associated header comments (e.g. MaxRecords) remain unchanged, that is, the new SWF file has identical header comments to that of the old.

Parallel Workloads Archive - Models

since Feb 27, 2006 / dants@cs.huji.ac.il / last update: Sep 13, 2012

Quick how-to:
#	what	how
1	download this tar file	est.tgz
2	open the tar and compile code	`tar -xvzf est.tgz cd m_tsafrir05 make`
3	run the program and follow the instructions; you may found the examples below helpful	`./est_driver`

If you use this model in your work, please acknowledge it by citing the following references:
1	Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson, ``Modeling User Runtime Estimates''. In 11th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 1-35, Jun 2005. Lecture Notes in Computer Science Volume 3834.
2	Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson, ``A Model/Utility for Generating User Runtime Estimates and Appending Them to a Standard-Workload-Format File''. URL http://www.cs.huji.ac.il/labs/parallel/workload/m_tsafrir05

#	*file*	*description*
1	est_model.hh, est_model.cc	The API of the model (only 2 functions; described below) and its implementation.
2	est_swf_job.hh	Defines the `Job_t` structure that encapsulates a single SWF-file record, and knows how to read/write it.
3	est_driver.cc	A simple shell utility to generate an estimate distribution and optionally combines it with an existing SWF-file. Contains the `main` function of the utility and can be viewed as an example of how to use the model's API.
4	est_parse.hh, est_parse.cc	Used by the shell utility to parse its command line arguments and optionally an SWF-file.
5	Makefile	Compiles the above and generates the `est_driver` utility.
6	Est05JSSPP.pdf	The paper that explains this model and its rationale.
7	index.html	This file.

*parameter*	*description*
`maxest` [mandatory]	The maximal allowed estimate in seconds. If appending estimates to an SWF-file, this is also the maximal allowed runtime (as job are killed if they exceed their estimate) and therefore all runtimes that are bigger than `maxest` will be truncated to be exactly `maxest`. On how exactly to choose this value, see below.
`njobs` \| `swf` [mandatory]	The second parameter is either `njobs` or `swf`: `njobs`: means "number of jobs". This must be a positive integer that indicates the number of estimates that the model must produce. If `njobs` is used then the resulting output would be the estimates distribution (in a PDF and CDF form). `swf`: this is a name of an SWF-file. If `swf` is given then (1) `njobs` is taken to be the number of jobs within the `swf`, (2) the `swf` is parsed, (3) artificial estimates are added to it, and (4) the `swf` is reprinted with the new estimates.
`userbins` [optional]	A user-specified list of estimate bins composed of comma-separated pairs of the form seconds=percent e.g. 3600=10.5,300=2.2 means 10.5% of the jobs will be set to use 1 hour as estimate (3600 sec) and 2.2% of the jobs will be set to use 5 minutes estimate (300 seconds). Though not mandatory, this option is especially important because it allows the user to associate `maxest` with the amount of jobs that use it. This is important because: The number of jobs that are associated with `maxest` is always substantial, but varies significantly between log to log (unlike other aspects of user estimates that are almost identical). The amount of jobs associated with `maxest` has decisive effect on performance: the more jobs associated with `maxest`, the worst the performance of the scheduler becomes (as jobs are indistinguishable in the eyes of the scheduler and cannot be backfilled). For further discussion about if and how to choose `userbins`, see below.
`precision` [optional]	In case an `swf` file is given (which means it is reprinted with artificial estimates), this parameter (a nonnegative integer) determines the number of precision digits in the output, that is, the number of decimal digits to the right of the floating point, in all the SWF fields that have a floating-point nature (e.g. runtime, used-memory, etc). The default is 0.
`seed` [optional]	The seed for the random number generator used (using the drand48 family). Default is 0.

#	*example*
1	est_driver 64800 l_ctc_sp2.swf Reads the CTC-SP2 log, replaces the original user estimates with artificial estimates generated by the model (using its default settings) and prints the resulting SWF file. The maximal value of generated estimates is set to be 18 hours (64800 seconds). Indeed this was administrative maximal value of runtimes and estimates in the CTC-SP2 site. The maximal allowed value relates both for runtimes and for estimates, as users are not allowed to provide estimates that are bigger than the maximal allowed runtime. However, it is almost always the case that a few jobs have runtimes that are bigger than the maximal value allowed. Usually, the difference between the actual runtime and the maximal value is less than one minute, and probably reflects the time it takes to kill a job that exceeded its runtime. But on very rare occasions, the difference is much bigger (probably because the system administrator explicitly allowed these jobs to continue running beyond their runtime estimates). Regardless of the reason, if a job's runtime is bigger than the maximal allowed value (=`maxest`, the current model truncates the runtime of the job to be exactly `maxest`).
2	est_driver 64800 l_ctc_sp2.swf -b 64800=23.8 -s 2 Same as in th previous example, but now 23.8% of the jobs are assigned with an estimate of 18 hours (64800 seconds). Indeed, in the original CTC-SP2 trace file, 23.8% of the jobs are associated with a user estimates of 18h. The default settings of this model (used in the previous example) associates around 22% of the jobs with the maximal estimate. The seed for the random number generator used by the model is set to be 2. A different seed will results in a different artificial distribution of user runtime estimates.
3	est_driver 129600 l_sdsc_blue.swf -b 129600=1.1,7200=27.3 This is similar to the above example: here we use a maximal estimate value of 36h (129,600 seconds) for the SDSC-BLUE log and instruct the model to associate only 1.1% of the jobs with this value, while associating 27.3% of the jobs with a smaller estimate of 2h (7200 seconds). Indeed, the SDSC-BLUE log is somewhat different than usual: the number of jobs associated with the maximal estimate is very small. The reason for this abnormality is that the "real" (or "de-facto") maximal estimate of this log is 2h, as this is the runtime upper bound of jobs submitted to the "express" and "interactive" queues of the the SDSC BLUE Horizon machine, that are used by most jobs. Note that in order to change the "effective" maximal estimate of a log from `maxest` to a smaller value (let this value be denoted E), it is not enough to instruct the model to associate a lot of jobs with E (e.g. in the above example `"-b 7200=27.3"` is insufficient) because, unless instructed otherwise, the model will always associate very many jobs with `maxest`. Therefore, we must explicitly indicate that we want the effective maximal value to be changed by associating this value with a small number of jobs. Indeed, in the SDSC-BLUE log, only 1.1% of the jobs are associated with the maximal value, whereas 27.3% of the jobs are associated with the maximal value of the two high priority queues, which is E=2h.