Topics in Performance Evaluation
Exercise 11 – Simulating Scheduling Parallel Jobs with Backfilling
In this exercise we will use simulation to compare parallel job
scheduling based on backfilling with a simple FCFS scheduler, using real
workload logs.
Background
A basic problem with FCFS is fragmentation: processors may be left idle if the
first queued job needs more than are currently available.
This problem is reduced by the EASY scheduler, using a technique called
backfilling.
This essentially means that jobs from the back of the queue are used to fill in
holes in the schedule.
The problem with backfilling is that big jobs may be starved.
To solve this, EASY makes a reservation for the first queued job.
To do so, it needs to know how long each job is expected to run.
Therefore EASY required users to provide runtime estimates.
In our traces, this is called "requested time".
The EASY scheduling algorithm is therefore the following.
- If there are enough free processors for the first queued job, start this
job immediately.
- Repeat the above step as many times as possible.
- If there are no more queued jobs, end.
- In this step we have found a queued job that needs more processors than
are available.
Calculate when this job is expected to start as follows:
- Start with the number of processors that are currently still unused.
- Loop on the currently running jobs, in the order of their expected
termination.
This can be determined from the difference between their original user
runtime estimates and the time they have already run in the simulation.
- for each job, assume it has terminated and add its processors to the
pool of idle processors.
- Stop when the idle processors are sufficient for the first queued job.
The (future simulation) time at which this happens is the time at which
this job is expected to start.
We call this the "shadow time".
- Also calculate the processors that will be left over when the first queued
job starts;
this is the difference between the number of processors that will be idle and
the number it actually needs.
We call these the "extra processors".
- Now start to backfill.
This is done by looping on the queue of waiting jobs.
For each one, check whether it satisfies either of the following two
conditions:
- It uses no more than the currently available processors, and is expected
to terminate by the shadow time.
- It uses no more than the currently available processors, and also no
more than the extra processors.
If so, backfill and start it immediately.
Don't forget to keep track correctly of the remaining idle and extra
processors.
Assignment
- Write an event-driven simulator of parallel job scheduling.
In the simulation, we ignore what the jobs actually do.
All we are interested in are the allocation of processors and the waiting times
of the jobs.
Thus we focus on simulating two types of events that occur repeatedly in the
system: job arrival and job termination.
- Data about the jobs will be based on real traces from parallel
systems, available from the
Parallel Workloads Archive.
You need to write a parser for the Standard Workload Format, and
create arrival events based on the trace data.
The format is simply a line with 18 space-separated fields for each job.
The fields that are important to us are
1 | job ID (serial number)
|
2 | job arrival time, in seconds since the
beginning of the trace |
4 | the job's real run time once it is allocated
processors |
5 | number of processors required by the
job |
9 | the job's requested run time |
The requested runtime is what the scheduler knows in advance, but
the real runtime may be different.
Note that the trace file also includes header comments.
Only one is important for us: the MaxProcs comment, which specifies
the number of processors in the machine.
You may assume that the input trace is well-formed.
- Write two alternative schedulers, one implementing FCFS and the
other implementing backfilling as in the EASY system.
The rest of the components stay essentially the same.
Selecting which scheduler to use (the original FCFS or the new EASY) should
be controlled by a command-line argument.
- Naturally you also need to monitor performance results and system
parameters during the simulation.
In particular, collect information about the following:
- The response time of each job, used to calculate the average
response time for all of them.
- The current waiting queue length each time the scheduler is
called, used to calculate the average queue length during the simulation.
- The average processor utilization for the whole simulation.
- The fraction of jobs that were backfilled, i.e. ran before a previously
submitted job that was left waiting in the queue.
- The output should be the four metrics listed above.
Obtain this for two different logs: the SDSC SP2 log and the SDSC Blue log,
for both FCFS and EASY.
Use the "cleaned" versions of the logs.
To help you with testing, three short sample input files are provided:
- l_simple.spwf is a simple sequence
of 15 jobs
that utilize the whole machine for 20 time units each, and arrive
20 time
units apart.
Thus they can be easily packed to achieve full utilization.
- l_full.spwf is a more complicated
sequence of 27
jobs, but again they are arranged so that each one has exactly
what it
needs when it arrives.
Thus they can also be packed to achieve full utilization.
- l_full_bf.spwf is the same set of
27 jobs,
but some arrive in the wrong order, so a FCFS scheduler will not
be able to
pack them so as to achieve full utilization.
Submit
Use
the Moodle system to submit
a report on your work, in pdf format, with the following data.
- Your names, logins, and IDs
- The results of your simulations.
For each log, list the metrics measured for FCFS and EASY.
- State your conclusions from the results.
Is EASY better or worse than FCFS? Or maybe they actually perform the same?
Note how this is reflected by the different metrics you tracked.
Are there any problems with any of the results?
- Answer the question: do we need to worry about an initial
transient period in this simulation?
- The source code
Submission deadline is Monday, 16 June 2014,
so I can give feedback in class on Tuesday.
Please do the exercise in pairs.
To the course home page