Topics in Performance Evaluation – Exercise 11

Topics in Performance Evaluation

Exercise 11 – Simulating Scheduling Parallel Jobs with Backfilling

In this exercise we will use simulation to compare parallel job scheduling based on backfilling with a simple FCFS scheduler, using real workload logs.

Background

A basic problem with FCFS is fragmentation: processors may be left idle if the first queued job needs more than are currently available. This problem is reduced by the EASY scheduler, using a technique called backfilling. This essentially means that jobs from the back of the queue are used to fill in holes in the schedule.

The problem with backfilling is that big jobs may be starved. To solve this, EASY makes a reservation for the first queued job. To do so, it needs to know how long each job is expected to run. Therefore EASY required users to provide runtime estimates. In our traces, this is called "requested time".

The EASY scheduling algorithm is therefore the following.

  1. If there are enough free processors for the first queued job, start this job immediately.
  2. Repeat the above step as many times as possible.
  3. If there are no more queued jobs, end.
  4. In this step we have found a queued job that needs more processors than are available. Calculate when this job is expected to start as follows:
    1. Start with the number of processors that are currently still unused.
    2. Loop on the currently running jobs, in the order of their expected termination. This can be determined from the difference between their original user runtime estimates and the time they have already run in the simulation.
    3. for each job, assume it has terminated and add its processors to the pool of idle processors.
    4. Stop when the idle processors are sufficient for the first queued job. The (future simulation) time at which this happens is the time at which this job is expected to start. We call this the "shadow time".
  5. Also calculate the processors that will be left over when the first queued job starts; this is the difference between the number of processors that will be idle and the number it actually needs. We call these the "extra processors".
  6. Now start to backfill. This is done by looping on the queue of waiting jobs. For each one, check whether it satisfies either of the following two conditions:
    1. It uses no more than the currently available processors, and is expected to terminate by the shadow time.
    2. It uses no more than the currently available processors, and also no more than the extra processors.
    If so, backfill and start it immediately. Don't forget to keep track correctly of the remaining idle and extra processors.

Assignment

  1. Write an event-driven simulator of parallel job scheduling. In the simulation, we ignore what the jobs actually do. All we are interested in are the allocation of processors and the waiting times of the jobs. Thus we focus on simulating two types of events that occur repeatedly in the system: job arrival and job termination.
  2. Data about the jobs will be based on real traces from parallel systems, available from the Parallel Workloads Archive. You need to write a parser for the Standard Workload Format, and create arrival events based on the trace data. The format is simply a line with 18 space-separated fields for each job. The fields that are important to us are
    1 job ID (serial number)
    2 job arrival time, in seconds since the beginning of the trace
    4 the job's real run time once it is allocated processors
    5 number of processors required by the job
    9 the job's requested run time
    The requested runtime is what the scheduler knows in advance, but the real runtime may be different.
    Note that the trace file also includes header comments. Only one is important for us: the MaxProcs comment, which specifies the number of processors in the machine.
    You may assume that the input trace is well-formed.
  3. Write two alternative schedulers, one implementing FCFS and the other implementing backfilling as in the EASY system. The rest of the components stay essentially the same. Selecting which scheduler to use (the original FCFS or the new EASY) should be controlled by a command-line argument.
  4. Naturally you also need to monitor performance results and system parameters during the simulation. In particular, collect information about the following:
    1. The response time of each job, used to calculate the average response time for all of them.
    2. The current waiting queue length each time the scheduler is called, used to calculate the average queue length during the simulation.
    3. The average processor utilization for the whole simulation.
    4. The fraction of jobs that were backfilled, i.e. ran before a previously submitted job that was left waiting in the queue.
  5. The output should be the four metrics listed above. Obtain this for two different logs: the SDSC SP2 log and the SDSC Blue log, for both FCFS and EASY. Use the "cleaned" versions of the logs.

To help you with testing, three short sample input files are provided:

Submit

Use the Moodle system to submit a report on your work, in pdf format, with the following data.

  1. Your names, logins, and IDs
  2. The results of your simulations. For each log, list the metrics measured for FCFS and EASY.
  3. State your conclusions from the results. Is EASY better or worse than FCFS? Or maybe they actually perform the same? Note how this is reflected by the different metrics you tracked. Are there any problems with any of the results?
  4. Answer the question: do we need to worry about an initial transient period in this simulation?
  5. The source code
Submission deadline is Monday, 16 June 2014, so I can give feedback in class on Tuesday.

Please do the exercise in pairs.

To the course home page