A specific algorithm that is often used is the EASY backfilling algorithm. Assuming the first queued job is too big for the currently available processors, a reservation is made for some future time at which enough processors will be available. Then the rest of the queue is scanned, and other smaller jobs are allowed to run, provided they will not violate this reservation.
In order to know when processors will become available, and whether jobs are short enough to terminate in time, the system needs to know job runtimes. For this, estimates of the runtimes are requested from users: when a job is submitted, the user gives an estimate of how long it will run. If it is short, it has a chance to backfill. But if it runs for longer than the estimate, the system may kill it so as not to violate a previous reservation.
The interesting thing is that it turns out that inaccurate estimates, that overestimate the runtime by a factor of 3-10, lead to better performance than accurate estimates. The goal of this exercise is to try and understand how this happens. This is an open research question.
If the event is a job arrival, the scheduler is called. The scheduler adds the new job to the set of waiting jobs it knows about (which might be empty), and tries to schedule it or another job. Scheduling means that nodes are allocated, and a termination event is created for the termination time of the job. Note that this is based on the real termination time, not the estimated one.
If the event is a termination, the job's statistics are recorded. Then the scheduler is called, to see if it can make use of the freed nodes.
The simulator can run the input workload as is, or it can change the load factor. I recommend you start with the code as is (i.e. run the simulator without the -l flag). But you may also experiment with loads in the range of 0.7 to 0.9. To increase the load, the simulator simply multiplies the interarrival times by a factor so that the jobs arrive faster, hence cause higher average load. Again, this code exists; you just need to set the desired load level(s) for the load loop in main, and run with -l.
The software is available (locally) at ~perf/sim.
You should use workload files from real production machines that run the EASY scheduler, and therefore have the required data, and specifically, user runtime estimates. You can find such workloads in the Parallel Workloads Archive. Use the logs from the SP2 machines installed at CTC, KTH, and SDSC.
Hint: in at least some of these cases, you should see unexpected results: that performance is better then estimates are farther from the real runtimes. Try to find at least one such case (i.e. combination of workload and types of estimates).
What you need to do is to monitor exactly what happens during the run. Here are some possible hints.
The really hard part, of course, it to try to put the pieces together and come up with an explanation of how the different types of estimates lead to different qualities of performance. Better yet, can you test whether your explanation is correct?
In your email, outline the experiments you ran, the performance results you got, and what you learnt from them. In part 1, you should just make a few runs and reproduce the results described above. In part 2, you should design additional experiments, or collect additional data, that will help explain these results. If you indeed come up with a good explanation that's great. If you don't, at least explain what you tried to do and what you did find.
Due date for part 1: Thursday, May 8. I'll send you a reply email with feedback.
Due date for part 2: Friday, May 23.