A Batch Scheduler for the Intel Paragon with a Non-contiguous Node Allocation Algorithm

Michael Wan, Reagan Moore, George Kremenek and Ken Steube
San Diego Supercomputer Center
P.O. Box 85608, San Diego, Ca 92186-9784
E-mail: mwan, moore, kremenek, steube@sdsc.edu

Abstract

As the system usage model for scalable parallel processors evolves from the single-user, dedicated access model to a multi-user production environment, a versatile batch scheduler is needed to match user requirements to system resources. This paper describes the design and performance of a batch scheduler for the Intel Paragon system that addresses the issues associated with a multi-user production environment, including scheduling for heterogeneous nodes, scheduling for long-running jobs, scheduling for large jobs, prime/non-prime time modes, and node allocation schemes. The Modified 2-D Buddy system (M2DB) for non-contiguous node allocation is introduced and studied in this paper.

1.0 Introduction
1. 1.1 Design Requirements
2.0 The Paragon Parallel batch scheduler
3.0 Workload characteristics and Performance
1. 3.2 Workload characteristics
4.0 Performance of the Batch Scheduler
5.0 Summary and Conclusions
6.0 Acknowledgments
References

1.0 Introduction

As large parallel processor systems become heavily used, there is a need for the system usage model to evolve from single-user, dedicated access to a much more versatile multi-user production environment. A key component for satisfying this need is a job scheduler capable of handling the scheduling of both interactive and batch jobs submitted by multiple users. This paper describes the design and performance of a batch scheduler for a 416-node Intel Paragon system in production use at the San Diego Supercomputer Center (SDSC).

The usage requirements for interactive and batch jobs can be quite different. Typically, interactive usage is for debugging and short production jobs that use a small number of nodes, while batch usage is mostly for long production jobs using a large number of nodes. To handle the disparate node size and execution time scales, space sharing is used. The nodes of the Paragon may be divided into groups called partitions. In our usage model, two non-overlapping partitions -- the Interactive and Batch partitions, -- are used. Interactive jobs are processed in the Interactive partition, and long-running production jobs are executed in the Batch partition. A concern with this type of space sharing is that the Interactive partition may be idle a large fraction of the time, and typically is idle at night. This concern is mitigated when the number of nodes devoted to the Interactive partition is relatively small when compared to the total number of nodes in the computer.

The scheduling of jobs in the Interactive partition is supported by a sequential node assignment mechanism provided by the Paragon operating system. Jobs are scheduled on a first-come-first-serve basis. An interactive job is rejected if there are not enough free nodes to satisfy the request.

The scheduling of jobs in the Batch partition is handled by the SDSC batch scheduler which is the subject of this paper. The batch scheduler is based on the Network Queuing System (NQS) [1]. NQS is a client/server queuing system originally designed for vector supercomputers running the traditional UNIX operating system. The SDSC batch scheduler uses the job queuing framework provided by NQS to which we have added a scheduling layer for assigning node partitions on the Paragon parallel computer.

1.1 Design Requirements

The design requirements for the parallel batch scheduler are based on the need at SDSC to handle multiple classes of batch jobs, a heterogeneous hardware configuration, and time varying usage patterns. The related issues are:

* Node allocation - Allocating nodes to an application involves the selection of a set of nodes from a pool of free nodes in the 2-D mesh. The Paragon architecture allows non-contiguous node allocation; the nodes of a given parallel job do not have to be physically adjacent. This flexibility has the potential of greatly improving the node utilization of the system. This is particularly true after the system has been up for some time at which point the free nodes in the system tend to be fragmented. Non-contiguous allocation , however, may have the problem of message contention because of the potential for longer message paths (and more hops) and interference with messages from other jobs. Fortunately, with the wormhole routing [2] technology used in the Paragon communication backplane, the number of hops between nodes is no longer the dominant factor affecting message latency as shown in a communication contention study by Lo, et. al. [3]. Hence, the primary performance objective of the SDSC scheduling scheme is maximizing node utilization through non-contiguous allocation. The scheduling scheme does the following:

Allocates a job if there are enough free nodes available even if the set of nodes to be allocated is fragmented.
Allocates exactly the number of nodes requested. Jobs are scheduled on a space sharing basis. No time sharing is assumed.
Tries to minimize fragmentation to reduce communication contention.

Prime/Non-prime times - The batch scheduler needs to support the notion of Prime (daytime) and Non-prime (night/weekend) times. The scheduler provides the flexibility of configuring the batch and interactive partitions differently during day and night. For example, during prime time, we may want to allocate a large number of nodes to the interactive partition while during non-prime time, most of the nodes should be allocated to the batch partition.
Heterogeneous nodes - Paragon systems can contain nodes with various physical attributes. For example, nodes may have different memory sizes (16-MBytes or 32-MBytes), or may have a different number of CPUs (GP - single CPU or MP - two CPUs). The batch scheduler must be capable of scheduling nodes by hardware request type. In addition, an option is needed for jobs that can use multiple types of nodes. For example, a given job may be able to use either the 16 MB or the 32 MB GP nodes. The ability to schedule jobs that can utilize multiple node types as compared to restricting jobs to only a single node type can increase node utilization.
Short running jobs - Since node allocation is based on space sharing only, long running batch jobs can block the execution of short jobs waiting in a batch queue. During prime time, we may need to reserve some nodes for the exclusive use of short running batch jobs.
Large batch jobs - For a job mix consisting of both small and large jobs, the smaller jobs tend to get scheduled first because it is easier to fit the smaller jobs into the available nodes. Once we have several smaller jobs running in the system, it will become even more difficult for a large job to be scheduled because a large job may need the nodes used by the smaller jobs. Therefore, a mechanism is needed to enhance the chance of large jobs getting scheduled.

Section 2.0 gives a summary of the Paragon parallel batch scheduler design. Section 3.0 gives a summary of the workload characteristics, and Section 4.0 gives the performance analysis of the scheduler for the 416-node Paragon at SDSC.

2.0 The Paragon Parallel batch scheduler

The Paragon parallel batch scheduler is based on version 2 of the NQS queuing system. A scheduling layer was added to NQS to schedule jobs for the Paragon system. The following gives a summary of the enhancements:

2.1 Enhancement for NQS commands

A new option "-lP" has been added to the "qsub" command which is used for submitting batch jobs. This new option allows users to request a specific number of nodes on which to run the batch job.

In addition, a new subcommand has been added to the "qmgr" command. The

"qmgr" command is an interactive program used by the NQS administrator to

control and configure the local NQS system. The new subcommand

set per_request ncpus = N QUEUE

allows an NQS administrator to specify the default number of nodes (N) for

a specific batch queue (QUEUE). That is, if a job is submitted to a given batch queue without specifying the number of nodes for the job, it will default to the ncpus value of the specified queue.

2.2 Grouping of nodes into "node sets" and "node groups"

In order to provide scheduling support for heterogeneous nodes and node

reservation for short running jobs during prime time, it is necessary to provide a framework for partitioning the nodes in the system. The framework also provides a way to associate scheduling policies with physical node sets.

Nodes are first partitioned into non-overlapping "node sets". The physical attributes of nodes in a given "node set" should be identical because they are treated as such by the scheduler.

"Node groups" are then constructed by combining one or more "node sets".

Overlapping "node groups" are allowed. i.e., a given "node set" can be assigned to more than one "node group". Each NQS queue is associated with a single "node group". Scheduling policies are enforced by linking each NQS queue to a particular usage requirement. Queues may be set up for short-running jobs or for jobs needing large memory nodes, or nodes with multiple CPUs. The following simple example illustrates the use of this framework for the scheduling of different types of nodes.

Consider a system with two different types of nodes - 8 nodes with 16 MB of memory and 16 nodes with 32 MB of memory. They are partitioned into two "node sets":

        node_set 1: 16 nodes with 32 MB of memory
        node_set 2:  8 nodes with 16 MB of memory

We create three "node groups" with the following "node set" assignment:

"node group" "node sets" assigned to the "node group"
1 1

2 2

3 1,2

"node group"	"node sets" assigned to the "node group"
1	1
2	2
3	1,2

"Node groups" 1 and 2 are for jobs that only want to run on the 32 MB and 16 MB memory nodes, respectively. "Node group" 3 is for jobs that can use nodes with either memory size.

The "node sets" associated with a given "node group" can be switched as a function of the time period (prime, non-prime or both) during which the nodes will be used. For example:

"node group"	"node sets" to use during prime time	"node sets" to use during non-prime time
4	1	1
5	-	1

"Node group" 4 uses "node set" 1 during both prime and non-prime times, but "node group" 5 can only use "node set" 1 during non-prime time. If we assign "node group" 4 to short running jobs and "node group" 5 to long running jobs, we in effect allow only short running jobs to use "node set" 1 during prime time. This configuration will give good throughput for short running jobs during the day.

Once a job has been selected for allocation, the scheduler selects free nodes by scanning the "node sets" associated with the "node group". It may take up to two passes to accomplish the task. During the first pass, the scheduler tries to find the "node set" with the minimum number of free nodes that can satisfy the entire request. If such a "node set" cannot be found, the scheduler will allocate free nodes from "the node sets" in the order specified in the definition of the "node group" that is associated with the NQS queue to which the job was originally submitted.

Within a "node set", nodes are allocated using the modified 2-D buddy strategy described in section 2.4.

2.3 Job priority and scheduling of large jobs

The original NQS system scheduled jobs according to the "queue priority" assigned to each NQS queue. Jobs were queued up at each NQS queue in the order of submission time. When NQS scheduled jobs to run, it looked through each queue, one queue at a time, starting with the queue with the highest "queue_priority". This simple scheme does not allow a job queued in a low priority queue to be scheduled ahead of jobs in the high priority queue, no matter how long it has been waiting in the queue.

To enhance scheduling flexibility, we introduced the concept of aging. With this scheme, the job priority is calculated as follows:

	job_priority = queue_priority + age_factor * time

where

	queue_priority = priority associated with the queue to which the job 
			 is submitted.

	age_factor = A constant factor to be defined by the NQS administrator.

	time = time in hours the job has been waiting in the queue.

For scheduling purposes, all jobs are sorted in the order of job_priority and the scheduler picks jobs to run starting with the highest priority job. If there are not sufficient free nodes to schedule the highest priority job, the next highest priority job will be considered and so on.

As noted earlier, large jobs have a more difficult time getting scheduled to run then smaller jobs. To ensure that large jobs do get scheduled, we introduced the concept of blocking controlled by a constant "block_priority" that is set by the NQS administrator. Under blocking, nodes are deliberately kept idle until enough free nodes are available to schedule the large-node job. If blocking is enabled (by setting the "block_priority" to greater than zero), the scheduler will check whether the "job_priority" of the highest priority job is greater than the "block_priority". If it is and if there are not enough free nodes to schedule this job, scheduling of new jobs will be blocked until there are sufficient free nodes for this job. The idea is that if a large job is an important job (high queue_priority) or if it has been waiting in queue for a while (aging), idle nodes will be accrued through the blocking mechanism until the job is able to run.

2.4 The modified 2-D buddy system (M2BD)

The two-dimensional buddy system proposed by Li and Cheng [4] is an allocation strategy for a square mesh-connected system. The strategy is only applicable to situations where all incoming jobs are square submeshes of dimension (n x n) and the mesh is a square mesh of size (N x N), where both n and N are power of 2 integers. This geometric limitation could have an adverse

effect on the node utilization of the system if user jobs use other node sizes. The allocation and deallocation overheads of the strategy are both O(log n).

The Paragon architecture is typically a non-square mesh. The Paragon OS can allocate non-square and non-contiguous nodes to a job. The nodes of a given parallel job do not have to be physically adjacent. We introduce an allocation strategy - Modified 2-D Buddy System (M2BD), which is an extension of the 2-D buddy strategy. It uses the basic tree data structure of the 2-D Buddy System, but can take advantage of the geometric flexibility provided by the Paragon architecture. Basically, this strategy attempts to schedule a job by allocating one or more rectangular blocks. The blocks allocated to a job can be non-contiguous.

As described in section 2.2, the Paragon mesh is configured into disjoint "node sets"; "node groups" are then constructed as groups of these "node sets". Each node set is a rectangular submesh of nodes with the same physical attributes. The M2BD system is used to allocate nodes within each "node set". This strategy is similar to the Multiple Buddy Strategy (MBS) proposed by Lo et. al. [3]. The primary difference between the two strategies is in the Buddy generation scheme described in 2.4.2..

2.4.1 The Modified 2-D Buddy System (M2DB) allocation algorithm

Similar to the 2-D buddy system, M2DB attempts to satisfy a request by repeatedly breaking down the rectangular submesh of a "node set" into smaller blocks. The search algorithm is governed by the following steps:

The first step involves allocating an "anchor block" which is the first block to be allocated. The "anchor block" allocation involves searching for the smallest free block that is greater than or equal to the requested size.
If the chosen block is the same size as the requested size, we are done.
If the chosen block is greater than the requested size, we break the block into small blocks (buddies) and go to step 1). The algorithm for breaking down a block is described 2.4.2.
If all free blocks are smaller than the requested size, the largest free block is chosen as the "anchor block" and the requested size is decremented by the size of the "anchor block".
After an "anchor block" has been selected, subsequent blocks are chosen that are as close as possible to the "anchor block", where closeness is defined as the sum of the distance between the four corners of the anchor block and the new candidate block. The distance between two points (x, y) and (X,Y) is defined as:
distance = (x - X)² + (y - Y)²
where x, y define the number of nodes in the horizontal and vertical directions from the upper left corner of the mesh.
If the size of the closest free block is greater then the remaining requested size, the block will be broken down into small blocks (buddies). Then step 5) is repeated.
Steps 5) and 6) are repeated until the remaining requested size is zero.

2.4.2 Buddy generation algorithm

In the M2DB system, each block is a rectangular submesh and is represented by <x, y, h, w> where <x, y> is the location of the upper left most node, and h and w represent the height and width, respectively of the rectangle.

In the normal 2D buddy system, a square block is split into four square buddies with identical size by splitting h and w in half. In the case of M2DB, since a block can be a rectangle of arbitrary w and h, we can split w and h in an arbitrary way and create buddies of arbitrary size. However, although it is no longer required that a job request size must be of power of 2 integer as in the case of the hypercube architecture, it is our experience that an overwhelming majority of the job request sizes are powers of 2. Therefore, a goal of the buddy generation algorithm should be to make the scheduling of jobs requesting a number of nodes that is a power of two as easy as possible.

Our M2DB algorithm breaks down a block into two or four smaller blocks (buddies), depending on the dimension (h and w) of the block. The dimensions of the buddies do not necessarily have to be identical. The M2DB Buddy generation algorithm is described in the following:

Determine the largest power of 2 integers, one for the width (W2) and one for the height (H2) that are less than or equal to the width (w) and the height (h), respectively, of the block to be split.

                H2 <= h
                W2 <= w

Depending on the values of h and w, the following situations arise :

h == w == H2 == W2, a power of 2 square mesh. The block will be broken down into four identical square buddies in the same manner as the normal 2D Buddy system. If the original block is represented by <x, y, w, h>, the four buddies are :
```
	<x, 		y, 		W2/2, 	H2/2>
	<x + W2/2, 	y, 		W2/2, 	H2/2>
	<x + W2/2, 	y + H2/2, 	W2/2, 	H2/2>
	<x ,		y + H2/2, 	W2/2, 	H2/2>
```
h == H2, w == W2, h > w, a power of 2 rectangular mesh. The block will be broken down into two identical buddies:
```
	<x, 	y, 		2, 	H2/2>
	<x, 	y + H2/2, 	W2, 	H2/2>
```
Two identical buddies are created by splitting h (the longer side) in half.
h == H2, w == W2, h < w, a power of 2 rectangular mesh. The block will be broken down into two identical buddies:
```
	<x,		y, 	W2/2, 	H2>
	<x + W2/2, 	y, 	W2/2, 	H2>
```
Two identical buddies are created by splitting w (the longer side) in half.
h == H2, w != W2, a rectangular mesh with a power of 2 only on one side. The block will be broken down into two unequal buddies:
```
	<x, 		y, 	W2, 		H2>
	<x + W2, 	y, 	w - W2, 	H2>
```
The non-power-of-2 side w is split into W2 and (w - W2) creating one large power-of-2 rectangular/square mesh and one smaller rectangular mesh.
h != H2, w == W2, a rectangular mesh with a power of 2 only on one side. The block will be broken down into two unequal buddies:
```
	<x, 	y, 		W2, 	H2>
	<x, 	y + H2, 	W2, 	h - H2>
```
The non-power-of-2 side h is split into H2 and (h - H2) creating one large power-of-2 rectangular/square mesh and one smaller rectangular mesh.
h != w, h != H2, w != W2, a non-power-of-2 rectangular mesh. The block will be broken down into four unequal buddies:
```
	<x, 		y, 		W2, 		H2>
	<x + W2, 	y, 		w - W2, 	H2>
	<x + W2, 	y + H2, 	w - W2, 	h - H2>
	<x, 		y + H2, 	W2, 		h - H2>
```
The non-power-of-2 side h is split into H2 and (h - H2), and w is split into W2 and (w - W2) creating one large power-of-2 rectangular/square mesh and three smaller rectangular meshes.

The deallocation procedure returns all blocks owned by a job to the free pool and combines the buddies to restore the parent block.

Similar to the Multiple Buddy Strategy (MBS), the overhead associated with M2DB for both allocation and deallocation is O(n) for a system with n nodes since a maximum of n blocks will be allocated or deallocated.

3.0 Workload characteristics and Performance

At SDSC, the 416 nodes are partitioned as follows:

For normal UNIX interactive login and uses.	6 nodes in the service partition
Nodes where the I/O servers reside. Unreachable by users.	10 I/O nodes
For use by parallel jobs.	400 nodes in the compute partition

The compute partition is further divided into a 32-node Interactive partition and a 368-node Batch partition. Of the 368 nodes in the Batch partition, 256 nodes have 32-MB of memory and 112 have 16-MB of memory. In the batch scheduler, these nodes are partitioned into three "node sets" :

node set 1	112 nodes with 16 MB of memory
node set 2	224 nodes with 32 MB of memory
node set 3	32 nodes with 32 MB of memory for short jobs

During prime time (which runs from 9:00 AM to 5:00 PM), "node set" 3 is reserved for the exclusive use of short jobs (1 hr. time limit) but is added back into the 32 MB node pool during non-prime time.

Multiple "node groups" were constructed with various combinations of these "node sets" to support NQS queues with different size and time limits. The allowed node sizes ranged from 1 to 256 nodes and the default time limit varied from 1 to 12 hours.

3.2 Workload characteristics

The Workload characteristics and performance analyses were carried out over a six month period from 04/01/95 to 09/30/95.

Table 1 gives a summary of the overall system performance and node utilization during the six month period. As can be seen in the table, most of the usage (97.8%) occurred in the Batch partition even though more Interactive jobs were processed. This is because most Interactive jobs are small, short running jobs. At any given time, there were on average 7.7 jobs running in the system.

Table 1 - Overall system performance and node utilization (04/01/95-09/30/95)
System down time as percent of wall clock time (*)	3.8 %
System up time as percent of wall clock time	96.2 %
Total number of Interactive jobs	22536
Total number of Batch jobs	13224
Average number of jobs running at any given time	7.7
Interactive partition Usage as percent of total usage (**)	2.2 %
Batch partition Usage as percent of total usage	97.8 %
Interactive partition Utilization while the system was up (***)	20.5 %
Batch partition Utilization while the system was up	78.2 %
Overall Utilization (Interactive and Batch)	73.6 %

* - System down time includes both scheduled hardware and software maintenance, and unscheduled interrupts.

** - Usage is computed in terms of node hours. i.e., the product of node_used and execution_time summed over jobs.

*** - Utilization is the ratio of the node hours actually in use as compared with the available node hours (total number of nodes times the wall clock time) expressed as a percentage.

As expected, the node utilization in the Interactive partition is low (20.5%). This is particularly true during non-prime time. However, the utilization of the Batch partition is reasonably high (78.2 %), especially considering the fact that 32 nodes have been reserved for the exclusive use of short jobs (1 hr. limit) during prime time. These nodes tended to be under-utilized during this time period. The overall utilization (Interactive and Batch) is 73.6 %.

Table 2 gives the overall utilization on a monthly basis from April through September, 1995. June and August usage is low because of summer vacation during the school year.

Table 2 - Monthly node utilization
Month	Utilization
4	76.3
5	77.8
6	65.7
7	77.9
8	68.2
9	74.9
AVE	73.7

Table 3 gives the usage distribution as a function of request size. The request size with the highest usage is 64 nodes, followed by 128 nodes and 32 nodes. Also given in the table are the number of jobs processed as a function of request size. In the Batch partition, excluding system backup jobs that ran on a single node, the request size with the largest number of jobs processed is 64, followed by 32 and 16. For the Interactive partition, a request size of 4 nodes is most popular, followed by 16 and 8.

Table 3 - Usage distribution by request size
Request size	Usage Distribution (% of total)	Number of batch jobs	Number of interactive jobs
1	0.15	4987	3364
2	0.05	65	3391
4	0.27	785	6683
8	4.88	1267	3543
16	7.74	1642	3676
32	16.22	2095	1845
64	39.69	3477	12
128	25.09	629	18
256	5.86	122	4

Table 4 gives the number of jobs processed as a function of execution time. An execution time of one hour or less was most popular. For longer running jobs, the 12 hour execution time was quite common reflecting the maximum run time limit of 12 hours for batch jobs. Interactive jobs have a soft time limit of 30 minutes.

Table 4 - Execution time distribution
Execution Time (hr)	Number of Batch jobs	Number of Interactive jobs
1.0	8555	22308
2.0	560	88
3.0	352	29
4.0	511	22
5.0	514	19
6.0	293	6
7.0	282	3
8.0	394	2
9.0	730	5
10.0	151	4
11.0	144	3
12.0	738	47

4.0 Performance of the Batch Scheduler

The Batch scheduler performed reasonably well in terms of utilizing the Batch partition. As shown in Table 1, the average utilization in the Batch partition over the six month period was 78.2 %. The 21.8 % idle time was caused by the following factors :

Not enough jobs to keep all nodes busy.
Partition fragmentation - There may be free nodes in the system, but too few to hold the smallest job in the batch queue.
Scheduling policy - For example, reserving 32 nodes for the exclusive use of short running jobs, blocking for large job scheduling, etc.

Table 5 gives a summary of scheduling throughput measured in terms of expansion factor and the average wait time for each NQS queue. The average wait time is the average time the jobs have been waiting in a queue before they are scheduled for execution. The Expansion factor is calculated from the following equation :


	Expansion factor = (execution_time + wait_time) / execution_time

It is a ratio of the total wall clock time (from job submission time to the end of execution time) to the actual execution time. It is a measure of how quickly a job gets done in competition with the submitted job load compared to immediate execution of the job on a dedicated system.

The characteristics of each queue are encoded in the queue names as follows:

Each queue name starts with the character "q", followed by an optional character "f", followed by an integer, and then followed by one of the characters - "s", "m" or "l". "f" means fat nodes with 32 MB of memory. If "f" is not specified, the 16 MB node is assumed. The integer describes the maximum number of nodes that can be requested for each job.

	"s" means short jobs with a time limit of 1 hour.
	"m" means medium jobs with a time limit of 4 hours.
	"l" means long jobs with a time limit of 12 hours.

In addition, there are two low priority queues called fstandby (for 32 MB nodes) and standby (for 16 MB nodes), with a time limit of 12 hours.

For example, the queue name "qf128m" means it is a 32 MB node queue with a maximum of 128 nodes and a time limit of 12 hours for each job.

Table 5 - Scheduling throughput and Expansion factor
Queue	Number of jobs	Usage % of total	av. wait time (hours)*	av. execution time (hours)	expansion factor **
q4t	1635	0.0	0.1	0.0	71.9
q4s	684	0.0	6.3	0.1	51.0
q8s	107	0.0	0.0	0.2	1.2
q16s	366	0.0	7.0	0.1	93.6
q32s	267	1.2	1.1	1.0	2.1
q64s	226	0.3	1.0	0.3	4.9
q32m	299	1.2	2.6	1.6	2.7
q64m	110	0.5	2.0	1.1	2.8
q128m	165	0.9	7.3	0.5	15.4
q256m	177	0.4	9.1	1.6	6.8
q1l	3080	0.1	2.3	0.3	8.6
q32l	1178	8.4	2.6	5.2	1.5
q64l	724	19.5	5.7	5.0	2.1
q128l	173	8.8	16.8	4.6	4.6
qf8s	155	0.0	0.7	0.1	7.2
qf16s	254	0.0	1.4	0.1	15.7
qf32s	348	0.6	1.3	0.5	3.4
qf32m	659	1.5	1.0	1.2	1.8
qf64m	182	0.5	2.7	0.7	4.6
qf128m	75	0.1	5.3	0.2	24.7
qf256m	14	0.1	32.5	0.6	59.3
qf32l	500	8.9	4.5	6.9	1.7
qf64l	259	9.1	8.6	6.7	2.3
qf128l	109	6.6	14.4	5.6	3.6
qf256l	74	7.4	23.7	5.6	5.2
standby	1132	10.3	6.5	6.5	2.0
fstandby	271	13.4	72.3	7.2	11.1

* Ave wait time - Average time the jobs have been waiting in queue before they were scheduled for execution.

** Expansion factor = (execution_time + wait_time) / execution_time

As can be seen, the expansion factor varies from 1.2 to 93.6. The queue with the largest expansion factor had poor turnaround primarily because the execution time was very short compared to the wait time. In this situation, the absolute value of the wait time is more meaningful.

In general, the following conclusions can be made :

The wait time increases with increasing number of nodes requested.
The wait time is larger for jobs requesting 32-MB memory nodes than for jobs requesting 16-MB nodes.
The wait time increases for jobs requesting larger amounts of time. Large long-running jobs block other large long-running jobs from executing, limiting the overall throughput of the queue.

Overall, the throughput of all job classes was reasonably good. For jobs requesting 64 nodes or less, the average wait time for most of the job classes was significantly less than 10 hours.

5.0 Summary and Conclusions

This paper describes the design and performance of a batch scheduler for the Intel Paragon system that addresses the various issues associated with a multi-user production environment. Issues addressed include scheduling for heterogeneous nodes, scheduling for long jobs, scheduling for large jobs, prime/non-prime time modes and node allocation schemes. A Modified 2-D Buddy system (M2DB) for non-contiguous node allocation wan implemented and its performance studied in this paper.

The use of non-contiguous allocation scheme resulted in reasonably good node utilization in the Batch partition. The utilization for the six months studied was 78.2 %. This is quite similar to the 77 % utilization achieved with the Multiple Buddy Strategy (MBS) described in Ref. [3] which is also a non-contiguous allocation strategy.

Overall, the scheduler performed as expected. The throughput of all job classes was reasonably good. The average wait time for short-running jobs was reasonably short. Large jobs had to wait longer to be scheduled with the average wait time generally less than 24 hours.

The 78.2 % utilization achieved is probably near optimum after taking into consideration the effects of scheduling policies (e.g., reserving nodes for short-running jobs and blocking for large jobs) and fragmentation. The system utilization is expected to improve if time sharing (more than one job can be scheduled to run in a node at a time) becomes practical on the Paragon system. The current scheduling scheme assumes only space sharing. i.e., a node can be used by only one job at a time.

With time sharing, blocking for large jobs is no longer necessary. Large jobs can be scheduled immediately to time share with other jobs if not enough free nodes are available. The need to reserve nodes for short running jobs also diminishes because these jobs can also be scheduled to run immediately to time share with other jobs. The effect of fragmentation may diminish if the scheduler over-subscribes the nodes through time sharing.

Our current batch scheduler is capable of scheduling jobs on a time share basis through tiling of jobs and gang-scheduling of overlapping jobs. However, time sharing of jobs places a heavy burden on the paging system and is not practical at this time due to performance and space limitations.

6.0 Acknowledgments

This project was funded by the SSD (Supercomputer Systems Division) of Intel Corporation. We would like to thank Joe Carter, Jerry Kearns, Roy Larson, and Don Ochoa of Intel Corporation for their support and guidance.

References

[1] B. Kingsbury, H. Walter, M. Bridge and T. Carver. "NQS, Network Queuing System, Version 2.0", Cosmic Program # ARC-13179.

[2] L. M. Ni and P. K. McKinley. "A survey of wormhole routing techniques in direct networks". IEEE Trans. Computers, 1993.

[3] V. Lo, K. Windisch, W. Liu and B. Nitzberg. "Non-contiguous Processor Allocation Algorithm for Mesh-connected Multicomputers", paper to be published.

[4] K. Li and K. Cheng. "A two-dimensional buddy system for dynamic resource allocation in a partitionable mesh connected system". Journal of Parallel and Distributed Computing, 12:79-83, 1991.