Abstract
As the system usage model for scalable parallel processors evolves from the single-user, dedicated access model to a multi-user production environment, a versatile batch scheduler is needed to match user requirements to system resources. This paper describes the design and performance of a batch scheduler for the Intel Paragon system that addresses the issues associated with a multi-user production environment, including scheduling for heterogeneous nodes, scheduling for long-running jobs, scheduling for large jobs, prime/non-prime time modes, and node allocation schemes. The Modified 2-D Buddy system (M2DB) for non-contiguous node allocation is introduced and studied in this paper.
As large parallel processor systems become heavily used, there is a need for the system usage model to evolve from single-user, dedicated access to a much more versatile multi-user production environment. A key component for satisfying this need is a job scheduler capable of handling the scheduling of both interactive and batch jobs submitted by multiple users. This paper describes the design and performance of a batch scheduler for a 416-node Intel Paragon system in production use at the San Diego Supercomputer Center (SDSC).
The usage requirements for interactive and batch jobs can be quite different. Typically, interactive usage is for debugging and short production jobs that use a small number of nodes, while batch usage is mostly for long production jobs using a large number of nodes. To handle the disparate node size and execution time scales, space sharing is used. The nodes of the Paragon may be divided into groups called partitions. In our usage model, two non-overlapping partitions -- the Interactive and Batch partitions, -- are used. Interactive jobs are processed in the Interactive partition, and long-running production jobs are executed in the Batch partition. A concern with this type of space sharing is that the Interactive partition may be idle a large fraction of the time, and typically is idle at night. This concern is mitigated when the number of nodes devoted to the Interactive partition is relatively small when compared to the total number of nodes in the computer.
The scheduling of jobs in the Interactive partition is supported by a sequential node assignment mechanism provided by the Paragon operating system. Jobs are scheduled on a first-come-first-serve basis. An interactive job is rejected if there are not enough free nodes to satisfy the request.
The scheduling of jobs in the Batch partition is handled by the SDSC batch scheduler which is the subject of this paper. The batch scheduler is based on the Network Queuing System (NQS) [1]. NQS is a client/server queuing system originally designed for vector supercomputers running the traditional UNIX operating system. The SDSC batch scheduler uses the job queuing framework provided by NQS to which we have added a scheduling layer for assigning node partitions on the Paragon parallel computer.
The design requirements for the parallel batch scheduler are based on the need at SDSC to handle multiple classes of batch jobs, a heterogeneous hardware configuration, and time varying usage patterns. The related issues are:
* Node allocation - Allocating nodes to an application involves the selection of a set of nodes from a pool of free nodes in the 2-D mesh. The Paragon architecture allows non-contiguous node allocation; the nodes of a given parallel job do not have to be physically adjacent. This flexibility has the potential of greatly improving the node utilization of the system. This is particularly true after the system has been up for some time at which point the free nodes in the system tend to be fragmented. Non-contiguous allocation , however, may have the problem of message contention because of the potential for longer message paths (and more hops) and interference with messages from other jobs. Fortunately, with the wormhole routing [2] technology used in the Paragon communication backplane, the number of hops between nodes is no longer the dominant factor affecting message latency as shown in a communication contention study by Lo, et. al. [3]. Hence, the primary performance objective of the SDSC scheduling scheme is maximizing node utilization through non-contiguous allocation. The scheduling scheme does the following:
The Paragon parallel batch scheduler is based on version 2 of the NQS queuing system. A scheduling layer was added to NQS to schedule jobs for the Paragon system. The following gives a summary of the enhancements:
A new option "-lP" has been added to the "qsub" command which is used for submitting batch jobs. This new option allows users to request a specific number of nodes on which to run the batch job.
In addition, a new subcommand has been added to the "qmgr" command. The
"qmgr" command is an interactive program used by the NQS administrator to
control and configure the local NQS system. The new subcommand
        set per_request ncpus = N   QUEUE 
allows an NQS administrator to specify the default number of nodes (N) fora specific batch queue (QUEUE). That is, if a job is submitted to a given batch queue without specifying the number of nodes for the job, it will default to the ncpus value of the specified queue.
In order to provide scheduling support for heterogeneous nodes and node
reservation for short running jobs during prime time, it is necessary to provide a framework for partitioning the nodes in the system. The framework also provides a way to associate scheduling policies with physical node sets.
Nodes are first partitioned into non-overlapping "node sets". The physical attributes of nodes in a given "node set" should be identical because they are treated as such by the scheduler.
"Node groups" are then constructed by combining one or more "node sets".
Overlapping "node groups" are allowed. i.e., a given "node set" can be assigned to more than one "node group". Each NQS queue is associated with a single "node group". Scheduling policies are enforced by linking each NQS queue to a particular usage requirement. Queues may be set up for short-running jobs or for jobs needing large memory nodes, or nodes with multiple CPUs. The following simple example illustrates the use of this framework for the scheduling of different types of nodes.
Consider a system with two different types of nodes - 8 nodes with 16 MB of memory and 16 nodes with 32 MB of memory. They are partitioned into two "node sets":
        node_set 1: 16 nodes with 32 MB of memory
        node_set 2:  8 nodes with 16 MB of memory
We create three "node groups" with the following "node set" assignment:
                    	     
| "node group" | "node sets" assigned to the "node group" | 
|---|---|
| 1 | 1 | 
| 2 | 2 | 
| 3 | 1,2 | 
"Node groups" 1 and 2 are for jobs that only want to run on the 32 MB and 16 MB memory nodes, respectively. "Node group" 3 is for jobs that can use nodes with either memory size.
The "node sets" associated with a given "node group" can be switched as a function of the time period (prime, non-prime or both) during which the nodes will be used. For example:
| "node group" | "node sets" to use during prime time | "node sets" to use during non-prime time | 
|---|---|---|
| 4 | 1 | 1 | 
| 5 | - | 1 | 
"Node group" 4 uses "node set" 1 during both prime and non-prime times, but "node group" 5 can only use "node set" 1 during non-prime time. If we assign "node group" 4 to short running jobs and "node group" 5 to long running jobs, we in effect allow only short running jobs to use "node set" 1 during prime time. This configuration will give good throughput for short running jobs during the day.
Once a job has been selected for allocation, the scheduler selects free nodes by scanning the "node sets" associated with the "node group". It may take up to two passes to accomplish the task. During the first pass, the scheduler tries to find the "node set" with the minimum number of free nodes that can satisfy the entire request. If such a "node set" cannot be found, the scheduler will allocate free nodes from "the node sets" in the order specified in the definition of the "node group" that is associated with the NQS queue to which the job was originally submitted.
Within a "node set", nodes are allocated using the modified 2-D buddy strategy described in section 2.4.
The original NQS system scheduled jobs according to the "queue priority" assigned to each NQS queue. Jobs were queued up at each NQS queue in the order of submission time. When NQS scheduled jobs to run, it looked through each queue, one queue at a time, starting with the queue with the highest "queue_priority". This simple scheme does not allow a job queued in a low priority queue to be scheduled ahead of jobs in the high priority queue, no matter how long it has been waiting in the queue.
To enhance scheduling flexibility, we introduced the concept of aging. With this scheme, the job priority is calculated as follows:
job_priority = queue_priority + age_factor * timewhere
queue_priority = priority associated with the queue to which the job is submitted. age_factor = A constant factor to be defined by the NQS administrator. time = time in hours the job has been waiting in the queue.
For scheduling purposes, all jobs are sorted in the order of job_priority and the scheduler picks jobs to run starting with the highest priority job. If there are not sufficient free nodes to schedule the highest priority job, the next highest priority job will be considered and so on.
As noted earlier, large jobs have a more difficult time getting scheduled to run then smaller jobs. To ensure that large jobs do get scheduled, we introduced the concept of blocking controlled by a constant "block_priority" that is set by the NQS administrator. Under blocking, nodes are deliberately kept idle until enough free nodes are available to schedule the large-node job. If blocking is enabled (by setting the "block_priority" to greater than zero), the scheduler will check whether the "job_priority" of the highest priority job is greater than the "block_priority". If it is and if there are not enough free nodes to schedule this job, scheduling of new jobs will be blocked until there are sufficient free nodes for this job. The idea is that if a large job is an important job (high queue_priority) or if it has been waiting in queue for a while (aging), idle nodes will be accrued through the blocking mechanism until the job is able to run.
The two-dimensional buddy system proposed by Li and Cheng [4] is an allocation strategy for a square mesh-connected system. The strategy is only applicable to situations where all incoming jobs are square submeshes of dimension (n x n) and the mesh is a square mesh of size (N x N), where both n and N are power of 2 integers. This geometric limitation could have an adverse
effect on the node utilization of the system if user jobs use other node sizes. The allocation and deallocation overheads of the strategy are both O(log n).
The Paragon architecture is typically a non-square mesh. The Paragon OS can allocate non-square and non-contiguous nodes to a job. The nodes of a given parallel job do not have to be physically adjacent. We introduce an allocation strategy - Modified 2-D Buddy System (M2BD), which is an extension of the 2-D buddy strategy. It uses the basic tree data structure of the 2-D Buddy System, but can take advantage of the geometric flexibility provided by the Paragon architecture. Basically, this strategy attempts to schedule a job by allocating one or more rectangular blocks. The blocks allocated to a job can be non-contiguous.
As described in section 2.2, the Paragon mesh is configured into disjoint "node sets"; "node groups" are then constructed as groups of these "node sets". Each node set is a rectangular submesh of nodes with the same physical attributes. The M2BD system is used to allocate nodes within each "node set". This strategy is similar to the Multiple Buddy Strategy (MBS) proposed by Lo et. al. [3]. The primary difference between the two strategies is in the Buddy generation scheme described in 2.4.2..
Similar to the 2-D buddy system, M2DB attempts to satisfy a request by repeatedly breaking down the rectangular submesh of a "node set" into smaller blocks. The search algorithm is governed by the following steps:
distance = (x - X)2 + (y - Y)2
where x, y define the number of nodes in the horizontal and vertical directions from the upper left corner of the mesh.
In the M2DB system, each block is a rectangular submesh and is represented by <x, y, h, w> where <x, y> is the location of the upper left most node, and h and w represent the height and width, respectively of the rectangle.
In the normal 2D buddy system, a square block is split into four square buddies with identical size by splitting h and w in half. In the case of M2DB, since a block can be a rectangle of arbitrary w and h, we can split w and h in an arbitrary way and create buddies of arbitrary size. However, although it is no longer required that a job request size must be of power of 2 integer as in the case of the hypercube architecture, it is our experience that an overwhelming majority of the job request sizes are powers of 2. Therefore, a goal of the buddy generation algorithm should be to make the scheduling of jobs requesting a number of nodes that is a power of two as easy as possible.
Our M2DB algorithm breaks down a block into two or four smaller blocks (buddies), depending on the dimension (h and w) of the block. The dimensions of the buddies do not necessarily have to be identical. The M2DB Buddy generation algorithm is described in the following:
Determine the largest power of 2 integers, one for the width (W2) and one for the height (H2) that are less than or equal to the width (w) and the height (h), respectively, of the block to be split.
                H2 <= h
                W2 <= w
Depending on the values of h and w, the following situations arise :
<x, y, W2/2, H2/2> <x + W2/2, y, W2/2, H2/2> <x + W2/2, y + H2/2, W2/2, H2/2> <x , y + H2/2, W2/2, H2/2>
<x, y, 2, H2/2> <x, y + H2/2, W2, H2/2>Two identical buddies are created by splitting h (the longer side) in half.
<x, y, W2/2, H2> <x + W2/2, y, W2/2, H2>Two identical buddies are created by splitting w (the longer side) in half.
<x, y, W2, H2> <x + W2, y, w - W2, H2>The non-power-of-2 side w is split into W2 and (w - W2) creating one large power-of-2 rectangular/square mesh and one smaller rectangular mesh.
<x, y, W2, H2> <x, y + H2, W2, h - H2>The non-power-of-2 side h is split into H2 and (h - H2) creating one large power-of-2 rectangular/square mesh and one smaller rectangular mesh.
<x, y, W2, H2> <x + W2, y, w - W2, H2> <x + W2, y + H2, w - W2, h - H2> <x, y + H2, W2, h - H2>The non-power-of-2 side h is split into H2 and (h - H2), and w is split into W2 and (w - W2) creating one large power-of-2 rectangular/square mesh and three smaller rectangular meshes.
The deallocation procedure returns all blocks owned by a job to the free pool and combines the buddies to restore the parent block.
Similar to the Multiple Buddy Strategy (MBS), the overhead associated with M2DB for both allocation and deallocation is O(n) for a system with n nodes since a maximum of n blocks will be allocated or deallocated.
| For normal UNIX interactive login and uses. | 6 nodes in the service partition | 
| Nodes where the I/O servers reside. Unreachable by users. | 10 I/O nodes | 
| For use by parallel jobs. | 400 nodes in the compute partition | 
The compute partition is further divided into a 32-node Interactive partition and a 368-node Batch partition. Of the 368 nodes in the Batch partition, 256 nodes have 32-MB of memory and 112 have 16-MB of memory. In the batch scheduler, these nodes are partitioned into three "node sets" :
| node set 1 | 112 nodes with 16 MB of memory | 
| node set 2 | 224 nodes with 32 MB of memory | 
| node set 3 | 32 nodes with 32 MB of memory for short jobs | 
During prime time (which runs from 9:00 AM to 5:00 PM), "node set" 3 is reserved for the exclusive use of short jobs (1 hr. time limit) but is added back into the 32 MB node pool during non-prime time.
Multiple "node groups" were constructed with various combinations of these "node sets" to support NQS queues with different size and time limits. The allowed node sizes ranged from 1 to 256 nodes and the default time limit varied from 1 to 12 hours.
The Workload characteristics and performance analyses were carried out over a six month period from 04/01/95 to 09/30/95.
Table 1 gives a summary of the overall system performance and node utilization during the six month period. As can be seen in the table, most of the usage (97.8%) occurred in the Batch partition even though more Interactive jobs were processed. This is because most Interactive jobs are small, short running jobs. At any given time, there were on average 7.7 jobs running in the system.
| Table 1 - Overall system performance and node utilization (04/01/95-09/30/95) | |
|---|---|
| System down time as percent of wall clock time (*) | 3.8 % | System up time as percent of wall clock time | 96.2 % | 
| Total number of Interactive jobs | 22536 | 
| Total number of Batch jobs | 13224 | 
| Average number of jobs running at any given time | 7.7 | 
| Interactive partition Usage as percent of total usage (**) | 2.2 % | 
| Batch partition Usage as percent of total usage | 97.8 % | 
| Interactive partition Utilization while the system was up (***) | 20.5 % | 
| Batch partition Utilization while the system was up | 78.2 % | 
| Overall Utilization (Interactive and Batch) | 73.6 % | 
* - System down time includes both scheduled hardware and software maintenance, and unscheduled interrupts.
** - Usage is computed in terms of node hours. i.e., the product of node_used and execution_time summed over jobs.
*** - Utilization is the ratio of the node hours actually in use as compared with the available node hours (total number of nodes times the wall clock time) expressed as a percentage.
As expected, the node utilization in the Interactive partition is low (20.5%). This is particularly true during non-prime time. However, the utilization of the Batch partition is reasonably high (78.2 %), especially considering the fact that 32 nodes have been reserved for the exclusive use of short jobs (1 hr. limit) during prime time. These nodes tended to be under-utilized during this time period. The overall utilization (Interactive and Batch) is 73.6 %.
Table 2 gives the overall utilization on a monthly basis from April through September, 1995. June and August usage is low because of summer vacation during the school year.
| Table 2 - Monthly node utilization | |
|---|---|
| Month | Utilization | 
| 4 | 76.3 | 
| 5 | 77.8 | 
| 6 | 65.7 | 
| 7 | 77.9 | 
| 8 | 68.2 | 
| 9 | 74.9 | 
| AVE | 73.7 | 
Table 3 gives the usage distribution as a function of request size. The request size with the highest usage is 64 nodes, followed by 128 nodes and 32 nodes. Also given in the table are the number of jobs processed as a function of request size. In the Batch partition, excluding system backup jobs that ran on a single node, the request size with the largest number of jobs processed is 64, followed by 32 and 16. For the Interactive partition, a request size of 4 nodes is most popular, followed by 16 and 8.
| Table 3 - Usage distribution by request size | |||
|---|---|---|---|
| Request size | Usage Distribution (% of total) | Number
of batch jobs | Number of interactive jobs | 
| 1 | 0.15 | 4987 | 3364 | 2 | 0.05 | 65 | 3391 | 4 | 0.27 | 785 | 6683 | 8 | 4.88 | 1267 | 3543 | 16 | 7.74 | 1642 | 3676 | 32 | 16.22 | 2095 | 1845 | 64 | 39.69 | 3477 | 12 | 128 | 25.09 | 629 | 18 | 256 | 5.86 | 122 | 4 | 
Table 4 gives the number of jobs processed as a function of execution time. An execution time of one hour or less was most popular. For longer running jobs, the 12 hour execution time was quite common reflecting the maximum run time limit of 12 hours for batch jobs. Interactive jobs have a soft time limit of 30 minutes.
| Table 4 - Execution time distribution | ||
|---|---|---|
| Execution Time (hr) | Number of Batch jobs | Number of Interactive jobs | 
| 1.0 | 8555 | 22308 | 
| 2.0 | 560 | 88 | 
| 3.0 | 352 | 29 | 
| 4.0 | 511 | 22 | 
| 5.0 | 514 | 19 | 
| 6.0 | 293 | 6 | 
| 7.0 | 282 | 3 | 
| 8.0 | 394 | 2 | 
| 9.0 | 730 | 5 | 
| 10.0 | 151 | 4 | 
| 11.0 | 144 | 3 | 
| 12.0 | 738 | 47 | 
The Batch scheduler performed reasonably well in terms of utilizing the Batch partition. As shown in Table 1, the average utilization in the Batch partition over the six month period was 78.2 %. The 21.8 % idle time was caused by the following factors :
Table 5 gives a summary of scheduling throughput measured in terms of expansion factor and the average wait time for each NQS queue. The average wait time is the average time the jobs have been waiting in a queue before they are scheduled for execution. The Expansion factor is calculated from the following equation :
Expansion factor = (execution_time + wait_time) / execution_time
It is a ratio of the total wall clock time (from job submission time to the end of execution time) to the actual execution time. It is a measure of how quickly a job gets done in competition with the submitted job load compared to immediate execution of the job on a dedicated system.
The characteristics of each queue are encoded in the queue names as follows:
Each queue name starts with the character "q", followed by an optional character "f", followed by an integer, and then followed by one of the characters - "s", "m" or "l". "f" means fat nodes with 32 MB of memory. If "f" is not specified, the 16 MB node is assumed. The integer describes the maximum number of nodes that can be requested for each job.
"s" means short jobs with a time limit of 1 hour. "m" means medium jobs with a time limit of 4 hours. "l" means long jobs with a time limit of 12 hours.
In addition, there are two low priority queues called fstandby (for 32 MB nodes) and standby (for 16 MB nodes), with a time limit of 12 hours.
For example, the queue name "qf128m" means it is a 32 MB node queue with a maximum of 128 nodes and a time limit of 12 hours for each job.
| Table 5 - Scheduling throughput and Expansion factor | |||||
|---|---|---|---|---|---|
| Queue | Number of jobs | Usage % of total | av. wait time (hours)* | av. execution time (hours) | expansion factor ** | 
| q4t | 1635 | 0.0 | 0.1 | 0.0 | 71.9 | 
| q4s | 684 | 0.0 | 6.3 | 0.1 | 51.0 | 
| q8s | 107 | 0.0 | 0.0 | 0.2 | 1.2 | 
| q16s | 366 | 0.0 | 7.0 | 0.1 | 93.6 | 
| q32s | 267 | 1.2 | 1.1 | 1.0 | 2.1 | 
| q64s | 226 | 0.3 | 1.0 | 0.3 | 4.9 | 
| q32m | 299 | 1.2 | 2.6 | 1.6 | 2.7 | 
| q64m | 110 | 0.5 | 2.0 | 1.1 | 2.8 | 
| q128m | 165 | 0.9 | 7.3 | 0.5 | 15.4 | 
| q256m | 177 | 0.4 | 9.1 | 1.6 | 6.8 | 
| q1l | 3080 | 0.1 | 2.3 | 0.3 | 8.6 | 
| q32l | 1178 | 8.4 | 2.6 | 5.2 | 1.5 | 
| q64l | 724 | 19.5 | 5.7 | 5.0 | 2.1 | 
| q128l | 173 | 8.8 | 16.8 | 4.6 | 4.6 | 
| qf8s | 155 | 0.0 | 0.7 | 0.1 | 7.2 | 
| qf16s | 254 | 0.0 | 1.4 | 0.1 | 15.7 | 
| qf32s | 348 | 0.6 | 1.3 | 0.5 | 3.4 | 
| qf32m | 659 | 1.5 | 1.0 | 1.2 | 1.8 | 
| qf64m | 182 | 0.5 | 2.7 | 0.7 | 4.6 | 
| qf128m | 75 | 0.1 | 5.3 | 0.2 | 24.7 | 
| qf256m | 14 | 0.1 | 32.5 | 0.6 | 59.3 | 
| qf32l | 500 | 8.9 | 4.5 | 6.9 | 1.7 | 
| qf64l | 259 | 9.1 | 8.6 | 6.7 | 2.3 | 
| qf128l | 109 | 6.6 | 14.4 | 5.6 | 3.6 | 
| qf256l | 74 | 7.4 | 23.7 | 5.6 | 5.2 | 
| standby | 1132 | 10.3 | 6.5 | 6.5 | 2.0 | 
| fstandby | 271 | 13.4 | 72.3 | 7.2 | 11.1 | 
* Ave wait time - Average time the jobs have been waiting in queue before they were scheduled for execution.
** Expansion factor = (execution_time + wait_time) / execution_time
As can be seen, the expansion factor varies from 1.2 to 93.6. The queue with the largest expansion factor had poor turnaround primarily because the execution time was very short compared to the wait time. In this situation, the absolute value of the wait time is more meaningful.
In general, the following conclusions can be made :
Overall, the throughput of all job classes was reasonably good. For jobs requesting 64 nodes or less, the average wait time for most of the job classes was significantly less than 10 hours.
This paper describes the design and performance of a batch scheduler for the Intel Paragon system that addresses the various issues associated with a multi-user production environment. Issues addressed include scheduling for heterogeneous nodes, scheduling for long jobs, scheduling for large jobs, prime/non-prime time modes and node allocation schemes. A Modified 2-D Buddy system (M2DB) for non-contiguous node allocation wan implemented and its performance studied in this paper.
The use of non-contiguous allocation scheme resulted in reasonably good node utilization in the Batch partition. The utilization for the six months studied was 78.2 %. This is quite similar to the 77 % utilization achieved with the Multiple Buddy Strategy (MBS) described in Ref. [3] which is also a non-contiguous allocation strategy.
Overall, the scheduler performed as expected. The throughput of all job classes was reasonably good. The average wait time for short-running jobs was reasonably short. Large jobs had to wait longer to be scheduled with the average wait time generally less than 24 hours.
The 78.2 % utilization achieved is probably near optimum after taking into consideration the effects of scheduling policies (e.g., reserving nodes for short-running jobs and blocking for large jobs) and fragmentation. The system utilization is expected to improve if time sharing (more than one job can be scheduled to run in a node at a time) becomes practical on the Paragon system. The current scheduling scheme assumes only space sharing. i.e., a node can be used by only one job at a time.
With time sharing, blocking for large jobs is no longer necessary. Large jobs can be scheduled immediately to time share with other jobs if not enough free nodes are available. The need to reserve nodes for short running jobs also diminishes because these jobs can also be scheduled to run immediately to time share with other jobs. The effect of fragmentation may diminish if the scheduler over-subscribes the nodes through time sharing.
Our current batch scheduler is capable of scheduling jobs on a time share basis through tiling of jobs and gang-scheduling of overlapping jobs. However, time sharing of jobs places a heavy burden on the paging system and is not practical at this time due to performance and space limitations.
This project was funded by the SSD (Supercomputer Systems Division) of Intel Corporation. We would like to thank Joe Carter, Jerry Kearns, Roy Larson, and Don Ochoa of Intel Corporation for their support and guidance.