Difference between revisions of "Mosix"
From MosixWiki
(12 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
+ | |||
+ | |||
MOSIX(M7) MOSIX Description MOSIX(M7) | MOSIX(M7) MOSIX Description MOSIX(M7) | ||
− | + | ||
'''NAME''' | '''NAME''' | ||
− | MOSIX - sharing the power of clusters and multi- | + | MOSIX - sharing the power of clusters and multi-clusters |
'''INTRODUCTION''' | '''INTRODUCTION''' | ||
Line 9: | Line 11: | ||
draw the most out of all the connected computers, including utilization | draw the most out of all the connected computers, including utilization | ||
of idle computers. | of idle computers. | ||
− | + | ||
At the core of MOSIX are adaptive resource sharing algorithms, applying | At the core of MOSIX are adaptive resource sharing algorithms, applying | ||
preemptive process migration based on processor loads, memory and I/O | preemptive process migration based on processor loads, memory and I/O | ||
demands of the processes, thus causing the cluster or the multi-cluster | demands of the processes, thus causing the cluster or the multi-cluster | ||
− | + | to work cooperatively similar to a single computer with many processors. | |
− | + | ||
Unlike earlier versions of MOSIX, only programs that are started by the | Unlike earlier versions of MOSIX, only programs that are started by the | ||
Line 20: | Line 21: | ||
programs are considered as "standard Linux programs" and are not affected | programs are considered as "standard Linux programs" and are not affected | ||
by MOSIX. | by MOSIX. | ||
− | + | ||
MOSIX maintains a high level of compatiblity with standard Linux, so that | MOSIX maintains a high level of compatiblity with standard Linux, so that | ||
binaries of almost every application that runs under Linux can run com- | binaries of almost every application that runs under Linux can run com- | ||
Line 30: | Line 31: | ||
kill'' option is selected, an error is returned to the program: such pro- | kill'' option is selected, an error is returned to the program: such pro- | ||
grams should probably run as standard Linux programs. | grams should probably run as standard Linux programs. | ||
− | + | ||
In order to improve the overall resource usage, processes of "migratable" | In order to improve the overall resource usage, processes of "migratable" | ||
programs may be moved automatically and transparently to other nodes | programs may be moved automatically and transparently to other nodes | ||
− | within the cluster or even the multi-cluster | + | within the cluster or even the multi-cluster grid. As the demands for |
resources change, processes may move again, as many times as necessary, | resources change, processes may move again, as many times as necessary, | ||
to continue optimizing the overall resource utilization, subject to the | to continue optimizing the overall resource utilization, subject to the | ||
− | inter- | + | inter-cluster priorities and policies. Manual-control over process |
− | + | migration is also supported. | |
MOSIX is particularly suitable for running CPU-intensive computational | MOSIX is particularly suitable for running CPU-intensive computational | ||
Line 43: | Line 44: | ||
with moderate amounts of I/O. Programs that perform large amounts of I/O | with moderate amounts of I/O. Programs that perform large amounts of I/O | ||
should better be run as standard Linux programs. | should better be run as standard Linux programs. | ||
− | + | ||
Apart from process-migration, MOSIX can provide both "migratable" and | Apart from process-migration, MOSIX can provide both "migratable" and | ||
"standard Linux" programs with the benefits of optimal initial assignment | "standard Linux" programs with the benefits of optimal initial assignment | ||
Line 49: | Line 50: | ||
a job is queued to run later, when resources are available, once it | a job is queued to run later, when resources are available, once it | ||
starts, it remains attached to its original Unix/Linux environment (stan- | starts, it remains attached to its original Unix/Linux environment (stan- | ||
− | dard-input/output/error, signals, etc.). | + | dard-input/output/error, signals, etc.). |
− | + | ||
− | '''REQUIREMENTS''' | + | '''REQUIREMENTS''' |
− | 1. All | + | 1. All nodes must run Linux (any distribution - mixing allowed). |
− | + | ||
− | + | 2. All participating nodes must be connected to a network that supports | |
− | + | TCP/IP and UDP/IP, where each node has a unique IP address in the | |
− | + | range 0.1.0.0 to 255.254.254.255 that is accessible to all the other | |
− | + | nodes. | |
+ | |||
+ | 3. TCP/IP ports 249-254 and UDP/IP ports 249-250 must be available for | ||
+ | MOSIX (not used by other applications or blocked by a firewall). | ||
+ | |||
+ | 4. The architecture of all nodes can be either i386 (32-bit) or x86_64 | ||
(64-bit). Processes that are started on a 32-bit node can migrate | (64-bit). Processes that are started on a 32-bit node can migrate | ||
to a 64-bit node, but not the opposite. | to a 64-bit node, but not the opposite. | ||
− | + | ||
− | + | 5. In multiprocessor nodes (SMP), all the processors must be of the | |
same speed. | same speed. | ||
− | + | ||
− | + | 6. The system-administrators of all the connected nodes must be able to | |
trust each other (see more on SECURITY below). | trust each other (see more on SECURITY below). | ||
+ | |||
+ | '''CLUSTER, MULTI-CLUSTER, PARTITION''' | ||
+ | The MOSIX concept of a "cluster" is a collection of computers that are | ||
+ | owned and managed by the same entity (a person, a group of people or a | ||
+ | project) - this can at times be quite different than a hardware cluster, | ||
+ | as each MOSIX cluster may range from a single workstation to a large com- | ||
+ | bination of computers - workstations, servers, blades, multi-core comput- | ||
+ | ers, etc. possibly of different speeds and number of processors and pos- | ||
+ | sibly in different locations. | ||
+ | |||
+ | A MOSIX multi-cluster is a collection of clusters that belong to differ- | ||
+ | ent entities (owners) who wish to share their resources subject to cer- | ||
+ | tain administrative conditions. In particular, when an owner needs its | ||
+ | computers - these computers must be returned immediately to the exclusive | ||
+ | use of their owner. An owner can also assign priorities to guest pro- | ||
+ | cesses of other owners, defining who can use their computers and when. | ||
+ | Typically, an owner is an individual user, a group of users or a depart- | ||
+ | ment that own the computers. The multi-cluster is usually restricted, | ||
+ | due to trust and security reasons, to a single organization, possibly in | ||
+ | various sites/branches, even across the world. | ||
+ | |||
+ | MOSIX supports dynamic multi-cluster configurations, where clusters can | ||
+ | join and leave at any time. When there are plenty of resources in the | ||
+ | multi-cluster, the MOSIX queuing system allows more processes to start. | ||
+ | When resources become scarce (because other clusters leave or claim their | ||
+ | resources and processes must migrate back to their home-clusters), MOSIX | ||
+ | has a freezing feature that can automatically freeze excess processes to | ||
+ | prevent memory-overload on the home-nodes. | ||
+ | Clusters may also be sub-divided into "partitions". Nodes that are | ||
+ | assigned to different cluster-partitions are halfway between being part | ||
+ | of the cluster and belonging to a different cluster. | ||
+ | |||
+ | Just as within the cluster: | ||
+ | 1. All cluster-partitions seem to other clusters as one cluster (elimi- | ||
+ | nating the need to inform and update system-administrators of other | ||
+ | clusters about internal changes to one's cluster). | ||
+ | 2. Processes that migrate to another partition share the same top-prior- | ||
+ | ity over processes from other clusters. | ||
+ | 3. Processes that migrate to another partition share the "Cluster" cate- | ||
+ | gory disk-space allocation rather than the "Grid" category for Private | ||
+ | Temporary Files (see below). | ||
+ | |||
+ | However, just as other clusters: | ||
+ | 1. Only processes that were allowed to migrate to other clusters are | ||
+ | allowed to migrate to other partitions. | ||
+ | 2. Batch jobs cannot be assigned to nodes in other partitions. | ||
+ | 3. Each partition maintains its own job-queue. | ||
+ | |||
+ | When you have both 32-bit and 64-bit computers in the same cluster, it is | ||
+ | highly recommended (but not mandatory) to set them up as different clus- | ||
+ | ter partitions. | ||
+ | |||
'''CONFIGURATION''' | '''CONFIGURATION''' | ||
− | + | To configure MOSIX interactively, simply run mosconf: it will lead you | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
step-by-step through the various configuration items. | step-by-step through the various configuration items. | ||
− | + | ||
− | + | Mosconf can be used in two ways: | |
− | configuration. | + | |
− | the | + | 1. To configure the local node (press <Enter> at the first question). |
− | + | ||
+ | 2. To configure MOSIX for other nodes: this is typically done on a | ||
+ | server that stores an image of the root-partition for some or all of | ||
+ | the cluster-nodes. This image can, for example, be NFS-mounted by | ||
+ | the cluster-nodes, or otherwise copied or reflected to them by any | ||
+ | other method: at the first question, enter the path to the stored | ||
+ | root-image. | ||
+ | |||
+ | There is no need to stop MOSIX in order to modify the configuration - | ||
+ | most changes will take effect within a minute. However, after modifying | ||
+ | the list of nodes in the cluster (/etc/mosix/mosix.map) or | ||
+ | /etc/mosix/mosip or /etc/mosix/myfeatures, you should run the command | ||
+ | "setpe" (but when you are using mosconf to configure your local node, | ||
+ | this is not necessary). | ||
+ | |||
+ | Below is a detailed description of the MOSIX configuration files (if you | ||
+ | prefer to edit them manually). | ||
+ | |||
+ | The directory /etc/mosix should include at least the subdirectories | ||
+ | /etc/mosix/partners, /etc/mosix/var, /etc/mosix/var/grid and the follow- | ||
+ | ing files: | ||
+ | |||
/etc/mosix/mosix.map | /etc/mosix/mosix.map | ||
This file defines which computers participate in your MOSIX clus- | This file defines which computers participate in your MOSIX clus- | ||
Line 84: | Line 159: | ||
that can be in any order. It may also include any number of com- | that can be in any order. It may also include any number of com- | ||
ment lines beginning with a '#', as well as empty lines. | ment lines beginning with a '#', as well as empty lines. | ||
− | + | ||
Data lines have 2 or 3 fields: | Data lines have 2 or 3 fields: | ||
− | + | ||
1. The IP ("a.b.c.d" or host-name) of the first node in a range | 1. The IP ("a.b.c.d" or host-name) of the first node in a range | ||
of nodes with consecutive IPs. | of nodes with consecutive IPs. | ||
− | + | ||
2. The number of nodes in that range. | 2. The number of nodes in that range. | ||
− | + | ||
− | 3. Optional combination of letter-flags: | + | 3. Optional combination of letter-flags and/or an integer: |
p[roximate] do not use compression on migration, e.g., over | p[roximate] do not use compression on migration, e.g., over | ||
fast networks or slow CPUs. | fast networks or slow CPUs. | ||
o[utsider] inaccessible to local-class processes. | o[utsider] inaccessible to local-class processes. | ||
− | + | {partition} a positive integer indicating the partition num- | |
+ | ber for that range. | ||
+ | |||
Alias lines are of the form: | Alias lines are of the form: | ||
a.b.c.d=e.f.g.h | a.b.c.d=e.f.g.h | ||
or | or | ||
a.b.c.d=host-name | a.b.c.d=host-name | ||
− | + | ||
− | They | + | They indicate that the IP address on the left-hand-side refers to |
− | same node as the right-hand-side. | + | the same node as the right-hand-side. |
− | + | ||
NOTES: | NOTES: | ||
− | + | ||
1. It is an error to attempt to declare the local node an "out- | 1. It is an error to attempt to declare the local node an "out- | ||
sider". | sider". | ||
− | + | ||
2. When using host names, the first result of gethostbyname(3) | 2. When using host names, the first result of gethostbyname(3) | ||
must return their IP address that is to be used by MOSIX: if | must return their IP address that is to be used by MOSIX: if | ||
in doubt - specify the IP address. | in doubt - specify the IP address. | ||
− | + | ||
3. The right-hand-side in alias lines must appear within the | 3. The right-hand-side in alias lines must appear within the | ||
data lines. | data lines. | ||
− | + | ||
4. IP addresses 0.0.x.x and 255.255.255.x are not allowed in | 4. IP addresses 0.0.x.x and 255.255.255.x are not allowed in | ||
MOSIX. | MOSIX. | ||
− | + | ||
5. If you change /etc/mosix/mosix.map while MOSIX is running, | 5. If you change /etc/mosix/mosix.map while MOSIX is running, | ||
you need to run setpe to notify MOSIX of the changes. | you need to run setpe to notify MOSIX of the changes. | ||
− | + | ||
/etc/mosix/secret | /etc/mosix/secret | ||
This is a security file that is used to prevent ordinary users | This is a security file that is used to prevent ordinary users | ||
Line 128: | Line 205: | ||
internal MOSIX TCP ports. The file should contain just a single | internal MOSIX TCP ports. The file should contain just a single | ||
line with a password that must be identical on all the nodes of | line with a password that must be identical on all the nodes of | ||
− | the cluster/ | + | the cluster/multi-cluster. This file must be accessible to ROOT |
− | (chmod 600!) | + | only (chmod 600!) |
− | + | ||
/etc/mosix/ecsecret | /etc/mosix/ecsecret | ||
Like /etc/mosix/secret, but used for running batch jobs as a | Like /etc/mosix/secret, but used for running batch jobs as a | ||
client (see mosrun(1)). If you do not wish to allow this node to | client (see mosrun(1)). If you do not wish to allow this node to | ||
send batch-jobs, do not create this file. | send batch-jobs, do not create this file. | ||
− | + | ||
/etc/mosix/essecret | /etc/mosix/essecret | ||
Like /etc/mosix/secret, but used for running batch jobs as a | Like /etc/mosix/secret, but used for running batch jobs as a | ||
Line 141: | Line 218: | ||
/etc/mosix/ecsecret. If you do not wish to allow this node to be | /etc/mosix/ecsecret. If you do not wish to allow this node to be | ||
a batch-server, do not create this file. | a batch-server, do not create this file. | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
The following files are optional: | The following files are optional: | ||
− | + | ||
/etc/mosix/mosip | /etc/mosix/mosip | ||
− | This file | + | This file contains our IP address, to be used for MOSIX purposes, |
− | + | in the regular format - a.b.c.d. This file is only necessary when | |
− | + | the node's IP address is ambiguous: it can be safely omitted if | |
− | + | the output of ifconfig(8) ("inet addr:") matches exactly one of | |
− | + | the IP addresses listed in the data lines of /etc/mosix/mosix.map. | |
− | + | ||
/etc/mosix/myfeatures | /etc/mosix/myfeatures | ||
This file contains one line of comma-separated topological fea- | This file contains one line of comma-separated topological fea- | ||
tures for this node (if any). For example: yellow,wood,chicken. | tures for this node (if any). For example: yellow,wood,chicken. | ||
− | + | ||
The list of all 32 features (one line per feature) can be found in | The list of all 32 features (one line per feature) can be found in | ||
/etc/mosix/features. | /etc/mosix/features. | ||
− | + | ||
If this file is missing, this node is assumed to have no topologi- | If this file is missing, this node is assumed to have no topologi- | ||
cal features. (see topology(7)) | cal features. (see topology(7)) | ||
− | + | ||
/etc/mosix/freeze.conf | /etc/mosix/freeze.conf | ||
This file sets the automatic freezing policies on a per-class | This file sets the automatic freezing policies on a per-class | ||
Line 175: | Line 244: | ||
in any order and classes that are not mentioned are not touched by | in any order and classes that are not mentioned are not touched by | ||
the automatic freezing mechanisms. | the automatic freezing mechanisms. | ||
− | + | ||
The space-separated constants in each line are as follows: | The space-separated constants in each line are as follows: | ||
1. class-number | 1. class-number | ||
Line 187: | Line 256: | ||
5. minautofreeze (floating point) | 5. minautofreeze (floating point) | ||
Freeze processes that are evacuated back home on arrival if | Freeze processes that are evacuated back home on arrival if | ||
− | load | + | load gets equal or above this |
6. minclustfreeze (floating point) | 6. minclustfreeze (floating point) | ||
Freeze processes that are evacuated back to this cluster on | Freeze processes that are evacuated back to this cluster on | ||
Line 202: | Line 271: | ||
this class. After this period, the running process will be | this class. After this period, the running process will be | ||
frozen and a frozen process will start to run. | frozen and a frozen process will start to run. | ||
− | + | ||
NOTES: | NOTES: | ||
− | + | ||
1. The load-units in fields #3-#6 depend on field #2. If 0, | 1. The load-units in fields #3-#6 depend on field #2. If 0, | ||
each unit represents the load created by a CPU-bound process | each unit represents the load created by a CPU-bound process | ||
Line 212: | Line 281: | ||
the computer and the more processors it has, the load created | the computer and the more processors it has, the load created | ||
by each CPU process decreases proportionally. | by each CPU process decreases proportionally. | ||
− | + | ||
2. Fields #3,#4,#5,#6 are floating-point, the rest are integers. | 2. Fields #3,#4,#5,#6 are floating-point, the rest are integers. | ||
− | + | ||
3. A value of "-1" in fields #3,#5,#6,#8 means ignoring that | 3. A value of "-1" in fields #3,#5,#6,#8 means ignoring that | ||
feature. | feature. | ||
− | + | ||
4. The first 4 fields are mandatory: omitted fields beyond them | 4. The first 4 fields are mandatory: omitted fields beyond them | ||
have the following values: minautofreeze=-1,mincluster- | have the following values: minautofreeze=-1,mincluster- | ||
freeze=-1,min-keep=0, max-procs=-1,slice=20. | freeze=-1,min-keep=0, max-procs=-1,slice=20. | ||
− | + | ||
5. The RED-MARK must be significantly higher than BLUE-MARK: | 5. The RED-MARK must be significantly higher than BLUE-MARK: | ||
otherwise a perpetual cycle of freezing and unfreezing could | otherwise a perpetual cycle of freezing and unfreezing could | ||
occur. You should allow at least 1.1 processes difference | occur. You should allow at least 1.1 processes difference | ||
between them. | between them. | ||
− | + | ||
6. Frozen processes do not respond to anything, except an | 6. Frozen processes do not respond to anything, except an | ||
unfreeze request or a signal that kills them. | unfreeze request or a signal that kills them. | ||
− | + | ||
7. Processes that were frozen manually are not unfrozen automat- | 7. Processes that were frozen manually are not unfrozen automat- | ||
ically. | ically. | ||
− | + | ||
This file may also contain lines starting with '/' to indicate | This file may also contain lines starting with '/' to indicate | ||
freezing-directory names. A "Freezing directory" is an existing | freezing-directory names. A "Freezing directory" is an existing | ||
Line 239: | Line 308: | ||
tion of freezing-directories should have sufficient free disk- | tion of freezing-directories should have sufficient free disk- | ||
space to contain the memory image of all the frozen processes. | space to contain the memory image of all the frozen processes. | ||
− | + | ||
If more than one freezing directory is listed, the freezing direc- | If more than one freezing directory is listed, the freezing direc- | ||
tory is chosen at random by each freezing process. It is also | tory is chosen at random by each freezing process. It is also | ||
possible to assign selection probabilities by adding a numeric | possible to assign selection probabilities by adding a numeric | ||
weight after the directory-name, for example: | weight after the directory-name, for example: | ||
− | + | ||
/tmp 2 | /tmp 2 | ||
/var/tmp 0.5 | /var/tmp 0.5 | ||
/mnt/tmp 2.5 | /mnt/tmp 2.5 | ||
− | + | ||
In this example, the total weight is 2+0.5+2.5=5, so out of | In this example, the total weight is 2+0.5+2.5=5, so out of | ||
every 10 frozen processes, an average of 4 (10*2/5) will be | every 10 frozen processes, an average of 4 (10*2/5) will be | ||
frozen to /tmp, an average of 1 (10*0.5/5) to /var/tmp and an | frozen to /tmp, an average of 1 (10*0.5/5) to /var/tmp and an | ||
average of 5 (10*2.5/5) to /mnt/tmp. | average of 5 (10*2.5/5) to /mnt/tmp. | ||
− | + | ||
When the weight is missing, it defaults to 1. A weight of 0 means | When the weight is missing, it defaults to 1. A weight of 0 means | ||
that this directory should be used only if all others cannot be | that this directory should be used only if all others cannot be | ||
accessed. | accessed. | ||
− | + | ||
If no freezing directories are specified, all freezing will be to | If no freezing directories are specified, all freezing will be to | ||
the /freeze directory (or symbolic-link). | the /freeze directory (or symbolic-link). | ||
− | + | ||
Freezing files are usually created with "root" (Super-User) per- | Freezing files are usually created with "root" (Super-User) per- | ||
missions, but if /etc/mosix/freeze.conf contains a line of the | missions, but if /etc/mosix/freeze.conf contains a line of the | ||
Line 268: | Line 337: | ||
(this is sometimes needed when freezing to NFS directories that do | (this is sometimes needed when freezing to NFS directories that do | ||
not allow "root" access). | not allow "root" access). | ||
− | + | ||
/etc/mosix/partners/* | /etc/mosix/partners/* | ||
− | If your cluster is part of a multi-cluster | + | If your cluster is part of a multi-cluster, then each file in |
− | + | /etc/mosix/partners describes another cluster that you want this | |
− | + | cluster to cooperate with. | |
− | + | ||
The file-names should indicate the corresponding cluster-names | The file-names should indicate the corresponding cluster-names | ||
(maximum 128 characters), for example: "geography", "chemistry", | (maximum 128 characters), for example: "geography", "chemistry", | ||
"management", "development", "sales", "students-lab-A", etc. The | "management", "development", "sales", "students-lab-A", etc. The | ||
format of each file is a follows: | format of each file is a follows: | ||
− | + | ||
Line #1: | Line #1: | ||
A verbal human-readable description of the cluster. | A verbal human-readable description of the cluster. | ||
Line #2: | Line #2: | ||
Four space-separated integers as follows: | Four space-separated integers as follows: | ||
− | + | ||
1. Priority: | 1. Priority: | ||
0-65535, the lower the better. | 0-65535, the lower the better. | ||
Line 315: | Line 384: | ||
that are believed to be part of the other cluster, contain- | that are believed to be part of the other cluster, contain- | ||
ing 5 space-separated items as follows: | ing 5 space-separated items as follows: | ||
− | + | ||
1. IP1 (or host-name): | 1. IP1 (or host-name): | ||
First node in range. | First node in range. | ||
Line 336: | Line 405: | ||
slow). | slow). | ||
NOTES: | NOTES: | ||
− | + | ||
1. From time-to-time, MOSIX will consult one or more of the | 1. From time-to-time, MOSIX will consult one or more of the | ||
"core" nodes to find the actual map of their cluster. It is | "core" nodes to find the actual map of their cluster. It is | ||
Line 346: | Line 415: | ||
as part of their cluster by the core-nodes (but they could | as part of their cluster by the core-nodes (but they could | ||
possibly still be used as "core-nodes" to list other nodes) | possibly still be used as "core-nodes" to list other nodes) | ||
− | + | ||
3. All core-nodes must have the same value for "proximate", | 3. All core-nodes must have the same value for "proximate", | ||
because the "proximate" field of unlisted nodes is copied | because the "proximate" field of unlisted nodes is copied | ||
from that of the core-node from which we happened to find | from that of the core-node from which we happened to find | ||
about them and this cannot be ambiguous. | about them and this cannot be ambiguous. | ||
− | + | ||
4. When using host names rather than IP addresses, the first | 4. When using host names rather than IP addresses, the first | ||
result of gethostbyname(3) must return their IP address that | result of gethostbyname(3) must return their IP address that | ||
is used by MOSIX: if in doubt - specify the IP address | is used by MOSIX: if in doubt - specify the IP address | ||
instead. | instead. | ||
− | + | ||
5. IP addresses 0.0.x.x and 255.255.255.x cannot be used in | 5. IP addresses 0.0.x.x and 255.255.255.x cannot be used in | ||
MOSIX. | MOSIX. | ||
− | + | ||
/etc/mosix/userview.map | /etc/mosix/userview.map | ||
Although it is possible to use only IP numbers and/or host-names | Although it is possible to use only IP numbers and/or host-names | ||
− | to specify nodes | + | to specify nodes in your cluster (and multi-cluster), it is more |
− | + | convenient to use small integers as node numbers: this file allows | |
− | + | you to map integers to IP addresses. Each line in this file con- | |
− | + | tains 3 elements: | |
− | + | ||
1. A node number (1-65535) | 1. A node number (1-65535) | ||
2. IP1 (or host-name, clearly identifiable by gethostbyname(3)) | 2. IP1 (or host-name, clearly identifiable by gethostbyname(3)) | ||
3. Number of nodes in range (the number of the last one must not | 3. Number of nodes in range (the number of the last one must not | ||
exceed 65535) | exceed 65535) | ||
− | + | ||
It is up to the cluster administrator to map as few or as many | It is up to the cluster administrator to map as few or as many | ||
− | nodes as they wish out of their cluster and multi-cluster | + | nodes as they wish out of their cluster and multi-cluster - the |
− | + | most common practice is to map all the nodes in one's cluster, but | |
− | + | not in other clusters. | |
− | + | ||
/etc/mosix/queue.conf | /etc/mosix/queue.conf | ||
This file configures the queueing system (see mosrun(1), mosq(1)). | This file configures the queueing system (see mosrun(1), mosq(1)). | ||
All lines in this file are optional and may appear in any order. | All lines in this file are optional and may appear in any order. | ||
+ | |||
Usually, one node in each cluster is elected by the system-admin- | Usually, one node in each cluster is elected by the system-admin- | ||
istrator to manage the queue, while the remaining nodes point to | istrator to manage the queue, while the remaining nodes point to | ||
− | that manager. As an exception, in a mixed cluster that has both | + | that manager. As an exception, in a mixed cluster that has both |
− | 32-bit and 64-bit computers, a separate 32-bit node should be | + | 32-bit and 64-bit computers, a separate 32-bit node should be |
− | to exclusively manage the queue for all 32-bit nodes and a 64-bit | + | elected to exclusively manage the queue for all 32-bit nodes and a |
− | + | 64-bit node elected to exclusively manage the queue for all 64-bit | |
+ | nodes. | ||
Defining the queue manager: | Defining the queue manager: | ||
Line 392: | Line 463: | ||
C {hostname} | C {hostname} | ||
assigns a specific node from the cluster (hostname) to manage the | assigns a specific node from the cluster (hostname) to manage the | ||
− | job queue. In the absence of this line, each node manages its own | + | job queue. In the absence of this line, each node manages its own |
− | queue (which is usually undesirable). | + | queue (which is usually undesirable). It is possible to have sev- |
− | + | eral 'C' lines - one for each cluster-partition. | |
+ | |||
Defining the default priority: | Defining the default priority: | ||
− | + | ||
The line: | The line: | ||
P {priority} | P {priority} | ||
Line 402: | Line 474: | ||
The lower this value - the higher the priority. In the absence of | The lower this value - the higher the priority. In the absence of | ||
this line, the default priority is 50. | this line, the default priority is 50. | ||
− | + | ||
Commonly, user-ID's are identical on all the nodes in the cluster. | Commonly, user-ID's are identical on all the nodes in the cluster. | ||
The line (with a single letter): | The line (with a single letter): | ||
Line 409: | Line 481: | ||
(except the Super-User) will be prevented from sending requests to | (except the Super-User) will be prevented from sending requests to | ||
modify the status of queued jobs from this node. | modify the status of queued jobs from this node. | ||
− | + | ||
Configuring the queue manager: | Configuring the queue manager: | ||
− | + | ||
The following lines are relevant only in the queue manager node | The following lines are relevant only in the queue manager node | ||
and are ignored on all other nodes: | and are ignored on all other nodes: | ||
− | + | ||
The MOSIX queueing system determines dynamically how many pro- | The MOSIX queueing system determines dynamically how many pro- | ||
cesses to run. The line: | cesses to run. The line: | ||
Line 424: | Line 496: | ||
sets the upper limit to 20 processes, even when more resources are | sets the upper limit to 20 processes, even when more resources are | ||
available. | available. | ||
− | + | ||
The line: | The line: | ||
X {1 <= x <= 8} | X {1 <= x <= 8} | ||
defines the maximal number of queued processes that may run simul- | defines the maximal number of queued processes that may run simul- | ||
taneously per CPU. This option applies only to processors within | taneously per CPU. This option applies only to processors within | ||
− | the cluster and is not available for other clusters in | + | the cluster and is not available for other clusters in a multi- |
− | (where the queueing system assigns at most one process per CPU). | + | cluster (where the queueing system assigns at most one process per |
− | + | CPU). In the absence of this line the default is | |
X 1 | X 1 | ||
− | + | ||
The line: | The line: | ||
Z {n} | Z {n} | ||
causes the first n jobs of priority 0 to start immediately (out of | causes the first n jobs of priority 0 to start immediately (out of | ||
− | order), without checking | + | order), without checking whether resources are available, leaving that |
responsibility to the system administrator. | responsibility to the system administrator. | ||
− | + | ||
Example: the cluster has 10 dual-CPU nodes, so the queueing system | Example: the cluster has 10 dual-CPU nodes, so the queueing system | ||
normally allows 20 jobs to run. In order to allow urgent jobs to | normally allows 20 jobs to run. In order to allow urgent jobs to | ||
Line 445: | Line 517: | ||
the system administrator configures a line: Z 10, thus allowing | the system administrator configures a line: Z 10, thus allowing | ||
each node to run a maximum of 3 jobs. | each node to run a maximum of 3 jobs. | ||
− | + | ||
+ | The line: | ||
+ | N {n} [{mb}] | ||
+ | causes the first n jobs of jobs of each user to start immediately | ||
+ | (out of order), without checking whether resources are available. | ||
+ | Only jobs above that number, per user, will be queued and whenever | ||
+ | the number of a user's running jobs drops below this number, a new | ||
+ | job of that user (if there is any waiting) will start to run. | ||
+ | |||
+ | When the mb parameter is given, only jobs that do not exceed this | ||
+ | amount of memory in MegaBytes will be started this way. | ||
+ | |||
+ | The system-administrator should weigh carefully, based on knowledge | ||
+ | about the patterns of jobs that users typically run, the benefits of | ||
+ | this option against its risks, such as having at times more jobs in | ||
+ | their cluster(s) than available memory to run them efficiently. If | ||
+ | this option is selected with a memory-limitation (mb), then the | ||
+ | system-administrator should request that users always specify the | ||
+ | maximum memory-requiremnts for all their queued jobs (using mosrun -m"). | ||
+ | |||
Fair-share policy: | Fair-share policy: | ||
The fairness policy determine the order in which jobs are | The fairness policy determine the order in which jobs are | ||
Line 454: | Line 545: | ||
the initial placement in the queue of jobs with the same pri- | the initial placement in the queue of jobs with the same pri- | ||
ority. | ority. | ||
− | + | ||
The default queueing policy is "first-come-first-served". | The default queueing policy is "first-come-first-served". | ||
Alternatively, jobs of different users can be placed in the | Alternatively, jobs of different users can be placed in the | ||
queue in an interleaved manner. | queue in an interleaved manner. | ||
− | + | ||
The line (with a single letter): | The line (with a single letter): | ||
F | F | ||
switches the queueing policy to the interleaved policy. | switches the queueing policy to the interleaved policy. | ||
− | + | ||
The advantage of the interleaved approach is that a user | The advantage of the interleaved approach is that a user | ||
wishing to run a relatively small number of processes, does | wishing to run a relatively small number of processes, does | ||
Line 468: | Line 559: | ||
the queue. The disadvantage is that older jobs need to wait | the queue. The disadvantage is that older jobs need to wait | ||
longer. | longer. | ||
− | + | ||
Normally, the interleaving ratio is equal among all users. | Normally, the interleaving ratio is equal among all users. | ||
For example, with two users (A and B) the queue may look like | For example, with two users (A and B) the queue may look like | ||
A-B-A-B-A-B-A-B. | A-B-A-B-A-B-A-B. | ||
− | + | ||
Each user is assigned an interleave ratio which determines | Each user is assigned an interleave ratio which determines | ||
(proportionally) how well their jobs will be placed in the | (proportionally) how well their jobs will be placed in the | ||
Line 483: | Line 574: | ||
UID can be either numeric or symbolic and there is no limit | UID can be either numeric or symbolic and there is no limit | ||
on the number of these 'U' lines. Examples: | on the number of these 'U' lines. Examples: | ||
− | 1. Two users (A & B): | + | 1. Two users (A & B): U userA 5 |
− | + | ||
(userB is not listed, hence it gets the default of 10) | (userB is not listed, hence it gets the default of 10) | ||
The queue looks like: A-A-B-A-A-B-A-A-B... | The queue looks like: A-A-B-A-A-B-A-A-B... | ||
Line 496: | Line 586: | ||
(userC is not listed, hence it gets the default of 10) | (userC is not listed, hence it gets the default of 10) | ||
The queue looks like: B-C-B-C-B-A-B-C-B-C-B-A-B-C-B-C... | The queue looks like: B-C-B-C-B-A-B-C-B-C-B-A-B-C-B-C... | ||
− | + | ||
Note that since the interleave ratio is determined per pro- | Note that since the interleave ratio is determined per pro- | ||
cess (and not per job), different (more complex) results will | cess (and not per job), different (more complex) results will | ||
occur when multi-process jobs are submitted to the queue. | occur when multi-process jobs are submitted to the queue. | ||
− | + | ||
/etc/mosix/private.conf | /etc/mosix/private.conf | ||
This file specifies where Private Temporary Files (PTFs) are | This file specifies where Private Temporary Files (PTFs) are | ||
Line 510: | Line 600: | ||
space for their PTFs, so we must make sure that they do not dis- | space for their PTFs, so we must make sure that they do not dis- | ||
turb local operations. | turb local operations. | ||
− | + | ||
− | + | Up to 3 different directories can be specified: for local pro- | |
− | cesses; guest-processes from the local cluster; and guest- | + | cesses; guest-processes from the local cluster (including other |
− | + | partitions); and guest-processes from other clusters in the multi- | |
− | + | cluster grid. Accordingly, each line in this file has 3 fields: | |
− | + | ||
1. A combination of the letters: 'O' (own node), 'C' (own clus- | 1. A combination of the letters: 'O' (own node), 'C' (own clus- | ||
ter) and 'G' (other clusters in the grid). For example, OC, | ter) and 'G' (other clusters in the grid). For example, OC, | ||
Line 523: | Line 613: | ||
3. An optional numeric limit, in Megabytes, of the total size of | 3. An optional numeric limit, in Megabytes, of the total size of | ||
PTFs per-process. | PTFs per-process. | ||
− | + | ||
If /etc/mosix/private.conf does not exist, then all PTFs will be | If /etc/mosix/private.conf does not exist, then all PTFs will be | ||
stored in "/private". If the directory "/private" also does not | stored in "/private". If the directory "/private" also does not | ||
Line 532: | Line 622: | ||
not be able to run on this node. Such guest processes that start | not be able to run on this node. Such guest processes that start | ||
using PTFs will migrate back to their home-nodes. | using PTFs will migrate back to their home-nodes. | ||
− | + | ||
When the third field is missing, it defaults to: | When the third field is missing, it defaults to: | ||
5 Gigabytes for local processes. | 5 Gigabytes for local processes. | ||
2 Gigabytes for processes from the same cluster. | 2 Gigabytes for processes from the same cluster. | ||
− | 1 Gigabyte for processes from other clusters | + | 1 Gigabyte for processes from other clusters. |
In any case, guest processes cannot exceed the size limit of their | In any case, guest processes cannot exceed the size limit of their | ||
home-node even on nodes that allow them more space. | home-node even on nodes that allow them more space. | ||
− | + | ||
+ | /etc/mosix/target.conf | ||
+ | This file contains the MRC (MOSIX Reach the Clouds) configuration, | ||
+ | which determines who can launch MRC jobs that run on this node and | ||
+ | what privileges and restrictions those launched jobs may have. | ||
+ | Each line begins with a colon-terminated keyword, followed by spe- | ||
+ | cific parameters for that keyword. Most keywords can be repeated | ||
+ | (except uids:, gids:, defuid:, defgid:). The keywords are: | ||
+ | |||
+ | accept: | ||
+ | An IP address, or a range of consecutive IP addresses from | ||
+ | where this node is willing to accept MRC jobs. An example of | ||
+ | a single IP address is: | ||
+ | |||
+ | accept: 101.102.103.104 | ||
+ | |||
+ | An example of a range of IP address is: | ||
+ | |||
+ | accept: 101.102.103.1 - 101.102.104.254 | ||
+ | |||
+ | The address(es) may be followed by an alternative file-name | ||
+ | (starting in '/'): in that case, the priviliges and restric- | ||
+ | tions for jobs from the given address(es) are contained in | ||
+ | the given file INSTEAD of /etc/mosix/target.conf. For exam- | ||
+ | ple: | ||
+ | |||
+ | accept: 1.2.3.1 - 1.2.3.254 /etc/mosix/special_case_1.2.3 | ||
+ | |||
+ | Alternative files have the same format as | ||
+ | /etc/mosix/target.conf, except that they do not contain the | ||
+ | keywords accept: and reject:. | ||
+ | |||
+ | reject: | ||
+ | IP addresses are specified as in accept: all MRC jobs will be | ||
+ | rejected from those address(es). This option is useful for | ||
+ | excluding particular addresses in the middle of a larger | ||
+ | range that is defined by accept:, for example: | ||
+ | |||
+ | accept: 10.20.30.1 - 10.20.31.254 | ||
+ | reject: 10.20.30.255 - 10.20.31.0 | ||
+ | |||
+ | |||
+ | nodir: | ||
+ | Prevent callers from overriding a given directory with a | ||
+ | directory from their calling computer. Note that overriding | ||
+ | all ancesstor-directories is also prevented (since overriding | ||
+ | them would override everything inside them as well, including | ||
+ | the given directory). For example: | ||
+ | |||
+ | nodir: /usr/share/X11 | ||
+ | |||
+ | prevents callers from overriding the directories | ||
+ | "/usr/share/X11", "/usr/share" and "/usr" (it is anyway pro- | ||
+ | hibited to override the root-directory). | ||
+ | |||
+ | nodir_under: | ||
+ | As nodir: but all subdirectories are also prevented from | ||
+ | being overriden. | ||
+ | |||
+ | allow-subdirs: | ||
+ | If a caller asks to export a directory under a directory-name | ||
+ | where: | ||
+ | 1. No file or directory exists under that name. | ||
+ | 2. The caller has no permission to create this directory. | ||
+ | 3. Overriding that directory-name is not forbidden (eg. by | ||
+ | nodir: or nodir_under:) | ||
+ | and the named-directory or any of its ancesstor-directories | ||
+ | appears with the allow-subdirs: keyword, then the given | ||
+ | directory will be specially created for the caller (it will | ||
+ | be empty and with "root" ownership). For example: | ||
+ | |||
+ | allow-subdirs: /tmp | ||
+ | allow-subdirs: /var/tmp | ||
+ | |||
+ | |||
+ | uids: | ||
+ | A list of user-names and/or user-IDs that may be used by MRC | ||
+ | callers. A '*' denotes all users. A '-' preceding a user- | ||
+ | name or user-ID explicitly excludes that user. The following | ||
+ | example allows all user-ID's except "root": | ||
+ | |||
+ | uids: * -root | ||
+ | |||
+ | |||
+ | gids: | ||
+ | A list of group-names and/or group-IDs that may be used by | ||
+ | MRC callers. A '*' denotes all groups. A '-' preceding a | ||
+ | group-name or group-ID explicitly excludes that group. | ||
+ | |||
+ | defuid: | ||
+ | The default user-ID under which jobs from users that are not | ||
+ | listed with the uids: keyword should run. When this keyword | ||
+ | is absent, the default is user-ID 65534 ("nobody"). | ||
+ | |||
+ | defgid: | ||
+ | The default group-ID under which jobs from user-groups that | ||
+ | are not listed with the gids: keyword should run. When this | ||
+ | keyword is absent, the default is group-ID 65534 ("nobody"). | ||
+ | |||
/etc/mosix/retainpri | /etc/mosix/retainpri | ||
This file contains an integer, specifying a delay in seconds: how | This file contains an integer, specifying a delay in seconds: how | ||
Line 547: | Line 735: | ||
there is no delay and processes with lower priority may arrive as | there is no delay and processes with lower priority may arrive as | ||
soon as there are no processes with a higher priority. | soon as there are no processes with a higher priority. | ||
− | + | ||
/etc/mosix/speed | /etc/mosix/speed | ||
If this file exists, it should contain a positive integer | If this file exists, it should contain a positive integer | ||
Line 555: | Line 743: | ||
rule of thumb, 1.5 times faster than Intel processors of the same | rule of thumb, 1.5 times faster than Intel processors of the same | ||
frequency. | frequency. | ||
− | + | ||
Normally this file is not necessary because the speed of the pro- | Normally this file is not necessary because the speed of the pro- | ||
cessor is automatically detected by the kernel when it boots. | cessor is automatically detected by the kernel when it boots. | ||
Line 567: | Line 755: | ||
and can vary significantly depending on the load of the | and can vary significantly depending on the load of the | ||
underlying operating-systems when it boots. | underlying operating-systems when it boots. | ||
− | + | ||
/etc/mosix/maxguests | /etc/mosix/maxguests | ||
If this file exists, it should contain an integer limit on the | If this file exists, it should contain an integer limit on the | ||
− | number of simultaneous guest-processes from other clusters | + | number of simultaneous guest-processes from other clusters. Oth- |
− | + | erwise, the maximum number of guest-processes from other clusters | |
− | + | is set to the default of 8 times the number of processors. | |
− | + | ||
/etc/mosix/.log_mosrun | /etc/mosix/.log_mosrun | ||
When this file is present, information about invocations of | When this file is present, information about invocations of | ||
mosrun(1) and process migrations will be recorded in the system- | mosrun(1) and process migrations will be recorded in the system- | ||
log (by default "/var/log/messages" on most Linux distributions). | log (by default "/var/log/messages" on most Linux distributions). | ||
− | + | ||
− | /etc/mosix/ | + | /etc/mosix/newtune |
Tuning constants optimizes the MOSIX performance by telling it | Tuning constants optimizes the MOSIX performance by telling it | ||
about the costs of networked operations. MOSIX has built-in tun- | about the costs of networked operations. MOSIX has built-in tun- | ||
ing default constants. This file is used to override them to suit | ing default constants. This file is used to override them to suit | ||
your particular hardware and networks. | your particular hardware and networks. | ||
− | + | ||
For most users, This file is difficult to set up manually. Thus, | For most users, This file is difficult to set up manually. Thus, | ||
MOSIX comes with a program to assemble it. For more information, | MOSIX comes with a program to assemble it. For more information, | ||
see topology(7). | see topology(7). | ||
− | + | ||
+ | '''KERNEL''' | ||
+ | Sometimes a MOSIX release provides patches for more than one Linux kernel | ||
+ | version. Also, special kernel-patches are released from time to time to | ||
+ | support particular Linux distributions (such as openSUSE): it is fine to | ||
+ | mix different such kernels within the same cluster. It is even OK to mix | ||
+ | older or newer kernels from other MOSIX releases, so long as the first | ||
+ | two numbers in their MOSIX version (run cat /proc/mosix/version to view | ||
+ | the version) are identical to the first two numbers of your MOSIX | ||
+ | release. | ||
+ | |||
+ | The MOSIX kernel patch is required for fully operational MOSIX systems | ||
+ | with process-migration. A limited number of functions, such as batch | ||
+ | jobs, queuing and viewing the loads, still works over any Linux kernel, | ||
+ | even without the MOSIX kernel patch (or when the kernel is incompatible | ||
+ | with the current MOSIX version). | ||
+ | |||
+ | It is not recommended to have mixed clusters where some nodes have the | ||
+ | MOSIX kernel-patch and others do not, but if you do so anyway, you should | ||
+ | observe the following rules regarding job-queuing: | ||
+ | |||
+ | On each "mixed" cluster, you may queue either migratable jobs or batch | ||
+ | jobs, but not both. If you choose to queue migratable jobs, then you | ||
+ | should select a node with the MOSIX kernel-patch as the queue-manager. If | ||
+ | you choose to queue batch jobs, then you should select a node without the | ||
+ | MOSIX kernel-patch as the queue-manager (see above the section about con- | ||
+ | figuring /etc/mosix/queue.conf). | ||
+ | |||
'''INTERFACE FOR PROGRAMS''' | '''INTERFACE FOR PROGRAMS''' | ||
The following interface is provided for programs running under mosrun(1) | The following interface is provided for programs running under mosrun(1) | ||
that wish to interface with their MOSIX run-time environment: | that wish to interface with their MOSIX run-time environment: | ||
− | + | ||
All access to MOSIX is performed via the "open" system call, but the use | All access to MOSIX is performed via the "open" system call, but the use | ||
of "open" is incidental and does not involve actual opening of files. If | of "open" is incidental and does not involve actual opening of files. If | ||
Line 598: | Line 813: | ||
would fail, returning -1, since the quoted files never exist, and | would fail, returning -1, since the quoted files never exist, and | ||
errno(3) would be set to ENOENT. | errno(3) would be set to ENOENT. | ||
− | + | ||
open("/proc/self/{special}", 0) | open("/proc/self/{special}", 0) | ||
reads a value from the MOSIX run-time environment. | reads a value from the MOSIX run-time environment. | ||
− | + | ||
open("/proc/self/{special}", 1|O_CREAT, newval) | open("/proc/self/{special}", 1|O_CREAT, newval) | ||
writes a value to the MOSIX run-time environment. | writes a value to the MOSIX run-time environment. | ||
− | + | ||
open("/proc/self/{special}", 2|O_CREAT, newval) | open("/proc/self/{special}", 2|O_CREAT, newval) | ||
both writes a new value and return the previous value. | both writes a new value and return the previous value. | ||
− | + | ||
(the O_CREAT flag is only required when your program is compiled with the | (the O_CREAT flag is only required when your program is compiled with the | ||
64-bit file-size option, but is harmless otherwise). | 64-bit file-size option, but is harmless otherwise). | ||
− | + | ||
Some "files" are read-only, some are write-only and some can do both | Some "files" are read-only, some are write-only and some can do both | ||
(rw). The "files" are as follows: | (rw). The "files" are as follows: | ||
− | + | ||
/proc/self/migrate | /proc/self/migrate | ||
writing a 0 migrates back home; writing -1 causes a migration con- | writing a 0 migrates back home; writing -1 causes a migration con- | ||
Line 619: | Line 834: | ||
cal node number, attempts to migrate there. Successful migration | cal node number, attempts to migrate there. Successful migration | ||
returns 0, failure returns -1 (write only) | returns 0, failure returns -1 (write only) | ||
− | + | ||
/proc/self/lock | /proc/self/lock | ||
When locked (1), no automatic migration may occur (except when | When locked (1), no automatic migration may occur (except when | ||
running on the current node is no longer allowed); when unlocked | running on the current node is no longer allowed); when unlocked | ||
(0), automatic migration can occur. (rw) | (0), automatic migration can occur. (rw) | ||
− | + | ||
/proc/self/whereami | /proc/self/whereami | ||
reads where the program is running: 0 if at home, otherwise usu- | reads where the program is running: 0 if at home, otherwise usu- | ||
ally an unsigned IP address, but if possible, its corresponding | ally an unsigned IP address, but if possible, its corresponding | ||
logical node number. (read only) | logical node number. (read only) | ||
− | + | ||
/proc/self/nmigs | /proc/self/nmigs | ||
reads the total number of migrations performed by this process and | reads the total number of migrations performed by this process and | ||
its MOSRUN ancesstors before it was born. (read only) | its MOSRUN ancesstors before it was born. (read only) | ||
− | + | ||
/proc/self/sigmig | /proc/self/sigmig | ||
Reads/sets a signal number (1-64 or 0 to cancel) to be received | Reads/sets a signal number (1-64 or 0 to cancel) to be received | ||
after each migration. (rw) | after each migration. (rw) | ||
− | + | ||
/proc/self/glob | /proc/self/glob | ||
Reads/modifies the process class. Processes of class 0 are not | Reads/modifies the process class. Processes of class 0 are not | ||
− | allowed to migrate outside the local cluster. Classes can also | + | allowed to migrate outside the local cluster or even outside the |
− | + | local partition. Classes can also affect the automatic-freezing | |
− | + | policy. (rw) | |
+ | |||
/proc/self/needmem | /proc/self/needmem | ||
Reads/modifies the process's memory requirement in Megabytes, so | Reads/modifies the process's memory requirement in Megabytes, so | ||
it does not automatically migrate to nodes with less free memory. | it does not automatically migrate to nodes with less free memory. | ||
Acceptable values are 0-262143. (rw) | Acceptable values are 0-262143. (rw) | ||
− | + | ||
/proc/self/unsupportok | /proc/self/unsupportok | ||
when 0, unsupported system-calls cause the process to be killed; | when 0, unsupported system-calls cause the process to be killed; | ||
Line 653: | Line 869: | ||
ENOSYS; when 2, an appropriate error-message will also be written | ENOSYS; when 2, an appropriate error-message will also be written | ||
to stderr. (rw) | to stderr. (rw) | ||
− | + | ||
/proc/self/clear | /proc/self/clear | ||
clears process statistics. (write only) | clears process statistics. (write only) | ||
− | + | ||
/proc/self/cpujob | /proc/self/cpujob | ||
Normally when 0, system-calls and I/O are taken into account for | Normally when 0, system-calls and I/O are taken into account for | ||
migration considerations. When set to 1, they are ignored. (rw) | migration considerations. When set to 1, they are ignored. (rw) | ||
− | + | ||
/proc/self/localtime | /proc/self/localtime | ||
When 0, gettimeofday(2) is always performed on the home node. | When 0, gettimeofday(2) is always performed on the home node. | ||
When 1, the date/time is taken from where the process is running. | When 1, the date/time is taken from where the process is running. | ||
(rw) | (rw) | ||
− | + | ||
/proc/self/decayrate | /proc/self/decayrate | ||
Reads/modifies the decay-rate per second (0-10000): programs can | Reads/modifies the decay-rate per second (0-10000): programs can | ||
Line 677: | Line 893: | ||
is provided for users who know well the cyclic behavior of their | is provided for users who know well the cyclic behavior of their | ||
program. (rw) | program. (rw) | ||
− | + | ||
/proc/self/checkpoint | /proc/self/checkpoint | ||
When writing (any value) - perform a checkpoint. When only read- | When writing (any value) - perform a checkpoint. When only read- | ||
Line 684: | Line 900: | ||
version. Returns -1 if the checkpoint fails, 0 if writing only | version. Returns -1 if the checkpoint fails, 0 if writing only | ||
and checkpoint is successful. (rw) | and checkpoint is successful. (rw) | ||
− | + | ||
/proc/self/checkpointfile | /proc/self/checkpointfile | ||
The third argument (newval) is a pointer to a file-name to be used | The third argument (newval) is a pointer to a file-name to be used | ||
as the basis for future checkpoints (see mosrun(1)). (write only) | as the basis for future checkpoints (see mosrun(1)). (write only) | ||
− | + | ||
/proc/self/checkpointlimit | /proc/self/checkpointlimit | ||
Reads/modifies the maximal number of checkpoint files to create | Reads/modifies the maximal number of checkpoint files to create | ||
Line 694: | Line 910: | ||
unlimits the number of checkpoints files. The maximal value | unlimits the number of checkpoints files. The maximal value | ||
allowed is 10000000. | allowed is 10000000. | ||
− | + | ||
/proc/self/checkpointinterval | /proc/self/checkpointinterval | ||
When writing, sets the interval in minutes for automatic check- | When writing, sets the interval in minutes for automatic check- | ||
Line 701: | Line 917: | ||
has a side effect of reseting the time left to the next check- | has a side effect of reseting the time left to the next check- | ||
point. Thus, writing too frequently is not recommended. (rw) | point. Thus, writing too frequently is not recommended. (rw) | ||
− | + | ||
+ | open("/proc/self/in_cluster", O_CREAT, node); and | ||
+ | open("/proc/self/in_partition", O_CREAT, node); | ||
+ | return 1 if the given node is in the same cluster/partition, 0 | ||
+ | otherwise. The node can be either an unsigned, host-order IP | ||
+ | address, or a node-number (listed in /etc/mosix/userview.map). | ||
+ | |||
More functions are available through the direct_communication(7) feature. | More functions are available through the direct_communication(7) feature. | ||
− | + | ||
The following information is available via the /proc file system for | The following information is available via the /proc file system for | ||
everyone to read (not just within the MOSIX run-time environment): | everyone to read (not just within the MOSIX run-time environment): | ||
− | + | ||
/proc/{pid}/from | /proc/{pid}/from | ||
The IP address (a.b.c.d) of the process' home-node ("0" if a local | The IP address (a.b.c.d) of the process' home-node ("0" if a local | ||
process). | process). | ||
− | + | ||
/proc/{pid}/where | /proc/{pid}/where | ||
The IP address (a.b.c.d) where the process is runing ("0" if run- | The IP address (a.b.c.d) where the process is runing ("0" if run- | ||
ning here). | ning here). | ||
− | + | ||
/proc/{pid}/class | /proc/{pid}/class | ||
The class of the process. | The class of the process. | ||
− | + | ||
/proc/{pid}/origipid | /proc/{pid}/origipid | ||
The original PID of the process on its home-node ("0" if a local | The original PID of the process on its home-node ("0" if a local | ||
process). | process). | ||
− | + | ||
/proc/{pid}/freezer | /proc/{pid}/freezer | ||
Whether and why the process was frozen: | Whether and why the process was frozen: | ||
− | + | ||
0 Not frozen | 0 Not frozen | ||
− | + | ||
1 Frozen automatically due to high load. | 1 Frozen automatically due to high load. | ||
− | + | ||
2 Frozen by the evacuation policy, to prevent flooding by | 2 Frozen by the evacuation policy, to prevent flooding by | ||
arriving processes when clusters are disconnected. | arriving processes when clusters are disconnected. | ||
− | + | ||
3 Frozen due to manual request. | 3 Frozen due to manual request. | ||
− | + | ||
-66 This is a guest process from another home-mode (freezing is | -66 This is a guest process from another home-mode (freezing is | ||
always on the home-node, hence not applicable here). | always on the home-node, hence not applicable here). | ||
− | + | ||
− | Attempting to read the above for non-MOSIX processes returns the string "-3". | + | Attempting to read the above for non-MOSIX processes returns the string |
− | + | "-3". | |
+ | |||
'''STARTING MOSIX''' | '''STARTING MOSIX''' | ||
− | To start MOSIX, run /etc/init.d/mosix start. Alternately, run mosd. | + | To start MOSIX, run /etc/init.d/mosix start. Alternately, run mosd. |
'''SECURITY''' | '''SECURITY''' | ||
− | All nodes within a MOSIX cluster and multi-cluster | + | All nodes within a MOSIX cluster and multi-cluster must trust each |
− | other's super-user(s) | + | other's super-user(s) - otherwise the security of the whole cluster or |
− | + | multi-cluster is compromized. | |
− | + | ||
Hostile computers must not be allowed physical access to the internal | Hostile computers must not be allowed physical access to the internal | ||
MOSIX network where they could masquerade as having IP addresses of | MOSIX network where they could masquerade as having IP addresses of | ||
trusted nodes. | trusted nodes. | ||
− | + | ||
'''SEE ALSO''' | '''SEE ALSO''' | ||
− | mosrun(1), mosctl(1), migrate(1), setpe(1), mon(1), mosps(1), | + | mosrun(1), mosctl(1), migrate(1), setpe(1), mon(1), mosps(1), mosps(1), |
− | moskillall(1), mosq(1), bestnode(1), mospipe(1), | + | timeof(1), moskillall(1), mosq(1), bestnode(1), mospipe(1), mrc(1), |
− | topology(7). | + | direct_communication(7), topology(7). |
'''HISTORY''' | '''HISTORY''' | ||
− | This is the 10-th version of MOSIX. The MOSIX | + | This is the 10-th version of MOSIX. The MOSIX wiki has more information |
− | + | about the previous releases. | |
− | MOSIX | + | MOSIX February 2009 MOSIX |
Latest revision as of 12:31, 22 February 2009
MOSIX(M7) MOSIX Description MOSIX(M7) NAME MOSIX - sharing the power of clusters and multi-clusters INTRODUCTION MOSIX is a generic solution for dynamic management of resources in a cluster or in a multi-cluster organizational grid. MOSIX allows users to draw the most out of all the connected computers, including utilization of idle computers. At the core of MOSIX are adaptive resource sharing algorithms, applying preemptive process migration based on processor loads, memory and I/O demands of the processes, thus causing the cluster or the multi-cluster to work cooperatively similar to a single computer with many processors. Unlike earlier versions of MOSIX, only programs that are started by the mosrun(1) utility are affected and can be considered "migratable" - other programs are considered as "standard Linux programs" and are not affected by MOSIX. MOSIX maintains a high level of compatiblity with standard Linux, so that binaries of almost every application that runs under Linux can run com- pletely unmodified under the MOSIX "migratable" category. The exceptions are usually system-administration or graphic utilities that would not benefit from process-migration anyway. If a "migratable" program that was started by mosrun(1) attempts to use unsupported features, it will either be killed with an appropriate error message, or if a ``do not kill option is selected, an error is returned to the program: such pro- grams should probably run as standard Linux programs. In order to improve the overall resource usage, processes of "migratable" programs may be moved automatically and transparently to other nodes within the cluster or even the multi-cluster grid. As the demands for resources change, processes may move again, as many times as necessary, to continue optimizing the overall resource utilization, subject to the inter-cluster priorities and policies. Manual-control over process migration is also supported. MOSIX is particularly suitable for running CPU-intensive computational programs with unpredictable resource usage and run times, and programs with moderate amounts of I/O. Programs that perform large amounts of I/O should better be run as standard Linux programs. Apart from process-migration, MOSIX can provide both "migratable" and "standard Linux" programs with the benefits of optimal initial assignment and live-queuing. The unique feature of live-queuing means that although a job is queued to run later, when resources are available, once it starts, it remains attached to its original Unix/Linux environment (stan- dard-input/output/error, signals, etc.). REQUIREMENTS 1. All nodes must run Linux (any distribution - mixing allowed). 2. All participating nodes must be connected to a network that supports TCP/IP and UDP/IP, where each node has a unique IP address in the range 0.1.0.0 to 255.254.254.255 that is accessible to all the other nodes. 3. TCP/IP ports 249-254 and UDP/IP ports 249-250 must be available for MOSIX (not used by other applications or blocked by a firewall). 4. The architecture of all nodes can be either i386 (32-bit) or x86_64 (64-bit). Processes that are started on a 32-bit node can migrate to a 64-bit node, but not the opposite. 5. In multiprocessor nodes (SMP), all the processors must be of the same speed. 6. The system-administrators of all the connected nodes must be able to trust each other (see more on SECURITY below). CLUSTER, MULTI-CLUSTER, PARTITION The MOSIX concept of a "cluster" is a collection of computers that are owned and managed by the same entity (a person, a group of people or a project) - this can at times be quite different than a hardware cluster, as each MOSIX cluster may range from a single workstation to a large com- bination of computers - workstations, servers, blades, multi-core comput- ers, etc. possibly of different speeds and number of processors and pos- sibly in different locations. A MOSIX multi-cluster is a collection of clusters that belong to differ- ent entities (owners) who wish to share their resources subject to cer- tain administrative conditions. In particular, when an owner needs its computers - these computers must be returned immediately to the exclusive use of their owner. An owner can also assign priorities to guest pro- cesses of other owners, defining who can use their computers and when. Typically, an owner is an individual user, a group of users or a depart- ment that own the computers. The multi-cluster is usually restricted, due to trust and security reasons, to a single organization, possibly in various sites/branches, even across the world. MOSIX supports dynamic multi-cluster configurations, where clusters can join and leave at any time. When there are plenty of resources in the multi-cluster, the MOSIX queuing system allows more processes to start. When resources become scarce (because other clusters leave or claim their resources and processes must migrate back to their home-clusters), MOSIX has a freezing feature that can automatically freeze excess processes to prevent memory-overload on the home-nodes. Clusters may also be sub-divided into "partitions". Nodes that are assigned to different cluster-partitions are halfway between being part of the cluster and belonging to a different cluster. Just as within the cluster: 1. All cluster-partitions seem to other clusters as one cluster (elimi- nating the need to inform and update system-administrators of other clusters about internal changes to one's cluster). 2. Processes that migrate to another partition share the same top-prior- ity over processes from other clusters. 3. Processes that migrate to another partition share the "Cluster" cate- gory disk-space allocation rather than the "Grid" category for Private Temporary Files (see below). However, just as other clusters: 1. Only processes that were allowed to migrate to other clusters are allowed to migrate to other partitions. 2. Batch jobs cannot be assigned to nodes in other partitions. 3. Each partition maintains its own job-queue. When you have both 32-bit and 64-bit computers in the same cluster, it is highly recommended (but not mandatory) to set them up as different clus- ter partitions. CONFIGURATION To configure MOSIX interactively, simply run mosconf: it will lead you step-by-step through the various configuration items. Mosconf can be used in two ways: 1. To configure the local node (press <Enter> at the first question). 2. To configure MOSIX for other nodes: this is typically done on a server that stores an image of the root-partition for some or all of the cluster-nodes. This image can, for example, be NFS-mounted by the cluster-nodes, or otherwise copied or reflected to them by any other method: at the first question, enter the path to the stored root-image. There is no need to stop MOSIX in order to modify the configuration - most changes will take effect within a minute. However, after modifying the list of nodes in the cluster (/etc/mosix/mosix.map) or /etc/mosix/mosip or /etc/mosix/myfeatures, you should run the command "setpe" (but when you are using mosconf to configure your local node, this is not necessary). Below is a detailed description of the MOSIX configuration files (if you prefer to edit them manually). The directory /etc/mosix should include at least the subdirectories /etc/mosix/partners, /etc/mosix/var, /etc/mosix/var/grid and the follow- ing files: /etc/mosix/mosix.map This file defines which computers participate in your MOSIX clus- ter. The file contains up to 256 data lines and/or alias lines that can be in any order. It may also include any number of com- ment lines beginning with a '#', as well as empty lines. Data lines have 2 or 3 fields: 1. The IP ("a.b.c.d" or host-name) of the first node in a range of nodes with consecutive IPs. 2. The number of nodes in that range. 3. Optional combination of letter-flags and/or an integer: p[roximate] do not use compression on migration, e.g., over fast networks or slow CPUs. o[utsider] inaccessible to local-class processes. {partition} a positive integer indicating the partition num- ber for that range. Alias lines are of the form: a.b.c.d=e.f.g.h or a.b.c.d=host-name They indicate that the IP address on the left-hand-side refers to the same node as the right-hand-side. NOTES: 1. It is an error to attempt to declare the local node an "out- sider". 2. When using host names, the first result of gethostbyname(3) must return their IP address that is to be used by MOSIX: if in doubt - specify the IP address. 3. The right-hand-side in alias lines must appear within the data lines. 4. IP addresses 0.0.x.x and 255.255.255.x are not allowed in MOSIX. 5. If you change /etc/mosix/mosix.map while MOSIX is running, you need to run setpe to notify MOSIX of the changes. /etc/mosix/secret This is a security file that is used to prevent ordinary users from interfering and/or compromizing security by connecting to the internal MOSIX TCP ports. The file should contain just a single line with a password that must be identical on all the nodes of the cluster/multi-cluster. This file must be accessible to ROOT only (chmod 600!) /etc/mosix/ecsecret Like /etc/mosix/secret, but used for running batch jobs as a client (see mosrun(1)). If you do not wish to allow this node to send batch-jobs, do not create this file. /etc/mosix/essecret Like /etc/mosix/secret, but used for running batch jobs as a server (see mosrun(1)). The password must match the client's /etc/mosix/ecsecret. If you do not wish to allow this node to be a batch-server, do not create this file. The following files are optional: /etc/mosix/mosip This file contains our IP address, to be used for MOSIX purposes, in the regular format - a.b.c.d. This file is only necessary when the node's IP address is ambiguous: it can be safely omitted if the output of ifconfig(8) ("inet addr:") matches exactly one of the IP addresses listed in the data lines of /etc/mosix/mosix.map. /etc/mosix/myfeatures This file contains one line of comma-separated topological fea- tures for this node (if any). For example: yellow,wood,chicken. The list of all 32 features (one line per feature) can be found in /etc/mosix/features. If this file is missing, this node is assumed to have no topologi- cal features. (see topology(7)) /etc/mosix/freeze.conf This file sets the automatic freezing policies on a per-class basis for MOSIX processes originating in this node. Each line describes the policy for one class of processes. The lines can be in any order and classes that are not mentioned are not touched by the automatic freezing mechanisms. The space-separated constants in each line are as follows: 1. class-number A positive integer identifying a class of processes 2. load-units: Used in fields #3-#6 below: 0=processes; 1=standard-load 3. RED-MARK (floating point) Freeze when load is higher 4. BLUE-MARK (floating point) Unfreeze when load is lower 5. minautofreeze (floating point) Freeze processes that are evacuated back home on arrival if load gets equal or above this 6. minclustfreeze (floating point) Freeze processes that are evacuated back to this cluster on arrival if load gets equal or above this 7. min-keep Keep running at least this number of processes - even if load is above RED-MARK. 8. max-procs Freeze excess processes above this number - even if load is below BLUE-MARK. 9. slice Time (in minutes) that a process of this class is allowed to run while there are automatically-frozen process(es) of this class. After this period, the running process will be frozen and a frozen process will start to run. NOTES: 1. The load-units in fields #3-#6 depend on field #2. If 0, each unit represents the load created by a CPU-bound process on this computer. If 1, each unit represents the load cre- ated by a CPU-bound process on a "standard" MOSIX computer (e.g. a 3GHz Pentium-IV). The difference is that the faster the computer and the more processors it has, the load created by each CPU process decreases proportionally. 2. Fields #3,#4,#5,#6 are floating-point, the rest are integers. 3. A value of "-1" in fields #3,#5,#6,#8 means ignoring that feature. 4. The first 4 fields are mandatory: omitted fields beyond them have the following values: minautofreeze=-1,mincluster- freeze=-1,min-keep=0, max-procs=-1,slice=20. 5. The RED-MARK must be significantly higher than BLUE-MARK: otherwise a perpetual cycle of freezing and unfreezing could occur. You should allow at least 1.1 processes difference between them. 6. Frozen processes do not respond to anything, except an unfreeze request or a signal that kills them. 7. Processes that were frozen manually are not unfrozen automat- ically. This file may also contain lines starting with '/' to indicate freezing-directory names. A "Freezing directory" is an existing directory (often a mount-point) where the memory contents of frozen process is saved. For successful freezing, the disk-parti- tion of freezing-directories should have sufficient free disk- space to contain the memory image of all the frozen processes. If more than one freezing directory is listed, the freezing direc- tory is chosen at random by each freezing process. It is also possible to assign selection probabilities by adding a numeric weight after the directory-name, for example: /tmp 2 /var/tmp 0.5 /mnt/tmp 2.5 In this example, the total weight is 2+0.5+2.5=5, so out of every 10 frozen processes, an average of 4 (10*2/5) will be frozen to /tmp, an average of 1 (10*0.5/5) to /var/tmp and an average of 5 (10*2.5/5) to /mnt/tmp. When the weight is missing, it defaults to 1. A weight of 0 means that this directory should be used only if all others cannot be accessed. If no freezing directories are specified, all freezing will be to the /freeze directory (or symbolic-link). Freezing files are usually created with "root" (Super-User) per- missions, but if /etc/mosix/freeze.conf contains a line of the form: U {UID} then they are created with permissions of the given numeric UID (this is sometimes needed when freezing to NFS directories that do not allow "root" access). /etc/mosix/partners/* If your cluster is part of a multi-cluster, then each file in /etc/mosix/partners describes another cluster that you want this cluster to cooperate with. The file-names should indicate the corresponding cluster-names (maximum 128 characters), for example: "geography", "chemistry", "management", "development", "sales", "students-lab-A", etc. The format of each file is a follows: Line #1: A verbal human-readable description of the cluster. Line #2: Four space-separated integers as follows: 1. Priority: 0-65535, the lower the better. The priority of the local cluster is always 0. MOSIX gives precedence to processes with higher priority - if they arrive, guests with lower pri- ority will be expelled. 2. Cango: 0=never send local processes to that cluster. 1=local processes may go to that cluster. 3. Cantake: 0=do not accept guest-processes from that cluster. 1=accept guest-processes from that cluster. 4. Canexpand: 0=no: Only nodes listed in the lines below may be recognized as part of that cluster: if a core node from that cluster tells us about other nodes in their cluster - ignore those unlisted nodes. 1=yes: Core-nodes of that cluster may specify other nodes that are in that cluster, and this node should believe them even if they are not listed in the lines below. -1=do not ask the other cluster: do not consult the other cluster to find out which nodes are in that cluster: instead just rely on and use the lines below. Following lines: Each line describes a range of consecutive IP addresses that are believed to be part of the other cluster, contain- ing 5 space-separated items as follows: 1. IP1 (or host-name): First node in range. 2. n: Number of nodes in this range. 3. Core: 0=no: This range of nodes may not inform us about who else is in that cluster. 1=yes: This range of nodes could inform us of who else is in that cluster. 4. Participate: 0=no This range is (as far as this node is con- cerned) not part of that cluster. 1=yes This range is probably a part of that cluster. 5. Proximate: 0=no Use compression on migration to/from that cluster. 1=yes Do not use compression when migrating to/from that cluster (network is very fast and CPU is slow). NOTES: 1. From time-to-time, MOSIX will consult one or more of the "core" nodes to find the actual map of their cluster. It is recommended to list such core nodes. The alternative is to set canexpand to -1, causing the map of that cluster to be determined solely by this file. 2. Nodes that do not "participate" are excluded even if listed as part of their cluster by the core-nodes (but they could possibly still be used as "core-nodes" to list other nodes) 3. All core-nodes must have the same value for "proximate", because the "proximate" field of unlisted nodes is copied from that of the core-node from which we happened to find about them and this cannot be ambiguous. 4. When using host names rather than IP addresses, the first result of gethostbyname(3) must return their IP address that is used by MOSIX: if in doubt - specify the IP address instead. 5. IP addresses 0.0.x.x and 255.255.255.x cannot be used in MOSIX. /etc/mosix/userview.map Although it is possible to use only IP numbers and/or host-names to specify nodes in your cluster (and multi-cluster), it is more convenient to use small integers as node numbers: this file allows you to map integers to IP addresses. Each line in this file con- tains 3 elements: 1. A node number (1-65535) 2. IP1 (or host-name, clearly identifiable by gethostbyname(3)) 3. Number of nodes in range (the number of the last one must not exceed 65535) It is up to the cluster administrator to map as few or as many nodes as they wish out of their cluster and multi-cluster - the most common practice is to map all the nodes in one's cluster, but not in other clusters. /etc/mosix/queue.conf This file configures the queueing system (see mosrun(1), mosq(1)). All lines in this file are optional and may appear in any order. Usually, one node in each cluster is elected by the system-admin- istrator to manage the queue, while the remaining nodes point to that manager. As an exception, in a mixed cluster that has both 32-bit and 64-bit computers, a separate 32-bit node should be elected to exclusively manage the queue for all 32-bit nodes and a 64-bit node elected to exclusively manage the queue for all 64-bit nodes. Defining the queue manager: The line: C {hostname} assigns a specific node from the cluster (hostname) to manage the job queue. In the absence of this line, each node manages its own queue (which is usually undesirable). It is possible to have sev- eral 'C' lines - one for each cluster-partition. Defining the default priority: The line: P {priority} assigns a default job-priority to all the jobs from this node. The lower this value - the higher the priority. In the absence of this line, the default priority is 50. Commonly, user-ID's are identical on all the nodes in the cluster. The line (with a single letter): S indicates that this is not the case, so users on other nodes (except the Super-User) will be prevented from sending requests to modify the status of queued jobs from this node. Configuring the queue manager: The following lines are relevant only in the queue manager node and are ignored on all other nodes: The MOSIX queueing system determines dynamically how many pro- cesses to run. The line: M {maxproc} if present, imposes a maximal number of processes that are allowed to run from the queue simultaneously on top of the regular queue- ing policy. For example, M 20 sets the upper limit to 20 processes, even when more resources are available. The line: X {1 <= x <= 8} defines the maximal number of queued processes that may run simul- taneously per CPU. This option applies only to processors within the cluster and is not available for other clusters in a multi- cluster (where the queueing system assigns at most one process per CPU). In the absence of this line the default is X 1 The line: Z {n} causes the first n jobs of priority 0 to start immediately (out of order), without checking whether resources are available, leaving that responsibility to the system administrator. Example: the cluster has 10 dual-CPU nodes, so the queueing system normally allows 20 jobs to run. In order to allow urgent jobs to run immediately (without waiting for regular jobs to complete), the system administrator configures a line: Z 10, thus allowing each node to run a maximum of 3 jobs. The line: N {n} [{mb}] causes the first n jobs of jobs of each user to start immediately (out of order), without checking whether resources are available. Only jobs above that number, per user, will be queued and whenever the number of a user's running jobs drops below this number, a new job of that user (if there is any waiting) will start to run. When the mb parameter is given, only jobs that do not exceed this amount of memory in MegaBytes will be started this way. The system-administrator should weigh carefully, based on knowledge about the patterns of jobs that users typically run, the benefits of this option against its risks, such as having at times more jobs in their cluster(s) than available memory to run them efficiently. If this option is selected with a memory-limitation (mb), then the system-administrator should request that users always specify the maximum memory-requiremnts for all their queued jobs (using mosrun -m"). Fair-share policy: The fairness policy determine the order in which jobs are initially placed in the queue. Note that fairness should not be confused with priority (as defined by the P {priority} line or by mosrun -q{pri} and possibly modified by mosq(1)): priorities always take precedence, so here we only discuss the initial placement in the queue of jobs with the same pri- ority. The default queueing policy is "first-come-first-served". Alternatively, jobs of different users can be placed in the queue in an interleaved manner. The line (with a single letter): F switches the queueing policy to the interleaved policy. The advantage of the interleaved approach is that a user wishing to run a relatively small number of processes, does not need to wait for all the jobs that were already placed in the queue. The disadvantage is that older jobs need to wait longer. Normally, the interleaving ratio is equal among all users. For example, with two users (A and B) the queue may look like A-B-A-B-A-B-A-B. Each user is assigned an interleave ratio which determines (proportionally) how well their jobs will be placed in the queue relative to other users: the smaller that ratio - the better placement they will get (and vice versa). Normally all users receive the same default interleave-ratio of 10 per process. However, lines of the form: U {UID} {1 <= interleave <= 100} can set a different interleave ratio for different users. UID can be either numeric or symbolic and there is no limit on the number of these 'U' lines. Examples: 1. Two users (A & B): U userA 5 (userB is not listed, hence it gets the default of 10) The queue looks like: A-A-B-A-A-B-A-A-B... 2. Two users (A & B): U userA 20 U userB 15 The queue looks like: B-A-B-A-B-A-B-B-A-B-A-B-A-B-B-A... 3. Three users (A, B & C): U userA 25 U userB 7 (userC is not listed, hence it gets the default of 10) The queue looks like: B-C-B-C-B-A-B-C-B-C-B-A-B-C-B-C... Note that since the interleave ratio is determined per pro- cess (and not per job), different (more complex) results will occur when multi-process jobs are submitted to the queue. /etc/mosix/private.conf This file specifies where Private Temporary Files (PTFs) are stored: PTFs are an important feature of mosrun(1) and may consume a significant amount of disk-space. It is important to ensure that sufficient disk-space is reserved for PTFs, but without allowing them to disturb other jobs by filling up disk-partitions. Guest processes can also demand unpredictable amounts of disk- space for their PTFs, so we must make sure that they do not dis- turb local operations. Up to 3 different directories can be specified: for local pro- cesses; guest-processes from the local cluster (including other partitions); and guest-processes from other clusters in the multi- cluster grid. Accordingly, each line in this file has 3 fields: 1. A combination of the letters: 'O' (own node), 'C' (own clus- ter) and 'G' (other clusters in the grid). For example, OC, C, CG or OCG. 2. A directory name (usually a mount-point) starting with '/', where PTFs for the above processes are to be stored. 3. An optional numeric limit, in Megabytes, of the total size of PTFs per-process. If /etc/mosix/private.conf does not exist, then all PTFs will be stored in "/private". If the directory "/private" also does not exist, or if /etc/mosix/private.conf exists but does not contain a line with an appropriate letter in the first field ('O', 'C' or 'G'), then no disk-space is allocated for PTFs of the affected processes, which usually means that processes requiring PTFs will not be able to run on this node. Such guest processes that start using PTFs will migrate back to their home-nodes. When the third field is missing, it defaults to: 5 Gigabytes for local processes. 2 Gigabytes for processes from the same cluster. 1 Gigabyte for processes from other clusters. In any case, guest processes cannot exceed the size limit of their home-node even on nodes that allow them more space. /etc/mosix/target.conf This file contains the MRC (MOSIX Reach the Clouds) configuration, which determines who can launch MRC jobs that run on this node and what privileges and restrictions those launched jobs may have. Each line begins with a colon-terminated keyword, followed by spe- cific parameters for that keyword. Most keywords can be repeated (except uids:, gids:, defuid:, defgid:). The keywords are: accept: An IP address, or a range of consecutive IP addresses from where this node is willing to accept MRC jobs. An example of a single IP address is: accept: 101.102.103.104 An example of a range of IP address is: accept: 101.102.103.1 - 101.102.104.254 The address(es) may be followed by an alternative file-name (starting in '/'): in that case, the priviliges and restric- tions for jobs from the given address(es) are contained in the given file INSTEAD of /etc/mosix/target.conf. For exam- ple: accept: 1.2.3.1 - 1.2.3.254 /etc/mosix/special_case_1.2.3 Alternative files have the same format as /etc/mosix/target.conf, except that they do not contain the keywords accept: and reject:. reject: IP addresses are specified as in accept: all MRC jobs will be rejected from those address(es). This option is useful for excluding particular addresses in the middle of a larger range that is defined by accept:, for example: accept: 10.20.30.1 - 10.20.31.254 reject: 10.20.30.255 - 10.20.31.0 nodir: Prevent callers from overriding a given directory with a directory from their calling computer. Note that overriding all ancesstor-directories is also prevented (since overriding them would override everything inside them as well, including the given directory). For example: nodir: /usr/share/X11 prevents callers from overriding the directories "/usr/share/X11", "/usr/share" and "/usr" (it is anyway pro- hibited to override the root-directory). nodir_under: As nodir: but all subdirectories are also prevented from being overriden. allow-subdirs: If a caller asks to export a directory under a directory-name where: 1. No file or directory exists under that name. 2. The caller has no permission to create this directory. 3. Overriding that directory-name is not forbidden (eg. by nodir: or nodir_under:) and the named-directory or any of its ancesstor-directories appears with the allow-subdirs: keyword, then the given directory will be specially created for the caller (it will be empty and with "root" ownership). For example: allow-subdirs: /tmp allow-subdirs: /var/tmp uids: A list of user-names and/or user-IDs that may be used by MRC callers. A '*' denotes all users. A '-' preceding a user- name or user-ID explicitly excludes that user. The following example allows all user-ID's except "root": uids: * -root gids: A list of group-names and/or group-IDs that may be used by MRC callers. A '*' denotes all groups. A '-' preceding a group-name or group-ID explicitly excludes that group. defuid: The default user-ID under which jobs from users that are not listed with the uids: keyword should run. When this keyword is absent, the default is user-ID 65534 ("nobody"). defgid: The default group-ID under which jobs from user-groups that are not listed with the gids: keyword should run. When this keyword is absent, the default is group-ID 65534 ("nobody"). /etc/mosix/retainpri This file contains an integer, specifying a delay in seconds: how long after all MOSIX processes of a certain priority (see above, /etc/mosix/priority) finish (or leave) to allow processes of lower priority (higher numbers) to start. When this file is absent, there is no delay and processes with lower priority may arrive as soon as there are no processes with a higher priority. /etc/mosix/speed If this file exists, it should contain a positive integer (1-10,000,000), providing the relative speed of the processor: the bigger the faster, where 10,000 units of speed are equivalent to a 3GHz Pentium-IV, and AMD (Athlon or Opteron) processors are, as a rule of thumb, 1.5 times faster than Intel processors of the same frequency. Normally this file is not necessary because the speed of the pro- cessor is automatically detected by the kernel when it boots. There are however two cases when you should consider using this option: 1. When you have a heterogeneous cluster and always use MOSIX to run a specific program (or programs) that perform better on certain processor-types than on others. 2. On Virtual-Machines that run over a hosting operating-system: in this case, the speed that the kernel detects is unreliable and can vary significantly depending on the load of the underlying operating-systems when it boots. /etc/mosix/maxguests If this file exists, it should contain an integer limit on the number of simultaneous guest-processes from other clusters. Oth- erwise, the maximum number of guest-processes from other clusters is set to the default of 8 times the number of processors. /etc/mosix/.log_mosrun When this file is present, information about invocations of mosrun(1) and process migrations will be recorded in the system- log (by default "/var/log/messages" on most Linux distributions). /etc/mosix/newtune Tuning constants optimizes the MOSIX performance by telling it about the costs of networked operations. MOSIX has built-in tun- ing default constants. This file is used to override them to suit your particular hardware and networks. For most users, This file is difficult to set up manually. Thus, MOSIX comes with a program to assemble it. For more information, see topology(7). KERNEL Sometimes a MOSIX release provides patches for more than one Linux kernel version. Also, special kernel-patches are released from time to time to support particular Linux distributions (such as openSUSE): it is fine to mix different such kernels within the same cluster. It is even OK to mix older or newer kernels from other MOSIX releases, so long as the first two numbers in their MOSIX version (run cat /proc/mosix/version to view the version) are identical to the first two numbers of your MOSIX release. The MOSIX kernel patch is required for fully operational MOSIX systems with process-migration. A limited number of functions, such as batch jobs, queuing and viewing the loads, still works over any Linux kernel, even without the MOSIX kernel patch (or when the kernel is incompatible with the current MOSIX version). It is not recommended to have mixed clusters where some nodes have the MOSIX kernel-patch and others do not, but if you do so anyway, you should observe the following rules regarding job-queuing: On each "mixed" cluster, you may queue either migratable jobs or batch jobs, but not both. If you choose to queue migratable jobs, then you should select a node with the MOSIX kernel-patch as the queue-manager. If you choose to queue batch jobs, then you should select a node without the MOSIX kernel-patch as the queue-manager (see above the section about con- figuring /etc/mosix/queue.conf). INTERFACE FOR PROGRAMS The following interface is provided for programs running under mosrun(1) that wish to interface with their MOSIX run-time environment: All access to MOSIX is performed via the "open" system call, but the use of "open" is incidental and does not involve actual opening of files. If the program were to run as a regular Linux program, those "open" calls would fail, returning -1, since the quoted files never exist, and errno(3) would be set to ENOENT. open("/proc/self/{special}", 0) reads a value from the MOSIX run-time environment. open("/proc/self/{special}", 1|O_CREAT, newval) writes a value to the MOSIX run-time environment. open("/proc/self/{special}", 2|O_CREAT, newval) both writes a new value and return the previous value. (the O_CREAT flag is only required when your program is compiled with the 64-bit file-size option, but is harmless otherwise). Some "files" are read-only, some are write-only and some can do both (rw). The "files" are as follows: /proc/self/migrate writing a 0 migrates back home; writing -1 causes a migration con- sideration; writing the unsigned value of an IP address or a logi- cal node number, attempts to migrate there. Successful migration returns 0, failure returns -1 (write only) /proc/self/lock When locked (1), no automatic migration may occur (except when running on the current node is no longer allowed); when unlocked (0), automatic migration can occur. (rw) /proc/self/whereami reads where the program is running: 0 if at home, otherwise usu- ally an unsigned IP address, but if possible, its corresponding logical node number. (read only) /proc/self/nmigs reads the total number of migrations performed by this process and its MOSRUN ancesstors before it was born. (read only) /proc/self/sigmig Reads/sets a signal number (1-64 or 0 to cancel) to be received after each migration. (rw) /proc/self/glob Reads/modifies the process class. Processes of class 0 are not allowed to migrate outside the local cluster or even outside the local partition. Classes can also affect the automatic-freezing policy. (rw) /proc/self/needmem Reads/modifies the process's memory requirement in Megabytes, so it does not automatically migrate to nodes with less free memory. Acceptable values are 0-262143. (rw) /proc/self/unsupportok when 0, unsupported system-calls cause the process to be killed; when 1 or 2, unsupported system-calls return -1 with errno set to ENOSYS; when 2, an appropriate error-message will also be written to stderr. (rw) /proc/self/clear clears process statistics. (write only) /proc/self/cpujob Normally when 0, system-calls and I/O are taken into account for migration considerations. When set to 1, they are ignored. (rw) /proc/self/localtime When 0, gettimeofday(2) is always performed on the home node. When 1, the date/time is taken from where the process is running. (rw) /proc/self/decayrate Reads/modifies the decay-rate per second (0-10000): programs can alternate between periods of intensive CPU and periods of demand- ing I/O. Decisions to migrate should be based neither on momen- tary program behaviour nor on extremely long term behaviour, so a balance must be struck, where old process statistics gradually decay in favour of newer statistics. The lesser the decay rate, the more weight is given to new information. The higher the decay rate, the more weight is given to older information. This option is provided for users who know well the cyclic behavior of their program. (rw) /proc/self/checkpoint When writing (any value) - perform a checkpoint. When only read- ing - return the version number of the next checkpoint to be made. When reading and writing - perform a checkpoint and return its version. Returns -1 if the checkpoint fails, 0 if writing only and checkpoint is successful. (rw) /proc/self/checkpointfile The third argument (newval) is a pointer to a file-name to be used as the basis for future checkpoints (see mosrun(1)). (write only) /proc/self/checkpointlimit Reads/modifies the maximal number of checkpoint files to create before recycling the checkpoint version number. A value of 0 unlimits the number of checkpoints files. The maximal value allowed is 10000000. /proc/self/checkpointinterval When writing, sets the interval in minutes for automatic check- points (see mosrun(1)). A value of 0 cancels automatic check- points. The maximal value allowed is 10000000. Note that writing has a side effect of reseting the time left to the next check- point. Thus, writing too frequently is not recommended. (rw) open("/proc/self/in_cluster", O_CREAT, node); and open("/proc/self/in_partition", O_CREAT, node); return 1 if the given node is in the same cluster/partition, 0 otherwise. The node can be either an unsigned, host-order IP address, or a node-number (listed in /etc/mosix/userview.map). More functions are available through the direct_communication(7) feature. The following information is available via the /proc file system for everyone to read (not just within the MOSIX run-time environment): /proc/{pid}/from The IP address (a.b.c.d) of the process' home-node ("0" if a local process). /proc/{pid}/where The IP address (a.b.c.d) where the process is runing ("0" if run- ning here). /proc/{pid}/class The class of the process. /proc/{pid}/origipid The original PID of the process on its home-node ("0" if a local process). /proc/{pid}/freezer Whether and why the process was frozen: 0 Not frozen 1 Frozen automatically due to high load. 2 Frozen by the evacuation policy, to prevent flooding by arriving processes when clusters are disconnected. 3 Frozen due to manual request. -66 This is a guest process from another home-mode (freezing is always on the home-node, hence not applicable here). Attempting to read the above for non-MOSIX processes returns the string "-3". STARTING MOSIX To start MOSIX, run /etc/init.d/mosix start. Alternately, run mosd. SECURITY All nodes within a MOSIX cluster and multi-cluster must trust each other's super-user(s) - otherwise the security of the whole cluster or multi-cluster is compromized. Hostile computers must not be allowed physical access to the internal MOSIX network where they could masquerade as having IP addresses of trusted nodes. SEE ALSO mosrun(1), mosctl(1), migrate(1), setpe(1), mon(1), mosps(1), mosps(1), timeof(1), moskillall(1), mosq(1), bestnode(1), mospipe(1), mrc(1), direct_communication(7), topology(7). HISTORY This is the 10-th version of MOSIX. The MOSIX wiki has more information about the previous releases. MOSIX February 2009 MOSIX