Difference between revisions of "Mosrun"

From MosixWiki
Jump to: navigation, search
 
(8 intermediate revisions by one user not shown)
Line 1: Line 1:
 
  MOSRUN(M1)                      MOSIX Commands                      MOSRUN(M1)
 
  MOSRUN(M1)                      MOSIX Commands                      MOSRUN(M1)
 
'''NAME'''
 
    MOSRUN - Running MOSIX programs
 
 
'''SYNOPSIS'''
 
    mosrun [location_options] [program_options] program [args] ...
 
    mosrun -S{maxjobs} [location_options] [program_options] {commands-file}
 
            [,{failed-file}]
 
    mosrun -R{filename} [-O{fd=filename}[,{fd2=fn2}]]...  [location_options]
 
    mosrun -I{filename}
 
    mosenv { same-arguments-as-mosrun }
 
    native program [args]...
 
 
            Location options:
 
 
              [-r{hostname} | -{a.b.c.d} | -{n} | -h | -b |
 
              -jID1-ID2[,ID3-ID4]... }] [-G[{class}]] [-F] [-L] [-l]
 
              [-D{DD:HH:MM}] [-A{minutes}] [-N{max}] [-{q|Q}[{pri}]]
 
              [-P{parallel_processes}] [-J{JobID}]
 
 
            Program Options:
 
 
              [-m{mb}] [-d {0-10000}] [-c] [-n] [-z] [-e] [-u] [-w] [-t] [-T]
 
              [-E[/{cwd}]] [-M[/{cwd}]] [-C{filename}] [-X{/directory}]...
 
 
 
'''DESCRIPTION'''
 
    Mosrun runs a program under the MOSIX discipline: this means that pro-
 
    grams activated by mosrun can potentially migrate to other nodes within
 
    the cluster or grid (see mosix(7)): programs that are not started by
 
    mosrun, run in "native" Linux mode and cannot migrate.
 
 
    Once running under MOSIX, the program and all its child-processes remain
 
    under the MOSIX discipline, with the exception of the native utility,
 
    which allows programs (mainly shells) that already run under mosrun to
 
    spawn children that run in native Linux mode.
 
 
    The following arguments may be used to specify the program's initial
 
    assignment:
 
 
    -r{hostname}            on the given host
 
    -{a.b.c.d}              on the given IP address
 
    -{n}                    on the given node-number
 
    -h                      on the home-node
 
    -b                      the program attempts to select the best node
 
    -jID1-ID2[,ID3-ID4]...  select at random from the given list of hosts,
 
                            IP's and/or node numbers.
 
 
    When none of the above arguments is used, the program will start wherever
 
    its parent process is running.
 
 
    The -F flag states that mosrun should start the program somewhere else,
 
    even if the requested node (above) is not available.
 
 
    The -L flag states that the program should not be allowed to migrate
 
    automatically.  It may still be migrated manually or when situations
 
    arise that do not allow it to continue running where it is.
 
 
    The -l flag negates the -L flag and allows the program to migrate auto-
 
    matically: this is useful when -L was already applied to the program
 
    (usually a shell) that calls mosrun.
 
 
    The -G argument states that the program is to be allowed to migrate to
 
    grid-wide nodes rather than only within the local cluster.  This argument
 
    may be followed by a positive integer, -G[{class}] that specify the pro-
 
    gram's class: when that number is omitted, the class of the program is
 
    assumed to be 1.  It is also possible to specify -G0, meaning that the
 
    program may not migrate outside the local cluster (this is useful when -G
 
    was already applied the calling program).
 
 
    The -D{timespec} allows the user to provide an estimate on how long their
 
    job should run.  MOSIX does not use this information - it is provided in
 
    order to help mosps(1) keep track of processes.  timespec can be speci-
 
    fied in any of the following formats (DD/HH/MM are numeric for days,
 
    hours and minutes respectively): DD:HH:MM; HH:MM; DDd; HHh; MMm;
 
    DDdHHhMMm; DDdHHh; DDdMMm; HHhMMm.  Periods when the process is frozen
 
    are automatically added to that estimate.
 
 
    The -m{mb} argument states that the program requires a certain amount of
 
    memory (in Megabytes) and should not run with less.  This has the effect
 
    of:
 
    1. Combined with the -b flag, the program will only consider to start
 
        running on nodes with available memory of at least {mb} Megabytes: the
 
        program will not even start until at least one such node is found.
 
    2. The program will not automatically migrate to nodes with less than
 
        {mb} Megabytes free memory (with the exception of the home node, when
 
        the program must move back home).
 
    3. The queuing system (see below) will take the program's memory require-
 
        ments into account when deciding which and how many jobs to allow to
 
        run at any point in time.
 
 
    Most system-calls are supported by MOSIX, but a few are not (such as map-
 
    ping shared memory or cloning - see the "LIMITATIONS" section below).  By
 
    default, when a program under mosrun encounters an unsupported system-
 
    call, it is killed.  The -e flag, however, allows the program to continue
 
    and behave as follows:
 
 
    1. mmap(2) with (flags & MAP_SHARED) - but !(prot & PROT_WRITE), replaces
 
        the MAP_SHARED with MAP_PRIVATE (this combination seems unusual or
 
        even faulty, but is unnecessarily used within some Linux libraries).
 
 
    2. all other unsupported system-call return -1 and "errno" is set to
 
        ENOSYS.
 
 
    The -w flag is the same as -e, but it also causes mosrun to print an
 
    error message to the standard-error when an unsupported system-call is
 
    encountered.  The -u flag returns to the default of killing the process.
 
 
    System calls and I/O operations are monitored and taken into account in
 
    automatic migration considerations, tending to pull processes towards
 
    their home-nodes.  The -c flag tells mosrun not to take system calls and
 
    I/O operations in the migration considerations.  The -n flag reverts to
 
    taking them into account.
 
 
    Even when running elsewhere, programs running under MOSIX obtain the
 
    results of the gettimeofday(2) system-call from their home-nodes.  The -t
 
    flag tells mosrun to take the time from the local node (where the process
 
    currently runs), thus reducing the communication overhead with the home-
 
    node. Note that this can be a problem when the clocks are not synchro-
 
    nized.  The -T flag reverses the effect of -t.
 
 
    The -d{decay} argument, where decay is an integer between 0 and 10000,
 
    sets the rate of decay of process-statistics as a fraction of 10000 per
 
    second (see mosix(7)).
 
 
    The -z flag states that the program's arguments begin at argument #0 -
 
    otherwise, the arguments (if any) are assumed to begin at argument #1 and
 
    argument #0 is assumed to be identical to the program-name.
 
 
    mosrun can send batch jobs to other nodes of the local cluster.  There
 
    are two types of batch jobs: those produced by the -E argument are native
 
    Linux jobs, while those produced by the -M argument are MOSIX jobs - but
 
    possibly with a different home-node.
 
 
    Batch jobs are executed from binaries in another node and preserve only
 
    some of the caller's environment: they receive the environment variables;
 
    they can read from their standard-input and write to their standard out-
 
    put and error, but not from/to other open files; they receive signals,
 
    but after forking, signals are delivered to the whole process-group
 
    rather than just the parent; they cannot communicate with other processes
 
    on the local node using pipes and sockets (other than standard input/out-
 
    put/error), semaphores, messages, etc.  and can only receive signals, but
 
    not send them.  The main advantage of batch jobs is that they save time
 
    by not needing to refer to the home-node to perform system-calls, so tem-
 
    porary files for example, can be created on the node where they start,
 
    preventing the calling node from becoming a bottleneck.  This approach is
 
    recommended for programs that perform a significant amount of I/O.
 
 
    Batch jobs use the path of the current directory as their current-direc-
 
    tory on the other node.  It is possible to override that path by specify-
 
    ing a different directory in the -E{/cwd} or -M{/cwd} arguments.
 
 
    MOSIX-specific arguments (-G, -F, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T,
 
    -A, -N, -C), do not apply to native Linux batch jobs that are started
 
    with the -E argument, but they do apply to jobs started with the -M argu-
 
    ment.
 
 
    Permission is required from the other node to send batch jobs there (see
 
    mosix(7) for more information).
 
 
    The following arguments: -G, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T are
 
    inherited by child processes: see however in mosix(7) how those can be
 
    changed at run time from within the program.
 
 
    The variant mosenv is used to circumvent the loss of certain environment
 
    variables by the GLIBC library due to the fact that mosrun is a "setuid"
 
    program: if your program relies on the settings of dynamic-linking envi-
 
    ronment variables (such as LD_LIBRARY_PATH) or malloc(3) debugging (MAL-
 
    LOC_CHECK_), use mosenv instead of mosrun.
 
 
CHECKPOINTS
 
    Most CPU-intensive processes running under mosrun can be checkpointed:
 
    this means that an image of those processes is saved to a file, and when
 
    necessary, the process can later recover itself from that file and con-
 
    tinue to run from that point.
 
 
    For successful checkpoint and recovery, the process must not depend heav-
 
    ily on its Linux environment.  Specifically, the following processes can-
 
    not be checkpointed at all:
 
 
    1. Processes with setuid/setgid privileges (for security reasons).
 
    2. Processes with open pipes or sockets.
 
 
    The following processes can be checkpointed, but may not run correctly
 
    after being recovered:
 
 
    1. Processes that rely on process-ID's of themselves or other processes
 
        (parent, sons, etc.).
 
    2. Processes that rely on parent-child relations (e.g. use wait(2), Ter-
 
        minal job-control, etc.).
 
    3. Processes that coordinate their input/output with other running pro-
 
        cesses.
 
    4. Processes that rely on timers and alarms.
 
    5. Processes that cannot afford to lose signals.
 
    6. Processes that use system-V IPC (semaphores and messages).
 
 
    The -C{filename} argument specifies where to save checkpoints: when a new
 
    checkpoint is saved, that file-name is given a consecutive numeric exten-
 
    sion (unless it already has one). For example, if the argument -Cmysave
 
    is given, then the first checkpoint will be saved to mysave.1, the second
 
    to mysave.2, etc., and if the argument -Csave.4 is given, then the first
 
    checkpoint will be saved to save.4, the second to save.5, etc.  If the -C
 
    argument is not provided, then the checkpoints will be saved to the
 
    default: ckpt.{pid}.1, ckpt.{pid}.2  ...  The -C argument is NOT inher-
 
    ited by child processes.
 
 
    The -N{max} argument specifies the maximum number of checkpoints to pro-
 
    duce before recycling the checkpoint versions.  This is mainly needed in
 
    order to save disk space.  For example, when running with the arguments:
 
    -Csave.4 -N3, checkpoints will be saved in save.4, save.5, save.6,
 
    save.4, save.5, save.6, save.4 ...
 
    The -N0 argument returns to the default of unlimited checkpoints; an
 
    argument of -N1 is risky, because if there is a crash just at the time
 
    when a backup is taken, there could be no remaining valid checkpoint
 
    file.  Similarly, if the process can possibly have open pipe(s) or
 
    socket(s) at the time a checkpoint is taken, a checkpoint file will be
 
    created and counted - but containing just an error message, hence this
 
    argument should have a large-enough value to accommodate this possibil-
 
    ity.  The -N argument is NOT inherited by child processes.
 
 
    Checkpoints can be triggered by the program itself, by a manual request
 
    (see migrate(1)) and/or at regular time intervals.  The -A{minutes} argu-
 
    ment requests that checkpoints be automatically taken every given number
 
    of minutes.  Note that if the process is within a blocking system-call
 
    (such as reading from a terminal) when the time for a checkpoint comes,
 
    the checkpoint will be delayed until after the completion of that system
 
    call.  Also, when the process is frozen, it will not produce a checkpoint
 
    until unfrozen.  The -A argument is NOT inherited by child processes.
 
 
    With the -R{filename} argument, mosrun recovers and continue to run the
 
    process from its saved checkpoint file.  Program options are not permit-
 
    ted with -R, since their values are recovered from the checkpoint file.
 
 
    It is not always possible (or desirable) for a recovered program to con-
 
    tinue to use the same files that were open at the time of checkpoint:
 
    mosrun -I{filename} inspects a checkpoint file and lists the open files,
 
    along with their modes, flags and offsets, then the -O argument allows
 
    the recovered program to continue using different files.  Files specified
 
    using this option, will be opened (or created) with the previous modes,
 
    flags and offsets.  The format of this argument is usually a comma-sepa-
 
    rated list of file-descriptor integers, followed by a '=' sign and a
 
    file-name.  For example: -O1=oldstdout,2=oldstderr,5=tmpfile, but in case
 
    one or more file-names contain a comma, it is optional to begin the argu-
 
    ment with a different separator, for example:
 
    -O@1=file,with,commas@2=oldstderr@5=tmpfile.
 
 
    In the absence of the -O argument, regular files and directories are re-
 
    opened with the previous modes, flags and offsets.
 
 
    Files that were already unlinked at the time of checkpoint, are assumed
 
    to be temporary files belonging to the process, and are also saved and
 
    recovered along with the process (an exception is if an unlinked file was
 
    opened for write-only).  Unpredictable results may occur if such files
 
    are used to communicate with other processes.
 
 
    As for special files (most commonly the user's terminal, used as standard
 
    input, output or error) that were open at the time of checkpoint - if
 
    mosrun is called with their file-descriptors open, then the existing open
 
    files are used (and their modes, flags and offsets are not modified).
 
    Special files that are neither specified in the -O argument, nor open
 
    when calling mosrun, are replaced with /dev/null.
 
 
    While a checkpoint is being taken, the partially-written checkpoint file
 
    has no permissions (chmod 0).  When the checkpoint is complete, its mode
 
    is changed to 0400 (read-only).
 
 
QUEUING
 
    MOSIX incorporates a queuing system that allow users to submit a number
 
    of jobs that will be scheduled to run when resources are available.
 
    Although the number of queued jobs can be large, it is limited by the
 
    number of Linux processes (about 30000 for all users): to queue more
 
    jobs, see the "RUNNING MULTIPLE JOBS" section below.
 
 
    The queuing system is common to a whole cluster and using it is optional.
 
    It is recommended that a policy is decided where either all the users of
 
    a cluster use it, or all do not.  Queued jobs can also be controlled
 
    using mosq(1).
 
 
    The -q argument causes the whole mosrun command to be queued and post-
 
    poned until the queuing system launch it.
 
 
    The letter q may optionally be followed by a non-negative integer, speci-
 
    fying the job's priority - the lower the number, the higher the priority
 
    (in the absence of this number, a pre-configured, per-node default of 50
 
    is used, unless configured otherwise by the system-administrator).
 
 
    Queued programs are not visible in mosps(1), while ps(1) shows them as
 
    "mosqueue".
 
 
    The -Q argument is similar to -q, except that if MOSIX is stopped (or
 
    restarted) while the program is queued, or if the queuing system attempts
 
    to abort the job (see mosq(1)), with -q the program will be killed, while
 
    with -Q it will bypass the queuing system and begin running.
 
 
    The -P{parallel_processes} argument informs the queuing system that the
 
    job may split into a given number of parallel processes (hence more
 
    resources must be reserved for it).
 
 
    The -J{JobID} argument allows bundling of several instances of mosrun
 
    with a single "job" ID for easy identification and manipulation (the con-
 
    cept of what a "job" means is left for each user to define).  "Jobs" can
 
    then be handled collectively by mosq(1), migrate(1), mosps(1) and
 
    moskillall(1).
 
 
    Job-ID's can be either a non-negative integer or a token from the file
 
    $HOME/.jobids: if this file exists, each line in it contains a number
 
    (JobID) followed by a token that can be used as a synonym to that JobID.
 
    The default JobID is 0.
 
 
    Job ID's are inherited by child processes.
 
 
    This argument is ignored for batch jobs originating from other nodes.
 
 
RUNNING MULTIPLE JOBS
 
    The -S{maxjobs} option runs under mosrun multiple command-lines from the
 
    file specified by commands-file, each with the given mosrun arguments.
 
 
    This option is commonly used to run the same program with many different
 
    sets of arguments.  For example, the contents of commands-file could be:
 
 
                my_program -a1 < ifile1 > output1
 
                my_program -a2 < ifile2 > output2
 
                my_program -a3 < ifile3 > output3
 
 
    Command-lines are started in the order they appear in commands-file.
 
    While the number of command-lines is unlimited, mosrun will run concur-
 
    rently up to maxjobs (1-30000) command-lines at any given time: when any
 
    command-line terminates, a new command-line is started.
 
 
    Command lines are interpreted by the standard shell (bash(1)).  Please
 
    note that bash has the property that when redirection is used, it spawns
 
    a son-process to run the command: if the number of processes is an issue,
 
    it is recommended to prepend the keyword exec before each command line
 
    that uses redirection.  For example:
 
 
                exec my_program -a1 < ifile1 > output1
 
                exec my_program -a2 < ifile2 > output2
 
                exec my_program -a3 < ifile3 > output3
 
 
    The exit status of mosrun -S{maxjobs} is the number of command-lines that
 
    failed (255 if more than 255 command-lines failed).
 
 
    As a further option, the commands-file argument can be followed by a
 
    comma and another file-name: commands-file,failed-commands.  Mosrun will
 
    create the second file and write to it the list of all the commands (if
 
    any) that failed (this provides an easy way to re-run only those commands
 
    that failed).
 
 
    The -S{maxjobs} option combines well with the queuing system (the -q
 
    argument), setting an absolute upper limit on the number of simultaneous
 
    jobs whereas the number of jobs allowed to run by the queuing system
 
    depends on the available grid-resources.  With this combination, to pre-
 
    vent an unnecessary and excessive number of waiting processes, no more
 
    than 10 jobs will be queued at any given moment.
 
 
PRIVATE TEMPORARY FILES
 
    Normally, all files are created on the home-node and all file-operations
 
    are performed there.  This is important because programs often share
 
    files, but can be costly: many programs use temporary files which they
 
    never share - they create those files as secondary-memory and discard
 
    them when they terminate.  It is best to migrate such files with the pro-
 
    cess rather than keep them in the home-node.
 
 
    The -X {/directory} argument tells Mosrun that a given directory is only
 
    used for private temporary files: all files that the program creates in
 
    this directory are kept with the process that created them and migrate
 
    with it.
 
 
    The -X argument may be repeated, specifying up to 10 private temporary
 
    directories.  The directories must start with '/'; can be up to 256 char-
 
    acters long; cannot include ".."; and for security reasons cannot be
 
    within "/etc", "/proc", "/sys" or "/dev".
 
 
    Only regular files are permitted within private temporary directories: no
 
    sub-directories, links, symbolic-links or special files are allowed
 
    (except that sub-directories can be specified by an extra -X argument).
 
 
    Private temporary file names must begin with '/' (no relative pathnames)
 
    and contain no ".." components.  The only file operations currently sup-
 
    ported for private temporary files are: open, creat, lseek, read, write,
 
    close, chmod, fchmod, unlink, truncate, ftruncate, access, stat.
 
 
    File-access permissions on private temporary files are provided for com-
 
    patibility, but are not enforced: the stat(2) system-call returns 0 in
 
    st_uid and st_gid.  stat(2) also returns the file-modification times
 
    according to the node where the process was running when making the last
 
    change to the file.
 
 
    The per-process maximum total size of all private temporary files is set
 
    by the system-administrator.  Different maximum values can be imposed
 
    when running on the home-node, in the local cluster and on the grid -
 
    exceeding this maximum will cause a process to migrate back to its home-
 
    node.
 
 
ALTERNATIVE FREEZING SPACE
 
    MOSIX processes can sometimes be frozen (you can freeze your processes
 
    manually and the system-administrator usually sets an automatic-freezing
 
    policy - See mosix(7)).
 
 
    The memory-image of frozen processes is saved to disk.  Normally the sys-
 
    tem-administrator determines where on disk to store your frozen pro-
 
    cesses, but you can override this default and set your own freezing-
 
    space.  One possible reason to do so is to ensure that your processes (or
 
    some of them) have sufficient freezing space regardless of what other
 
    users do.  Another possible reason is to protect other users if you
 
    believe that your processes (or some of them) may require so much memory
 
    that they could disturb other users.
 
 
    Setting your own freezing space can be done either by setting the envi-
 
    ronment-variable FREEZE_DIR to an alternative directory (starting with
 
    '/'); or if you wish to specify more than one freeze-directory, by creat-
 
    ing a file: $HOME/.freeze_dirs where each line contains a directory-name
 
    starting with '/'.  For more details, read about "lines starting with
 
    '/'" within the section about configuring /etc/mosix/freeze.conf in the
 
    mosix(7) manual.
 
 
    You must have write-access to the your alterantive freeze-directory(s).
 
    The space available in alternative freeze-directories is subject to pos-
 
    sible disk quotas.
 
 
RECURSIVE MOSRUN
 
    It is possible to run mosrun within an already-running mosrun: this can
 
    happen, for example, when a shell-script that contains calls to mosrun is
 
    itself run by mosrun, or when running mosrun make with a Makefile that
 
    contains calls to mosrun.
 
 
    The following arguments (and only those) of the outer mosrun will be pre-
 
    served by the inner mosrun (unless the inner mosrun explicitly requests
 
    otherwise): -c, -d, -e, -J, -G, -L, -l, -m, -n, -T, -t, -u, -w.
 
 
'''LIMITATIONS'''
 
    32-bit processes must have a 32-bit home-node (but they can be assigned
 
    or migrated to 64-bit nodes).  Attempts to execute a 32-bit binary under
 
    a 64-bit home-node will turn the process into a native Linux process (and
 
    if that process has open private-temporary-files or uses direct communi-
 
    cation, it will be killed).  Obviously, 64-bit processes cannot run on
 
    32-bit nodes.
 
 
    Some system-calls are not supported by mosrun, including system-calls
 
    that are tightly connected to resources of the local node or intended for
 
    system-administration.  These are:
 
 
    acct, add_key, adjtimex, afs_syscall(x86_64), alloc_hugepages(i386),
 
    bdflush, capget, capset, chroot, clock_getres, clock_nanosleep,
 
    clock_settime, create_module(x86_64), delete_module, epoll_create,
 
    epoll_ctl, epoll_pwait, epoll_wait, eventfd, fadvise,
 
    free_hugepages(i386), futex, get_kernel_syms(x86_64), get_mempolicy,
 
    get_robust_list, getcpu, getpmsg(x86_64), init_module, inotify_add_watch,
 
    inotify_init, inotify_rm_watch, io_cancel, io_destroy, io_getevents,
 
    io_setup, io_submit, ioperm, iopl, ioprio_get, ioprio_set,
 
    kexec_load(x86_64), keyctl, lookup_dcookie, madvise, mbind,
 
    migrate_pages, mlock, mlockall, move_pages, mq_getsetattr, mq_notify,
 
    mq_open, mq_timedreceive, mq_timedsend, mq_unlink, munlock, munlockall,
 
    nfsservctl, personality, pivot_root, ptrace, quotactl, reboot,
 
    remap_file_pages, request_key, rt_sigqueueinfo, rt_sigtimedwait,
 
    sched_get_priority_max, sched_get_priority_min, sched_getaffinity,
 
    sched_getparam, sched_getscheduler, sched_rr_get_interval,
 
    sched_setaffinity, sched_setparam, sched_setscheduler, security(x86_64),
 
    set_mempolicy, setdomainname, sethostname, set_robust_list, settimeofday,
 
    shmat, signalfd, swapoff, swapon, syslog, timer_create, timer_delete,
 
    timer_getoverrun, timer_gettime, timer_settime, timerfd, tuxcall(x86_64),
 
    unshare, uselib, vm86(i386), vmsplice, waitid.
 
 
    In addition, mosrun supports only limited options for the following sys-
 
    tem-calls:
 
 
    clone  The only permitted flags are CLONE_CHILD_SETTID, CLONE_PARENT_SET-
 
            TID, CLONE_CHILD_CLEARTID, and the combination
 
            CLONE_VFORK|CLONE_VM; the child-termination signal must be SIGCLD
 
            and the stack-pointer (child_stack) must be NULL.
 
    getpriority
 
            may refer only to the calling process.
 
    ioctl  The following requests are not supported: TIOCSERGSTRUCT, TIOCSER-
 
            GETMULTI, TIOCSERSETMULTI, SIOCSIFFLAGS, SIOCSIFMETRIC, SIOC-
 
            SIFMTU, SIOCSIFMAP, SIOCSIFHWADDR, SIOCSIFSLAVE, SIOCADDMULTI,
 
            SIOCDELMULTI, SIOCSIFHWBROADCAST, SIOCSIFTXQLEN, SIOCSMIIREG,
 
            SIOCBONDENSLAVE, SIOCBONDRELEASE, SIOCBONDSETHWADDR, SIOCBOND-
 
            SLAVEINFOQUERY, SIOCBONDINFOQUERY, SIOCBONDCHANGEACTIVE, SIOCBRAD-
 
            DIF, SIOCBRDELIF.  Non-standard requests that are defined in
 
            drivers that are not part of the standard Linux kernel are also
 
            likely to not be supported.
 
    ipc    the following SYSV-IPC calls are not supported: shmat, semtimedop,
 
            new-version calls (bit 16 set in call-number).
 
    mmap  MAP_SHARED and mapping of special-character devices are not per-
 
            mitted.
 
    prctl  only the PR_SET_DEATHSIG and PR_GET_DEATHSIG options are sup-
 
            ported.
 
    setpriority
 
            may refer only to the calling process.
 
    setrlimit
 
            it is not permitted to modify the maximum number of open files
 
            (RLIMIT_NOFILES): mosrun fixes this limit at 1024.
 
 
    Programs that fail to run because they call an unsupported system-call
 
    can still run in batch mode ('mosrun -E').
 
 
    Users are not permitted to send the SIGSTOP signal to MOSIX processes:
 
    SIGTSTP should be used instead (and moskillall(1) changes SIGSTOP to
 
    SIGTSTP).
 
 
'''SEE ALSO'''
 
    migrate(1), mosq(1), moskillall(1), mosps(1), direct_communication(7),
 
    mosix(7).
 
 
MOSIX                              May 2006                              MOSIX
 
 
 
 
 
 
 
 
MOSRUN(M1)                      MOSIX Commands                    MOSRUN(M1)
 
 
    
 
    
 
  '''NAME'''
 
  '''NAME'''
Line 536: Line 23:
 
    
 
    
 
               [-m{mb}] [-d {0-10000}] [-c] [-n] [-z] [-e] [-u] [-w] [-t] [-T]
 
               [-m{mb}] [-d {0-10000}] [-c] [-n] [-z] [-e] [-u] [-w] [-t] [-T]
               [-E[/{cwd}]] [-M[/{cwd}]] [-C{filename}] [-X{/directory}]...
+
               [-E[/{cwd}]] [-M[/{cwd}]] [-i] [-C{filename}] [-X{/directory}]...
   
+
 
 +
 
 
  '''DESCRIPTION'''
 
  '''DESCRIPTION'''
 
     Mosrun runs a program under the MOSIX discipline: this means that pro-
 
     Mosrun runs a program under the MOSIX discipline: this means that pro-
Line 546: Line 34:
 
     Once running under MOSIX, the program and all its child-processes remain
 
     Once running under MOSIX, the program and all its child-processes remain
 
     under the MOSIX discipline, with the exception of the native utility,
 
     under the MOSIX discipline, with the exception of the native utility,
     which allows programs (mainly shells) that already run under mosrun to
+
     that allows programs (mainly shells) that already run under mosrun to
 
     spawn children that run in native Linux mode.
 
     spawn children that run in native Linux mode.
 
    
 
    
Line 574: Line 62:
 
     (usually a shell) that calls mosrun.
 
     (usually a shell) that calls mosrun.
 
    
 
    
     The -G argument states that the program is to be allowed to migrate to
+
     The -G argument states that the program should be be allowed to migrate to
     grid-wide nodes rather than only within the local cluster.  This argument
+
     nodes in other partitions and clusters within the grid, rather than only
    may be followed by a positive integer, -G[{class}] that specify the pro-
+
    within the local partition.  This argument may be followed by a positive
    gram's class: when that number is omitted, the class of the program is
+
    integer, -G[{class}] that specify the program's class: when that number
    assumed to be 1.  It is also possible to specify -G0, meaning that the
+
    is omitted, the class of the program is assumed to be 1.  It is also pos-
    program may not migrate outside the local cluster (this is useful when -G
+
    sible to specify -G0, meaning that the program may not migrate outside
    was already applied the calling program).
+
    the local partition (this is useful when -G was already applied the call-
 +
    ing program).
 
    
 
    
 
     The -D{timespec} allows the user to provide an estimate on how long their
 
     The -D{timespec} allows the user to provide an estimate on how long their
Line 640: Line 129:
 
     otherwise, the arguments (if any) are assumed to begin at argument #1 and
 
     otherwise, the arguments (if any) are assumed to begin at argument #1 and
 
     argument #0 is assumed to be identical to the program-name.
 
     argument #0 is assumed to be identical to the program-name.
 
+
 
     mosrun can send batch jobs to other nodes of the local cluster. There
+
     mosrun can send batch jobs to other nodes of the local cluster-partition.
     are two types of batch jobs: those produced by the -E argument are native
+
     There are two types of batch jobs: those produced by the -E argument are
     Linux jobs, while those produced by the -M argument are MOSIX jobs - but
+
     native Linux jobs, while those produced by the -M argument are MOSIX jobs
     possibly with a different home-node.
+
     - but possibly with a different home-node.
 
    
 
    
 
     Batch jobs are executed from binaries in another node and preserve only
 
     Batch jobs are executed from binaries in another node and preserve only
Line 663: Line 152:
 
     tory on the other node.  It is possible to override that path by specify-
 
     tory on the other node.  It is possible to override that path by specify-
 
     ing a different directory in the -E{/cwd} or -M{/cwd} arguments.
 
     ing a different directory in the -E{/cwd} or -M{/cwd} arguments.
 
+
 
 +
    The -i flag states that all the standard-input of a batch job is for its
 +
    exclusive use: it is especially recommended when the input of a batch job
 +
    is redirected from a file.  Programs that use poll(2) or select(2) to
 +
    check for input before reading from their standard-input can only work in
 +
    batch mode with the -i flag.  This flag can also improve the performance.
 +
    An example when the -i flag cannot be used, is when an interactive shell
 +
    places a batch job in the background (because typed input that is
 +
    intended for the shell may go to the batch job instead).
 +
   
 
     MOSIX-specific arguments (-G, -F, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T,
 
     MOSIX-specific arguments (-G, -F, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T,
 
     -A, -N, -C), do not apply to native Linux batch jobs that are started
 
     -A, -N, -C), do not apply to native Linux batch jobs that are started
Line 744: Line 242:
 
     process from its saved checkpoint file.  Program options are not permit-
 
     process from its saved checkpoint file.  Program options are not permit-
 
     ted with -R, since their values are recovered from the checkpoint file.
 
     ted with -R, since their values are recovered from the checkpoint file.
 
+
 
 
     It is not always possible (or desirable) for a recovered program to con-
 
     It is not always possible (or desirable) for a recovered program to con-
 
     tinue to use the same files that were open at the time of checkpoint:
 
     tinue to use the same files that were open at the time of checkpoint:
Line 757: Line 255:
 
     ment with a different separator, for example:
 
     ment with a different separator, for example:
 
     -O@1=file,with,commas@2=oldstderr@5=tmpfile.
 
     -O@1=file,with,commas@2=oldstderr@5=tmpfile.
 
+
 
 
     In the absence of the -O argument, regular files and directories are re-
 
     In the absence of the -O argument, regular files and directories are re-
 
     opened with the previous modes, flags and offsets.
 
     opened with the previous modes, flags and offsets.
Line 784: Line 282:
 
     number of Linux processes (about 30000 for all users): to queue more
 
     number of Linux processes (about 30000 for all users): to queue more
 
     jobs, see the "RUNNING MULTIPLE JOBS" section below.
 
     jobs, see the "RUNNING MULTIPLE JOBS" section below.
 
+
   
     The queuing system is common to a whole cluster and using it is optional.
+
     The queuing system is common to each cluster-partition and using it is
     It is recommended that a policy is decided where either all the users of
+
     optional. It is recommended that a policy is decided where either all the
     a cluster use it, or all do not.  Queued jobs can also be controlled
+
     users of a cluster use it, or all do not.  Queued jobs can also be con-
     using mosq(1).
+
     trolled using mosq(1).
 
+
   
 
     The -q argument causes the whole mosrun command to be queued and post-
 
     The -q argument causes the whole mosrun command to be queued and post-
 
     poned until the queuing system launch it.
 
     poned until the queuing system launch it.
 
+
 
 
     The letter q may optionally be followed by a non-negative integer, speci-
 
     The letter q may optionally be followed by a non-negative integer, speci-
 
     fying the job's priority - the lower the number, the higher the priority
 
     fying the job's priority - the lower the number, the higher the priority
 
     (in the absence of this number, a pre-configured, per-node default of 50
 
     (in the absence of this number, a pre-configured, per-node default of 50
 
     is used, unless configured otherwise by the system-administrator).
 
     is used, unless configured otherwise by the system-administrator).
 
+
     
     Queued programs are not visible in mosps(1), while ps(1) shows them as
+
     Queued programs are shown mosps(1) and ps(1) as "mosqueue".
    "mosqueue".
+
     
 
+
 
     The -Q argument is similar to -q, except that if MOSIX is stopped (or
 
     The -Q argument is similar to -q, except that if MOSIX is stopped (or
 
     restarted) while the program is queued, or if the queuing system attempts
 
     restarted) while the program is queued, or if the queuing system attempts
 
     to abort the job (see mosq(1)), with -q the program will be killed, while
 
     to abort the job (see mosq(1)), with -q the program will be killed, while
 
     with -Q it will bypass the queuing system and begin running.
 
     with -Q it will bypass the queuing system and begin running.
 
+
 
 
     The -P{parallel_processes} argument informs the queuing system that the
 
     The -P{parallel_processes} argument informs the queuing system that the
 
     job may split into a given number of parallel processes (hence more
 
     job may split into a given number of parallel processes (hence more
 
     resources must be reserved for it).
 
     resources must be reserved for it).
 
+
 
 
     The -J{JobID} argument allows bundling of several instances of mosrun
 
     The -J{JobID} argument allows bundling of several instances of mosrun
 
     with a single "job" ID for easy identification and manipulation (the con-
 
     with a single "job" ID for easy identification and manipulation (the con-
Line 815: Line 312:
 
     then be handled collectively by mosq(1), migrate(1), mosps(1) and
 
     then be handled collectively by mosq(1), migrate(1), mosps(1) and
 
     moskillall(1).
 
     moskillall(1).
 
+
 
 
     Job-ID's can be either a non-negative integer or a token from the file
 
     Job-ID's can be either a non-negative integer or a token from the file
 
     $HOME/.jobids: if this file exists, each line in it contains a number
 
     $HOME/.jobids: if this file exists, each line in it contains a number
 
     (JobID) followed by a token that can be used as a synonym to that JobID.
 
     (JobID) followed by a token that can be used as a synonym to that JobID.
 
     The default JobID is 0.
 
     The default JobID is 0.
 
+
 
 
     Job ID's are inherited by child processes.
 
     Job ID's are inherited by child processes.
 
+
 
 
     This argument is ignored for batch jobs originating from other nodes.
 
     This argument is ignored for batch jobs originating from other nodes.
 
    
 
    
Line 899: Line 396:
 
     according to the node where the process was running when making the last
 
     according to the node where the process was running when making the last
 
     change to the file.
 
     change to the file.
 
+
 
 
     The per-process maximum total size of all private temporary files is set
 
     The per-process maximum total size of all private temporary files is set
 
     by the system-administrator.  Different maximum values can be imposed
 
     by the system-administrator.  Different maximum values can be imposed
Line 905: Line 402:
 
     exceeding this maximum will cause a process to migrate back to its home-
 
     exceeding this maximum will cause a process to migrate back to its home-
 
     node.
 
     node.
 
+
 
 
  '''ALTERNATIVE FREEZING SPACE'''
 
  '''ALTERNATIVE FREEZING SPACE'''
 
     MOSIX processes can sometimes be frozen (you can freeze your processes
 
     MOSIX processes can sometimes be frozen (you can freeze your processes
 
     manually and the system-administrator usually sets an automatic-freezing
 
     manually and the system-administrator usually sets an automatic-freezing
 
     policy - See mosix(7)).
 
     policy - See mosix(7)).
 
+
 
 
     The memory-image of frozen processes is saved to disk.  Normally the sys-
 
     The memory-image of frozen processes is saved to disk.  Normally the sys-
 
     tem-administrator determines where on disk to store your frozen pro-
 
     tem-administrator determines where on disk to store your frozen pro-
Line 919: Line 416:
 
     believe that your processes (or some of them) may require so much memory
 
     believe that your processes (or some of them) may require so much memory
 
     that they could disturb other users.
 
     that they could disturb other users.
 
+
 
 
     Setting your own freezing space can be done either by setting the envi-
 
     Setting your own freezing space can be done either by setting the envi-
 
     ronment-variable FREEZE_DIR to an alternative directory (starting with
 
     ronment-variable FREEZE_DIR to an alternative directory (starting with
Line 927: Line 424:
 
     '/'" within the section about configuring /etc/mosix/freeze.conf in the
 
     '/'" within the section about configuring /etc/mosix/freeze.conf in the
 
     mosix(7) manual.
 
     mosix(7) manual.
 
+
 
 
     You must have write-access to the your alterantive freeze-directory(s).
 
     You must have write-access to the your alterantive freeze-directory(s).
 
     The space available in alternative freeze-directories is subject to pos-
 
     The space available in alternative freeze-directories is subject to pos-
 
     sible disk quotas.
 
     sible disk quotas.
 
+
   
 
  '''RECURSIVE MOSRUN'''
 
  '''RECURSIVE MOSRUN'''
 
     It is possible to run mosrun within an already-running mosrun: this can
 
     It is possible to run mosrun within an already-running mosrun: this can
Line 937: Line 434:
 
     itself run by mosrun, or when running mosrun make with a Makefile that
 
     itself run by mosrun, or when running mosrun make with a Makefile that
 
     contains calls to mosrun.
 
     contains calls to mosrun.
 
+
 
 
     The following arguments (and only those) of the outer mosrun will be pre-
 
     The following arguments (and only those) of the outer mosrun will be pre-
 
     served by the inner mosrun (unless the inner mosrun explicitly requests
 
     served by the inner mosrun (unless the inner mosrun explicitly requests
 
     otherwise): -c, -d, -e, -J, -G, -L, -l, -m, -n, -T, -t, -u, -w.
 
     otherwise): -c, -d, -e, -J, -G, -L, -l, -m, -n, -T, -t, -u, -w.
 
+
   
 +
'''FOR THE SYSTEM ADMINISTRATOR'''
 +
    Some installations want to restrict access to mosrun, or control its
 +
    allowed parameters according to local policies (for example, enforce
 +
    queuing).  If you want to do this:
 +
   
 +
    1.  Allocate a special (preferably new) user-group for mosrun (we shall
 +
        call it "mos" for the instructions below).
 +
    2.  chgrp mos /bin/mosrun
 +
    3.  chmod 4750 /bin/mosrun
 +
    4.  Write a wrapper program which receives the same parameters as
 +
        "mosrun", then checks and/or modifies its parameters according to the
 +
        desired local policies, then executes:
 +
        /bin/mosrun -g {mosrun-parametrs}
 +
    5.  chgrp mos /bin/wrapper
 +
    6.  chmod 2755 /bin/wrapper
 +
    7.  Tell your users to use "wrapper" (or any other name you choose)
 +
        instead of "mosrun".
 +
     
 +
   
 
  '''LIMITATIONS'''
 
  '''LIMITATIONS'''
 +
    32-bit processes must have a 32-bit home-node (but they can be assigned
 +
    or migrated to 64-bit nodes).  Attempts to execute a 32-bit binary under
 +
    a 64-bit home-node will turn the process into a native Linux process (and
 +
    if that process has open private-temporary-files or uses direct communi-
 +
    cation, it will be killed).  Obviously, 64-bit processes cannot run on
 +
    32-bit nodes.
 +
 
 +
    Batch jobs from 64-bit nodes are currently not permitted to run on 32-bit
 +
    nodes.
 +
   
 
     Some system-calls are not supported by mosrun, including system-calls
 
     Some system-calls are not supported by mosrun, including system-calls
 
     that are tightly connected to resources of the local node or intended for
 
     that are tightly connected to resources of the local node or intended for
 
     system-administration.  These are:
 
     system-administration.  These are:
 
+
 
 
     acct, add_key, adjtimex, afs_syscall(x86_64), alloc_hugepages(i386),
 
     acct, add_key, adjtimex, afs_syscall(x86_64), alloc_hugepages(i386),
 
     bdflush, capget, capset, chroot, clock_getres, clock_nanosleep,
 
     bdflush, capget, capset, chroot, clock_getres, clock_nanosleep,
 
     clock_settime, create_module(x86_64), delete_module, epoll_create,
 
     clock_settime, create_module(x86_64), delete_module, epoll_create,
     epoll_ctl, epoll_pwait, epoll_wait, eventfd, fadvise,
+
     epoll_ctl, epoll_pwait, epoll_wait, eventfd, free_hugepages(i386), futex,
    free_hugepages(i386), futex, get_kernel_syms(x86_64), get_mempolicy,
+
    get_kernel_syms(x86_64), get_mempolicy, get_robust_list, getcpu,
    get_robust_list, getcpu, getpmsg(x86_64), init_module, inotify_add_watch,
+
    getpmsg(x86_64), init_module, inotify_add_watch, inotify_init, ino-
     inotify_init, inotify_rm_watch, io_cancel, io_destroy, io_getevents,
+
     tify_rm_watch, io_cancel, io_destroy, io_getevents, io_setup, io_submit,
    io_setup, io_submit, ioperm, iopl, ioprio_get, ioprio_set,
+
    ioperm, iopl, ioprio_get, ioprio_set, kexec_load(x86_64), keyctl,
    kexec_load(x86_64), keyctl, lookup_dcookie, madvise, mbind,
+
    lookup_dcookie, madvise, mbind, migrate_pages, mlock, mlockall,
    migrate_pages, mlock, mlockall, modify_ldt, move_pages, mq_getsetattr,
+
    move_pages, mq_getsetattr, mq_notify, mq_open, mq_timedreceive, mq_timed-
    mq_notify, mq_open, mq_timedreceive, mq_timedsend, mq_unlink, munlock,
+
    send, mq_unlink, munlock, munlockall, nfsservctl, personality,
    munlockall, nfsservctl, personality, pivot_root, ptrace, quotactl,
+
    pivot_root, ptrace, quotactl, reboot, remap_file_pages, request_key,
    reboot, remap_file_pages, request_key, rt_sigqueueinfo, rt_sigtimedwait,
+
    rt_sigqueueinfo, rt_sigtimedwait, sched_get_priority_max, sched_get_pri-
     sched_get_priority_max, sched_get_priority_min, sched_getaffinity,
+
     ority_min, sched_getaffinity, sched_getparam, sched_getscheduler,
    sched_getparam, sched_getscheduler, sched_rr_get_interval,
+
     sched_rr_get_interval, sched_setaffinity, sched_setparam, sched_setsched-
     sched_setaffinity, sched_setparam, sched_setscheduler, security(x86_64),
+
    uler, security(x86_64), set_mempolicy, setdomainname, sethostname,
    set_mempolicy, setdomainname, sethostname, set_robust_list, settimeofday,
+
    set_robust_list, settimeofday, shmat, signalfd, swapoff, swapon, syslog,
    shmat, signalfd, swapoff, swapon, syslog, timer_create, timer_delete,
+
    timer_create, timer_delete, timer_getoverrun, timer_gettime, timer_set-
    timer_getoverrun, timer_gettime, timer_settime, timerfd, tuxcall(x86_64),
+
    time, timerfd,  timerfd_gettime, timerfd_settime, tuxcall(x86_64),
 
     unshare, uselib, vm86(i386), vmsplice, waitid.
 
     unshare, uselib, vm86(i386), vmsplice, waitid.
 
+
   
 
     In addition, mosrun supports only limited options for the following sys-
 
     In addition, mosrun supports only limited options for the following sys-
 
     tem-calls:
 
     tem-calls:
Line 1,006: Line 532:
 
    
 
    
 
  '''SEE ALSO'''
 
  '''SEE ALSO'''
     migrate(1), mosq(1), moskillall(1), mosps(1), mosix(7).
+
     migrate(1), mosq(1), moskillall(1), mosps(1), direct_communication(7),
 +
    mosix(7).
 
    
 
    
  MOSIX                              May 2006                            MOSIX
+
  MOSIX                              February 2009                              MOSIX

Latest revision as of 10:37, 22 February 2009

MOSRUN(M1)                      MOSIX Commands                      MOSRUN(M1)
 
NAME
    MOSRUN - Running MOSIX programs
 
SYNOPSIS
    mosrun [location_options] [program_options] program [args] ...
    mosrun -S{maxjobs} [location_options] [program_options] {commands-file}
           [,{failed-file}]
    mosrun -R{filename} [-O{fd=filename}[,{fd2=fn2}]]...  [location_options]
    mosrun -I{filename}
    mosenv { same-arguments-as-mosrun }
    native program [args]...
 
           Location options:
 
              [-r{hostname} | -{a.b.c.d} | -{n} | -h | -b |
              -jID1-ID2[,ID3-ID4]... }] [-G[{class}]] [-F] [-L] [-l]
              [-D{DD:HH:MM}] [-A{minutes}] [-N{max}] [-{q|Q}[{pri}]]
              [-P{parallel_processes}] [-J{JobID}]
 
           Program Options:
 
              [-m{mb}] [-d {0-10000}] [-c] [-n] [-z] [-e] [-u] [-w] [-t] [-T]
              [-E[/{cwd}]] [-M[/{cwd}]] [-i] [-C{filename}] [-X{/directory}]...
 
 
DESCRIPTION
    Mosrun runs a program under the MOSIX discipline: this means that pro-
    grams activated by mosrun can potentially migrate to other nodes within
    the cluster or grid (see mosix(7)): programs that are not started by
    mosrun, run in "native" Linux mode and cannot migrate.
 
    Once running under MOSIX, the program and all its child-processes remain
    under the MOSIX discipline, with the exception of the native utility,
    that allows programs (mainly shells) that already run under mosrun to
    spawn children that run in native Linux mode.
 
    The following arguments may be used to specify the program's initial
    assignment:
 
    -r{hostname}            on the given host
    -{a.b.c.d}              on the given IP address
    -{n}                    on the given node-number
    -h                      on the home-node
    -b                      the program attempts to select the best node
    -jID1-ID2[,ID3-ID4]...  select at random from the given list of hosts,
                            IP's and/or node numbers.
 
    When none of the above arguments is used, the program will start wherever
    its parent process is running.
 
    The -F flag states that mosrun should start the program somewhere else,
    even if the requested node (above) is not available.
 
    The -L flag states that the program should not be allowed to migrate
    automatically.  It may still be migrated manually or when situations
    arise that do not allow it to continue running where it is.
 
    The -l flag negates the -L flag and allows the program to migrate auto-
    matically: this is useful when -L was already applied to the program
    (usually a shell) that calls mosrun.
 
    The -G argument states that the program should be be allowed to migrate to
    nodes in other partitions and clusters within the grid, rather than only
    within the local partition.  This argument may be followed by a positive
    integer, -G[{class}] that specify the program's class: when that number
    is omitted, the class of the program is assumed to be 1.  It is also pos-
    sible to specify -G0, meaning that the program may not migrate outside
    the local partition (this is useful when -G was already applied the call-
    ing program).
 
    The -D{timespec} allows the user to provide an estimate on how long their
    job should run.  MOSIX does not use this information - it is provided in
    order to help mosps(1) keep track of processes.  timespec can be speci-
    fied in any of the following formats (DD/HH/MM are numeric for days,
    hours and minutes respectively): DD:HH:MM; HH:MM; DDd; HHh; MMm;
    DDdHHhMMm; DDdHHh; DDdMMm; HHhMMm.  Periods when the process is frozen
    are automatically added to that estimate.
 
    The -m{mb} argument states that the program requires a certain amount of
    memory (in Megabytes) and should not run with less.  This has the effect
    of:
    1. Combined with the -b flag, the program will only consider to start
       running on nodes with available memory of at least {mb} Megabytes: the
       program will not even start until at least one such node is found.
    2. The program will not automatically migrate to nodes with less than
       {mb} Megabytes free memory (with the exception of the home node, when
       the program must move back home).
    3. The queuing system (see below) will take the program's memory require-
       ments into account when deciding which and how many jobs to allow to
       run at any point in time.
 
    Most system-calls are supported by MOSIX, but a few are not (such as map-
    ping shared memory or cloning - see the "LIMITATIONS" section below).  By
    default, when a program under mosrun encounters an unsupported system-
    call, it is killed.  The -e flag, however, allows the program to continue
    and behave as follows:
 
    1. mmap(2) with (flags & MAP_SHARED) - but !(prot & PROT_WRITE), replaces
       the MAP_SHARED with MAP_PRIVATE (this combination seems unusual or
       even faulty, but is unnecessarily used within some Linux libraries).
 
    2. all other unsupported system-call return -1 and "errno" is set to
       ENOSYS.
 
    The -w flag is the same as -e, but it also causes mosrun to print an
    error message to the standard-error when an unsupported system-call is
    encountered.  The -u flag returns to the default of killing the process.
 
    System calls and I/O operations are monitored and taken into account in
    automatic migration considerations, tending to pull processes towards
    their home-nodes.  The -c flag tells mosrun not to take system calls and
    I/O operations in the migration considerations.  The -n flag reverts to
    taking them into account.
 
    Even when running elsewhere, programs running under MOSIX obtain the
    results of the gettimeofday(2) system-call from their home-nodes.  The -t
    flag tells mosrun to take the time from the local node (where the process
    currently runs), thus reducing the communication overhead with the home-
    node. Note that this can be a problem when the clocks are not synchro-
    nized.  The -T flag reverses the effect of -t.
 
    The -d{decay} argument, where decay is an integer between 0 and 10000,
    sets the rate of decay of process-statistics as a fraction of 10000 per
    second (see mosix(7)).
 
    The -z flag states that the program's arguments begin at argument #0 -
    otherwise, the arguments (if any) are assumed to begin at argument #1 and
    argument #0 is assumed to be identical to the program-name.
 
    mosrun can send batch jobs to other nodes of the local cluster-partition.
    There are two types of batch jobs: those produced by the -E argument are
    native Linux jobs, while those produced by the -M argument are MOSIX jobs
    - but possibly with a different home-node.
 
    Batch jobs are executed from binaries in another node and preserve only
    some of the caller's environment: they receive the environment variables;
    they can read from their standard-input and write to their standard out-
    put and error, but not from/to other open files; they receive signals,
    but after forking, signals are delivered to the whole process-group
    rather than just the parent; they cannot communicate with other processes
    on the local node using pipes and sockets (other than standard input/out-
    put/error), semaphores, messages, etc.  and can only receive signals, but
    not send them.  The main advantage of batch jobs is that they save time
    by not needing to refer to the home-node to perform system-calls, so tem-
    porary files for example, can be created on the node where they start,
    preventing the calling node from becoming a bottleneck.  This approach is
    recommended for programs that perform a significant amount of I/O.
 
    Batch jobs use the path of the current directory as their current-direc-
    tory on the other node.  It is possible to override that path by specify-
    ing a different directory in the -E{/cwd} or -M{/cwd} arguments.
  
    The -i flag states that all the standard-input of a batch job is for its
    exclusive use: it is especially recommended when the input of a batch job
    is redirected from a file.  Programs that use poll(2) or select(2) to
    check for input before reading from their standard-input can only work in
    batch mode with the -i flag.  This flag can also improve the performance.
    An example when the -i flag cannot be used, is when an interactive shell
    places a batch job in the background (because typed input that is
    intended for the shell may go to the batch job instead).
    
    MOSIX-specific arguments (-G, -F, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T,
    -A, -N, -C), do not apply to native Linux batch jobs that are started
    with the -E argument, but they do apply to jobs started with the -M argu-
    ment.
 
    Permission is required from the other node to send batch jobs there (see
    mosix(7) for more information).
 
    The following arguments: -G, -L, -l, -m, -d, -c, -n, -e, -u, -t, -T are
    inherited by child processes: see however in mosix(7) how those can be
    changed at run time from within the program.
 
    The variant mosenv is used to circumvent the loss of certain environment
    variables by the GLIBC library due to the fact that mosrun is a "setuid"
    program: if your program relies on the settings of dynamic-linking envi-
    ronment variables (such as LD_LIBRARY_PATH) or malloc(3) debugging (MAL-
    LOC_CHECK_), use mosenv instead of mosrun.
 
CHECKPOINTS
    Most CPU-intensive processes running under mosrun can be checkpointed:
    this means that an image of those processes is saved to a file, and when
    necessary, the process can later recover itself from that file and con-
    tinue to run from that point.
 
    For successful checkpoint and recovery, the process must not depend heav-
    ily on its Linux environment.  Specifically, the following processes can-
    not be checkpointed at all:
 
    1. Processes with setuid/setgid privileges (for security reasons).
    2. Processes with open pipes or sockets.
 
    The following processes can be checkpointed, but may not run correctly
    after being recovered:
 
    1. Processes that rely on process-ID's of themselves or other processes
       (parent, sons, etc.).
    2. Processes that rely on parent-child relations (e.g. use wait(2), Ter-
       minal job-control, etc.).
    3. Processes that coordinate their input/output with other running pro-
       cesses.
    4. Processes that rely on timers and alarms.
    5. Processes that cannot afford to lose signals.
    6. Processes that use system-V IPC (semaphores and messages).
 
    The -C{filename} argument specifies where to save checkpoints: when a new
    checkpoint is saved, that file-name is given a consecutive numeric exten-
    sion (unless it already has one). For example, if the argument -Cmysave
    is given, then the first checkpoint will be saved to mysave.1, the second
    to mysave.2, etc., and if the argument -Csave.4 is given, then the first
    checkpoint will be saved to save.4, the second to save.5, etc.  If the -C
    argument is not provided, then the checkpoints will be saved to the
    default: ckpt.{pid}.1, ckpt.{pid}.2  ...  The -C argument is NOT inher-
    ited by child processes.
 
    The -N{max} argument specifies the maximum number of checkpoints to pro-
    duce before recycling the checkpoint versions.  This is mainly needed in
    order to save disk space.  For example, when running with the arguments:
    -Csave.4 -N3, checkpoints will be saved in save.4, save.5, save.6,
    save.4, save.5, save.6, save.4 ...
    The -N0 argument returns to the default of unlimited checkpoints; an
    argument of -N1 is risky, because if there is a crash just at the time
    when a backup is taken, there could be no remaining valid checkpoint
    file.  Similarly, if the process can possibly have open pipe(s) or
    socket(s) at the time a checkpoint is taken, a checkpoint file will be
    created and counted - but containing just an error message, hence this
    argument should have a large-enough value to accommodate this possibil-
    ity.  The -N argument is NOT inherited by child processes.
 
    Checkpoints can be triggered by the program itself, by a manual request
    (see migrate(1)) and/or at regular time intervals.  The -A{minutes} argu-
    ment requests that checkpoints be automatically taken every given number
    of minutes.  Note that if the process is within a blocking system-call
    (such as reading from a terminal) when the time for a checkpoint comes,
    the checkpoint will be delayed until after the completion of that system
    call.  Also, when the process is frozen, it will not produce a checkpoint
    until unfrozen.  The -A argument is NOT inherited by child processes.
 
    With the -R{filename} argument, mosrun recovers and continue to run the
    process from its saved checkpoint file.  Program options are not permit-
    ted with -R, since their values are recovered from the checkpoint file.
 
    It is not always possible (or desirable) for a recovered program to con-
    tinue to use the same files that were open at the time of checkpoint:
    mosrun -I{filename} inspects a checkpoint file and lists the open files,
    along with their modes, flags and offsets, then the -O argument allows
    the recovered program to continue using different files.  Files specified
    using this option, will be opened (or created) with the previous modes,
    flags and offsets.  The format of this argument is usually a comma-sepa-
    rated list of file-descriptor integers, followed by a '=' sign and a
    file-name.  For example: -O1=oldstdout,2=oldstderr,5=tmpfile, but in case
    one or more file-names contain a comma, it is optional to begin the argu-
    ment with a different separator, for example:
    -O@1=file,with,commas@2=oldstderr@5=tmpfile.
 
    In the absence of the -O argument, regular files and directories are re-
    opened with the previous modes, flags and offsets.
 
    Files that were already unlinked at the time of checkpoint, are assumed
    to be temporary files belonging to the process, and are also saved and
    recovered along with the process (an exception is if an unlinked file was
    opened for write-only).  Unpredictable results may occur if such files
    are used to communicate with other processes.
 
    As for special files (most commonly the user's terminal, used as standard
    input, output or error) that were open at the time of checkpoint - if
    mosrun is called with their file-descriptors open, then the existing open
    files are used (and their modes, flags and offsets are not modified).
    Special files that are neither specified in the -O argument, nor open
    when calling mosrun, are replaced with /dev/null.
 
    While a checkpoint is being taken, the partially-written checkpoint file
    has no permissions (chmod 0).  When the checkpoint is complete, its mode
    is changed to 0400 (read-only).
 
QUEUING
    MOSIX incorporates a queuing system that allow users to submit a number
    of jobs that will be scheduled to run when resources are available.
    Although the number of queued jobs can be large, it is limited by the
    number of Linux processes (about 30000 for all users): to queue more
    jobs, see the "RUNNING MULTIPLE JOBS" section below.
    
    The queuing system is common to each cluster-partition and using it is
    optional. It is recommended that a policy is decided where either all the
    users of a cluster use it, or all do not.  Queued jobs can also be con-
    trolled using mosq(1).
    
    The -q argument causes the whole mosrun command to be queued and post-
    poned until the queuing system launch it.
  
    The letter q may optionally be followed by a non-negative integer, speci-
    fying the job's priority - the lower the number, the higher the priority
    (in the absence of this number, a pre-configured, per-node default of 50
    is used, unless configured otherwise by the system-administrator).
     
    Queued programs are shown mosps(1) and ps(1) as "mosqueue".
     
    The -Q argument is similar to -q, except that if MOSIX is stopped (or
    restarted) while the program is queued, or if the queuing system attempts
    to abort the job (see mosq(1)), with -q the program will be killed, while
    with -Q it will bypass the queuing system and begin running.
  
    The -P{parallel_processes} argument informs the queuing system that the
    job may split into a given number of parallel processes (hence more
    resources must be reserved for it).
  
    The -J{JobID} argument allows bundling of several instances of mosrun
    with a single "job" ID for easy identification and manipulation (the con-
    cept of what a "job" means is left for each user to define).  "Jobs" can
    then be handled collectively by mosq(1), migrate(1), mosps(1) and
    moskillall(1).
  
    Job-ID's can be either a non-negative integer or a token from the file
    $HOME/.jobids: if this file exists, each line in it contains a number
    (JobID) followed by a token that can be used as a synonym to that JobID.
    The default JobID is 0.
  
    Job ID's are inherited by child processes.
  
    This argument is ignored for batch jobs originating from other nodes.
 
RUNNING MULTIPLE JOBS
    The -S{maxjobs} option runs under mosrun multiple command-lines from the
    file specified by commands-file, each with the given mosrun arguments.
 
    This option is commonly used to run the same program with many different
    sets of arguments.  For example, the contents of commands-file could be:
 
               my_program -a1 < ifile1 > output1
               my_program -a2 < ifile2 > output2
               my_program -a3 < ifile3 > output3
 
    Command-lines are started in the order they appear in commands-file.
    While the number of command-lines is unlimited, mosrun will run concur-
    rently up to maxjobs (1-30000) command-lines at any given time: when any
    command-line terminates, a new command-line is started.
 
    Command lines are interpreted by the standard shell (bash(1)).  Please
    note that bash has the property that when redirection is used, it spawns
    a son-process to run the command: if the number of processes is an issue,
    it is recommended to prepend the keyword exec before each command line
    that uses redirection.  For example:
 
               exec my_program -a1 < ifile1 > output1
               exec my_program -a2 < ifile2 > output2
               exec my_program -a3 < ifile3 > output3
 
    The exit status of mosrun -S{maxjobs} is the number of command-lines that
    failed (255 if more than 255 command-lines failed).
 
    As a further option, the commands-file argument can be followed by a
    comma and another file-name: commands-file,failed-commands.  Mosrun will
    create the second file and write to it the list of all the commands (if
    any) that failed (this provides an easy way to re-run only those commands
    that failed).
 
    The -S{maxjobs} option combines well with the queuing system (the -q
    argument), setting an absolute upper limit on the number of simultaneous
    jobs whereas the number of jobs allowed to run by the queuing system
    depends on the available grid-resources.  With this combination, to pre-
    vent an unnecessary and excessive number of waiting processes, no more
    than 10 jobs will be queued at any given moment.
 
PRIVATE TEMPORARY FILES
    Normally, all files are created on the home-node and all file-operations
    are performed there.  This is important because programs often share
    files, but can be costly: many programs use temporary files which they
    never share - they create those files as secondary-memory and discard
    them when they terminate.  It is best to migrate such files with the pro-
    cess rather than keep them in the home-node.
 
    The -X {/directory} argument tells Mosrun that a given directory is only
    used for private temporary files: all files that the program creates in
    this directory are kept with the process that created them and migrate
    with it.
 
    The -X argument may be repeated, specifying up to 10 private temporary
    directories.  The directories must start with '/'; can be up to 256 char-
    acters long; cannot include ".."; and for security reasons cannot be
    within "/etc", "/proc", "/sys" or "/dev".
 
    Only regular files are permitted within private temporary directories: no
    sub-directories, links, symbolic-links or special files are allowed
    (except that sub-directories can be specified by an extra -X argument).
 
    Private temporary file names must begin with '/' (no relative pathnames)
    and contain no ".." components.  The only file operations currently sup-
    ported for private temporary files are: open, creat, lseek, read, write,
    close, chmod, fchmod, unlink, truncate, ftruncate, access, stat.
 
    File-access permissions on private temporary files are provided for com-
    patibility, but are not enforced: the stat(2) system-call returns 0 in
    st_uid and st_gid.  stat(2) also returns the file-modification times
    according to the node where the process was running when making the last
    change to the file.
  
    The per-process maximum total size of all private temporary files is set
    by the system-administrator.  Different maximum values can be imposed
    when running on the home-node, in the local cluster and on the grid -
    exceeding this maximum will cause a process to migrate back to its home-
    node.
  
ALTERNATIVE FREEZING SPACE
    MOSIX processes can sometimes be frozen (you can freeze your processes
    manually and the system-administrator usually sets an automatic-freezing
    policy - See mosix(7)).
  
    The memory-image of frozen processes is saved to disk.  Normally the sys-
    tem-administrator determines where on disk to store your frozen pro-
    cesses, but you can override this default and set your own freezing-
    space.  One possible reason to do so is to ensure that your processes (or
    some of them) have sufficient freezing space regardless of what other
    users do.  Another possible reason is to protect other users if you
    believe that your processes (or some of them) may require so much memory
    that they could disturb other users.
  
    Setting your own freezing space can be done either by setting the envi-
    ronment-variable FREEZE_DIR to an alternative directory (starting with
    '/'); or if you wish to specify more than one freeze-directory, by creat-
    ing a file: $HOME/.freeze_dirs where each line contains a directory-name
    starting with '/'.  For more details, read about "lines starting with
    '/'" within the section about configuring /etc/mosix/freeze.conf in the
    mosix(7) manual.
  
    You must have write-access to the your alterantive freeze-directory(s).
    The space available in alternative freeze-directories is subject to pos-
    sible disk quotas.
   
RECURSIVE MOSRUN
    It is possible to run mosrun within an already-running mosrun: this can
    happen, for example, when a shell-script that contains calls to mosrun is
    itself run by mosrun, or when running mosrun make with a Makefile that
    contains calls to mosrun.
  
    The following arguments (and only those) of the outer mosrun will be pre-
    served by the inner mosrun (unless the inner mosrun explicitly requests
    otherwise): -c, -d, -e, -J, -G, -L, -l, -m, -n, -T, -t, -u, -w.
    
FOR THE SYSTEM ADMINISTRATOR
    Some installations want to restrict access to mosrun, or control its
    allowed parameters according to local policies (for example, enforce
    queuing).  If you want to do this:
    
    1.  Allocate a special (preferably new) user-group for mosrun (we shall
        call it "mos" for the instructions below).
    2.  chgrp mos /bin/mosrun
    3.  chmod 4750 /bin/mosrun
    4.  Write a wrapper program which receives the same parameters as
        "mosrun", then checks and/or modifies its parameters according to the
        desired local policies, then executes:
        /bin/mosrun -g {mosrun-parametrs}
    5.  chgrp mos /bin/wrapper
    6.  chmod 2755 /bin/wrapper
    7.  Tell your users to use "wrapper" (or any other name you choose)
        instead of "mosrun".
      
    
LIMITATIONS
    32-bit processes must have a 32-bit home-node (but they can be assigned
    or migrated to 64-bit nodes).  Attempts to execute a 32-bit binary under
    a 64-bit home-node will turn the process into a native Linux process (and
    if that process has open private-temporary-files or uses direct communi-
    cation, it will be killed).  Obviously, 64-bit processes cannot run on
    32-bit nodes.
  
    Batch jobs from 64-bit nodes are currently not permitted to run on 32-bit 
    nodes.
   
    Some system-calls are not supported by mosrun, including system-calls
    that are tightly connected to resources of the local node or intended for
    system-administration.  These are:
  
    acct, add_key, adjtimex, afs_syscall(x86_64), alloc_hugepages(i386),
    bdflush, capget, capset, chroot, clock_getres, clock_nanosleep,
    clock_settime, create_module(x86_64), delete_module, epoll_create,
    epoll_ctl, epoll_pwait, epoll_wait, eventfd, free_hugepages(i386), futex,
    get_kernel_syms(x86_64), get_mempolicy, get_robust_list, getcpu,
    getpmsg(x86_64), init_module, inotify_add_watch, inotify_init, ino-
    tify_rm_watch, io_cancel, io_destroy, io_getevents, io_setup, io_submit,
    ioperm, iopl, ioprio_get, ioprio_set, kexec_load(x86_64), keyctl,
    lookup_dcookie, madvise, mbind, migrate_pages, mlock, mlockall,
    move_pages, mq_getsetattr, mq_notify, mq_open, mq_timedreceive, mq_timed-
    send, mq_unlink, munlock, munlockall, nfsservctl, personality,
    pivot_root, ptrace, quotactl, reboot, remap_file_pages, request_key,
    rt_sigqueueinfo, rt_sigtimedwait, sched_get_priority_max, sched_get_pri-
    ority_min, sched_getaffinity, sched_getparam, sched_getscheduler,
    sched_rr_get_interval, sched_setaffinity, sched_setparam, sched_setsched-
    uler, security(x86_64), set_mempolicy, setdomainname, sethostname,
    set_robust_list, settimeofday, shmat, signalfd, swapoff, swapon, syslog,
    timer_create, timer_delete, timer_getoverrun, timer_gettime, timer_set-
    time, timerfd,  timerfd_gettime, timerfd_settime, tuxcall(x86_64),
    unshare, uselib, vm86(i386), vmsplice, waitid.
    
    In addition, mosrun supports only limited options for the following sys-
    tem-calls:
 
    clone  The only permitted flags are CLONE_CHILD_SETTID, CLONE_PARENT_SET-
           TID, CLONE_CHILD_CLEARTID, and the combination
           CLONE_VFORK|CLONE_VM; the child-termination signal must be SIGCLD
           and the stack-pointer (child_stack) must be NULL.
    getpriority
           may refer only to the calling process.
    ioctl  The following requests are not supported: TIOCSERGSTRUCT, TIOCSER-
           GETMULTI, TIOCSERSETMULTI, SIOCSIFFLAGS, SIOCSIFMETRIC, SIOC-
           SIFMTU, SIOCSIFMAP, SIOCSIFHWADDR, SIOCSIFSLAVE, SIOCADDMULTI,
           SIOCDELMULTI, SIOCSIFHWBROADCAST, SIOCSIFTXQLEN, SIOCSMIIREG,
           SIOCBONDENSLAVE, SIOCBONDRELEASE, SIOCBONDSETHWADDR, SIOCBOND-
           SLAVEINFOQUERY, SIOCBONDINFOQUERY, SIOCBONDCHANGEACTIVE, SIOCBRAD-
           DIF, SIOCBRDELIF.  Non-standard requests that are defined in
           drivers that are not part of the standard Linux kernel are also
           likely to not be supported.
    ipc    the following SYSV-IPC calls are not supported: shmat, semtimedop,
           new-version calls (bit 16 set in call-number).
    mmap   MAP_SHARED and mapping of special-character devices are not per-
           mitted.
    prctl  only the PR_SET_DEATHSIG and PR_GET_DEATHSIG options are sup-
           ported.
    setpriority
           may refer only to the calling process.
    setrlimit
           it is not permitted to modify the maximum number of open files
           (RLIMIT_NOFILES): mosrun fixes this limit at 1024.
 
    Programs that fail to run because they call an unsupported system-call
    can still run in batch mode ('mosrun -E').
 
    Users are not permitted to send the SIGSTOP signal to MOSIX processes:
    SIGTSTP should be used instead (and moskillall(1) changes SIGSTOP to
    SIGTSTP).
 
SEE ALSO
    migrate(1), mosq(1), moskillall(1), mosps(1), direct_communication(7),
    mosix(7).
 
MOSIX                              February 2009                              MOSIX