Direct communication
From MosixWiki
DIRECT COMMUNICATION(M7) MOSIX Description DIRECT COMMUNICATION(M7) NAME DIRECT COMMUNICATION - migratable sockets between MOSIX processes PURPOSE Normally, MOSIX processes do all their I/O (and most system-calls) via their home-node: this can be slow because operations are limited by the network speed and latency. Direct communication allows processes to pass messages directly between them, bypassing their home-nodes. For example, if process X whose home-node is A and runs on node B wishes to send a message over a socket to process Y whose home-node is C and runs on node D, then the message has to pass over the network from B to A to C to D. Using direct communication, the message will pass directly send messages to mailboxes of other processes anywhere within the multi- cluster Grid (that are willing to accept them). Direct communication makes the location of processes transparent, so the senders do not need to know where the receivers run, but only to identify them by their home-node and process-ID (PID) in their home-node. Direct communication guarantees that the order of messages per receiver is preserved, even when the sender(s) and receiver migrate - no matter where to and how many times they migrate. SENDING MESSAGES To start sending messages to another process, use: them = open("/proc/mosix/mbox/{a.b.c.d}/{pid}", 1); where {a.b.c.d} is the IP address of the receiver's home-node and {pid} is the process-ID of the receiver. To send messages to a process with the same home-node, you can use 0.0.0.0 instead of the local IP address (this is even preferable, because it allows the communication to proceed in the rare event when the home-node is shut-down from its cluster). The returned value (them) is not a standard (POSIX) file-descriptor: it can only be used within the following system calls: w = write(them, message, length); fcntl(them, F_SETFL, O_NONBLOCK); fcntl(them, F_SETFL, 0); dup2(them, 1); dup2(them, 2); close(them); Zero-length messages are allowed. Each process may at any time have up to 128 open direct communication file-descriptors for sending messages to other processes. These file- descriptors are inherited by child processes (after fork(2)). When dup2 is used as above, the corresponding file-descriptor (1 for standard-output; 2 for standard-error) is associated with sending mes- sages to the same process as them. In that case, only the above calls (write, fcntl, close, but not dup2) can then be used with that descriptor. RECEIVING MESSAGES To start receiving messages, create a mailbox: my_mbox = open("/proc/mosix/mybox", O_CREAT, flags); where flags is any combination (bitwise OR) of the following: 1 Allow receiving messages from other users of the same group (GID). 2 Allow receiving messages from all other users. 4 Allow receiving messages from processes with other home-nodes. 8 Do not delay: normally when attempting to receive a message and no fitting message was received, the call blocks until either a message or a signal arrives, but with this flag, the call returns immedi- ately a value of -1 (with errno set to EAGAIN). 16 Receive a SIGIO signal (See signal(7)) when a message is ready to be read (for assynchroneous operation). 32 Normally, when attempting to read and the next message does not fit in the read buffer (the message length is bigger than the count parameter of the read(2) system-call), the next message is trun- cated. When this bit is set, the first message that fits the read- buffer will be read (even if out of order): if none of the pending messages fits the buffer, the receiving process either waits for a new message that fits the buffer to arrive, or if bit 8 ("do not delay") is also set, returns -1 with errno set to EAGAIN. 64 Treat zero-length messages as an end-of-file condition: once a zero- length message is read, all further reads will return 0 (pending and future messages are not deleted, so they can still be read once this flag is cleared). The returned value (my_mbox) is not a standard (POSIX) file-descriptor: it can only be used within the following system calls: r = read(my_mbox, buf, count); dup2(my_mbox, 0); close(my_mbox); ioctl(my_mbox, SIOCINTERESTED, addr); (see FILTERING below). Reading my_mbox always reads a single message at a time, even when count allows reading more messages. A message can have zero-length, but count must be positive. unlike in "SENDING MESSAGES" above, my_mbox is NOT inherited by child processes. When dup2 is used as above, file-descriptor 0 (standard-input) is associ- ated with receiving messages from other processes, but only the read and close system-calls can then be used with file-descriptor 0. Closing my_mbox (or close(0) if dup2(my_mbox, 0) was used - whichever is closed last) discards all pending messages. To change the flags of the mailbox without losing any pending messages, open it again (without using close): my_mbox = open("/proc/mosix/mybox", O_CREAT, new_flags); Note that when removing permission-flags (1, 2 and 4) from new_flags, messages that were already sent earlier will still arrive, even from senders that are no longer allowed to send messages to the current pro- cess. Re-opening always returns the same value (my_mbox) as the initial open (unless an error occurs and -1 is returned). Also note that if dup2(my_mbox, 0) was used, new_flags will immediately apply to file- descriptor 0 as well. Extra information is available about the latest message that was read. To get this information, you should first define the following macro: static inline unsigned int GET_IP(char *file_name) { int ip = open(file_name, 0); return((unsigned int)((ip==-1 && errno>255) ? -errno: ip)); } To find the IP address of the sender's home, use: sender_home = GET_IP("/proc/self/sender_home"); To find the process-ID (PID) of the sender, use: sender_pid = open("/proc/self/sender_pid", 0); To find the IP address of the node where the sender was running when the message was sent, use: sender_location = GET_IP("/proc/self/sender_location"); (this can be used, for example, to make a manual migration request in order to bring together communicating processes to the same node) To find the length of the last message, use: bytes = open("/proc/self/message_length", 0); (this makes it possible to detect truncated messages: if the last message was truncated, bytes will contain the original length) There is one more tool to check whether the last message arrived from the local cluster or not (e.g., from other clusters in the Grid): in_my_cluster = open("/proc/self/in_cluster", O_CREAT, sender_home); This is a general-purpose tool that can be used to find whether any given IP address is part of the caller's cluster: it is not strictly part of direct communication and should not be called too often (eg. per received-message) because it requires the home-node's assistance, so using it extensively would defeat the purpose and efficiency of direct communication. FILTERING The following facility allows the receiver to select which types of mes- sages it is interested to receive: struct interested { unsigned char conditions; /* bitmap of conditions */ unsigned char testlen; /* length of test-pattern (1-8 bytes) */ short pid; /* Process-ID of sender */ unsigned int home; /* home-node of sender (0 = same home) */ int minlen; /* minimum message length */ int maxlen; /* maximum message length */ int testoffset; /* offset of test-pattern within message */ unsigned char testdata[8]; /* expected test-pattern */ }; /* conditions: */ #define INTERESTED_IN_PID 1 #define INTERESTED_IN_HOME 2 #define INTERESTED_IN_MINLEN 4 #define INTERESTED_IN_MAXLEN 8 #define INTERESTED_IN_PATTERN 16 struct interested filter; #define SIOCINTERESTED 0x8985 A call to: ioctl(my_mbox, SIOCINTERESTED, &filter); starts applying the given filter, while a call to: ioctl(my_mbox, SIOCINTERESTED, NULL); cancels the filtering. Closing my_mbox also cancels the filtering (but re-opening with different flags does not cancel the filtering). When filtering is applied, only messages that comply with the filter are received: if there are no complying messages, the receiving process either waits for a complying message to arrive, or if bit 8 ("do not delay") of the flags from open("/proc/self/mybox", O_CREAT, flags) is set, read(my_mbox,...) returns -1 with errno set to EAGAIN. Different types of messages can be received simply by modifying the con- tents of the filter between calls to read(my_mbox,...). filter.conditions is a bit-map indicating which condition(s) to consider: When INTERESTED_IN_PID is set, the process-ID of the sender must match filter.pid. When INTERESTED_IN_HOME is set, the home-node of the sender must match filter.home (a value of 0 can be used to match senders from the same home-node). When INTERESTED_IN_MINLEN is set, the message length must be at least filter.minlen bytes long. When INTERESTED_IN_MAXLEN is set, the message length must be no longer than filter.maxlen bytes. When INTERESTED_IN_PATTERN is set, the message must contain a given pat- tern of data at a given offset. The offset within the message is given by filter.testoffset, the pattern's length (1 to 8 bytes) in filter.testlen and its expected contents in filter.testdata. ERRORS Sender errors: ENOENT Invalid pathname in open: the specified IP address is not part of this cluster/Grid, or the process-ID is out of range (must be 2-32767). ESRCH No such process (this error is detected only when attempting to send - not when opening the connection). EACCES No permission to send to that process. ENOSPC Non-blocking (O_NONBLOCK) was requested and the receiver has no more space to accept this message - perhaps try again later. ECONNABORTED The home-node of the receiver is no longer in our multi- cluster Grid. EMFILE The maximum of 128 direct communicaiton file-descriptors is already in use. EINVAL When opening, the second parameter does not contain the bit "1"; When writing, the length is negative or more than 32MB. ETIMEDOUT Failed to establish connection with the mail-box managing daemon (postald). ECONNREFUSED The mail-box managing (postald) refused to serve the call (proba- bly a MOSIX installation error). EIO Communication breakdown with the mail-box managing daemon (postald). Receiver errors: EAGAIN No message is currently available for reading and the "Do not delay" flag is set. EXFULL Messages were possibly lost (usually due to insufficient memory): the receiver may still be able to receive new messages. ENOMSG The receiver had insufficient memory to store the last message. Despite this error, it is still possible to find out who sent the last message and its original length. EINVAL One or more values in the filtering structure are illegal or their combination makes it impossible to receive any message (for example, the offset of the data-pattern is beyond the maximum message length). Errors that are common to both sender and receiver: EINTR Read/write interrupted by a signal. ENOMEM Insufficient memory to complete the operation. EFAULT Bad read/write buffer address. ENETUNREACH Could not establish a connection with the mail-box managing dea- mon (postald). ECONNRESET Connection lost with the mail-box managing daemon (postald). POSSIBLE APPLICATIONS The scope of direct communication is very wide: almost any program that requires communication between related processes can benefit. Following are a few examples: 1. Use direct communication within standard communication packages and libraries, such as MPI. 2. Pipe-like applications where one process' output is the other's input: write your own code or use the existing mospipe(1) MOSIX utility. 3. Direct communiction can be used to implement fast I/O for migrated processes (with the cooperation of a local process on the node where the migrated process is running). In particular, it can be used to give migrated processes access to data from a common NFS server without causing their home-node to become a bottleneck. LIMITATIONS Processes that are involved in direct communication (having open file- descriptors for either sending or receiving messages) cannot be check- pointed and cannot execute mosrun recursively or native (see mosrun(1)). SEE ALSO mosrun(1), mospipe(1), mosix(7). MOSIX January 2008 MOSIX