A comparison of three microkernels
Journal of Supercomputing
The future of supercomputing lies in massively parallel computers. The nodes of these machines will need a different kind of operating system than current computers have. Many researchers in the field believe that microkernels provide the kind of functionality and performance required. In this paper we discuss three popular microkernels, Amoeba, Mach, and Chorus, to show what they can do. They are compared and contrasted in the areas of process management, memory management, and communication.
... NTRODUCTION Future advances in supercomputing will require the use of massively parallel computers, containing thousands of powerful CPUs. To perform well, these parallel supercomputers will require operating systems radically different from current ones. Most researchers in the operating systems field believe that these new operating systems will have to be much smaller than traditional ones to achieve the efficiency and flexibility needed. In this paper we first summarize the state-of-the-art in operating systems for parallel supercomputers, and point out their shortcomings. Then we discuss a new trend-microkernels-and point out why there is interest in them. The rest of the paper deals with three modern microkernels: Amoeba, Mach, and Chorus. After a brief introduction to each one, we look at the three main areas that any operating system must handle: processes, memory management and communication, and describe how each of these example microkernels deals with that area. OPERATING SYSTEMS FOR PARALLEL SUPERCOMPUTERS Except for the simplest embedded applications, every CPU needs an operating system to manage its resources and hide the details of the hardware from the application programs. Typical functions of an operating system are to allow processes to create, destroy, and manage subprocesses, allocate and return memory units, making sure that each process can access only the memory it is entitled to access (by setting up the MMU properly), handle communication between processes, both locally and on other machines, and perform input/output. Parallel supercomputers are no exception: they also need operating systems. First, the individual nodes have resources that have to be managed: the processor must be scheduled, memory must be allocated, etc. Second, the system as a whole has global resources that must be handled: processes have to be assigned to processors, I/O has to be performed and so on. Since parallel computing is in its infancy, no consensus has yet formed on the best way to tackle the problem, but it is becoming clear that simplicity, flexibility, and high performance are crucial here. At present, two approaches to parallel operating systems are widely used. Below we discuss both of them and show why they are converging on a third approach. The first approach is the hosted system in which the parallel supercomputer consists of a large number of (identical) processing nodes plus a host computer. The host computer, which can range from an ordinary workstation to a Cray Y-MP, runs a standard operating system, normally some version of UNIX ® . The other nodes do not run any operating system at all. Instead, the application program is compiled with a special runtime library containing procedures for doing limited process and memory management, plus procedures for communicating with other nodes and with the host. The application program runs in kernel mode (if there is one) and takes over the whole machine. Most system calls are executed by sending messages to the host computer, which then executes them and sends back the results. Parallel supercomputers using this approach include the Cray T3D, Fujitsu AP1000, Meiko CS-1, Thinking Machines CM-2, and Intel iPSC/x. The other model, the symmetric system, has a complete operating system running on each node. Again, some version of UNIX is the most popular choice. There is no need for a host since each node runs a complete operating system (although a host can be used for program development). Parallel supercomputers using this approach include the Meiko CS-2, Thinking Machines CM-5, and IBM SP1. Each of these models has its problems. The hosted model is very primitive, and harks backs almost 50 years to the very first computers, which also used runtime libraries instead of operating systems. It also means that it is impossible for multiple users to run jobs on the same parallel supercomputer at the same time in a protected way. For high-performance applications that are highly I/O bound, such as imaging or scientific visualization, when a node is idle waiting for data, it is not possible to just run another process, because there are no other processes. Although theoretically an application using this model could use virtual memory by programming the MMU itself, in practice the lack of an operating system means that virtual memory is not available and programs must fit into real memory. Finally, having a single host handle nearly all the system calls means that the host is potentially a bottleneck and is certainly a single point of failure that will bring the whole system down if it crashes. As a consequence of all these problems, the trend is definitely away from having the computing engines run only a small runtime library. Putting a full version of UNIX on every node solves most of these problems, but unfortunately it creates new ones. The basic problem is that UNIX is too large and inefficient for many parallel applications. For example, it was designed on the assumption that every node must control disks, terminals, and other I/O devices that are not present on all the nodes of a parallel supercomputer. If a system has 1024 processors, only a small number of which have a disk, it is wasteful to load each of the processors down with an operating system that contains a large, sophisticated file system. Another basic problem is that with many operating systems, communication is a major issue, if not the major issue. In most existing operating systems, including UNIX, communication was grafted on late in life, does not fit in well, and is inefficient. MICROKERNELS The solution appears to be to have a new kind of operating system that is effectively a compromise between having no operating system at all and having a large monolithic operating system that does many things that are not needed. At the heart of this approach is a small piece of code, called a microkernel , that is present on all machines. It runs in protected (i.e., kernel) mode, and provides basic process management, memory management, communication, and I/O services to the rest of the system. All of the other traditional operating system services, such as the file system, are supplied by server processes, generally running in user mode and only on those machines that need them. This division of labor allows the microkernel to be small and fast, and does not burden each CPU with facilities (such as a complete file system) that it does not need. Three strategies for achieving this goal are possible. In the first one, the runtime system is later. When another kernel receives a broadcast message, it checks to see if the message bears the next sequence number in numerical order. If so, it passes it up to the waiting user process or buffers it temporarily. If the kernel sees that it has missed a broadcast, it asks the sequencer to give it the message it missed. The protocol is described in more detail in [Kaashoek et al., 1993] . Communication in Mach All communication in Mach is based on reliable, one-way message passing. Remote procedure call, reliable byte streams, and other models can be built on top of this message passing. The central concept in the message passing system is the port, which keeps track of incoming messages. A thread can create as many ports as it needs. Messages go from a thread to a port, so for bidirectional communication, two ports are needed. When a thread creates a port, a kind of capability is created for the port and put on a capability list stored in the kernel. Each process (not each thread) has a capability list for its ports. A small integer is returned to the caller giving the position of the capability in the list, analogous to a file descriptor in UNIX. This integer is used for subsequent calls using the port. Capabilities contain rights telling what the holder can do with the corresponding port. The rights are RECEIVE, SEND, and SEND-ONCE. Each port has an owner, and only the owner has the RECEIVE right and can receive from the port. Initially the process creating the port is the owner, but the capability with the RECEIVE right can be passed to another process in a message. Only one process at a time may hold the capability with RECEIVE right for a port. The SEND and SEND-ONCE rights allow the holder to send to the port, the difference between the two is that the latter is good for sending a single message, after which the capability is automatically deleted by the kernel. Capabilities with the SEND-ONCE right are used to build RPC communication, in which the client gives the server a capability for exactly one reply message. Ports can be grouped in sets. A port set also has a capability, but only with RECEIVE rights. It is not possible to send to a port set. When a thread reads from a port set, it gets the first message from one of the ports in the set, but no guarantee is given about which port will be chosen. The entity sent from a thread to a port is a message . A message contains a short header and an arbitrarily long sequence of fields, each of which has a well-defined type. The kernel guarantees reliable delivery of messages, so users do not have to be concerned with lost messages or acknowledgements. Messages from a given port are delivered in the order they are received. They are never combined, so even if a port contains many small messages, a thread reading from the port will only get one message. Specially marked messages can carry capabilities from one process to another. They are inserted by the sending kernel and removed by the receiving kernel. To optimize message transport, when a process receives a message, the message is not actually copied. Instead it is mapped in using the copy-on-write mechanism described above. Only when a page is written to, does a fault occur and an actual copy made. The description of ports and messages given above is more-or-less directly descended from RIG, and applies only to messages sent to a receiver on the same machine as the sender. To extend communication beyond one machine, Mach invented the concept of a network message server , a process in user space that can act as a proxy for remote processes. For example, a server on machine B can register with the network message server on machine A, which can then create a local port for the server on A and add it to the port set for all its remote customers. Clients on A can send requests to this port. When the network message server gets the message, it forwards it to the destination machine using whatever network protocol is appropriate. This design puts the network code in user space, analogous to the external pager. However, several years of experience with this scheme demonstrated that the performance was unsatisfactory, so the design was changed, and in releases after 3.0 the networking code will be in the kernel. It is already in the OSF microkernel. Communication in Chorus In broad outline, communication in Chorus is similar to communication in Mach, but with numerous differences in the technical details. Like Mach, Chorus has ports and messages, and communication occurs when a thread sends a message to a port. Ports are named and protected by Chorus capabilities, which are managed in user space. Ports can also be moved from one machine to another, for example, to allow a new server to take over the work load when the machine the old one is on must go down for maintenance. When a port is moved, optionally all its messages move with it or are deleted.