Intrusion Survivability for Commodity Operating Systems
Digital Threats: Research and Practice
Despite the deployment of preventive security mechanisms to protect the assets and computing platforms of users, intrusions eventually occur. We propose a novel intrusion survivability approach to withstand ongoing intrusions. Our approach relies on an orchestration of fine-grained recovery and per-service responses (e.g., privileges removal). Such an approach may put the system into a degraded mode. This degraded mode prevents attackers to reinfect the system or to achieve their goals if they
... anaged to reinfect it. It maintains the availability of core functions while waiting for patches to be deployed. We devised a cost-sensitive response selection process to ensure that while the service is in a degraded mode, its core functions are still operating. We built a Linux-based prototype and evaluated the effectiveness of our approach against different types of intrusions. The results show that our solution removes the effects of the intrusions, that it can select appropriate responses, and that it allows services to survive when reinfected. In terms of performance overhead, in most cases, we observed a small overhead, except in the rare case of services that write many small files asynchronously in a burst, where we observed a higher but acceptable overhead. The idea of Intrusion Detection Systems (IDSs) dates back to the 1980s [1, 21] . Since then, more intrusion detection approaches were introduced, refined, and transferred from academia to industry. Most of today's commodity Operating Systems (OSs) can be deployed with some kind of Intrusion Detection System (IDS). However, as the name suggests, IDSs only focus on the detection and do not provide the ability to survive or withstand an intrusion once it has been detected. To limit the damage done by security incidents, intrusion recovery systems help administrators restore a compromised system into a sane state. Common limitations are that they do not preserve availability [31, 35, 44] (e.g., they force a system shutdown) or that they neither stop intrusions from reoccurring nor withstand reinfections [31, 35, 44, 84, 87] . If the recovery mechanism restores the system to a sane state, the system continues to run with the same vulnerabilities and nothing stops attackers from reinfecting it. Thus, the system could enter a loop of infections and recoveries. Existing intrusion response systems, however, apply responses  to stop an intrusion or limit its impact on the system; but existing approaches apply coarse-grained responses that affect the whole system and not just the compromised services  (e.g., blocking port 80 for the whole system, because a single compromised service uses this port maliciously). They also rely on a strong assumption of having complete knowledge of the vulnerabilities present and used by the attacker [27, 73] to select responses. These limitations mean that they cannot respond to intrusions without affecting the availability of the system or of some services. Whether it is due to business continuity, safety reasons, or the user experience, the availability of services is an important aspect of a computing platform. For example, while web sites, code repositories, or databases are not safety-critical, they can be important for a company or for the workflow of a user. Therefore, the problem that we address is the following: How to design an Operating System (OS) so its services can survive ongoing intrusions while maintaining availability? Our approach distinguishes itself from prior work on three fronts. First, we combine the restoration of files and processes of a service with the ability to apply responses after the restoration to withstand a reinfection. Second, we apply per-service responses that affect the compromised services instead of the whole system (e.g., only one service views the file system as read-only). Third, after recovering a compromised service, the responses we apply can put the recovered service into a degraded mode, because they remove some privileges normally needed by the service. The degraded mode is introduced on purpose. When the intrusion is detected, we do not have precise information about the vulnerabilities exploited to patch them or we do not have a patch available. The degraded mode allows the system to survive the intrusion for two reasons. First, after the recovery, the degraded mode either stops the attackers from reinfecting the service or from achieving their malicious goals. Second, it keeps as many functions of the service available as possible, thus maintaining availability while waiting for a patch. We maintain the availability by ensuring that core functions of services are still operating, while non-essential functions might not be working due to some responses. For example, a web server could have "provide read access to the website" as core function and "log accesses" as non-essential. Thus, if we remove the write access to the file system it would degrade the service's state (i.e., it cannot log anymore), but we would still maintain its core function. We developed a cost-sensitive response selection where administrators describe a policy consisting of cost models for responses and malicious behaviors. Our solution then selects a response that maximizes the effectiveness while minimizing its impact on the service based on the policy. This approach gives time for administrators to plan an update to fix the vulnerabilities (e.g., wait for a vendor to release a patch). Finally, once they patched the system, we can remove the responses, and the system can leave the degraded mode. Contributions. Our main contributions are the following: • We propose a novel intrusion survivability approach to withstand ongoing intrusions and maintain the availability of core functions of services (Sections 3.1 and 4). Intrusion Survivability for Commodity Operating Systems • 21:3 • We introduce a cost-sensitive response selection process to help select optimal responses (Section 5). • We develop a Linux-based prototype implementation by modifying the Linux kernel, systemd , CRIU , Linux audit , and snapper  (Section 6). • We evaluate our prototype by measuring the effectiveness of the responses applied, the ability to select appropriate responses, the availability cost of a checkpoint and a restore, the overhead of our solution, and the stability of the degraded services (Section 7). Outline. The rest of this article is structured as follows: First, in Section 2, we mention related concepts about our work, and we review the state-of-the-art on intrusion recovery and response systems. In Section 3, we give an overview of our approach, and we define the scope of our work. In Section 4, we specify the requirements and architecture of our approach. In Section 5, we describe how we select cost-sensitive responses and maintain core functions. In Section 6, we describe a prototype implementation that we then evaluate in Section 7. In Section 8, we discuss some limitations of our work, and we give a summary of the comparison with the related work. We conclude and give the next steps regarding our work in Section 9.