|
| 1 | +# Hardware Memory Isolation |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +Another role of the hardware, besides computing, is to provide isolation, mainly between different software components. |
| 10 | +You have encountered (hopefully) the main protection mechanism that the hardware uses to ensure memory isolation: pages. |
| 11 | +There are other less-known mechanisms for ensuring memory isolation, through which we will go this session: segments, privilege rings, memory protection keys. |
| 12 | +We will also dive into virtualization, focusing on the hardware-assisted one. |
| 13 | +You get to write a small hypervisor. |
| 14 | + |
| 15 | +## Memory Isolation |
| 16 | + |
| 17 | +### Pages and Segments |
| 18 | + |
| 19 | +The system needs a way to enforce ownership and permissons on the memory zones. |
| 20 | +For example, it needs to enforce that a certain memory zone cand only be read and executed, not written. |
| 21 | +How can you do this? |
| 22 | +The first answer that the CPU designers had is called **segmentation**. |
| 23 | + |
| 24 | +#### Segmentation |
| 25 | + |
| 26 | +Segmentation is the x86 CPU feature that allows assigning permissions and ownership to a certain memory zone, using segments. |
| 27 | +Today, the modern systems don't use segmentation anymore, when performing usual operations. |
| 28 | +Segmentation is only required at the early stages of booting, when pages cannot be used. |
| 29 | +But segmentation is still there, and it is tied to many other components of the system, so knowledge of how segmentation works is useful. |
| 30 | +Let's start at the beginning of time, with something called the **Real mode**. |
| 31 | + |
| 32 | +##### Real Mode Memory Addressing |
| 33 | + |
| 34 | +At the beginning of time, x86 CPUs had only 16-bits registers. |
| 35 | +This meant that the maximum memory size that could be used, with **flat memory addressing** model, was 64KB. |
| 36 | +It was thought that this memory is enough, until it wasn't enough. |
| 37 | +Instead of expanding the registers, the CPU manufacturers came up with the **segment memory addressing** model. |
| 38 | +This meant that each memory address, instead of using one register, used 2: a normal register and a segment register, each fitting 16 bits. |
| 39 | +The way the addresses were calculated was the following: |
| 40 | +``` |
| 41 | +addr = SEGMENT_REGISTER * 0x10 + OFFSET_REGISTER |
| 42 | +``` |
| 43 | +This led to a new memory maximum of almost 1MB, which was tought to be enough. |
| 44 | +Notice that, at that time, a segment was just a number. |
| 45 | +It didn't enforce any permissions on the memory zone, or ownership, because there was no user and kernel separation. |
| 46 | +The operating system had absolute power, and there was nothing but the operating system. |
| 47 | +Then came 32-bit registers, the need for applications, the need for isolation. |
| 48 | +The **protected mode** was born. |
| 49 | + |
| 50 | +##### Protected Mode Memory Addressing |
| 51 | + |
| 52 | +Well, all those things didn't appear at the same time. |
| 53 | +First, the registers were extended to 32 bits. |
| 54 | +Using the segmentation model, each segment would now fit 4GB of memory. |
| 55 | +This meant that a system could use 68GB of memory. |
| 56 | +But that memory wasn't available. |
| 57 | +you could barely reach 4GB of RAM. |
| 58 | +So, the addressing model was switched to the flat one: a memory address would be composed of only one register. |
| 59 | + |
| 60 | +The segments got another role: enforce isolation (pages still weren't a thing). |
| 61 | +A segment would not be a value used to compute an address, but an index in a table, the **GDT** (Global Descriptor Table). |
| 62 | + |
| 63 | +###### The GDT |
| 64 | + |
| 65 | +The GDT is a system table, that contains descriptors of memory zones: where it starts, where it ends, how it grows, is it readable, writeable, executable, who can access it. |
| 66 | +Each entry in the GDT has 8 bytes in size, and it looks like this: |
| 67 | + |
| 68 | +TODO: insert GDT entry diagram |
| 69 | + |
| 70 | +Let's break each field down: |
| 71 | + * Base: the address at which the segment begins |
| 72 | + * Limit: the first wacky one; how many bytes or pages are contained in the segment |
| 73 | + * Access: |
| 74 | + TODO: insert Access Byte diagram |
| 75 | + * P: Present - 1 if the segment is valid |
| 76 | + * DPL: Descriptor privilege level, where 0 is the highest privilege level, and 3 the lowest |
| 77 | + * S: Descriptor type; not interesting for us today |
| 78 | + * E: Executable; if 0, the segment is a data one; if 1, the segment is a code one |
| 79 | + * D/C: its significance depends on the **E** bit |
| 80 | + * if E is 0, the field is Direction; not interesting for today |
| 81 | + * if E is 1, the field is Conforming; |
| 82 | + * if the Conforming bit is 1, the code in this segment can be executed by equal or lower privilege level code; |
| 83 | + for example, code in a segment with DPL equal to 3, with C equal to 1, can be executed by code from a segment with DPL equal to 1 |
| 84 | + * if the C bit is 0, the code in this segment can be executed only by code from segments with the same DPL |
| 85 | + * R/W: another field depending on E |
| 86 | + * if E is 0, the field is Writeable; data segments are always readable |
| 87 | + * if E is 1, the field is Readable; code semgments are never writeable |
| 88 | + * A: Accessed; not interesting |
| 89 | + * Flags: |
| 90 | + * G - Granularity: the Limit fields is in bytes (0), or pages (1)? |
| 91 | + * DB - Size: if 1, the segment is a 32-bit one, else it is a 16-bit one |
| 92 | + * L - Long-mode code: if 1, the segment is a 64-bit code one |
| 93 | + TODO: insert Flags diagram |
| 94 | + |
| 95 | +As complicated as it looks, The GDT did its job of enforcing some memory protection, hence a CPU with a GDT and 32-bit addresses is in the **Protected Mode**. |
| 96 | +This came at the cost of the programmer's sanity. |
| 97 | +If you are wondering what the designers of this model were smoking, you are not alone. |
| 98 | +Fortunately, in modern systems, **Base** and **Limit** are ignored. |
| 99 | +A segment always covers the entire address space. |
| 100 | +Does that mean that the entire address space is in both code and data segments? |
| 101 | +Yes? |
| 102 | +Then how do we ensure separation between executable and writeable memory zones? |
| 103 | +Also, the entire memory is accesible by all privilege levels? |
| 104 | +This doesn't seem right. |
| 105 | +Enter pages. |
| 106 | +But first, something about privilege levels, also known as privilege rings. |
| 107 | + |
| 108 | +##### Privilege Rings |
| 109 | + |
| 110 | +You may have heard about kernel-space and user-space. |
| 111 | +How do we know if a memory zone belongs to the kernel-space, or the user-space? |
| 112 | +That memory zone is part of a segment, or page, as we will see later, that belongs to either kernel or user. |
| 113 | +The kernel-space is, in fact, any memory zone that belongs to **ring 0**, or DPL 0, and the user-space belongs to **ring 3**, or DPL 3. |
| 114 | +What about the other rings, 1 and 2? |
| 115 | +They can be used, but almost no one does it. |
| 116 | +Some drivers use those rings, but it's not a common practice. |
| 117 | + |
| 118 | +At this point, things get weird. |
| 119 | +What if you want software with higher privileges than the kernel, like a hypervisor? |
| 120 | +You get ring -1. |
| 121 | +But what if you want a piece of code that is run by the hardware in critical moments? |
| 122 | +You get ring -2. |
| 123 | +Ring -3? |
| 124 | +Someone got there. |
| 125 | +Fear not, we will explore these weird notions later. |
| 126 | +For now, let's do something practical. |
| 127 | + |
| 128 | +##### Tutorial: Reading the GDT of the Linux Kernel |
| 129 | + |
| 130 | +Go to the [`read-gdt`](./activities/read-gdt/) folder. |
| 131 | +There you have a simple kernel module that reads the GDT of the operating system, then prints each field. |
| 132 | +Run `make` to build the module, then `sudo insmod read_gdt.ko` to insert the module. |
| 133 | +By running `sudo dmesg` you should see 16 GDT entries listed, the total size of the GDT and the virtual address where it is placed. |
| 134 | +Only 16 entries are listed, because the ones after that are null. |
| 135 | +Take a look at the entries, and figure out what entries 1 to 6 represent. |
| 136 | +You should find 3 kernel entries, and 3 user ones. |
| 137 | +Notice that entries 0 and 7 are null. |
| 138 | +Entry 0 should always be null. |
| 139 | +Entries from 8 onward are TSS and LDT entries, which won't be detailed in today's session. |
| 140 | + |
| 141 | +Note that a special instruction, `sgdt` is used to retrieve the GDT pointer descriptor. |
| 142 | +The opposite instruction is `lgdt`. |
| 143 | + |
| 144 | +#### Paging |
| 145 | + |
| 146 | +Soon enough, people got tired of dealing with segmentation; |
| 147 | +a new method to divide the memory was needed. |
| 148 | +Pages were born. |
| 149 | +Unlike segments, that can be of any size, pages have fixed sizes: 4KB. |
| 150 | +There are also the huge pages, that usually have 2MB, 4MB or 1GB. |
| 151 | +Pages are organised hierarchically, in a tree-like structure. |
| 152 | +A hardware component, called the MMU (Memory Management Unit) manages this structure. |
| 153 | +We won't go into details about how that structure is organised, as to not transform this session into a Operating Systems design session. |
| 154 | +What is important to know is that each page has permissions, that are checked by the MMU at every access. |
| 155 | +The hardware doesn't, however, check if a memory page is accessed by the process that should be able to access it. |
| 156 | +That is the role of the OS. |
| 157 | + |
| 158 | +#### Memory Protection Keys |
| 159 | + |
| 160 | +We have the following scenario: |
| 161 | +an application wants to change an area of its memory from read-write to read-only, for reasons. |
| 162 | +To do this it will call `mprotect` on that area. |
| 163 | +What will happen behind the scenes will be that the OS will change permissions for each page that is part of the memory area, then it will flush the TLB. |
| 164 | +This is costly time-wise. |
| 165 | +As a solution, Intel proposed the MPK set of instructions, that can quickly change permissions for an area of memory of any size. |
| 166 | +How does this work? |
| 167 | +Up to the moment when MPK was proposed, page-table entries had 4 bits that weren't used. |
| 168 | +These 4 bits are tranformed into 16 possible `keys`. |
| 169 | +Furthermore, a register, `PKRU`, is added to hold the permissions for each of those keys, local to each thread. |
| 170 | +This allows an application to allocate its pages to a `protection domain`. |
| 171 | +When accessing a page, instead of checking only the page permissions, the MMU will also check the protection domain permissions. |
| 172 | + |
| 173 | +Let's take a practical example: |
| 174 | +Application A has a page with read-write permissions. |
| 175 | +It allocates a `protection domain` with read permissions, then adds the page to that protection domain. |
| 176 | +When performing a write on that page, a Segmentation Fault will be received, because, even though the page has the right permissions, the protection domain does not. |
| 177 | +Everything sounds nice, doesn't it? |
| 178 | +Well, it is not. |
| 179 | +The reason for this is that the instruction used to modify `PKRU` is unprivileged. |
| 180 | +So, if an attacker gains the ability to execute arbitrary code, the whole mechanism can be bypassed. |
| 181 | +Another problem is, as detailed by [this paper](https://arxiv.org/pdf/1811.07276v1.pdf), the fact that, after an application frees a protection domain, the key isn't deleted from the page-table entries. |
| 182 | +So, if the same key is allocated again, it will still cover the previous pages, that should no longer be under a protection domain. |
| 183 | +A classical example of `use-after-free`. |
| 184 | +The final problem is that there are only 16 possible keys. |
| 185 | +For the whole system. |
| 186 | +A system that can run hundreds, if not thousands of processes, with many more threads. |
| 187 | +You can see how this can go wrong. |
| 188 | + |
| 189 | +### Control-Flow Enforcement |
| 190 | + |
| 191 | +#### Invalid Jump Detection |
| 192 | + |
| 193 | +#### Hardware Shadow Stack |
| 194 | + |
| 195 | +### Intel MPX |
0 commit comments