|
| 1 | +# Hardware Memory Isolation |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +Besides computing, anotherrole of the hardware is to provide isolation, mainly between the operating system and the user applications. |
| 10 | +You have encountered (hopefully) the main protection mechanism that the hardware uses to ensure memory isolation: pages. |
| 11 | +There are other less-known mechanisms for ensuring memory isolation, through which we will go during this session: segments, privilege rings, memory protection keys. |
| 12 | +We will also dive into virtualization, focusing on the hardware-assisted one. |
| 13 | + |
| 14 | +## Memory Isolation |
| 15 | + |
| 16 | +### Pages and Segments |
| 17 | + |
| 18 | +The system needs a way to enforce ownership and permissons on the memory zones. |
| 19 | +For example, it needs to enforce that a certain memory zone cand only be read and executed, not written. |
| 20 | +How can this be achieved? |
| 21 | +The first answer that the CPU designers came up with is called **segmentation**. |
| 22 | + |
| 23 | +#### Segmentation |
| 24 | + |
| 25 | +Segmentation is the x86 CPU feature that allows assigning permissions and ownership to a certain memory zone, using segments. |
| 26 | +Segments differ from pages through their size and their organization, as you will see next. |
| 27 | +Today, the modern systems don't use segmentation anymore, when performing usual operations. |
| 28 | +Segmentation is only required at the early stages of booting, when pages cannot be used. |
| 29 | +But segmentation is still there, and it is tied to many other components of the system, so knowledge of how segmentation works is useful. |
| 30 | +Let's start at the beginning of time, with something called the **Real mode**. |
| 31 | + |
| 32 | +##### Real Mode Memory Addressing |
| 33 | + |
| 34 | +At the beginning of time, x86 CPUs only had 16-bit registers. |
| 35 | +This meant that the maximum memory size that could be used, with **flat memory addressing** model, was 64KB (2^16). |
| 36 | +It was thought that this amount of memory is enough, until it wasn't enough. |
| 37 | +Instead of expanding the registers, the CPU manufacturers came up with the **segment memory addressing** model. |
| 38 | +This meant that each memory address, instead of using one register, used 2: a normal register and a segment register, each fitting 16 bits. |
| 39 | +The way the addresses were calculated was the following: |
| 40 | + |
| 41 | +``` |
| 42 | +addrress = SEGMENT_REGISTER * 0x10 + OFFSET_REGISTER |
| 43 | +``` |
| 44 | + |
| 45 | +This led to a new memory maximum of almost 1MB, which was tought to be enough. |
| 46 | +Notice that, at that time, a segment was just a number. |
| 47 | +It didn't enforce any permissions on the memory zone, or ownership, because there was no user-kernel separation. |
| 48 | +The operating system had absolute power, and there was nothing but the operating system. |
| 49 | +Then came 32-bit registers, the need for applications, the need for isolation. |
| 50 | +The **protected mode** was born. |
| 51 | + |
| 52 | +Note: real mode still exists. |
| 53 | +Every x86 CPU starts in real mode, and must be switched to other modes, like the **Protected Mode**. |
| 54 | + |
| 55 | +##### Protected Mode Memory Addressing |
| 56 | + |
| 57 | +Well, all those things didn't appear at the same time. |
| 58 | +First, the registers were extended to 32 bits. |
| 59 | +Using the segmentation model, each segment would now fit 4GB of memory. |
| 60 | +This meant that a system could use 68GB of memory. |
| 61 | +But that memory wasn't available. |
| 62 | +You could barely reach 4GB of RAM. |
| 63 | +So, the addressing model was switched to the flat one: a memory address would be composed of only one register. |
| 64 | + |
| 65 | +The segments got another role: enforce isolation (pages still weren't a thing). |
| 66 | +A segment would not be a value used to compute an address, but an index in a table, the **GDT** (Global Descriptor Table). |
| 67 | + |
| 68 | +###### The GDT |
| 69 | + |
| 70 | +The GDT is a system table, that contains descriptors of memory zones: where they start, where they end, how they grow, whether they are readable, writeable, executable, who can access them. |
| 71 | +Each entry in the GDT has 8 bytes in size, and it looks like this: |
| 72 | + |
| 73 | +TODO: insert GDT entry diagram |
| 74 | + |
| 75 | +Let's break each field down: |
| 76 | + * **Base**: the address at which the segment begins |
| 77 | + * **Limit**: the first wacky one; how many bytes or pages are contained in the segment |
| 78 | + * **Access**: |
| 79 | + TODO: insert Access Byte diagram |
| 80 | + * **P**: Present - 1 if the segment is valid |
| 81 | + * **DPL**: Descriptor privilege level, where 0 is the highest privilege level, and 3 the lowest |
| 82 | + * **S**: Descriptor type; not interesting for us today |
| 83 | + * **E**: Executable; if 0, the segment is a data one; if 1, the segment is a code one |
| 84 | + * **D/C**: its significance depends on the **E** bit |
| 85 | + * if **E** is 0, the field is Direction; not interesting for today |
| 86 | + * if **E** is 1, the field is Conforming; |
| 87 | + * if the Conforming bit is 1, the code in this segment can be executed by equal or lower privilege level code; |
| 88 | + for example, code in a segment with **DPL** equal to 3, with **C** equal to 1, can be executed by code from a segment with **DPL** equal to 1 |
| 89 | + * if the **C** bit is 0, the code in this segment can be executed only by code from segments with the same **DPL** |
| 90 | + * **R/W**: another field depending on **E** |
| 91 | + * if **E** is 0, the field is Writeable; data segments are always readable |
| 92 | + * if **E** is 1, the field is Readable; code semgments are never writeable |
| 93 | + * **A**: Accessed; not interesting |
| 94 | + * Flags: |
| 95 | + * **G** - Granularity: the **Limit** field is in bytes (0), or pages (1)? |
| 96 | + * **DB** - Size: if 1, the segment is a 32-bit one, else it is a 16-bit one |
| 97 | + * **L** - Long-mode code: if 1, the segment is a 64-bit code one |
| 98 | + TODO: insert Flags diagram |
| 99 | + |
| 100 | +As complicated as it looks, The GDT did its job of enforcing some memory protection, hence a CPU with a GDT and 32-bit addresses is operating in the **Protected Mode**. |
| 101 | +This came at the cost of the programmer's sanity. |
| 102 | +If you are wondering what the designers of this model were smoking, you are not alone. |
| 103 | + |
| 104 | +Fortunately, in modern systems, **Base** and **Limit** are ignored. |
| 105 | +A segment always covers the entire address space. |
| 106 | +Does that mean that the entire address space is in both the code and data segments? |
| 107 | +Yes. |
| 108 | +Then how do we ensure separation between executable and writeable memory zones? |
| 109 | +Also, the entire memory is accesible by all privilege levels? |
| 110 | +This doesn't seem right. |
| 111 | +Enter pages. |
| 112 | +But first, something about privilege levels, also known as **privilege rings**. |
| 113 | + |
| 114 | +##### Privilege Rings |
| 115 | + |
| 116 | +You should have heard about kernel-space and user-space. |
| 117 | +How do we know if a memory zone belongs to the kernel-space, or the user-space? |
| 118 | +That memory zone is part of a segment / a page, as we will see later, that belongs to either the kernel or the user. |
| 119 | +The kernel-space is, in fact, any memory zone that belongs to **ring 0**, or DPL 0, and the user-space belongs to **ring 3**, or DPL 3. |
| 120 | +What about the other rings, 1 and 2? |
| 121 | +They can be used, but almost no one does it. |
| 122 | +Some drivers use those rings, but it's not a common practice. |
| 123 | + |
| 124 | +At this point, things get weird. |
| 125 | +What if you want software with higher privileges than the kernel, like a hypervisor? |
| 126 | +You get ring -1. |
| 127 | +But what if you want a piece of code that is run by the hardware in critical moments? |
| 128 | +You get ring -2. |
| 129 | +Ring -3? |
| 130 | +Someone got there. |
| 131 | +Fear not, we will explore these weird notions later. |
| 132 | +For now, let's do something practical. |
| 133 | + |
| 134 | +##### Tutorial: Reading the GDT of the Linux Kernel |
| 135 | + |
| 136 | +Go to the [`read-gdt`](./activities/read-gdt/) folder. |
| 137 | +There you have a simple kernel module that reads the GDT of the operating system, then prints each field. |
| 138 | +Run `make` to build the module, then `sudo insmod read_gdt.ko` to insert the module. |
| 139 | +By running `sudo dmesg` you should see 16 GDT entries listed, the total size of the GDT and the virtual address where it is placed. |
| 140 | +Only 16 entries are listed, because the ones after that are null. |
| 141 | +Take a look at the entries, and figure out what entries 1 to 6 represent. |
| 142 | +You should find 3 kernel entries, and 3 user ones. |
| 143 | +Notice that entries 0 and 7 are null. |
| 144 | +Entry 0 should always be null. |
| 145 | +Entries from 8 onward are TSS and LDT entries, which won't be detailed in today's session. |
| 146 | + |
| 147 | +Note that a special instruction, `sgdt` is used to retrieve the GDT pointer descriptor. |
| 148 | +The opposite instruction is `lgdt`. |
| 149 | + |
| 150 | +#### Paging |
| 151 | + |
| 152 | +Soon enough, people got tired of dealing with segmentation; |
| 153 | +a new method to divide the memory was needed. |
| 154 | +Pages were born. |
| 155 | +Unlike segments, that can be of any size, pages have fixed sizes: the standard one is 4KB, but it can also be of other sizes. |
| 156 | +There are also the huge pages, that usually have 2MB, 4MB or 1GB. |
| 157 | +Pages are organised hierarchically, in a tree-like structure. |
| 158 | +A hardware component, called the MMU (Memory Management Unit) manages this structure. |
| 159 | +We won't go into details about how that structure is organised, as to not transform this session into a Operating Systems design session. |
| 160 | +What is important to know is that each page has permissions, that are checked by the MMU at every access. |
| 161 | +The hardware doesn't, however, check if a memory page is accessed by the process that should be able to access it. |
| 162 | +That is the role of the OS. |
| 163 | + |
| 164 | +TODO: develop paging - at least MMU and TLB |
| 165 | + |
| 166 | +#### Memory Protection Keys |
| 167 | + |
| 168 | +Let's consider the following scenario: |
| 169 | +an application wants to change an area of its memory from read-write to read-only, for reasons. |
| 170 | +To do this it will call `mprotect()` on that area. |
| 171 | +What will happen behind the scenes will be that the OS will change permissions for each page that is part of the memory area, then it will flush the TLB. |
| 172 | +This is costly time-wise. |
| 173 | +As a solution, Intel proposed the `MPK` set of instructions, that can quickly change permissions for an area of memory of any size. |
| 174 | +How does this work? |
| 175 | +Up to the moment when MPK was proposed, page-table entries had 4 bits that weren't used. |
| 176 | +These 4 bits are tranformed into 16 possible `keys`. |
| 177 | +Furthermore, a register, `PKRU`, is added to hold the permissions for each of those keys, local to each thread. |
| 178 | +This allows an application to allocate its pages to a _protection domain_. |
| 179 | +When accessing a page, instead of checking only the page permissions, the MMU will also check the protection domain permissions. |
| 180 | + |
| 181 | +Let's take a practical example: |
| 182 | +Application A has a page with read-write permissions. |
| 183 | +It allocates a `protection domain` with read permissions, then adds the page to that protection domain. |
| 184 | +When performing a write on that page, a Segmentation Fault will be received, because, even though the page has the right permissions, the protection domain does not. |
| 185 | + |
| 186 | +Everything sounds nice, doesn't it? |
| 187 | +Well, it is not. |
| 188 | +The reason for this is that the instruction used to modify `PKRU` is unprivileged. |
| 189 | +So, if an attacker gains the ability to execute arbitrary code, the whole mechanism can be bypassed. |
| 190 | +Another problem is, as detailed by [this paper](https://arxiv.org/pdf/1811.07276v1.pdf), the fact that, after an application frees a protection domain, the key isn't deleted from the page-table entries. |
| 191 | +So, if the same key is allocated again, it will still cover the previous pages, that should no longer be under a protection domain. |
| 192 | +A classic example of `use-after-free`. |
| 193 | +The final problem is that there are only 16 possible keys. |
| 194 | +For the whole system. |
| 195 | +A system that can run hundreds, if not thousands of processes, with many more threads. |
| 196 | +You can see how this can go wrong. |
| 197 | + |
| 198 | +##### Tutorial: MPK Basics |
| 199 | + |
| 200 | +TODO: code where the students use `pkey_mprotect` |
| 201 | + |
| 202 | +##### Activity: I Do This For Your Own Good |
| 203 | + |
| 204 | +TODO: one program tries to read from a non-readable zone, enforced through PKU. The students must write a program to change the permissions of the `pkey` |
| 205 | + |
| 206 | +### Control-Flow Enforcement |
| 207 | + |
| 208 | +Don't you hate it when someone exploits your binary, using methods like `Return Oriented Programming` (ROP)? |
| 209 | +(By the way, we have 2 sessions dedicated to ROP, [here]() and [here]()). |
| 210 | +Well, other people hate it too, so they searched for a solution. |
| 211 | +Intel's solution was adding control-flow enforcement in hardware, with the new `Control-Flow Enforcement Technology` instructions. |
| 212 | +Those instructions are split in 2 parts: |
| 213 | +* `Indirect Branch Tracking`, which checks if a `jmp` or `call` instruction targets a valid code address, marked accordingly by the programmer / compiler. |
| 214 | +* `Shadow Stack`, which checks if the return address was altered in any way. |
| 215 | + |
| 216 | +But wait, don't we have the `Stack Canary` for the last one? |
| 217 | +The main problem with that solution is that the canary is placed on the same memory zone as the return address. |
| 218 | +If we can modify the return address, what is stopping someone from reading the canary first, then modifying the return address, while keeping the canary intact? |
| 219 | +The shadow stack places a copy of the return address in a separate memory zone, that can be accessed using only some special instructions. |
| 220 | + |
| 221 | +#### Indirect Branch Tracking |
| 222 | + |
| 223 | +The main idea behind some attacks is to chain small pieces of code (gadgets), in order to call a system function, usually, in a certain way. |
| 224 | +Those gadgets aren't real functions that should be normally called, but rather pieces of a normal function, that end with `jmp` (Jump-Oriented Programming) or `call` (Call-Oriented Programming). |
| 225 | +What if we say that the program can use `jmp` / `call` only to certain instructions? |
| 226 | +That's what `Indirect Branch Tracking` does: |
| 227 | +a new instruction is added, `endbr`. |
| 228 | +Every time a `jmp` or `call` instruction is used, the CPU expects the next instruction that is executed to be `endbr`. |
| 229 | +Otherwise, an exception is raised, and the program is terminated. |
| 230 | + |
| 231 | +TODO: it only works for indirect branching - edit is needed |
| 232 | + |
| 233 | +#### Hardware Shadow Stack |
| 234 | + |
| 235 | + |
| 236 | + |
| 237 | +### Intel MPX ? |
0 commit comments