diff --git a/hardware-isolation/README.md b/hardware-isolation/README.md deleted file mode 100644 index 0f09692..0000000 --- a/hardware-isolation/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# Hardware Isolation - -## Virtualization - -### Intel VMX - -## Memory Protection - -### Pages and Segments - -### Protection Rings - -### Protection Keys - -### IOMMU - -## Enclaves and Trusted Execution Environments diff --git a/hardware-memory-isolation/README.md b/hardware-memory-isolation/README.md new file mode 100644 index 0000000..5f0ffc1 --- /dev/null +++ b/hardware-memory-isolation/README.md @@ -0,0 +1,237 @@ +# Hardware Memory Isolation + +## Table of Contents + +## Prerequisites + +## Introduction + +Besides computing, anotherrole of the hardware is to provide isolation, mainly between the operating system and the user applications. +You have encountered (hopefully) the main protection mechanism that the hardware uses to ensure memory isolation: pages. +There are other less-known mechanisms for ensuring memory isolation, through which we will go during this session: segments, privilege rings, memory protection keys. +We will also dive into virtualization, focusing on the hardware-assisted one. + +## Memory Isolation + +### Pages and Segments + +The system needs a way to enforce ownership and permissons on the memory zones. +For example, it needs to enforce that a certain memory zone cand only be read and executed, not written. +How can this be achieved? +The first answer that the CPU designers came up with is called **segmentation**. + +#### Segmentation + +Segmentation is the x86 CPU feature that allows assigning permissions and ownership to a certain memory zone, using segments. +Segments differ from pages through their size and their organization, as you will see next. +Today, the modern systems don't use segmentation anymore, when performing usual operations. +Segmentation is only required at the early stages of booting, when pages cannot be used. +But segmentation is still there, and it is tied to many other components of the system, so knowledge of how segmentation works is useful. +Let's start at the beginning of time, with something called the **Real mode**. + +##### Real Mode Memory Addressing + +At the beginning of time, x86 CPUs only had 16-bit registers. +This meant that the maximum memory size that could be used, with **flat memory addressing** model, was 64KB (2^16). +It was thought that this amount of memory is enough, until it wasn't enough. +Instead of expanding the registers, the CPU manufacturers came up with the **segment memory addressing** model. +This meant that each memory address, instead of using one register, used 2: a normal register and a segment register, each fitting 16 bits. +The way the addresses were calculated was the following: + +``` +addrress = SEGMENT_REGISTER * 0x10 + OFFSET_REGISTER +``` + +This led to a new memory maximum of almost 1MB, which was tought to be enough. +Notice that, at that time, a segment was just a number. +It didn't enforce any permissions on the memory zone, or ownership, because there was no user-kernel separation. +The operating system had absolute power, and there was nothing but the operating system. +Then came 32-bit registers, the need for applications, the need for isolation. +The **protected mode** was born. + +Note: real mode still exists. +Every x86 CPU starts in real mode, and must be switched to other modes, like the **Protected Mode**. + +##### Protected Mode Memory Addressing + +Well, all those things didn't appear at the same time. +First, the registers were extended to 32 bits. +Using the segmentation model, each segment would now fit 4GB of memory. +This meant that a system could use 68GB of memory. +But that memory wasn't available. +You could barely reach 4GB of RAM. +So, the addressing model was switched to the flat one: a memory address would be composed of only one register. + +The segments got another role: enforce isolation (pages still weren't a thing). +A segment would not be a value used to compute an address, but an index in a table, the **GDT** (Global Descriptor Table). + +###### The GDT + +The GDT is a system table, that contains descriptors of memory zones: where they start, where they end, how they grow, whether they are readable, writeable, executable, who can access them. +Each entry in the GDT has 8 bytes in size, and it looks like this: + +TODO: insert GDT entry diagram + +Let's break each field down: + * **Base**: the address at which the segment begins + * **Limit**: the first wacky one; how many bytes or pages are contained in the segment + * **Access**: + TODO: insert Access Byte diagram + * **P**: Present - 1 if the segment is valid + * **DPL**: Descriptor privilege level, where 0 is the highest privilege level, and 3 the lowest + * **S**: Descriptor type; not interesting for us today + * **E**: Executable; if 0, the segment is a data one; if 1, the segment is a code one + * **D/C**: its significance depends on the **E** bit + * if **E** is 0, the field is Direction; not interesting for today + * if **E** is 1, the field is Conforming; + * if the Conforming bit is 1, the code in this segment can be executed by equal or lower privilege level code; + for example, code in a segment with **DPL** equal to 3, with **C** equal to 1, can be executed by code from a segment with **DPL** equal to 1 + * if the **C** bit is 0, the code in this segment can be executed only by code from segments with the same **DPL** + * **R/W**: another field depending on **E** + * if **E** is 0, the field is Writeable; data segments are always readable + * if **E** is 1, the field is Readable; code semgments are never writeable + * **A**: Accessed; not interesting + * Flags: + * **G** - Granularity: the **Limit** field is in bytes (0), or pages (1)? + * **DB** - Size: if 1, the segment is a 32-bit one, else it is a 16-bit one + * **L** - Long-mode code: if 1, the segment is a 64-bit code one + TODO: insert Flags diagram + +As complicated as it looks, The GDT did its job of enforcing some memory protection, hence a CPU with a GDT and 32-bit addresses is operating in the **Protected Mode**. +This came at the cost of the programmer's sanity. +If you are wondering what the designers of this model were smoking, you are not alone. + +Fortunately, in modern systems, **Base** and **Limit** are ignored. +A segment always covers the entire address space. +Does that mean that the entire address space is in both the code and data segments? +Yes. +Then how do we ensure separation between executable and writeable memory zones? +Also, the entire memory is accesible by all privilege levels? +This doesn't seem right. +Enter pages. +But first, something about privilege levels, also known as **privilege rings**. + +##### Privilege Rings + +You should have heard about kernel-space and user-space. +How do we know if a memory zone belongs to the kernel-space, or the user-space? +That memory zone is part of a segment / a page, as we will see later, that belongs to either the kernel or the user. +The kernel-space is, in fact, any memory zone that belongs to **ring 0**, or DPL 0, and the user-space belongs to **ring 3**, or DPL 3. +What about the other rings, 1 and 2? +They can be used, but almost no one does it. +Some drivers use those rings, but it's not a common practice. + +At this point, things get weird. +What if you want software with higher privileges than the kernel, like a hypervisor? +You get ring -1. +But what if you want a piece of code that is run by the hardware in critical moments? +You get ring -2. +Ring -3? +Someone got there. +Fear not, we will explore these weird notions later. +For now, let's do something practical. + +##### Tutorial: Reading the GDT of the Linux Kernel + +Go to the [`read-gdt`](./activities/read-gdt/) folder. +There you have a simple kernel module that reads the GDT of the operating system, then prints each field. +Run `make` to build the module, then `sudo insmod read_gdt.ko` to insert the module. +By running `sudo dmesg` you should see 16 GDT entries listed, the total size of the GDT and the virtual address where it is placed. +Only 16 entries are listed, because the ones after that are null. +Take a look at the entries, and figure out what entries 1 to 6 represent. +You should find 3 kernel entries, and 3 user ones. +Notice that entries 0 and 7 are null. +Entry 0 should always be null. +Entries from 8 onward are TSS and LDT entries, which won't be detailed in today's session. + +Note that a special instruction, `sgdt` is used to retrieve the GDT pointer descriptor. +The opposite instruction is `lgdt`. + +#### Paging + +Soon enough, people got tired of dealing with segmentation; +a new method to divide the memory was needed. +Pages were born. +Unlike segments, that can be of any size, pages have fixed sizes: the standard one is 4KB, but it can also be of other sizes. +There are also the huge pages, that usually have 2MB, 4MB or 1GB. +Pages are organised hierarchically, in a tree-like structure. +A hardware component, called the MMU (Memory Management Unit) manages this structure. +We won't go into details about how that structure is organised, as to not transform this session into a Operating Systems design session. +What is important to know is that each page has permissions, that are checked by the MMU at every access. +The hardware doesn't, however, check if a memory page is accessed by the process that should be able to access it. +That is the role of the OS. + +TODO: develop paging - at least MMU and TLB + +#### Memory Protection Keys + +Let's consider the following scenario: +an application wants to change an area of its memory from read-write to read-only, for reasons. +To do this it will call `mprotect()` on that area. +What will happen behind the scenes will be that the OS will change permissions for each page that is part of the memory area, then it will flush the TLB. +This is costly time-wise. +As a solution, Intel proposed the `MPK` set of instructions, that can quickly change permissions for an area of memory of any size. +How does this work? +Up to the moment when MPK was proposed, page-table entries had 4 bits that weren't used. +These 4 bits are tranformed into 16 possible `keys`. +Furthermore, a register, `PKRU`, is added to hold the permissions for each of those keys, local to each thread. +This allows an application to allocate its pages to a _protection domain_. +When accessing a page, instead of checking only the page permissions, the MMU will also check the protection domain permissions. + +Let's take a practical example: +Application A has a page with read-write permissions. +It allocates a `protection domain` with read permissions, then adds the page to that protection domain. +When performing a write on that page, a Segmentation Fault will be received, because, even though the page has the right permissions, the protection domain does not. + +Everything sounds nice, doesn't it? +Well, it is not. +The reason for this is that the instruction used to modify `PKRU` is unprivileged. +So, if an attacker gains the ability to execute arbitrary code, the whole mechanism can be bypassed. +Another problem is, as detailed by [this paper](https://arxiv.org/pdf/1811.07276v1.pdf), the fact that, after an application frees a protection domain, the key isn't deleted from the page-table entries. +So, if the same key is allocated again, it will still cover the previous pages, that should no longer be under a protection domain. +A classic example of `use-after-free`. +The final problem is that there are only 16 possible keys. +For the whole system. +A system that can run hundreds, if not thousands of processes, with many more threads. +You can see how this can go wrong. + +##### Tutorial: MPK Basics + +TODO: code where the students use `pkey_mprotect` + +##### Activity: I Do This For Your Own Good + +TODO: one program tries to read from a non-readable zone, enforced through PKU. The students must write a program to change the permissions of the `pkey` + +### Control-Flow Enforcement + +Don't you hate it when someone exploits your binary, using methods like `Return Oriented Programming` (ROP)? +(By the way, we have 2 sessions dedicated to ROP, [here]() and [here]()). +Well, other people hate it too, so they searched for a solution. +Intel's solution was adding control-flow enforcement in hardware, with the new `Control-Flow Enforcement Technology` instructions. +Those instructions are split in 2 parts: +* `Indirect Branch Tracking`, which checks if a `jmp` or `call` instruction targets a valid code address, marked accordingly by the programmer / compiler. +* `Shadow Stack`, which checks if the return address was altered in any way. + +But wait, don't we have the `Stack Canary` for the last one? +The main problem with that solution is that the canary is placed on the same memory zone as the return address. +If we can modify the return address, what is stopping someone from reading the canary first, then modifying the return address, while keeping the canary intact? +The shadow stack places a copy of the return address in a separate memory zone, that can be accessed using only some special instructions. + +#### Indirect Branch Tracking + +The main idea behind some attacks is to chain small pieces of code (gadgets), in order to call a system function, usually, in a certain way. +Those gadgets aren't real functions that should be normally called, but rather pieces of a normal function, that end with `jmp` (Jump-Oriented Programming) or `call` (Call-Oriented Programming). +What if we say that the program can use `jmp` / `call` only to certain instructions? +That's what `Indirect Branch Tracking` does: +a new instruction is added, `endbr`. +Every time a `jmp` or `call` instruction is used, the CPU expects the next instruction that is executed to be `endbr`. +Otherwise, an exception is raised, and the program is terminated. + +TODO: it only works for indirect branching - edit is needed + +#### Hardware Shadow Stack + + + +### Intel MPX ? diff --git a/hardware-memory-isolation/activities/mpk-basics/Makefile b/hardware-memory-isolation/activities/mpk-basics/Makefile new file mode 100644 index 0000000..2a7e171 --- /dev/null +++ b/hardware-memory-isolation/activities/mpk-basics/Makefile @@ -0,0 +1,6 @@ + + +main: main.c + +clean: + rm main \ No newline at end of file diff --git a/hardware-memory-isolation/activities/mpk-basics/main.c b/hardware-memory-isolation/activities/mpk-basics/main.c new file mode 100644 index 0000000..39c4ff6 --- /dev/null +++ b/hardware-memory-isolation/activities/mpk-basics/main.c @@ -0,0 +1,60 @@ +#include + +#include +#include +#include + +#define STRING "Hello MPK\0" +#define STRING_LEN 10 + +#define PKEY_DISABLE_ACCESS 0x1 +#define PKEY_DISABLE_WRITE 0x2 + +int main() +{ + int ret, pkey; + char *string = mmap(0, STRING_LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, 0, 0); + + if(!string) + { + fprintf(stderr, "mmap failed: %s\n", strerror(errno)); + return 1; + } + memcpy(string, STRING, STRING_LEN); + + printf("%s\n", string); + + ret = mprotect(string, STRING_LEN, PROT_READ); + if(ret != 0) + { + fprintf(stderr, "mprotect failed: %s\n", strerror(errno)); + return 1; + } + + // segfault + // printf("%s\n", string); + + pkey = pkey_alloc(0, PKEY_DISABLE_ACCESS); + if(pkey == -1) + { + fprintf(stderr, "pkey_alloc failed: %s\n", strerror(errno)); + return 1; + } + + printf("pkey: %d\n", pkey); + + ret = pkey_mprotect(string, STRING_LEN, PROT_READ | PROT_WRITE, pkey); + if(ret != 0) + { + fprintf(stderr, "pkey_mprotect failed: %s\n", strerror(errno)); + return 1; + } + + printf("%s\n", string); + + munmap(string, STRING_LEN); + pkey_free(pkey); + + return 0; +} + diff --git a/hardware-memory-isolation/activities/read-gdt/Makefile b/hardware-memory-isolation/activities/read-gdt/Makefile new file mode 100644 index 0000000..edc8c85 --- /dev/null +++ b/hardware-memory-isolation/activities/read-gdt/Makefile @@ -0,0 +1,7 @@ +obj-m += read_gdt.o + +all: + make -C /home/cristi/WSL2-Linux-Kernel M=$(shell pwd) modules + +clean: + make -C /home/cristi/WSL2-Linux-Kernel M=$(shell pwd) clean diff --git a/hardware-memory-isolation/activities/read-gdt/read_gdt.c b/hardware-memory-isolation/activities/read-gdt/read_gdt.c new file mode 100644 index 0000000..c817050 --- /dev/null +++ b/hardware-memory-isolation/activities/read-gdt/read_gdt.c @@ -0,0 +1,74 @@ +#include +#include +#include + +MODULE_DESCRIPTION("Read GDT Kernel Module"); +MODULE_LICENSE("GPL"); + +struct gdt_desc +{ + unsigned short size; + unsigned long address; +} __attribute__((packed)); + +struct gdt_entry +{ + unsigned short limit0; + unsigned short base0; + unsigned short base1: 8, a: 1, rw: 1, dc: 1, e: 1, s: 1, dpl: 2, p: 1; + unsigned short limit1: 4, res: 1, l: 1, d: 1, g: 1, base2: 8; +} __attribute__((packed)); + +struct gdt_system_entry +{ + unsigned short limit0; + unsigned short base0; + unsigned short base1: 8, type: 4, s: 1, dpl: 2, p: 1; + unsigned short limit1: 4, res: 1, l: 1, d: 1, g: 1, base2: 8; +} __attribute__((packed)); + +static void print_gdt_entry(struct gdt_entry *entry) +{ + pr_info("\tlimit0: %hu, limit1: %hu\n", entry->limit0, entry->limit1); + pr_info("\tbase0: %hu, base1: %hu, base2: %hu\n", entry->base0, entry->base1, entry->base2); + + if (entry->s) + pr_info("\te: %hu, dc: %hu, rw: %hu, a: %hu, s: %hu, dpl: %hu, p: %hu", + entry->e, entry->dc, entry->rw, entry->a, entry->s, entry->dpl, entry->p); + else + pr_info("\ttype: %hu, s: %hu, dpl: %hu, p: %hu", + ((struct gdt_system_entry *)entry)->type, entry->s, entry->dpl, entry->p); + + pr_info("\tl: %hu, d: %hu, g: %hu\n", entry->l, entry->d, entry->g); +} + +static int __init gdt_read_init(void) +{ + int i; + + struct gdt_desc desc; + struct gdt_entry *entries; + + asm volatile("sgdt %0" : "=m" (desc)); + + pr_info("GDT size: %hu", desc.size); + pr_info("GDT address: 0x%lx", desc.address); + + entries = (struct gdt_entry *)desc.address; + + for (i = 0; i < 16; i++) + { + pr_info("Entry number %d\n", i); + print_gdt_entry(entries + i); + } + + return 0; +} + +static void __exit gdt_read_exit(void) +{ + pr_debug("Bye\n"); +} + +module_init(gdt_read_init); +module_exit(gdt_read_exit);