-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9141fe1
commit fc35c4c
Showing
4 changed files
with
276 additions
and
17 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
# Hardware Memory Isolation | ||
|
||
## Table of Contents | ||
|
||
## Prerequisites | ||
|
||
## Introduction | ||
|
||
Another role of the hardware, besides computing, is to provide isolation, mainly between different software components. | ||
You have encountered (hopefully) the main protection mechanism that the hardware uses to ensure memory isolation: pages. | ||
There are other less-known mechanisms for ensuring memory isolation, through which we will go this session: segments, privilege rings, memory protection keys. | ||
We will also dive into virtualization, focusing on the hardware-assisted one. | ||
You get to write a small hypervisor. | ||
|
||
## Memory Isolation | ||
|
||
### Pages and Segments | ||
|
||
The system needs a way to enforce ownership and permissons on the memory zones. | ||
For example, it needs to enforce that a certain memory zone cand only be read and executed, not written. | ||
How can you do this? | ||
The first answer that the CPU designers had is called **segmentation**. | ||
|
||
#### Segmentation | ||
|
||
Segmentation is the x86 CPU feature that allows assigning permissions and ownership to a certain memory zone, using segments. | ||
Today, the modern systems don't use segmentation anymore, when performing usual operations. | ||
Segmentation is only required at the early stages of booting, when pages cannot be used. | ||
But segmentation is still there, and it is tied to many other components of the system, so knowledge of how segmentation works is useful. | ||
Let's start at the beginning of time, with something called the **Real mode**. | ||
|
||
##### Real Mode Memory Addressing | ||
|
||
At the beginning of time, x86 CPUs had only 16-bits registers. | ||
This meant that the maximum memory size that could be used, with **flat memory addressing** model, was 64KB. | ||
It was thought that this memory is enough, until it wasn't enough. | ||
Instead of expanding the registers, the CPU manufacturers came up with the **segment memory addressing** model. | ||
This meant that each memory address, instead of using one register, used 2: a normal register and a segment register, each fitting 16 bits. | ||
The way the addresses were calculated was the following: | ||
``` | ||
addr = SEGMENT_REGISTER * 0x10 + OFFSET_REGISTER | ||
``` | ||
This led to a new memory maximum of almost 1MB, which was tought to be enough. | ||
Notice that, at that time, a segment was just a number. | ||
It didn't enforce any permissions on the memory zone, or ownership, because there was no user and kernel separation. | ||
The operating system had absolute power, and there was nothing but the operating system. | ||
Then came 32-bit registers, the need for applications, the need for isolation. | ||
The **protected mode** was born. | ||
|
||
##### Protected Mode Memory Addressing | ||
|
||
Well, all those things didn't appear at the same time. | ||
First, the registers were extended to 32 bits. | ||
Using the segmentation model, each segment would now fit 4GB of memory. | ||
This meant that a system could use 68GB of memory. | ||
But that memory wasn't available. | ||
you could barely reach 4GB of RAM. | ||
So, the addressing model was switched to the flat one: a memory address would be composed of only one register. | ||
|
||
The segments got another role: enforce isolation (pages still weren't a thing). | ||
A segment would not be a value used to compute an address, but an index in a table, the **GDT** (Global Descriptor Table). | ||
|
||
###### The GDT | ||
|
||
The GDT is a system table, that contains descriptors of memory zones: where it starts, where it ends, how it grows, is it readable, writeable, executable, who can access it. | ||
Each entry in the GDT has 8 bytes in size, and it looks like this: | ||
|
||
TODO: insert GDT entry diagram | ||
|
||
Let's break each field down: | ||
* Base: the address at which the segment begins | ||
* Limit: the first wacky one; how many bytes or pages are contained in the segment | ||
* Access: | ||
TODO: insert Access Byte diagram | ||
* P: Present - 1 if the segment is valid | ||
* DPL: Descriptor privilege level, where 0 is the highest privilege level, and 3 the lowest | ||
* S: Descriptor type; not interesting for us today | ||
* E: Executable; if 0, the segment is a data one; if 1, the segment is a code one | ||
* D/C: its significance depends on the **E** bit | ||
* if E is 0, the field is Direction; not interesting for today | ||
* if E is 1, the field is Conforming; | ||
* if the Conforming bit is 1, the code in this segment can be executed by equal or lower privilege level code; | ||
for example, code in a segment with DPL equal to 3, with C equal to 1, can be executed by code from a segment with DPL equal to 1 | ||
* if the C bit is 0, the code in this segment can be executed only by code from segments with the same DPL | ||
* R/W: another field depending on E | ||
* if E is 0, the field is Writeable; data segments are always readable | ||
* if E is 1, the field is Readable; code semgments are never writeable | ||
* A: Accessed; not interesting | ||
* Flags: | ||
* G - Granularity: the Limit fields is in bytes (0), or pages (1)? | ||
* DB - Size: if 1, the segment is a 32-bit one, else it is a 16-bit one | ||
* L - Long-mode code: if 1, the segment is a 64-bit code one | ||
TODO: insert Flags diagram | ||
|
||
As complicated as it looks, The GDT did its job of enforcing some memory protection, hence a CPU with a GDT and 32-bit addresses is in the **Protected Mode**. | ||
This came at the cost of the programmer's sanity. | ||
If you are wondering what the designers of this model were smoking, you are not alone. | ||
Fortunately, in modern systems, **Base** and **Limit** are ignored. | ||
A segment always covers the entire address space. | ||
Does that mean that the entire address space is in both code and data segments? | ||
Yes? | ||
Then how do we ensure separation between executable and writeable memory zones? | ||
Also, the entire memory is accesible by all privilege levels? | ||
This doesn't seem right. | ||
Enter pages. | ||
But first, something about privilege levels, also known as privilege rings. | ||
|
||
##### Privilege Rings | ||
|
||
You may have heard about kernel-space and user-space. | ||
How do we know if a memory zone belongs to the kernel-space, or the user-space? | ||
That memory zone is part of a segment, or page, as we will see later, that belongs to either kernel or user. | ||
The kernel-space is, in fact, any memory zone that belongs to **ring 0**, or DPL 0, and the user-space belongs to **ring 3**, or DPL 3. | ||
What about the other rings, 1 and 2? | ||
They can be used, but almost no one does it. | ||
Some drivers use those rings, but it's not a common practice. | ||
|
||
At this point, things get weird. | ||
What if you want software with higher privileges than the kernel, like a hypervisor? | ||
You get ring -1. | ||
But what if you want a piece of code that is run by the hardware in critical moments? | ||
You get ring -2. | ||
Ring -3? | ||
Someone got there. | ||
Fear not, we will explore these weird notions later. | ||
For now, let's do something practical. | ||
|
||
##### Tutorial: Reading the GDT of the Linux Kernel | ||
|
||
Go to the [`read-gdt`](./activities/read-gdt/) folder. | ||
There you have a simple kernel module that reads the GDT of the operating system, then prints each field. | ||
Run `make` to build the module, then `sudo insmod read_gdt.ko` to insert the module. | ||
By running `sudo dmesg` you should see 16 GDT entries listed, the total size of the GDT and the virtual address where it is placed. | ||
Only 16 entries are listed, because the ones after that are null. | ||
Take a look at the entries, and figure out what entries 1 to 6 represent. | ||
You should find 3 kernel entries, and 3 user ones. | ||
Notice that entries 0 and 7 are null. | ||
Entry 0 should always be null. | ||
Entries from 8 onward are TSS and LDT entries, which won't be detailed in today's session. | ||
|
||
Note that a special instruction, `sgdt` is used to retrieve the GDT pointer descriptor. | ||
The opposite instruction is `lgdt`. | ||
|
||
#### Paging | ||
|
||
Soon enough, people got tired of dealing with segmentation; | ||
a new method to divide the memory was needed. | ||
Pages were born. | ||
Unlike segments, that can be of any size, pages have fixed sizes: 4KB. | ||
There are also the huge pages, that usually have 2MB, 4MB or 1GB. | ||
Pages are organised hierarchically, in a tree-like structure. | ||
A hardware component, called the MMU (Memory Management Unit) manages this structure. | ||
We won't go into details about how that structure is organised, as to not transform this session into a Operating Systems design session. | ||
What is important to know is that each page has permissions, that are checked by the MMU at every access. | ||
The hardware doesn't, however, check if a memory page is accessed by the process that should be able to access it. | ||
That is the role of the OS. | ||
|
||
#### Memory Protection Keys | ||
|
||
We have the following scenario: | ||
an application wants to change an area of its memory from read-write to read-only, for reasons. | ||
To do this it will call `mprotect` on that area. | ||
What will happen behind the scenes will be that the OS will change permissions for each page that is part of the memory area, then it will flush the TLB. | ||
This is costly time-wise. | ||
As a solution, Intel proposed the MPK set of instructions, that can quickly change permissions for an area of memory of any size. | ||
How does this work? | ||
Up to the moment when MPK was proposed, page-table entries had 4 bits that weren't used. | ||
These 4 bits are tranformed into 16 possible `keys`. | ||
Furthermore, a register, `PKRU`, is added to hold the permissions for each of those keys, local to each thread. | ||
This allows an application to allocate its pages to a `protection domain`. | ||
When accessing a page, instead of checking only the page permissions, the MMU will also check the protection domain permissions. | ||
|
||
Let's take a practical example: | ||
Application A has a page with read-write permissions. | ||
It allocates a `protection domain` with read permissions, then adds the page to that protection domain. | ||
When performing a write on that page, a Segmentation Fault will be received, because, even though the page has the right permissions, the protection domain does not. | ||
Everything sounds nice, doesn't it? | ||
Well, it is not. | ||
The reason for this is that the instruction used to modify `PKRU` is unprivileged. | ||
So, if an attacker gains the ability to execute arbitrary code, the whole mechanism can be bypassed. | ||
Another problem is, as detailed by [this paper](https://arxiv.org/pdf/1811.07276v1.pdf), the fact that, after an application frees a protection domain, the key isn't deleted from the page-table entries. | ||
So, if the same key is allocated again, it will still cover the previous pages, that should no longer be under a protection domain. | ||
A classical example of `use-after-free`. | ||
The final problem is that there are only 16 possible keys. | ||
For the whole system. | ||
A system that can run hundreds, if not thousands of processes, with many more threads. | ||
You can see how this can go wrong. | ||
|
||
### Control-Flow Enforcement | ||
|
||
#### Invalid Jump Detection | ||
|
||
#### Hardware Shadow Stack | ||
|
||
### Intel MPX |
7 changes: 7 additions & 0 deletions
7
hardware-memory-isolation/activities/read-gdt/public/Makefile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
obj-m += read_gdt.o | ||
|
||
all: | ||
make -C /home/cristi/WSL2-Linux-Kernel M=$(shell pwd) modules | ||
|
||
clean: | ||
make -C /home/cristi/WSL2-Linux-Kernel M=$(shell pwd) clean |
74 changes: 74 additions & 0 deletions
74
hardware-memory-isolation/activities/read-gdt/public/read_gdt.c
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
#include <linux/kernel.h> | ||
#include <linux/init.h> | ||
#include <linux/module.h> | ||
|
||
MODULE_DESCRIPTION("Read GDT Kernel Module"); | ||
MODULE_LICENSE("GPL"); | ||
|
||
struct gdt_desc | ||
{ | ||
unsigned short size; | ||
unsigned long address; | ||
} __attribute__((packed)); | ||
|
||
struct gdt_entry | ||
{ | ||
unsigned short limit0; | ||
unsigned short base0; | ||
unsigned short base1: 8, a: 1, rw: 1, dc: 1, e: 1, s: 1, dpl: 2, p: 1; | ||
unsigned short limit1: 4, res: 1, l: 1, d: 1, g: 1, base2: 8; | ||
} __attribute__((packed)); | ||
|
||
struct gdt_system_entry | ||
{ | ||
unsigned short limit0; | ||
unsigned short base0; | ||
unsigned short base1: 8, type: 4, s: 1, dpl: 2, p: 1; | ||
unsigned short limit1: 4, res: 1, l: 1, d: 1, g: 1, base2: 8; | ||
} __attribute__((packed)); | ||
|
||
static void print_gdt_entry(struct gdt_entry *entry) | ||
{ | ||
pr_info("\tlimit0: %hu, limit1: %hu\n", entry->limit0, entry->limit1); | ||
pr_info("\tbase0: %hu, base1: %hu, base2: %hu\n", entry->base0, entry->base1, entry->base2); | ||
|
||
if (entry->s) | ||
pr_info("\te: %hu, dc: %hu, rw: %hu, a: %hu, s: %hu, dpl: %hu, p: %hu", | ||
entry->e, entry->dc, entry->rw, entry->a, entry->s, entry->dpl, entry->p); | ||
else | ||
pr_info("\ttype: %hu, s: %hu, dpl: %hu, p: %hu", | ||
((struct gdt_system_entry *)entry)->type, entry->s, entry->dpl, entry->p); | ||
|
||
pr_info("\tl: %hu, d: %hu, g: %hu\n", entry->l, entry->d, entry->g); | ||
} | ||
|
||
static int __init gdt_read_init(void) | ||
{ | ||
int i; | ||
|
||
struct gdt_desc desc; | ||
struct gdt_entry *entries; | ||
|
||
asm volatile("sgdt %0" : "=m" (desc)); | ||
|
||
pr_info("GDT size: %hu", desc.size); | ||
pr_info("GDT address: 0x%lx", desc.address); | ||
|
||
entries = (struct gdt_entry *)desc.address; | ||
|
||
for (i = 0; i < 16; i++) | ||
{ | ||
pr_info("Entry number %d\n", i); | ||
print_gdt_entry(entries + i); | ||
} | ||
|
||
return 0; | ||
} | ||
|
||
static void __exit gdt_read_exit(void) | ||
{ | ||
pr_debug("Bye\n"); | ||
} | ||
|
||
module_init(gdt_read_init); | ||
module_exit(gdt_read_exit); |