Skip to content

Commit b6dd527

Browse files
authored
[release 1.10] backport GC developer docs to 1.10 (#52616)
The current GC developer docs are fairly out-of-date with the actual implementation. This PR should be strictly a NFC.
1 parent ed79752 commit b6dd527

File tree

2 files changed

+34
-55
lines changed

2 files changed

+34
-55
lines changed

doc/src/devdocs/gc.md

+34-55
Original file line numberDiff line numberDiff line change
@@ -2,77 +2,56 @@
22

33
## Introduction
44

5-
Julia has a serial, stop-the-world, generational, non-moving mark-sweep garbage collector.
6-
Native objects are precisely scanned and foreign ones are conservatively marked.
5+
Julia has a non-moving, partially concurrent, parallel, generational and mostly precise mark-sweep collector (an interface
6+
for conservative stack scanning is provided as an option for users who wish to call Julia from C).
77

8-
## Memory layout of objects and GC bits
8+
## Allocation
99

10-
An opaque tag is stored in the front of GC managed objects, and its lowest two bits are
11-
used for garbage collection. The lowest bit is set for marked objects and the second
12-
lowest bit stores age information (e.g. it's only set for old objects).
10+
Julia uses two types of allocators, the size of the allocation request determining which one is used. Objects up to 2k
11+
bytes are allocated on a per-thread free-list pool allocator, while objects larger than 2k bytes are allocated through libc
12+
malloc.
1313

14-
Objects are aligned by a multiple of 4 bytes to ensure this pointer tagging is legal.
14+
Julia’s pool allocator partitions objects on different size classes, so that a memory page managed by the pool allocator
15+
(which spans 4 operating system pages on 64bit platforms) only contains objects of the same size class. Each memory
16+
page from the pool allocator is paired with some page metadata stored on per-thread lock-free lists. The page metadata contains information such as whether the page has live objects at all, number of free slots, and offsets to the first and last objects in the free-list contained in that page. These metadata are used to optimize the collection phase: a page which has no live objects at all may be returned to the operating system without any need of scanning it, for example.
1517

16-
## Pool allocation
18+
While a page that has no objects may be returned to the operating system, its associated metadata is permanently
19+
allocated and may outlive the given page. As mentioned above, metadata for allocated pages are stored on per-thread lock-free
20+
lists. Metadata for free pages, however, may be stored into three separate lock-free lists depending on whether the page has been mapped but never accessed (`page_pool_clean`), or whether the page has been lazily sweeped and it's waiting to be madvised by a background GC thread (`page_pool_lazily_freed`), or whether the page has been madvised (`page_pool_freed`).
1721

18-
Sufficiently small objects (up to 2032 bytes) are allocated on per-thread object
19-
pools.
22+
Julia's pool allocator follows a "tiered" allocation discipline. When requesting a memory page for the pool allocator, Julia will:
2023

21-
A three-level tree (analogous to a three-level page-table) is used to keep metadata
22-
(e.g. whether a page has been allocated, whether contains marked objects, number of free objects etc.)
23-
about address ranges spanning at least one page.
24-
Sweeping a pool allocated object consists of inserting it back into the free list
25-
maintained by its pool.
24+
- Try to claim a page from `page_pool_lazily_freed`, which contains pages which were empty on the last stop-the-world phase, but not yet madivsed by a concurrent sweeper GC thread.
2625

27-
## Malloc'd arrays and big objects
26+
- If it failed claiming a page from `page_pool_lazily_freed`, it will try to claim a page from `the page_pool_clean`, which contains pages which were mmaped on a previous page allocation request but never accessed.
2827

29-
Two lists are used to keep track of the remaining allocated objects:
30-
one for sufficiently large malloc'd arrays (`mallocarray_t`) and one for
31-
sufficiently large objects (`bigval_t`).
28+
- If it failed claiming a page from `pool_page_clean` and from `page_pool_lazily_freed`, it will try to claim a page
29+
from `page_pool_freed`, which contains pages which have already been madvised by a concurrent sweeper GC thread and whose underlying virtual address can be recycled.
3230

33-
Sweeping these objects consists of unlinking them from their list and calling `free` on the
34-
corresponding address.
31+
- If it failed in all of the attempts mentioned above, it will mmap a batch of pages, claim one page for itself, and
32+
insert the remaining pages into `page_pool_clean`.
3533

36-
## Generational and remembered sets
34+
![Diagram of tiered pool allocation](./img/gc-tiered-allocation.jpg)
3735

38-
Field writes into old objects trigger a write barrier if the written field
39-
points to a young object and if a write barrier has not been triggered on the old object yet.
40-
In this case, the old object being written to is enqueued into a remembered set, and
41-
its mark bit is set to indicate that a write barrier has already been triggered on it.
36+
## Marking and Generational Collection
4237

43-
There is no explicit flag to determine whether a marking pass will scan the
44-
entire heap or only through young objects and remembered set.
45-
The mark bits of the objects themselves are used to determine whether a full mark happens.
46-
The mark-sweep algorithm follows this sequence of steps:
38+
Julia’s mark phase is implemented through a parallel iterative depth-first-search over the object graph. Julia’s collector is non-moving, so object age information can’t be determined through the memory region in which the object resides alone, but has to be somehow encoded in the object header or on a side table. The lowest two bits of an object’s header are used to store, respectively, a mark bit that is set when an object is scanned during the mark phase and an age bit for the generational collection.
4739

48-
- Objects in the remembered set have their GC mark bits reset
49-
(these are set once write barrier is triggered, as described above) and are enqueued.
40+
Generational collection is implemented through sticky bits: objects are only pushed to the mark-stack, and therefore
41+
traced, if their mark-bits are not set. When objects reach the oldest generation, their mark-bits are not reset during
42+
the so-called "quick-sweep", which leads to these objects not being traced in a subsequent mark phase. A "full-sweep",
43+
however, causes the mark-bits of all objects to be reset, leading to all objects being traced in a subsequent mark phase.
44+
Objects are promoted to the next generation during every sweep phase they survive. On the mutator side, field writes
45+
are intercepted through a write barrier that pushes an object’s address into a per-thread remembered set if the object is
46+
in the last generation, and if the object at the field being written is not. Objects in this remembered set are then traced
47+
during the mark phase.
5048

51-
- Roots (e.g. thread locals) are enqueued.
49+
## Sweeping
5250

53-
- Object graph is traversed and mark bits are set.
51+
Sweeping of object pools for Julia may fall into two categories: if a given page managed by the pool allocator contains at least one live object, then a free-list must be threaded through its dead objects; if a given page contains no live objects at all, then its underlying physical memory may be returned to the operating system through, for instance, the use of madvise system calls on Linux.
5452

55-
- Object pools, malloc'd arrays and big objects are sweeped. On a full sweep,
56-
the mark bits of all marked objects are reset. On a generational sweep,
57-
only the mark bits of marked young objects are reset.
58-
59-
- Mark bits of objects in the remembered set are set,
60-
so we don't trigger the write barrier on them again.
61-
62-
After these stages, old objects will be left with their mark bits set,
63-
so that references from them are not explored in a subsequent generational collection.
64-
This scheme eliminates the need of explicitly keeping a flag to indicate a full mark
65-
(though a flag to indicate a full sweep is necessary).
53+
The first category of sweeping is currently serial and performed in the stop-the-world phase. For the second category of sweeping, if concurrent page sweeping is enabled through the flag `--gcthreads=X,1` we perform the madvise system calls in a background sweeper thread, concurrently with the mutator threads. During the stop-the-world phase of the collector, pool allocated pages which contain no live objects are initially pushed into the `pool_page_lazily_freed`. The background sweeping thread is then woken up and is responsible for removing pages from `pool_page_lazily_freed`, calling madvise on them, and inserting them into `pool_page_freed`. As described above, `pool_page_lazily_freed` is also shared with mutator threads. This implies that on allocation-heavy multithreaded workloads, mutator threads would often avoid a page fault on allocation (coming from accessing a fresh mmaped page or accessing a madvised page) by directly allocating from a page in `pool_page_lazily_freed`, while the background sweeper thread needs to madvise a reduce number of pages given some of them were already claimed by the mutators.
6654

6755
## Heuristics
6856

69-
GC heuristics tune the GC by changing the size of the allocation interval between garbage collections.
70-
71-
The GC heuristics measure how big the heap size is after a collection and set the next
72-
collection according to the algorithm described by https://dl.acm.org/doi/10.1145/3563323,
73-
in summary, it argues that the heap target should have a square root relationship with the live heap, and that it should also be scaled by how fast the GC is freeing objects and how fast the mutators are allocating.
74-
The heuristics measure the heap size by counting the number of pages that are in use and the objects that use malloc. Previously we measured the heap size by counting
75-
the alive objects, but that doesn't take into account fragmentation which could lead to bad decisions, that also meant that we used thread local information (allocations) to make
76-
decisions about a process wide (when to GC), measuring pages means the decision is global.
77-
78-
The GC will do full collections when the heap size reaches 80% of the maximum allowed size.
57+
GC heuristics tune the GC by changing the size of the allocation interval between garbage collections. If a GC was unproductive, then we increase the size of the allocation interval to allow objects more time to die. If a GC returns a lot of space we can shrink the interval. The goal is to find a steady state where we are allocating just about the same amount as we are collecting.
346 KB
Loading

0 commit comments

Comments
 (0)