You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: news/2022/07/dagger-storage.md
+17-13
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,9 @@
1
+
@def title = "Storage Changes coming to Dagger.jl"
2
+
@def hascode = true
3
+
@def date = Date(2022, 07, 01)
4
+
@def rss = "Storage Changes coming to Dagger.jl"
5
+
@def tags = ["dagger", "storage", "news"]
6
+
1
7
# Storage Changes coming to Dagger.jl
2
8
3
9
In my last blog post I wrote about how Dagger.jl, a pure-Julia graph scheduler, executes programs represented as DAGs (directed acyclic graphs). Part of executing a DAG involves moving data around between computers, and keeping track of results from each node of the DAG. When the DAG being executed can fit entirely in memory, this works out excellently. But what happens when it *doesn't* fit?
@@ -14,19 +20,19 @@ Maybe instead of relying on the OS, we should try to do this ourselves. We'll al
14
20
15
21
Let's bring this concept back to Dagger to ground it in reality. When Dagger executes a DAG, it has exclusive access to the results of every node of the DAG, meaning that if Dagger "forgets" about a result, Julia's GC will delete the memory used by that result. Similarly, when the user asks for the result of a node, it's up to Dagger to figure out how to get that result to the user, but *how* Dagger does that is opaque, and thus flexible. So, theoretically, between the result being created and the result being provided to the user, Dagger could have saved the result to disk, deleted the copy in memory, and then later read the result back into memory from disk, effectively swapping our result out of and into memory on demand.
16
22
17
-
The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness", and Dagger simply makes it possible to use this functionality automatically.
23
+
The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness" (just like we described above), and Dagger simply makes it possible to use this functionality automatically.
18
24
19
-
Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Least-Recently Used (LRU) allocator that we can enable by setting a few environment variables:
25
+
Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Most-Recently Used (MRU) allocator that we can enable by setting a few environment variables:
20
26
21
-
```
22
-
JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the LRU allocator globally
27
+
```sh
28
+
JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the MRU allocator globally
23
29
JULIA_MEMPOOL_EXPERIMENTAL_MEMORY_BOUND=$((1024**3))# Set a 1GB limit for in-memory data
24
-
JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32 * (1024 ** 3))) # Set a 32GB limit for disk data
30
+
JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32* (1024**3))) # Set a 32GB limit for on-disk data
25
31
```
26
32
27
33
We can now launch Julia and do some table operations:
Let's now ask MemPool's memory manager how much memory and disk it's using:
44
50
45
-
```
51
+
```julia
46
52
println(MemPool.GLOBAL_DEVICE[])
47
53
TODO
48
54
```
49
55
50
56
And we can check manually that our data is (at least partially) stored on disk:
51
57
52
-
```
58
+
```sh
53
59
du -sh ~/.mempool/
54
60
TODO
55
61
```
56
62
57
63
This is really cool! With a small amount of code, our table operations suddenly start operating out-of-core and let us scale beyond the amount of RAM in our computer. In fact, it's possible to scale even further with some tricks. In the example above, some of the data being stored is pretty repetitive; maybe we can get a bit fancy and compress our data before storing it to disk? Doing this is easy, we just need to tell MemPool to do data compression and decompression for us automatically:
58
64
59
65
TODO: Demo of inline data compression/decompression
60
-
```
66
+
```julia
61
67
# In a fresh Julia session
62
68
using DataFrames, PooledArrays, Dagger
63
69
using MemPool, CodecZlib
@@ -104,6 +110,4 @@ Now, if you just naively passed all of those files paths into the `DTable` const
104
110
105
111
If you instead use an invocation like the above (using all the fancy flags to `Dagger.tochunk`), your per-file times would look more like 100 nanoseconds, *irrespective* of how slow the networked filesystem is, leading to a comfortable wait time of 8 milliseconds. How is this possible?! By cheating :) Instead of loading the file on the spot, this invocation just registers the path within MemPool's datastore, and only later (when a read of the data for a file is attempted) is the file actually opened and parsed. This means that if you never ask MemPool to access a file's data, it will never pass it to CSV.jl to be opened and parsed, so your program's users never have to spend time waiting on loading data that they don't use.
0 commit comments