Skip to content

Commit f091b5c

Browse files
committed
fixup! Add Dagger Storage news post
1 parent 3997a5c commit f091b5c

File tree

1 file changed

+17
-13
lines changed

1 file changed

+17
-13
lines changed

news/2022/07/dagger-storage.md

+17-13
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
@def title = "Storage Changes coming to Dagger.jl"
2+
@def hascode = true
3+
@def date = Date(2022, 07, 01)
4+
@def rss = "Storage Changes coming to Dagger.jl"
5+
@def tags = ["dagger", "storage", "news"]
6+
17
# Storage Changes coming to Dagger.jl
28

39
In my last blog post I wrote about how Dagger.jl, a pure-Julia graph scheduler, executes programs represented as DAGs (directed acyclic graphs). Part of executing a DAG involves moving data around between computers, and keeping track of results from each node of the DAG. When the DAG being executed can fit entirely in memory, this works out excellently. But what happens when it *doesn't* fit?
@@ -14,19 +20,19 @@ Maybe instead of relying on the OS, we should try to do this ourselves. We'll al
1420

1521
Let's bring this concept back to Dagger to ground it in reality. When Dagger executes a DAG, it has exclusive access to the results of every node of the DAG, meaning that if Dagger "forgets" about a result, Julia's GC will delete the memory used by that result. Similarly, when the user asks for the result of a node, it's up to Dagger to figure out how to get that result to the user, but *how* Dagger does that is opaque, and thus flexible. So, theoretically, between the result being created and the result being provided to the user, Dagger could have saved the result to disk, deleted the copy in memory, and then later read the result back into memory from disk, effectively swapping our result out of and into memory on demand.
1622

17-
The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness", and Dagger simply makes it possible to use this functionality automatically.
23+
The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness" (just like we described above), and Dagger simply makes it possible to use this functionality automatically.
1824

19-
Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Least-Recently Used (LRU) allocator that we can enable by setting a few environment variables:
25+
Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Most-Recently Used (MRU) allocator that we can enable by setting a few environment variables:
2026

21-
```
22-
JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the LRU allocator globally
27+
```sh
28+
JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the MRU allocator globally
2329
JULIA_MEMPOOL_EXPERIMENTAL_MEMORY_BOUND=$((1024 ** 3)) # Set a 1GB limit for in-memory data
24-
JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32 * (1024 ** 3))) # Set a 32GB limit for disk data
30+
JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32 * (1024 ** 3))) # Set a 32GB limit for on-disk data
2531
```
2632

2733
We can now launch Julia and do some table operations:
2834

29-
```
35+
```julia
3036
using DataFrames, PooledArrays
3137
using Dagger
3238

@@ -37,27 +43,27 @@ strings = ["alpha",
3743

3844
fetch(DTable(i->DataFrame(a=PooledArray(rand(strings, 1024^2)),
3945
b=PooledArray(rand(UInt8, 1024^2))),
40-
1024))
46+
200))
4147
```
4248

4349
Let's now ask MemPool's memory manager how much memory and disk it's using:
4450

45-
```
51+
```julia
4652
println(MemPool.GLOBAL_DEVICE[])
4753
TODO
4854
```
4955

5056
And we can check manually that our data is (at least partially) stored on disk:
5157

52-
```
58+
```sh
5359
du -sh ~/.mempool/
5460
TODO
5561
```
5662

5763
This is really cool! With a small amount of code, our table operations suddenly start operating out-of-core and let us scale beyond the amount of RAM in our computer. In fact, it's possible to scale even further with some tricks. In the example above, some of the data being stored is pretty repetitive; maybe we can get a bit fancy and compress our data before storing it to disk? Doing this is easy, we just need to tell MemPool to do data compression and decompression for us automatically:
5864

5965
TODO: Demo of inline data compression/decompression
60-
```
66+
```julia
6167
# In a fresh Julia session
6268
using DataFrames, PooledArrays, Dagger
6369
using MemPool, CodecZlib
@@ -104,6 +110,4 @@ Now, if you just naively passed all of those files paths into the `DTable` const
104110

105111
If you instead use an invocation like the above (using all the fancy flags to `Dagger.tochunk`), your per-file times would look more like 100 nanoseconds, *irrespective* of how slow the networked filesystem is, leading to a comfortable wait time of 8 milliseconds. How is this possible?! By cheating :) Instead of loading the file on the spot, this invocation just registers the path within MemPool's datastore, and only later (when a read of the data for a file is attempted) is the file actually opened and parsed. This means that if you never ask MemPool to access a file's data, it will never pass it to CSV.jl to be opened and parsed, so your program's users never have to spend time waiting on loading data that they don't use.
106112

107-
## Limitations
108-
109-
This
113+
{{addcomments}}

0 commit comments

Comments
 (0)