fixup! Add Dagger Storage news post

jpsamaroo · jpsamaroo · commit f091b5ca7817 · 2022-07-01T07:23:51.000-05:00
diff --git a/news/2022/07/dagger-storage.md b/news/2022/07/dagger-storage.md
@@ -1,3 +1,9 @@
+@def title = "Storage Changes coming to Dagger.jl"
+@def hascode = true
+@def date = Date(2022, 07, 01)
+@def rss = "Storage Changes coming to Dagger.jl"
+@def tags = ["dagger", "storage", "news"]
+
 # Storage Changes coming to Dagger.jl
 
 In my last blog post I wrote about how Dagger.jl, a pure-Julia graph scheduler, executes programs represented as DAGs (directed acyclic graphs). Part of executing a DAG involves moving data around between computers, and keeping track of results from each node of the DAG. When the DAG being executed can fit entirely in memory, this works out excellently. But what happens when it *doesn't* fit?
@@ -14,19 +20,19 @@ Maybe instead of relying on the OS, we should try to do this ourselves. We'll al
 
 Let's bring this concept back to Dagger to ground it in reality. When Dagger executes a DAG, it has exclusive access to the results of every node of the DAG, meaning that if Dagger "forgets" about a result, Julia's GC will delete the memory used by that result. Similarly, when the user asks for the result of a node, it's up to Dagger to figure out how to get that result to the user, but *how* Dagger does that is opaque, and thus flexible. So, theoretically, between the result being created and the result being provided to the user, Dagger could have saved the result to disk, deleted the copy in memory, and then later read the result back into memory from disk, effectively swapping our result out of and into memory on demand.
 
-The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness", and Dagger simply makes it possible to use this functionality automatically.
+The cool thing is, this *was* a theory, but now it's a reality - Dagger has the ability to do exactly this with results generated within a DAG as I just described. More specifically, our memory management library, MemPool.jl, gained "storage awareness" (just like we described above), and Dagger simply makes it possible to use this functionality automatically.
 
-Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Least-Recently Used (LRU) allocator that we can enable by setting a few environment variables:
+Let's see how we can do this in practice with Dagger's `DTable`. We first need to configure MemPool with a memory manager device. MemPool has a built-in Most-Recently Used (MRU) allocator that we can enable by setting a few environment variables:
 
-```
-JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the LRU allocator globally
+```sh
+JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 # Enable the MRU allocator globally
 JULIA_MEMPOOL_EXPERIMENTAL_MEMORY_BOUND=$((1024 ** 3)) # Set a 1GB limit for in-memory data
-JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32 * (1024 ** 3))) # Set a 32GB limit for disk data
+JULIA_MEMPOOL_EXPERIMENTAL_DISK_BOUND=$((32 * (1024 ** 3))) # Set a 32GB limit for on-disk data
 ```
 
 We can now launch Julia and do some table operations:
 
-```
+```julia
 using DataFrames, PooledArrays
 using Dagger
 
@@ -37,27 +43,27 @@ strings = ["alpha",
 
 fetch(DTable(i->DataFrame(a=PooledArray(rand(strings, 1024^2)),
                           b=PooledArray(rand(UInt8, 1024^2))),
-             1024))
+             200))
 ```
 
 Let's now ask MemPool's memory manager how much memory and disk it's using:
 
-```
+```julia
 println(MemPool.GLOBAL_DEVICE[])
 TODO
 ```
 
 And we can check manually that our data is (at least partially) stored on disk:
 
-```
+```sh
 du -sh ~/.mempool/
 TODO
 ```
 
 This is really cool! With a small amount of code, our table operations suddenly start operating out-of-core and let us scale beyond the amount of RAM in our computer. In fact, it's possible to scale even further with some tricks. In the example above, some of the data being stored is pretty repetitive; maybe we can get a bit fancy and compress our data before storing it to disk? Doing this is easy, we just need to tell MemPool to do data compression and decompression for us automatically:
 
 TODO: Demo of inline data compression/decompression
-```
+```julia
 # In a fresh Julia session
 using DataFrames, PooledArrays, Dagger
 using MemPool, CodecZlib
@@ -104,6 +110,4 @@ Now, if you just naively passed all of those files paths into the `DTable` const
 
 If you instead use an invocation like the above (using all the fancy flags to `Dagger.tochunk`), your per-file times would look more like 100 nanoseconds, *irrespective* of how slow the networked filesystem is, leading to a comfortable wait time of 8 milliseconds. How is this possible?! By cheating :) Instead of loading the file on the spot, this invocation just registers the path within MemPool's datastore, and only later (when a read of the data for a file is attempted) is the file actually opened and parsed. This means that if you never ask MemPool to access a file's data, it will never pass it to CSV.jl to be opened and parsed, so your program's users never have to spend time waiting on loading data that they don't use.
 
-## Limitations
-
-This 
+{{addcomments}}