You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+86-38
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
## What is graphchain?
6
6
7
-
Graphchain is like [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html#memory) for dask graphs. [Dask graph computations](http://dask.pydata.org/en/latest/spec.html) are cached to a local or remote location of your choice, specified by a [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html).
7
+
Graphchain is like [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html) for dask graphs. [Dask graph computations](https://docs.dask.org/en/latest/spec.html) are cached to a local or remote location of your choice, specified by a [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html).
8
8
9
9
When you change your dask graph (by changing a computation's implementation or its inputs), graphchain will take care to only recompute the minimum number of computations necessary to fetch the result. This allows you to iterate quickly over your graph without spending time on recomputing previously computed keys.
10
10
@@ -23,7 +23,7 @@ Additionally, the result of a computation is only cached if it is estimated that
print('Running complicated computation on DataFrame...')
41
+
defexpensive_computation(df, num_quantiles):
42
+
print("Running expensive computation on DataFrame...")
43
43
return df.quantile(q=[i / num_quantiles for i inrange(num_quantiles)])
44
44
45
-
defsummarise_dataframes(*dfs):
46
-
print('Summing DataFrames...')
45
+
defsummarize_dataframes(*dfs):
46
+
print("Summing DataFrames...")
47
47
returnsum(df.sum().sum() for df in dfs)
48
48
49
49
dsk = {
50
-
'df_a': (create_dataframe, 10_000, 1000),
51
-
'df_b': (create_dataframe, 10_000, 1000),
52
-
'df_c': (complicated_computation, 'df_a', 2048),
53
-
'df_d': (complicated_computation, 'df_b', 2048),
54
-
'result': (summarise_dataframes, 'df_c', 'df_d')
50
+
"df_a": (create_dataframe, 10_000, 1000),
51
+
"df_b": (create_dataframe, 10_000, 1000),
52
+
"df_c": (expensive_computation, "df_a", 2048),
53
+
"df_d": (expensive_computation, "df_b", 2048),
54
+
"result": (summarize_dataframes, "df_c", "df_d")
55
55
}
56
56
```
57
57
58
-
Using `dask.get` to fetch the `'result'` key takes about 6 seconds:
58
+
Using `dask.get` to fetch the `"result"` key takes about 6 seconds:
59
59
60
60
```python
61
-
>>>%time dask.get(dsk, 'result')
61
+
>>>%time dask.get(dsk, "result")
62
62
63
63
Creating DataFrame...
64
-
Running complicated computation on DataFrame...
64
+
Running expensive computation on DataFrame...
65
65
Creating DataFrame...
66
-
Running complicated computation on DataFrame...
66
+
Running expensive computation on DataFrame...
67
67
Summing DataFrames...
68
68
69
69
CPU times: user 7.39 s, sys: 686 ms, total: 8.08 s
@@ -73,10 +73,10 @@ Wall time: 6.19 s
73
73
On the other hand, using `graphchain.get` for the first time to fetch `'result'` takes only 4 seconds:
74
74
75
75
```python
76
-
>>>%time graphchain.get(dsk, 'result')
76
+
>>>%time graphchain.get(dsk, "result")
77
77
78
78
Creating DataFrame...
79
-
Running complicated computation on DataFrame...
79
+
Running expensive computation on DataFrame...
80
80
Summing DataFrames...
81
81
82
82
CPU times: user 4.7 s, sys: 519 ms, total: 5.22 s
@@ -85,10 +85,10 @@ Wall time: 4.04 s
85
85
86
86
The reason `graphchain.get` is faster than `dask.get` is because it can load `df_b` and `df_d` from cache after `df_a` and `df_c` have been computed and cached. Note that graphchain will only cache the result of a computation if loading that computation from cache is estimated to be faster than simply running the computation.
87
87
88
-
Running `graphchain.get` a second time to fetch `'result'` will be almost instant since this time the result itself is also available from cache:
88
+
Running `graphchain.get` a second time to fetch `"result"` will be almost instant since this time the result itself is also available from cache:
89
89
90
90
```python
91
-
>>>%time graphchain.get(dsk, 'result')
91
+
>>>%time graphchain.get(dsk, "result")
92
92
93
93
CPU times: user 4.79 ms, sys: 1.79 ms, total: 6.58 ms
94
94
Wall time: 5.34 ms
@@ -97,15 +97,15 @@ Wall time: 5.34 ms
97
97
Now let's say we want to change how the result is summarised from a sum to an average:
98
98
99
99
```python
100
-
defsummarise_dataframes(*dfs):
101
-
print('Averaging DataFrames...')
100
+
defsummarize_dataframes(*dfs):
101
+
print("Averaging DataFrames...")
102
102
returnsum(df.mean().mean() for df in dfs) /len(dfs)
103
103
```
104
104
105
-
If we then ask graphchain to fetch `'result'`, it will detect that only `summarise_dataframes` has changed and therefore only recompute this function with inputs loaded from cache:
105
+
If we then ask graphchain to fetch `"result"`, it will detect that only `summarize_dataframes` has changed and therefore only recompute this function with inputs loaded from cache:
106
106
107
107
```python
108
-
>>>%time graphchain.get(dsk, 'result')
108
+
>>>%time graphchain.get(dsk, "result")
109
109
110
110
Averaging DataFrames...
111
111
@@ -118,49 +118,97 @@ Wall time: 86.6 ms
118
118
Graphchain's cache is by default `./__graphchain_cache__`, but you can ask graphchain to use a cache at any [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html) such as `s3://mybucket/__graphchain_cache__`:
with dask.config.set(scheduler="sync", delayed_optimize=optimize_s3):
171
+
print(result.compute())
172
+
```
173
+
174
+
### Using a custom a serializer/deserializer
175
+
176
+
By default graphchain will cache dask computations with [joblib.dump](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and LZ4 compression. However, you may also supply a custom `serialize` and `deserialize` function that writes and reads computations to and from a [PyFilesystem filesystem](https://docs.pyfilesystem.org/en/latest/introduction.html), respectively. For example, the following snippet shows how to serialize dask DataFrames with [dask.dataframe.to_parquet](https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html), while other objects are serialized with joblib:
0 commit comments