superlinear-ai
diff --git a/‎.flake8
Lines changed: 1 addition & 1 deletion b/‎.flake8
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 86 additions & 38 deletions b/‎README.md
Lines changed: 86 additions & 38 deletions
@@ -3,7 +3,7 @@
 # https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length
 # TODO: https://github.com/PyCQA/flake8/issues/234
 doctests = True
-ignore = B019,DAR103,E203,E501,FS003,S101,W503
+ignore = DAR103,E203,E501,FS003,S101,W503
 max_line_length = 100
 max_complexity = 10
 
 
@@ -4,7 +4,7 @@
 
 ## What is graphchain?
 
-Graphchain is like [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html#memory) for dask graphs. [Dask graph computations](http://dask.pydata.org/en/latest/spec.html) are cached to a local or remote location of your choice, specified by a [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html).
+Graphchain is like [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html) for dask graphs. [Dask graph computations](https://docs.dask.org/en/latest/spec.html) are cached to a local or remote location of your choice, specified by a [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html).
 
 When you change your dask graph (by changing a computation's implementation or its inputs), graphchain will take care to only recompute the minimum number of computations necessary to fetch the result. This allows you to iterate quickly over your graph without spending time on recomputing previously computed keys.
 
@@ -23,7 +23,7 @@ Additionally, the result of a computation is only cached if it is estimated that
 
 Install graphchain with pip to get started:
 
-```bash
+```sh
 pip install graphchain
 ```
 
@@ -35,35 +35,35 @@ import graphchain
 import pandas as pd
 
 def create_dataframe(num_rows, num_cols):
-    print('Creating DataFrame...')
+    print("Creating DataFrame...")
     return pd.DataFrame(data=[range(num_cols)]*num_rows)
 
-def complicated_computation(df, num_quantiles):
-    print('Running complicated computation on DataFrame...')
+def expensive_computation(df, num_quantiles):
+    print("Running expensive computation on DataFrame...")
     return df.quantile(q=[i / num_quantiles for i in range(num_quantiles)])
 
-def summarise_dataframes(*dfs):
-    print('Summing DataFrames...')
+def summarize_dataframes(*dfs):
+    print("Summing DataFrames...")
     return sum(df.sum().sum() for df in dfs)
 
 dsk = {
-    'df_a': (create_dataframe, 10_000, 1000),
-    'df_b': (create_dataframe, 10_000, 1000),
-    'df_c': (complicated_computation, 'df_a', 2048),
-    'df_d': (complicated_computation, 'df_b', 2048),
-    'result': (summarise_dataframes, 'df_c', 'df_d')
+    "df_a": (create_dataframe, 10_000, 1000),
+    "df_b": (create_dataframe, 10_000, 1000),
+    "df_c": (expensive_computation, "df_a", 2048),
+    "df_d": (expensive_computation, "df_b", 2048),
+    "result": (summarize_dataframes, "df_c", "df_d")
 }
 ```
 
-Using `dask.get` to fetch the `'result'` key takes about 6 seconds:
+Using `dask.get` to fetch the `"result"` key takes about 6 seconds:
 
 ```python
->>> %time dask.get(dsk, 'result')
+>>> %time dask.get(dsk, "result")
 
 Creating DataFrame...
-Running complicated computation on DataFrame...
+Running expensive computation on DataFrame...
 Creating DataFrame...
-Running complicated computation on DataFrame...
+Running expensive computation on DataFrame...
 Summing DataFrames...
 
 CPU times: user 7.39 s, sys: 686 ms, total: 8.08 s
@@ -73,10 +73,10 @@ Wall time: 6.19 s
 On the other hand, using `graphchain.get` for the first time to fetch `'result'` takes only 4 seconds:
 
 ```python
->>> %time graphchain.get(dsk, 'result')
+>>> %time graphchain.get(dsk, "result")
 
 Creating DataFrame...
-Running complicated computation on DataFrame...
+Running expensive computation on DataFrame...
 Summing DataFrames...
 
 CPU times: user 4.7 s, sys: 519 ms, total: 5.22 s
@@ -85,10 +85,10 @@ Wall time: 4.04 s
 
 The reason `graphchain.get` is faster than `dask.get` is because it can load `df_b` and `df_d` from cache after `df_a` and `df_c` have been computed and cached. Note that graphchain will only cache the result of a computation if loading that computation from cache is estimated to be faster than simply running the computation.
 
-Running `graphchain.get` a second time to fetch `'result'` will be almost instant since this time the result itself is also available from cache:
+Running `graphchain.get` a second time to fetch `"result"` will be almost instant since this time the result itself is also available from cache:
 
 ```python
->>> %time graphchain.get(dsk, 'result')
+>>> %time graphchain.get(dsk, "result")
 
 CPU times: user 4.79 ms, sys: 1.79 ms, total: 6.58 ms
 Wall time: 5.34 ms
@@ -97,15 +97,15 @@ Wall time: 5.34 ms
 Now let's say we want to change how the result is summarised from a sum to an average:
 
 ```python
-def summarise_dataframes(*dfs):
-    print('Averaging DataFrames...')
+def summarize_dataframes(*dfs):
+    print("Averaging DataFrames...")
     return sum(df.mean().mean() for df in dfs) / len(dfs)
 ```
 
-If we then ask graphchain to fetch `'result'`, it will detect that only `summarise_dataframes` has changed and therefore only recompute this function with inputs loaded from cache:
+If we then ask graphchain to fetch `"result"`, it will detect that only `summarize_dataframes` has changed and therefore only recompute this function with inputs loaded from cache:
 
 ```python
->>> %time graphchain.get(dsk, 'result')
+>>> %time graphchain.get(dsk, "result")
 
 Averaging DataFrames...
 
@@ -118,49 +118,97 @@ Wall time: 86.6 ms
 Graphchain's cache is by default `./__graphchain_cache__`, but you can ask graphchain to use a cache at any [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html) such as `s3://mybucket/__graphchain_cache__`:
 
 ```python
-graphchain.get(dsk, 'result', location='s3://mybucket/__graphchain_cache__')
+graphchain.get(dsk, "result", location="s3://mybucket/__graphchain_cache__")
 ```
 
 ### Excluding keys from being cached
 
 In some cases you may not want a key to be cached. To avoid writing certain keys to the graphchain cache, you can use the `skip_keys` argument:
 
 ```python
-graphchain.get(dsk, 'result', skip_keys=['result'])
+graphchain.get(dsk, "result", skip_keys=["result"])
 ```
 
 ### Using graphchain with dask.delayed
 
 Alternatively, you can use graphchain together with dask.delayed for easier dask graph creation:
 
 ```python
+import dask
+import pandas as pd
+
 @dask.delayed
 def create_dataframe(num_rows, num_cols):
-    print('Creating DataFrame...')
+    print("Creating DataFrame...")
     return pd.DataFrame(data=[range(num_cols)]*num_rows)
 
 @dask.delayed
-def complicated_computation(df, num_quantiles):
-    print('Running complicated computation on DataFrame...')
+def expensive_computation(df, num_quantiles):
+    print("Running expensive computation on DataFrame...")
     return df.quantile(q=[i / num_quantiles for i in range(num_quantiles)])
 
 @dask.delayed
-def summarise_dataframes(*dfs):
-    print('Summing DataFrames...')
+def summarize_dataframes(*dfs):
+    print("Summing DataFrames...")
     return sum(df.sum().sum() for df in dfs)
 
-df_a = create_dataframe(num_rows=50_000, num_cols=500, seed=42)
-df_b = create_dataframe(num_rows=50_000, num_cols=500, seed=42)
-df_c = complicated_computation(df_a, window=3)
-df_d = complicated_computation(df_b, window=3)
-result = summarise_dataframes(df_c, df_d)
+df_a = create_dataframe(num_rows=10_000, num_cols=1000)
+df_b = create_dataframe(num_rows=10_000, num_cols=1000)
+df_c = expensive_computation(df_a, num_quantiles=2048)
+df_d = expensive_computation(df_b, num_quantiles=2048)
+result = summarize_dataframes(df_c, df_d)
 ```
 
 After which you can compute `result` by setting the `delayed_optimize` method to `graphchain.optimize`:
 
 ```python
-with dask.config.set(scheduler='sync', delayed_optimize=graphchain.optimize):
-    result.compute(location='s3://mybucket/__graphchain_cache__')
+import graphchain
+from functools import partial
+
+optimize_s3 = partial(graphchain.optimize, location="s3://mybucket/__graphchain_cache__/")
+
+with dask.config.set(scheduler="sync", delayed_optimize=optimize_s3):
+    print(result.compute())
+```
+
+### Using a custom a serializer/deserializer
+
+By default graphchain will cache dask computations with [joblib.dump](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and LZ4 compression. However, you may also supply a custom `serialize` and `deserialize` function that writes and reads computations to and from a [PyFilesystem filesystem](https://docs.pyfilesystem.org/en/latest/introduction.html), respectively. For example, the following snippet shows how to serialize dask DataFrames with [dask.dataframe.to_parquet](https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html), while other objects are serialized with joblib:
+
+```python
+import dask.dataframe
+import graphchain
+import fs.osfs
+import joblib
+import os
+from functools import partial
+from typing import Any
+
+def custom_serialize(obj: Any, fs: fs.osfs.OSFS, key: str) -> None:
+    """Serialize dask DataFrames with to_parquet, and other objects with joblib.dump."""
+    if isinstance(obj, dask.dataframe.DataFrame):
+        obj.to_parquet(os.path.join(fs.root_path, "parquet", key))
+    else:
+        with fs.open(f"{key}.joblib", "wb") as fid:
+            joblib.dump(obj, fid)
+
+def custom_deserialize(fs: fs.osfs.OSFS, key: str) -> Any:
+    """Deserialize dask DataFrames with read_parquet, and other objects with joblib.load."""
+    if fs.exists(f"{key}.joblib"):
+        with fs.open(f"{key}.joblib", "rb") as fid:
+            return joblib.load(fid)
+    else:
+        return dask.dataframe.read_parquet(os.path.join(fs.root_path, "parquet", key))
+
+optimize_parquet = partial(
+    graphchain.optimize,
+    location="./__graphchain_cache__/custom/",
+    serialize=custom_serialize,
+    deserialize=custom_deserialize
+)
+
+with dask.config.set(scheduler="sync", delayed_optimize=optimize_parquet):
+    print(result.compute())
 ```
 
 ## Contributing