Dict memoization sorting before normalizing keys is unsafe #2140

sirosen · 2021-10-28T19:57:03Z

This line struck me as odd:

Line 89 in 1449ea7

keys = sorted(denormalized_dict)

My first thought was that it might not give you a stable sort if objects are hashable (can be used in a dict), but define something meaningless for comparison (my thought was comparison of memory addresses).

The case I found when trying to exploit this to make something unstable is a type which is hashable but raises an error on comparison. Functions:

>>> def foo(): return 1
...
>>> def bar(): return 2
...
>>> d = {foo: 1, bar: 2}
>>> sorted(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'function' and 'function'

To show that this doesn't only happen in "silly" cases, I also tried out enum.auto, and found that it also is hashable but not comparable:

>>> class Foo(enum.Enum):
...     x = enum.auto()
...     y = enum.auto()
...
>>> Foo.x < Foo.y
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'Foo' and 'Foo'

In theory we might be able to find a type which does define comparison, but does it in an unreliable way which is tied to the state of the current process. I think given the presence of types which define __hash__ but not __lt__, we should assume that such objects exist.

Resolution

To solve this, I believe that all we need to do is do the key normalization step first, saving the normalization results in a temporary dict, and then operate on the normed keyspace.

Something to the effect of

def id_for_memo_dict(d):
    keymap = {id_for_memo(k): k for k in d}
    normed_keys = sorted(keymap.values())

    normalized_list = []
    for k in normed_keys:
        normalized_list.append(k)
        normalized_list.append(id_for_memo(d[keymap[k]]))
    return serialize(normalized_list)

The text was updated successfully, but these errors were encountered:

benclifford · 2021-10-28T19:57:54Z

crossref #1591 which also talks about function memoization

MundiaNderi · 2024-03-05T09:33:00Z

Hey @benclifford I'd like to learn more about memoization and work on this issue.

benclifford · 2024-03-05T09:58:43Z

@MundiaNderi ok, some places to start:

do the parsl tutorial to get a basic understanding of what we're trying to do here. that's linked from the front page of parsl-project.org and is: https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master

Theres a section in the user guide https://parsl.readthedocs.io/en/stable/userguide/checkpoints.html - the words "memoization", "caching" and "checkpointing" are all used fairly interchangeably (although they have some subtle differences) - this section will give you some background information.

There are some test cases that you can look at the source code for and check you can run them:

pytest parsl/tests/test_python_apps/test_memoize_* --config parsl/tests/configs/local_threads.py

and

pytest parsl/tests/test_checkpointing/ --config local

That should give you some understanding of the background for this issue and the sort of things we test here.

Then some basic development sequence might be:

i) make a new test case that fails, using pytest
ii) make changes to the code - maybe based around the changes @sirosen wrote about in the issue above, which would go in this file: parsl/dataflow/memoization.py
iii) see that your new test case doesn't fail any more

benclifford · 2024-04-11T19:18:22Z

note from Matthias Diener that i'm adding here for future reference

The persistent hashing that we are doing is done by pytools.KeyBuilder:https://github.c>
tools/persistent_dict.py#L206

In particular, for dictionaries and sets, KeyBuilder uses an unordered hashing
function(https://github.com/inducer/pytools/blob/3e9d9d2f32b4d2eed4ca2340a9d02e396ac167>
ytools/init.py#L2832 ) to ensure a stable hash independent of
element order (without needing to sort). Maybe something like that could be
beneficial in the context of #2140

benclifford added the memo/checkpoint label Nov 2, 2021

benclifford added the outreachy Good initial contributions for Outreachy applicants label Mar 5, 2024

MundiaNderi linked a pull request Mar 6, 2024 that will close this issue

Dict memoization sorting before normalizing keys is unsafe #3157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dict memoization sorting before normalizing keys is unsafe #2140

Dict memoization sorting before normalizing keys is unsafe #2140

sirosen commented Oct 28, 2021

benclifford commented Oct 28, 2021

MundiaNderi commented Mar 5, 2024

benclifford commented Mar 5, 2024

benclifford commented Apr 11, 2024

Dict memoization sorting before normalizing keys is unsafe #2140

Dict memoization sorting before normalizing keys is unsafe #2140

Comments

sirosen commented Oct 28, 2021

Resolution

benclifford commented Oct 28, 2021

MundiaNderi commented Mar 5, 2024

benclifford commented Mar 5, 2024

benclifford commented Apr 11, 2024