Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict memoization sorting before normalizing keys is unsafe #2140

Open
sirosen opened this issue Oct 28, 2021 · 4 comments · May be fixed by #3157
Open

Dict memoization sorting before normalizing keys is unsafe #2140

sirosen opened this issue Oct 28, 2021 · 4 comments · May be fixed by #3157
Labels
memo/checkpoint outreachy Good initial contributions for Outreachy applicants

Comments

@sirosen
Copy link

sirosen commented Oct 28, 2021

This line struck me as odd:

keys = sorted(denormalized_dict)

My first thought was that it might not give you a stable sort if objects are hashable (can be used in a dict), but define something meaningless for comparison (my thought was comparison of memory addresses).

The case I found when trying to exploit this to make something unstable is a type which is hashable but raises an error on comparison. Functions:

>>> def foo(): return 1
...
>>> def bar(): return 2
...
>>> d = {foo: 1, bar: 2}
>>> sorted(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'function' and 'function'

To show that this doesn't only happen in "silly" cases, I also tried out enum.auto, and found that it also is hashable but not comparable:

>>> class Foo(enum.Enum):
...     x = enum.auto()
...     y = enum.auto()
...
>>> Foo.x < Foo.y
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'Foo' and 'Foo'

In theory we might be able to find a type which does define comparison, but does it in an unreliable way which is tied to the state of the current process. I think given the presence of types which define __hash__ but not __lt__, we should assume that such objects exist.

Resolution

To solve this, I believe that all we need to do is do the key normalization step first, saving the normalization results in a temporary dict, and then operate on the normed keyspace.

Something to the effect of

def id_for_memo_dict(d):
    keymap = {id_for_memo(k): k for k in d}
    normed_keys = sorted(keymap.values())

    normalized_list = []
    for k in normed_keys:
        normalized_list.append(k)
        normalized_list.append(id_for_memo(d[keymap[k]]))
    return serialize(normalized_list)
@benclifford
Copy link
Collaborator

crossref #1591 which also talks about function memoization

@benclifford benclifford added the outreachy Good initial contributions for Outreachy applicants label Mar 5, 2024
@MundiaNderi
Copy link

Hey @benclifford I'd like to learn more about memoization and work on this issue.

@benclifford
Copy link
Collaborator

@MundiaNderi ok, some places to start:

do the parsl tutorial to get a basic understanding of what we're trying to do here. that's linked from the front page of parsl-project.org and is: https://mybinder.org/v2/gh/Parsl/parsl-tutorial/master

Theres a section in the user guide https://parsl.readthedocs.io/en/stable/userguide/checkpoints.html - the words "memoization", "caching" and "checkpointing" are all used fairly interchangeably (although they have some subtle differences) - this section will give you some background information.

There are some test cases that you can look at the source code for and check you can run them:

pytest parsl/tests/test_python_apps/test_memoize_* --config parsl/tests/configs/local_threads.py

and

pytest parsl/tests/test_checkpointing/ --config local

That should give you some understanding of the background for this issue and the sort of things we test here.

Then some basic development sequence might be:

i) make a new test case that fails, using pytest
ii) make changes to the code - maybe based around the changes @sirosen wrote about in the issue above, which would go in this file: parsl/dataflow/memoization.py
iii) see that your new test case doesn't fail any more

@benclifford
Copy link
Collaborator

note from Matthias Diener that i'm adding here for future reference

The persistent hashing that we are doing is done by pytools.KeyBuilder:https://github.c>
tools/persistent_dict.py#L206

In particular, for dictionaries and sets, KeyBuilder uses an unordered hashing
function(https://github.com/inducer/pytools/blob/3e9d9d2f32b4d2eed4ca2340a9d02e396ac167>
ytools/init.py#L2832 ) to ensure a stable hash independent of
element order (without needing to sort). Maybe something like that could be
beneficial in the context of #2140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memo/checkpoint outreachy Good initial contributions for Outreachy applicants
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants