Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use rdFingerprinGenerator #99

Merged
merged 6 commits into from
Aug 13, 2024
Merged

Conversation

isty2e
Copy link
Contributor

@isty2e isty2e commented Jun 20, 2024

Changelogs

This PR addresses the deprecation warning when rdkit>=2014 is installed. See #98 and datamol-io/datamol#225 for more details.

  • Refactoring fingerprint atom pair, rdkit, morgan, topological, and their count fingerprints to FingerprintGenerator.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted. Eventually consider making a new tutorial for new features.
  • Write concise and explanatory changelogs below.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Mostly copy-pasted from datamol-io/datamol#226. However, the radii for ECFP and ECFP-count remains unchanged unlike the original PR.

Update:

  • Since copy.deepcopy(FP_DEF_PARAMS[method]) does not work due to rdkit.Chem.rdFingerprintGenerator.AtomInvariantsGenerator not being serializable, The class is wrapped with SerializableMorganFeatureAtomInvGen in molfeat/calc/_serializable_classes.py.
  • For extensibility in similar future cases, such serializable classes are now stored in molfeat.calc._serializable_classes.SERIALIZABLE_CLASSES.
  • Deep copying such classes will result in different instances of wrapped classes, so different FPCalculator instances will have distinct AtomInvariantsGenerator instances.
  • Since identity between different FPCalculator instances is affected by this, __gestate__, __setstate__, and to_state_dict() methods in FPCalculator have been changed.
  • Modified test cases.

@isty2e isty2e requested a review from maclandrol as a code owner June 20, 2024 04:51
@maclandrol
Copy link
Member

@isty2e tests are failing (the fingerprints have changed), also needs to run black and ruff.

@isty2e
Copy link
Contributor Author

isty2e commented Jun 21, 2024

Upon fixing the nBits to fpSize in tests/test_state:test_fp_state(), I encountered another error:

______________________________________________________________________ test_fp_state _______________________________________________________________________

    def test_fp_state():
        expected_res = [
            {
                "name": "FPCalculator",
                "module": "molfeat.calc.fingerprints",
                "args": {"length": 512, "method": "ecfp", "counting": False, "fpSize": 512},
                "_molfeat_version": MOLFEAT_VERSION,
            },
            {
                "name": "FPCalculator",
                "module": "molfeat.calc.fingerprints",
                "args": {
                    "length": 241,
                    "method": "fcfp-count",
                    "counting": True,
                    "fpSize": 241,
                },
                "_molfeat_version": MOLFEAT_VERSION,
            },
            {
                "name": "FPCalculator",
                "module": "molfeat.calc.fingerprints",
                "args": {"length": 2048, "method": "maccs", "counting": False},
                "_molfeat_version": MOLFEAT_VERSION,
            },
        ]
        for i, (fp_name, fp_len) in enumerate(zip(["ecfp", "fcfp-count", "maccs"], [512, 241, 2048])):
>           calc = FPCalculator(fp_name, fp_len)

tests/test_state.py:284:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
molfeat/calc/fingerprints.py:229: in __init__
    default_params = copy.deepcopy(FP_DEF_PARAMS[method])
../../../../conda/envs/molfeat-dev/lib/python3.11/copy.py:146: in deepcopy
    y = copier(x, memo)
../../../../conda/envs/molfeat-dev/lib/python3.11/copy.py:231: in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = <rdkit.Chem.rdFingerprintGenerator.AtomInvariantsGenerator object at 0x14f212ff0>
memo = {5623304640: {'countBounds': None, 'countSimulation': False, 'fpSize': 2048, 'includeChirality': False, ...}}, _nil = []

    def deepcopy(x, memo=None, _nil=[]):
        """Deep copy operation on arbitrary Python objects.

        See the module's __doc__ string for more info.
        """

        if memo is None:
            memo = {}

        d = id(x)
        y = memo.get(d, _nil)
        if y is not _nil:
            return y

        cls = type(x)

        copier = _deepcopy_dispatch.get(cls)
        if copier is not None:
            y = copier(x, memo)
        else:
            if issubclass(cls, type):
                y = _deepcopy_atomic(x, memo)
            else:
                copier = getattr(x, "__deepcopy__", None)
                if copier is not None:
                    y = copier(memo)
                else:
                    reductor = dispatch_table.get(cls)
                    if reductor:
                        rv = reductor(x)
                    else:
                        reductor = getattr(x, "__reduce_ex__", None)
                        if reductor is not None:
>                           rv = reductor(4)
E                           RuntimeError: Pickling of "rdkit.Chem.rdFingerprintGenerator.AtomInvariantsGenerator" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)

../../../../conda/envs/molfeat-dev/lib/python3.11/copy.py:161: RuntimeError

The attempt to deepcopy the default params in FP_DEF_PARAMS failed because of "atomInvariantsGenerator": rdFingerprintGenerator.GetMorganFeatureAtomInvGen() in fcfp and rdkit.Chem.rdFingerprintGenerator.AtomInvariantsGenerator not being serializable. I can think of a few solutions, but is there any recommendation?

@maclandrol
Copy link
Member

You can try to update __getstate__ and __setstate__ to manage the serialization in a custom way. That should take care of FPCalculator (note that you also needs to check that to_state_dict is working properly).

For the test set, it's very likely that you might need to modify the test_state.py file a bit.

If you do have another approach in mind, would be happy to discuss it. Also let me know if you are stuck and need help.

@isty2e
Copy link
Contributor Author

isty2e commented Jun 22, 2024

Since rdkit.Chem.rdFingerprintGenerator.AtomInvariantsGenerator is implemented in C++, using __getstate__ will require wrapping the invariantsgenerator in another class, which might be inefficient. A few alternative solutions:

  1. Implementing a custom deepcopy-like class which tries deep copying but falls back to shallow copying for a non-serializable instance - This may share the AtomInvariantsGenerator instance over multiple instances of FPCalcualtor.
  2. Doing similar things but initialising another AtomInvariantsGenerator instance upon failure - Each FPCalculator will have its own AtomInvariantsGenerator instance as an attribute. This will also involve defining a factory function for that purpose, but if more non-serializable objects will be added to the params, the implementation can be a bit boilerplate-like and messy.
  3. Shallow copying - likely problematic I think?

@maclandrol
Copy link
Member

I was thinking of some simple custom class loading so in set state, you define a serializable proxy representation of the C++ object that cannot be pickled, then in getstate, you map that back into the rdkit object.

I like 2, as I am not sure what sharing the same object would implies on the long term. Overall only 3 is a no go IMO.

@isty2e
Copy link
Contributor Author

isty2e commented Jun 24, 2024

Considering extensibility, I just implemented a wrapper for AtomInvariantsGenerator and updated changelogs of this PR. This got more complicated than I expected, but I hope it works well.

molfeat/calc/_serializable_classes.py Outdated Show resolved Hide resolved
molfeat/calc/_serializable_classes.py Show resolved Hide resolved
molfeat/calc/fingerprints.py Outdated Show resolved Hide resolved
molfeat/calc/fingerprints.py Show resolved Hide resolved
molfeat/calc/fingerprints.py Outdated Show resolved Hide resolved
tests/test_state.py Show resolved Hide resolved
@maclandrol
Copy link
Member

Awesome, please run black and ruff as the linting is failing.

@isty2e
Copy link
Contributor Author

isty2e commented Jul 18, 2024

Awesome, please run black and ruff as the linting is failing.

I thought I've run them but somehow the CI has been blocked by linting issues. Maybe some configuration or version mismatch here? A pre-commit hook would be appreciated.

@zhu0619
Copy link
Contributor

zhu0619 commented Aug 12, 2024

@isty2e Thank you for your suggestion. We address the code-check issue in a separate PR to ensure this one stays focused on the primary changes.
For this PR, could you update the ruff version to 0.5.7 and re-run ruff check ..
Additionally, please merge the latest changes from main into this branch.
Once those are done, we should be good to go.

@isty2e
Copy link
Contributor Author

isty2e commented Aug 13, 2024

Merged main and pushed the changes.

@zhu0619 zhu0619 requested a review from maclandrol August 13, 2024 01:00
@zhu0619
Copy link
Contributor

zhu0619 commented Aug 13, 2024

@maclandrol Could you take a final look and resolve the remaining conversations?

Copy link
Member

@maclandrol maclandrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhu0619 zhu0619 merged commit 9f6062b into datamol-io:main Aug 13, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants