Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix NamedEntityExtractor serde #7684

Merged
merged 4 commits into from
May 14, 2024
Merged

fix: Fix NamedEntityExtractor serde #7684

merged 4 commits into from
May 14, 2024

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented May 13, 2024

Why:

Fixes serialization and deserialization (serde) in the NamedEntityExtractor component. Important for usecases where NamedEntityExtractor needs to be serialized to a pipeline yaml file.

What:

  • Modified the serialization method to correctly serialize the backend type by calling .name on the _backend.type attribute, ensuring that it is stored as a string rather than an object reference.
  • Adjusted the deserialization method to transform the backend string back into the correct NamedEntityExtractorBackend enumeration before attempting to use it in constructing a NamedEntityExtractor instance.
  • Included a new test ensuring that the NamedEntityExtractor can be serialized and deserialized correctly, particularly when used within a pipeline configuration.

How can it be used:

  • Saving the configuration of a NamedEntityExtractor to a yaml file
  • Loading a previously saved NamedEntityExtractor configuration to resume processing or to replicate a specific setup across different environments or applications.

Example of serializing and deserializing a pipeline including a NamedEntityExtractor:

from haystack import Pipeline
from haystack.components.extractors import NamedEntityExtractor, NamedEntityExtractorBackend

# Create an extractor and add it to a pipeline
extractor = NamedEntityExtractor(backend=NamedEntityExtractorBackend.HUGGING_FACE, model="dslim/bert-base-NER")
pipeline = Pipeline()
pipeline.add_component(instance=extractor, name="extractor")

# Serialize the pipeline to a file
with open("path_to_file/test_pipeline.yaml", "w") as f:
    pipeline.dump(f)

# Later or elsewhere, deserialize the pipeline from the file
with open("path_to_file/test_pipeline.yaml", "r") as f:
    loaded_pipeline = Pipeline.load(f)

How did you test it:

  • Updated the unit test for the NamedEntityExtractor to include checks for both serialization and deserialization functionality, confirming that the process retains the correct component configurations.
  • Added a new test that incorporates the NamedEntityExtractor within a pipeline configuration, ensuring that the entire pipeline can be serialized and deserialized correctly. This test verifies that the pipeline's dictionary representation remains consistent before and after the serde.

Notes for the reviewer:

  • Pay special attention to the changes in the serialization and deserialization methods, ensuring that the implementation aligns with standard practices for handling enumeration types in Python.

@vblagoje vblagoje added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label May 13, 2024
@vblagoje vblagoje removed the ignore-for-release-notes PRs with this flag won't be included in the release notes. label May 13, 2024
@vblagoje vblagoje added this to the 2.1.2 milestone May 13, 2024
@coveralls
Copy link
Collaborator

coveralls commented May 13, 2024

Pull Request Test Coverage Report for Build 9076600918

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 20 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 90.423%

Files with Coverage Reduction New Missed Lines %
components/extractors/named_entity_extractor.py 20 68.67%
Totals Coverage Status
Change from base Build 9074873557: 0.02%
Covered Lines: 6534
Relevant Lines: 7226

💛 - Coveralls

@vblagoje vblagoje marked this pull request as ready for review May 13, 2024 10:44
@vblagoje vblagoje requested review from a team as code owners May 13, 2024 10:44
@vblagoje vblagoje requested review from dfokina and davidsbatista and removed request for a team May 13, 2024 10:44
@@ -216,6 +216,7 @@ def from_dict(cls, data: Dict[str, Any]) -> "NamedEntityExtractor":
try:
init_params = data["init_parameters"]
init_params["device"] = ComponentDevice.from_dict(init_params["device"])
init_params["backend"] = NamedEntityExtractorBackend[init_params["backend"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused; where is this init_params used? it it's never used inside this function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, nevermind, I understood now what's going on :)

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vblagoje
Copy link
Member Author

vblagoje commented May 14, 2024

@davidsbatista there was a conflict with main branch commit that I needed to resolve. I also took the opportunity to remove unit test markers as we don't use them any more. Please double check again everything 🙏

@davidsbatista davidsbatista merged commit 4352b16 into main May 14, 2024
25 checks passed
@davidsbatista davidsbatista deleted the nee_ser_fix branch May 14, 2024 10:24
davidsbatista pushed a commit that referenced this pull request May 16, 2024
* Fix NamedEntityExtractor serde

* Add release note

* Linting, remove unit markers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(De-) Serialization is not properly working for NamedEntityExtractor
3 participants