Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Entity Extractor + HierarchicalDocument #601

Merged
merged 28 commits into from
Jul 31, 2024
Merged

Conversation

RitxmSaha
Copy link
Contributor

@RitxmSaha RitxmSaha commented Jul 29, 2024

This PR adds the EntityExtractor to the graph extractor class and also adds HierarchicalDocument as a new experimental data model.

GraphExtractor Class:

  • Implemented EntityExtractor
  • Standardized resolve to work with all types of GraphExtractor

HierarchicalDocument:

  • Implemented as inherited class of document, no longer uses elements property, uses children which is list[HierarchicalDocument]
  • Implemented recursive version of explode if isinstance(doc,HierarchicalDocument)

@RitxmSaha RitxmSaha changed the title Added Entity Extractor + HierarchicalDocument + async openai api calls. Added Entity Extractor + HierarchicalDocument Jul 30, 2024
@RitxmSaha RitxmSaha requested review from baitsguy and dtecuci July 30, 2024 20:39
Copy link
Contributor

@baitsguy baitsguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, a few clarifications and wanted some feedback on the threading

def __init__(self, document=None, **kwargs):
super().__init__(document)
if "doc_id" not in self.data:
self.doc_id = str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below can simplify

self.doc_id = self.data.get("doc_id", str(uuid.uuid4()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed(really cool solution now i see the power of .get)


@property
def children(self) -> list["HierarchicalDocument"]:
"""TODO"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the todo here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@property
def elements(self) -> list[Element]:
raise ValueError("MetadataDocument does not have elements")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@elements.setter
def elements(self, elements: list[Element]):
raise ValueError("MetadataDocument does not have elements")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def elements(self, elements: list[Element]):
raise ValueError("MetadataDocument does not have elements")

def __str__(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something that each Document type needs or can it be put in the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried putting it in the base class, but the base class does not have the children property that I use in HierarchicalDocument. My hope was that as we transition over to a hierarchical data model, this would not be a seperate class anymore so the bloat wouldn't be as big.

from collections import defaultdict


class TestGraphExtractor:
docs = [
Document(
metadata_docs = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name hierarchical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure what this means. I put metadata_docs for the documents used for test_metadata_extractor and entity_docs for test_entity_extractor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming thing, we have a MetadataDocument, so this was confusing

cur.doc_id = cur.data["doc_id"]
else:
cur.doc_id = str(uuid.uuid4())
cur.doc_id = str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the previous implementation that I made. It was to make sure each entity in the entities property did not have their doc_id rewritten(since I had already built relationships between those id's and parent documents). We don't need this anymore since we switched over to using hierarchicaldocument class where the id's are assigned to children when they are converted to hierarchicaldocuments.

@@ -30,6 +34,20 @@ def __init__(self, nodeKey: str, nodeLabel: str, relLabel: str):
self.relLabel = relLabel


class GraphEntity(GraphData):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GraphNode a more generally used convention or entity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that would be better since it is technically used to extract a specific type of node, not a singular node.

EntityExtractor uses the label and description of GraphEntity s to extract all nodes that fit that specific label + description.

# Get list[Document] representation of docset, trigger execute with take_all()
execution = Execution(docset.context, docset.plan)
dataset = execution.execute(docset.plan)
docs = dataset.take_all(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove the None, a little confusing with it there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return docset

def _extract(self, doc: HierarchicalDocument) -> HierarchicalDocument:
from multiprocessing import Pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about doing this in here. It's unclear that an EntityExtractor can internal spawn threads with its own configurations. Maybe calling it MultithreadedEntityExtractor or something will help?

@alexaryn @eric-anderson I'm curious about your thoughts here since you've played with this stuff in other parts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding a multithreading flag to entity extractor would be a good solution here?

@RitxmSaha RitxmSaha requested a review from bsowell July 30, 2024 21:52
@RitxmSaha RitxmSaha requested a review from baitsguy July 31, 2024 21:17
Copy link
Contributor

@baitsguy baitsguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@RitxmSaha RitxmSaha merged commit 22af2c2 into main Jul 31, 2024
10 checks passed
@RitxmSaha RitxmSaha deleted the ritam-hierarchical-test branch August 1, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants