-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Entity Extractor + HierarchicalDocument #601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, a few clarifications and wanted some feedback on the threading
def __init__(self, document=None, **kwargs): | ||
super().__init__(document) | ||
if "doc_id" not in self.data: | ||
self.doc_id = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and below can simplify
self.doc_id = self.data.get("doc_id", str(uuid.uuid4()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed(really cool solution now i see the power of .get)
|
||
@property | ||
def children(self) -> list["HierarchicalDocument"]: | ||
"""TODO""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the todo here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@property | ||
def elements(self) -> list[Element]: | ||
raise ValueError("MetadataDocument does not have elements") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@elements.setter | ||
def elements(self, elements: list[Element]): | ||
raise ValueError("MetadataDocument does not have elements") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
def elements(self, elements: list[Element]): | ||
raise ValueError("MetadataDocument does not have elements") | ||
|
||
def __str__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something that each Document type needs or can it be put in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried putting it in the base class, but the base class does not have the children property that I use in HierarchicalDocument. My hope was that as we transition over to a hierarchical data model, this would not be a seperate class anymore so the bloat wouldn't be as big.
from collections import defaultdict | ||
|
||
|
||
class TestGraphExtractor: | ||
docs = [ | ||
Document( | ||
metadata_docs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name hierarchical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure what this means. I put metadata_docs for the documents used for test_metadata_extractor and entity_docs for test_entity_extractor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a naming thing, we have a MetadataDocument, so this was confusing
cur.doc_id = cur.data["doc_id"] | ||
else: | ||
cur.doc_id = str(uuid.uuid4()) | ||
cur.doc_id = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the previous implementation that I made. It was to make sure each entity in the entities property did not have their doc_id rewritten(since I had already built relationships between those id's and parent documents). We don't need this anymore since we switched over to using hierarchicaldocument class where the id's are assigned to children when they are converted to hierarchicaldocuments.
@@ -30,6 +34,20 @@ def __init__(self, nodeKey: str, nodeLabel: str, relLabel: str): | |||
self.relLabel = relLabel | |||
|
|||
|
|||
class GraphEntity(GraphData): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is GraphNode a more generally used convention or entity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that would be better since it is technically used to extract a specific type of node, not a singular node.
EntityExtractor uses the label and description of GraphEntity s to extract all nodes that fit that specific label + description.
# Get list[Document] representation of docset, trigger execute with take_all() | ||
execution = Execution(docset.context, docset.plan) | ||
dataset = execution.execute(docset.plan) | ||
docs = dataset.take_all(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove the None, a little confusing with it there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return docset | ||
|
||
def _extract(self, doc: HierarchicalDocument) -> HierarchicalDocument: | ||
from multiprocessing import Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about doing this in here. It's unclear that an EntityExtractor can internal spawn threads with its own configurations. Maybe calling it MultithreadedEntityExtractor or something will help?
@alexaryn @eric-anderson I'm curious about your thoughts here since you've played with this stuff in other parts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think adding a multithreading flag to entity extractor would be a good solution here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
This PR adds the EntityExtractor to the graph extractor class and also adds HierarchicalDocument as a new experimental data model.
GraphExtractor Class:
HierarchicalDocument: