New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added Entity Extractor + HierarchicalDocument #601

Merged

RitxmSaha merged 28 commits into main from ritam-hierarchical-test

Jul 31, 2024

Contributor

RitxmSaha commented Jul 29, 2024 •

edited

Loading

This PR adds the EntityExtractor to the graph extractor class and also adds HierarchicalDocument as a new experimental data model.

GraphExtractor Class:

Implemented EntityExtractor
Standardized resolve to work with all types of GraphExtractor

HierarchicalDocument:

Implemented as inherited class of document, no longer uses elements property, uses children which is list[HierarchicalDocument]
Implemented recursive version of explode if isinstance(doc,HierarchicalDocument)

RitxmSaha and others added 20 commits

July 25, 2024 22:45


          added supervised extractor + docset graph_extractor bugfix

6e41633


          close multiprocessing pool

4bed154


          pushing architecture change

27649ea


          added hierarchical document to sycamore data model + adjusted explode

e607ef2


          updated metadata extractor + entity extractor to work with new data m…

f000070

…odel


          formatting

69f41a0


          formatting

aefe37f


          linted

75a37d3


          Merge branch 'main' into ritam-hierarchical-test

ebe9a33


          mypy additions

9f722a7


          mypy

e1a86b0


          mypy

fa075ab


          switched back to multiprocessing

608d122


          formatting

3d23d83


          linted

aa09ad3


          bug fix


          bugfix

33fd558


          make mypy happy

77c6cc4


          mypy happy

30a679c


          linted

f0c8a52

RitxmSaha changed the title ~~Added Entity Extractor + HierarchicalDocument + async openai api calls.~~ Added Entity Extractor + HierarchicalDocument

RitxmSaha added 2 commits

July 30, 2024 20:19


          fixed resolve function

4cd760e


          bugfix

27bea21

RitxmSaha requested review from baitsguy and dtecuci

July 30, 2024 20:39

baitsguy reviewed

View reviewed changes

Contributor

baitsguy left a comment

Looks good overall, a few clarifications and wanted some feedback on the threading

lib/sycamore/sycamore/data/document.py Outdated

+                  def __init__(self, document=None, **kwargs):
+                      super().__init__(document)
+                      if "doc_id" not in self.data:
+                          self.doc_id = str(uuid.uuid4())

Contributor

baitsguy Jul 30, 2024

here and below can simplify

self.doc_id = self.data.get("doc_id", str(uuid.uuid4()))

Contributor Author

RitxmSaha Jul 30, 2024

fixed(really cool solution now i see the power of .get)

lib/sycamore/sycamore/data/document.py Outdated

+                  @property
+                  def children(self) -> list["HierarchicalDocument"]:
+                      """TODO"""

Contributor

baitsguy Jul 30, 2024

what's the todo here?

Contributor Author

RitxmSaha Jul 30, 2024

fixed

lib/sycamore/sycamore/data/document.py Outdated

+                  @property
+                  def elements(self) -> list[Element]:
+                      raise ValueError("MetadataDocument does not have elements")

Contributor

baitsguy Jul 30, 2024

string fix

Contributor Author

RitxmSaha Jul 30, 2024

fixed

lib/sycamore/sycamore/data/document.py Outdated

+                  @elements.setter
+                  def elements(self, elements: list[Element]):
+                      raise ValueError("MetadataDocument does not have elements")

Contributor

baitsguy Jul 30, 2024

string fix

Contributor Author

RitxmSaha Jul 30, 2024

fixed

lib/sycamore/sycamore/data/document.py

+                  def elements(self, elements: list[Element]):
+                      raise ValueError("MetadataDocument does not have elements")
+                  def __str__(self) -> str:

Contributor

baitsguy Jul 30, 2024

is this something that each Document type needs or can it be put in the base class?

Contributor Author

RitxmSaha Jul 30, 2024

I tried putting it in the base class, but the base class does not have the children property that I use in HierarchicalDocument. My hope was that as we transition over to a hierarchical data model, this would not be a seperate class anymore so the bloat wouldn't be as big.

lib/sycamore/sycamore/tests/unit/transforms/test_graph_extractor.py

               from collections import defaultdict
               class TestGraphExtractor:
-                  docs = [
-                      Document(
+                  metadata_docs = [

Contributor

baitsguy Jul 30, 2024

name hierarchical?

Contributor Author

RitxmSaha Jul 30, 2024

Not exactly sure what this means. I put metadata_docs for the documents used for test_metadata_extractor and entity_docs for test_entity_extractor.

Contributor

baitsguy Jul 31, 2024

Just a naming thing, we have a MetadataDocument, so this was confusing

lib/sycamore/sycamore/transforms/explode.py

-                              cur.doc_id = cur.data["doc_id"]
-                          else:
-                              cur.doc_id = str(uuid.uuid4())
+                          cur.doc_id = str(uuid.uuid4())

Contributor

baitsguy Jul 30, 2024

why did we change this?

Contributor Author

RitxmSaha Jul 30, 2024

This was the previous implementation that I made. It was to make sure each entity in the entities property did not have their doc_id rewritten(since I had already built relationships between those id's and parent documents). We don't need this anymore since we switched over to using hierarchicaldocument class where the id's are assigned to children when they are converted to hierarchicaldocuments.

lib/sycamore/sycamore/transforms/extract_graph.py

		@@ -30,6 +34,20 @@ def __init__(self, nodeKey: str, nodeLabel: str, relLabel: str):
		self.relLabel = relLabel


		class GraphEntity(GraphData):

Contributor

baitsguy Jul 30, 2024

Is GraphNode a more generally used convention or entity?

Contributor Author

RitxmSaha Jul 30, 2024

Not sure if that would be better since it is technically used to extract a specific type of node, not a singular node.

EntityExtractor uses the label and description of GraphEntity s to extract all nodes that fit that specific label + description.

lib/sycamore/sycamore/transforms/extract_graph.py Outdated

+                      # Get list[Document] representation of docset, trigger execute with take_all()
+                      execution = Execution(docset.context, docset.plan)
+                      dataset = execution.execute(docset.plan)
+                      docs = dataset.take_all(None)

Contributor

baitsguy Jul 30, 2024

you can remove the None, a little confusing with it there

Contributor Author

RitxmSaha Jul 30, 2024

fixed

lib/sycamore/sycamore/transforms/extract_graph.py Outdated

+                      return docset
+                  def _extract(self, doc: HierarchicalDocument) -> HierarchicalDocument:
+                      from multiprocessing import Pool

Contributor

baitsguy Jul 30, 2024

I'm a little worried about doing this in here. It's unclear that an EntityExtractor can internal spawn threads with its own configurations. Maybe calling it MultithreadedEntityExtractor or something will help?

@alexaryn @eric-anderson I'm curious about your thoughts here since you've played with this stuff in other parts

Contributor Author

RitxmSaha Jul 30, 2024

Do you think adding a multithreading flag to entity extractor would be a good solution here?

RitxmSaha added 2 commits

July 30, 2024 21:48


          address vinayak comments

caece5d


          linted

ad6587c

RitxmSaha requested a review from bsowell

July 30, 2024 21:52


          fix typo

94c4b17

RitxmSaha added 3 commits

July 31, 2024 21:10


          removed multithreading

682ece2


          syntax for adding list

33ffd2f


          mypy happy

461ad5a

RitxmSaha requested a review from baitsguy

July 31, 2024 21:17

baitsguy approved these changes

View reviewed changes

Contributor

baitsguy left a comment

Nice work!

RitxmSaha merged commit 22af2c2 into main

10 checks passed

RitxmSaha deleted the ritam-hierarchical-test branch

August 1, 2024 20:27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet