[llm unify 7/n] Summarize #1192

HenryL27 · 2025-02-21T01:21:00Z

Again, general lack of confidence in this.

Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).

Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.

I probably broke something in luna but all the unittests passed so idk.

Also not sure about some of the names.

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson · 2025-02-26T19:30:36Z

lib/sycamore/sycamore/llms/prompts/default_prompts.py

+        {% endif %}
+        """
+    ),
+    user=J_GET_ELEMENT_TEXT_MACRO


Can we force the question to always be present by having a default question of "What is the summary of this information?"

This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.

eric-anderson · 2025-02-26T19:33:12Z

lib/sycamore/sycamore/llms/prompts/default_prompts.py

+        """
+    ),
+    user=J_GET_ELEMENT_TEXT_MACRO
+    + textwrap.dedent(


Do we want dedent or remove all leading whitespace. It's weird to read code like:

if condition_a: if condition_ab "a&b" else: "a&!b" else: "!a"

I like to have the prompts render out with indentation for printability (though I think the llm doesn't really see it) - like

Element 0: properties.state_0: Alabama properties.state_1: Alaska ...

eric-anderson · 2025-02-26T19:37:33Z

lib/sycamore/sycamore/query/execution/operations.py

    context: Optional[Context] = None,
+    docset_summarizer: Optional[Type[Summarizer]] = None,
+    summarizer_kwargs: dict[str, Any] = {},


Why do we need a class + kwargs rather than passing in an object?

eric-anderson · 2025-02-26T19:39:34Z

lib/sycamore/sycamore/query/execution/operations.py

@@ -52,14 +58,13 @@ def math_operation(val1: int, val2: int, operator: str) -> Union[int, float]:
 @context_params
 def summarize_data(


Can we simplify this? It feels weird to have a summarize class and a whole ton of parameters in the call.

eric-anderson · 2025-02-26T19:40:15Z

lib/sycamore/sycamore/query/execution/operations.py

+            summaries_as_text=summaries_as_text,
+        )
+
+    # If data is not DocSets, text is this list here


Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?

According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway

eric-anderson · 2025-02-26T19:42:25Z

lib/sycamore/sycamore/query/execution/operations.py

+
+
+def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
+    if summarizer_cls is LLMElementTextSummarizer:


This if class thing can't be the right way to do this.

eric-anderson · 2025-02-26T19:45:22Z

lib/sycamore/sycamore/transforms/summarize.py

@@ -143,7 +141,238 @@ def collapse(text: str, tokens_per_chunk: int, tokenizer: Tokenizer, summarizer_
    return cur_summary


-class DocumentSummarizer(Summarizer):
+class HeirarchicalDocumentSummarizer(Summarizer):


It would have been nice to not have this all together in a single PR, this probably could have been separate.

eric-anderson · 2025-02-26T19:46:59Z

lib/sycamore/sycamore/transforms/summarize.py

-class DocumentSummarizer(Summarizer):
+class HeirarchicalDocumentSummarizer(Summarizer):
+    """
+    Summarizes a document by constructing a heirarchical tree of batches of elements,


I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?

eric-anderson · 2025-02-26T21:33:33Z

lib/sycamore/sycamore/transforms/summarize.py

+        return comptransform
+
+
+class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):


Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?

Signed-off-by: Henry Lindeman <[email protected]>

…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 added 5 commits February 20, 2025 13:23

initial heirarchical document summarize implementation

ff9ca26

Signed-off-by: Henry Lindeman <[email protected]>

ruff

53879e6

Signed-off-by: Henry Lindeman <[email protected]>

make some tests work

9dffee1

Signed-off-by: Henry Lindeman <[email protected]>

fix more tests

532f7e8

Signed-off-by: Henry Lindeman <[email protected]>

mypy

8d2b2f8

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 requested a review from baitsguy February 21, 2025 01:21

HenryL27 added 15 commits February 21, 2025 09:25

fix llm filter codegen

aa1e51f

Signed-off-by: Henry Lindeman <[email protected]>

put back collapsing summarizer

93fb7ec

Signed-off-by: Henry Lindeman <[email protected]>

fix names

41e7a19

Signed-off-by: Henry Lindeman <[email protected]>

add docset summarizer parametrization

d77df22

Signed-off-by: Henry Lindeman <[email protected]>

add roundrobin summarizer

8f10485

Signed-off-by: Henry Lindeman <[email protected]>

mypy and ruff

8066362

Signed-off-by: Henry Lindeman <[email protected]>

rename to RoundRobinOneshotDocumentSummarizer

6eb82c2

Signed-off-by: Henry Lindeman <[email protected]>

factor complicated common jinja logic to fragments

13d6041

Signed-off-by: Henry Lindeman <[email protected]>

add max tokens heirarchical summarizer

90e6e4e

Signed-off-by: Henry Lindeman <[email protected]>

ruff

a621b8d

Signed-off-by: Henry Lindeman <[email protected]>

fix unit tests

b690923

Signed-off-by: Henry Lindeman <[email protected]>

mypy

b1acb1e

Signed-off-by: Henry Lindeman <[email protected]>

add unit tests for summarizers

eb18e7e

Signed-off-by: Henry Lindeman <[email protected]>

a whole bunch of docstrings

7fce340

Signed-off-by: Henry Lindeman <[email protected]>

oops didn't mean to commit this

2c887b1

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson reviewed Feb 26, 2025

View reviewed changes

HenryL27 added 7 commits February 26, 2025 14:42

move complex prompts to be next to the complex code that sets them up

e22111a

Signed-off-by: Henry Lindeman <[email protected]>

have summmarize_data take a summarizer instance rather than a summari…

3fe1c14

…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>

fix unit tests

687d1e4

Signed-off-by: Henry Lindeman <[email protected]>

inline get text macro since it's only used by one template

d18a03f

Signed-off-by: Henry Lindeman <[email protected]>

remove collapse document summarizer

6a373ad

Signed-off-by: Henry Lindeman <[email protected]>

apparently that lets me get rid of collapse and qasummarizer too, nice

5b267a7

Signed-off-by: Henry Lindeman <[email protected]>

mypy + tests

d156b03

Signed-off-by: Henry Lindeman <[email protected]>

ruff

9104aff

Signed-off-by: Henry Lindeman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm unify 7/n] Summarize #1192

[llm unify 7/n] Summarize #1192

HenryL27 commented Feb 21, 2025 •

edited

Loading

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 26, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 27, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 26, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 27, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

		@@ -52,14 +58,13 @@ def math_operation(val1: int, val2: int, operator: str) -> Union[int, float]:
		@context_params
		def summarize_data(



		def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
		if summarizer_cls is LLMElementTextSummarizer:

		return comptransform


		class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):

[llm unify 7/n] Summarize #1192

Are you sure you want to change the base?

[llm unify 7/n] Summarize #1192

Conversation

HenryL27 commented Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HenryL27 commented Feb 21, 2025 •

edited

Loading