Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llm unify 7/n] Summarize #1192

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

[llm unify 7/n] Summarize #1192

wants to merge 28 commits into from

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Feb 21, 2025

Again, general lack of confidence in this.

Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).

Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.

I probably broke something in luna but all the unittests passed so idk.

Also not sure about some of the names.

Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27 HenryL27 requested a review from baitsguy February 21, 2025 01:21
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
{% endif %}
"""
),
user=J_GET_ELEMENT_TEXT_MACRO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we force the question to always be present by having a default question of "What is the summary of this information?"

This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.

"""
),
user=J_GET_ELEMENT_TEXT_MACRO
+ textwrap.dedent(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want dedent or remove all leading whitespace. It's weird to read code like:

if condition_a:
      if condition_ab
"a&b"
      else:
"a&!b"
else:
"!a"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to have the prompts render out with indentation for printability (though I think the llm doesn't really see it) - like

Element 0:
    properties.state_0: Alabama
    properties.state_1: Alaska
    ...

context: Optional[Context] = None,
docset_summarizer: Optional[Type[Summarizer]] = None,
summarizer_kwargs: dict[str, Any] = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a class + kwargs rather than passing in an object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh. fixed

@@ -52,14 +58,13 @@ def math_operation(val1: int, val2: int, operator: str) -> Union[int, float]:
@context_params
def summarize_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this? It feels weird to have a summarize class and a whole ton of parameters in the call.

summaries_as_text=summaries_as_text,
)

# If data is not DocSets, text is this list here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway



def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
if summarizer_cls is LLMElementTextSummarizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if class thing can't be the right way to do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not.

@@ -143,7 +141,238 @@ def collapse(text: str, tokens_per_chunk: int, tokenizer: Tokenizer, summarizer_
return cur_summary


class DocumentSummarizer(Summarizer):
class HeirarchicalDocumentSummarizer(Summarizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been nice to not have this all together in a single PR, this probably could have been separate.

class DocumentSummarizer(Summarizer):
class HeirarchicalDocumentSummarizer(Summarizer):
"""
Summarizes a document by constructing a heirarchical tree of batches of elements,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?

return comptransform


class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?

Signed-off-by: Henry Lindeman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants