-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llm unify 7/n] Summarize #1192
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
{% endif %} | ||
""" | ||
), | ||
user=J_GET_ELEMENT_TEXT_MACRO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we force the question to always be present by having a default question of "What is the summary of this information?"
This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.
""" | ||
), | ||
user=J_GET_ELEMENT_TEXT_MACRO | ||
+ textwrap.dedent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want dedent or remove all leading whitespace. It's weird to read code like:
if condition_a:
if condition_ab
"a&b"
else:
"a&!b"
else:
"!a"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to have the prompts render out with indentation for printability (though I think the llm doesn't really see it) - like
Element 0:
properties.state_0: Alabama
properties.state_1: Alaska
...
context: Optional[Context] = None, | ||
docset_summarizer: Optional[Type[Summarizer]] = None, | ||
summarizer_kwargs: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a class + kwargs rather than passing in an object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duh. fixed
@@ -52,14 +58,13 @@ def math_operation(val1: int, val2: int, operator: str) -> Union[int, float]: | |||
@context_params | |||
def summarize_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this? It feels weird to have a summarize class and a whole ton of parameters in the call.
summaries_as_text=summaries_as_text, | ||
) | ||
|
||
# If data is not DocSets, text is this list here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway
|
||
|
||
def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer: | ||
if summarizer_cls is LLMElementTextSummarizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if class thing can't be the right way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not.
@@ -143,7 +141,238 @@ def collapse(text: str, tokens_per_chunk: int, tokenizer: Tokenizer, summarizer_ | |||
return cur_summary | |||
|
|||
|
|||
class DocumentSummarizer(Summarizer): | |||
class HeirarchicalDocumentSummarizer(Summarizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been nice to not have this all together in a single PR, this probably could have been separate.
class DocumentSummarizer(Summarizer): | ||
class HeirarchicalDocumentSummarizer(Summarizer): | ||
""" | ||
Summarizes a document by constructing a heirarchical tree of batches of elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?
return comptransform | ||
|
||
|
||
class MaxTokensHeirarchicalDocumentSummarizer(Summarizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?
Signed-off-by: Henry Lindeman <[email protected]>
…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Again, general lack of confidence in this.
Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).
Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.
I probably broke something in luna but all the unittests passed so idk.
Also not sure about some of the names.