You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Test-Time Compute for Topics: Embrace uncertainty to reduce hallucinations, with multi-sample and/or semantic entropy
This is a long-form issue, in three parts:
Introduce the basic concept and a simple algorithm.
Explain a more advanced variant using semantic entropy (Farquhar et al. 2024).
Extend that more advanced variant for better multi-class support.
Exploiting multi-sampling for Topics
A common way to improve LLM output, in particular against hallucinations, is to use Test-Time compute: make the LLM work extra, with multiple calls, to get a better answer -- with, or without fine-tuning. See for example the Introduction in (Snell et al. 2024) for an accessible overview and key references [the rest of the article is very thorough but more advanced than what we need for now]. This is also, at its simplest, what is being done by some algorithms who keep trying until the LLM returns an answer that passes sanity checks (e.g. non-empty output).
Another place where sampling multiple times is used is in Bayesian sampling: get a sample from the distribution, rather than a simple “best” output. This allows for a measure of how certain the model is of its answer. We can use this with LLM too. Indeed, sampling multiple answers and looking at the distribution of these outputs is a form of compute at test time, where we use the whole output rather than keep only the best.
The algorithm looks as follows when applying it to the topic categorization task, given a list of pre-specified topics (e.g. provided by the user or discovered by a first step of the algorithm).
For each comment:
[Sampling] Repeat N times: generate the LLM answer, indicating which category the comment belongs to. With proper sanitization of the output, it returns one (or several) class(es). This gives you N samples from the LLM’s categorical distribution for that comment.
[Aggregation] Either:
Refuse to answer if the entropy of the distribution is too high -- or any other measure of spread of the distribution (probability distance between top categories, size of the support). The algorithm can’t categorize. This solves [Topic Models] Understand various shades of “Other/None of the Above” #1876.
Or pick one of several high-probability categories
If looking for a single topic: Keep just the mode
If accepting multi topics: Keep some of the categories, for example
All those above a certain probability threshold
Or only roughly-equi-probable most-probably categories
Or any variant of that selection algorithm
There are a few caveats for equating multi-topics with multi-modes: the entropy will be higher when several categories have high probability, thus the threshold needs to account for that. See below (extension of semantic entropy) for an alternative approach that could be more robust.
Notes on a few parameters and refinements:
There are a few parameters to this algorithm: sampling temperature of the LLM (controls how much variability comes from the LLM), and thresholds used in the aggregation step. To choose them, ideally we’d choose by optimizing on a “training” set of human-labelled data (which we do have, in BG2018 and WB2022, see discussion in [Topic Models] Support multiple topics per comment #1877 ) -- typically with cross-validation to have an idea of how robust is that optimization (with the caveat that two conversations don’t allow for much “out of distribution” measure). Note that even a Nature paper (Farquhar et al. 2024) for a more advanced version of the idea here (which I’ll cover in a follow-up issue) is a bit fuzzy on how they chose these parameters ;-)
If, as we should, we do allow the LLM to return multiple classes (as should be the case if expecting multi-topics, see [Topic Models] Support multiple topics per comment #1877 ), the same simple algorithm can apply, or we can in the estimation of the underlying distribution from N samples, by estimating the probabilities of a multilabel distribution. However, while more rigorous and more complicated (exploit the without-replacement aspect, capturing correlations maybe?), I suspect this will be quite a lot of work, for possibly marginal improvement. So I recommend pragmatically leaving it aside until we are in refinement territory.
Semantic entropy
A more advanced version of the above is possible and was recently published in Nature (Farquhar et al. 2024), and got some press in TIME Magazine (Perrigo 2024).
In a nutshell, its nice refinement is that, instead of using the entropy of the distribution of the raw answers, it first groups the answers by meaning, to account for various output formats (“The capital of France is Paris” vs “Paris is the capital of France” vs “Paris”). All the answers which are equivalent to each another (in the sense of “double entailment”, i.e. A implies B and B implies A) are collapsed into one same equivalence class/category. Entailment is judged by asking the LLM “does A imply B”. Efficient grouping is done via a classical connected-component algorithm from graph theory.
This amounts to semantic entropy, in the sense that it takes the entropy of semantically distinct answer classes rather than just character-based distinct answers. They then decide that the algorithm cannot answer, if the entropy is too high, or pick the mode of the distribution as the right answer -- and they demonstrate this reduces hallucinations. Illustrated nicely in Figure 1.a from their article:
That is the gist we need. The main content of the paper discusses the following two extra ideas, but we can skip it for topics:
Likelihood vs discrete entropy: Instead of computing entropy on the counts of each class, they discuss using the “average length-normalized token-likelihood” over the answers in the class, because not every LLM provides the likelihood. In practice their experiments show that there is rarely a difference, so let’s keep it simple and use the count.
Splitting large documents: The basic idea above works well for shorts answers to short questions, which is what we need for topics, but doesn’t apply as easily to long documents with long questions and complicated answers -- which they spend a big part of the paper extending. We do not need it for topics because they are short questions, but I discuss that approach into this separate Github issue on applying it to Summarization: [LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881
Choice of parameters (thresholds, temperatures), as always, is a question. In their article, the authors compare various methods and thus report AUROC (aka classical AUC) and AURAC (new method taking into account non-response), which are integrated over thresholds. We do not have that luxury in practical applications, so we need to tune that threshold as discussed above. As to the temperature, they just pick one without explaining why.
Extending semantic entropy to multilabel
Generalization of semantic entropy tomultilabel (aka multi-topics): the semantic entropy article assumes that there is a single correct answer to each question. To handle multilabel, we can either:
Option A: Modify to accept multiple correct answers (i.e. modes), similar to discussed for multi-sampling above, with each answer being a single category.
Option B: Or consider that there is a single correct answer that there is a single correct answer which contains the multiple classes.
Option A is nice and simple, but has a few potential issues:
Multi-modality increases entropy, so we need to be extra careful in choosing a threshold that accepts several categories but rules out “too many”. For our topics, this can be done nicely when we know all the categories that we care about: unlike the Semantic Entropy equation (5) page 8 that needs to resort to an approximation to the number of classes actually produced by the LLM, we can compute the semantic entropy on the actual number of classes. That will help mitigate the issue.
There is also a problem if, for example, the comment could belong to two classes, with one more strongly than the other. Indeed, if asked about to return a single class, the LLM might always favour the strongest one, completely occulting the real but weaker other. We could mitigate the issue by choosing an adequately high value of the LLM temperature, but it’s not ideal.
Option B needs more careful thinking, but avoids these issues, by explicitly treating multilabel and expecting as unique answers a full set of labels
The entailment question now includes a form of set-theory question.
And we can either
require the exact same set, for equivalence,
or consider taking the union or the intersection of sets when they have several classes in common.
This latter part requires a bit more thought (as mentioned in #1877 , a literature review on multilabel would be handy), but that’s a starting point!
References
Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. ‘Detecting Hallucinations in Large Language Models Using Semantic Entropy’. Nature 630 (8017): 625–30. https://doi.org/10.1038/s41586-024-07421-0.
Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. ‘Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters’. arXiv. https://doi.org/10.48550/arXiv.2408.03314.
The text was updated successfully, but these errors were encountered:
Test-Time Compute for Topics: Embrace uncertainty to reduce hallucinations, with multi-sample and/or semantic entropy
This is a long-form issue, in three parts:
Exploiting multi-sampling for Topics
A common way to improve LLM output, in particular against hallucinations, is to use Test-Time compute: make the LLM work extra, with multiple calls, to get a better answer -- with, or without fine-tuning. See for example the Introduction in (Snell et al. 2024) for an accessible overview and key references [the rest of the article is very thorough but more advanced than what we need for now]. This is also, at its simplest, what is being done by some algorithms who keep trying until the LLM returns an answer that passes sanity checks (e.g. non-empty output).
Another place where sampling multiple times is used is in Bayesian sampling: get a sample from the distribution, rather than a simple “best” output. This allows for a measure of how certain the model is of its answer. We can use this with LLM too. Indeed, sampling multiple answers and looking at the distribution of these outputs is a form of compute at test time, where we use the whole output rather than keep only the best.
The algorithm looks as follows when applying it to the topic categorization task, given a list of pre-specified topics (e.g. provided by the user or discovered by a first step of the algorithm).
For each comment:
There are a few caveats for equating multi-topics with multi-modes: the entropy will be higher when several categories have high probability, thus the threshold needs to account for that. See below (extension of semantic entropy) for an alternative approach that could be more robust.
Notes on a few parameters and refinements:
Semantic entropy
A more advanced version of the above is possible and was recently published in Nature (Farquhar et al. 2024), and got some press in TIME Magazine (Perrigo 2024).
In a nutshell, its nice refinement is that, instead of using the entropy of the distribution of the raw answers, it first groups the answers by meaning, to account for various output formats (“The capital of France is Paris” vs “Paris is the capital of France” vs “Paris”). All the answers which are equivalent to each another (in the sense of “double entailment”, i.e. A implies B and B implies A) are collapsed into one same equivalence class/category. Entailment is judged by asking the LLM “does A imply B”. Efficient grouping is done via a classical connected-component algorithm from graph theory.
This amounts to semantic entropy, in the sense that it takes the entropy of semantically distinct answer classes rather than just character-based distinct answers. They then decide that the algorithm cannot answer, if the entropy is too high, or pick the mode of the distribution as the right answer -- and they demonstrate this reduces hallucinations. Illustrated nicely in Figure 1.a from their article:
That is the gist we need. The main content of the paper discusses the following two extra ideas, but we can skip it for topics:
Choice of parameters (thresholds, temperatures), as always, is a question. In their article, the authors compare various methods and thus report AUROC (aka classical AUC) and AURAC (new method taking into account non-response), which are integrated over thresholds. We do not have that luxury in practical applications, so we need to tune that threshold as discussed above. As to the temperature, they just pick one without explaining why.
Extending semantic entropy to multilabel
Generalization of semantic entropy to multilabel (aka multi-topics): the semantic entropy article assumes that there is a single correct answer to each question. To handle multilabel, we can either:
Option A is nice and simple, but has a few potential issues:
Option B needs more careful thinking, but avoids these issues, by explicitly treating multilabel and expecting as unique answers a full set of labels
This latter part requires a bit more thought (as mentioned in #1877 , a literature review on multilabel would be handy), but that’s a starting point!
References
The text was updated successfully, but these errors were encountered: