Questions about `update()` function in `EntropyBottleneck` & `GaussianConditional` #311

JasonLSC · 2024-10-12T07:21:47Z

JasonLSC
Oct 12, 2024

Thank you for your outstanding open-source project. I've been studying it recently.

First, I’m a bit unclear about the need to run the update() function once at test time before actual encoding. Simply put, what is this preparation for? Why is update() not needed during training?

My second question is that I noticed the update() functions for EntropyBottleneck and GaussianConditional are different. Could you briefly explain why that is?

My third question is, what is the role of scale_table in GaussianConditional?

Answered by YodaEmbedding

Oct 21, 2024

First, I’m a bit unclear about the need to run the update() function once at test time before actual encoding. Simply put, what is this preparation for? Why is update() not needed during training?

.update() fills in the _quantized_cdf table, which is used during runtime, but not during training.

During training, we measure the rate using the negative log-likelihoods (NLL) of $\hat{y}$. That gives us a differentiable function for rate. Thus, we don't actually need to losslessly compress and then losslessly decompress, since that's just the identity function anyways. (Also, it would be difficult to get a differentiable function for rate from just the length of a losslessly compressed bits…

View full answer

YodaEmbedding · 2024-10-21T22:28:41Z

YodaEmbedding
Oct 21, 2024

First, I’m a bit unclear about the need to run the update() function once at test time before actual encoding. Simply put, what is this preparation for? Why is update() not needed during training?

.update() fills in the _quantized_cdf table, which is used during runtime, but not during training.

During training, we measure the rate using the negative log-likelihoods (NLL) of $\hat{y}$. That gives us a differentiable function for rate. Thus, we don't actually need to losslessly compress and then losslessly decompress, since that's just the identity function anyways. (Also, it would be difficult to get a differentiable function for rate from just the length of a losslessly compressed bitstream.)

During runtime, we are no longer interested in minimizing the rate. (That was the job of training.) We now want an actual bitstream, so this time we do in fact losslessly compress. This requires running the lossless arithmetic coder, which runs faster when we give it a _quantized_cdf table of precomputed CDFs which do not change.

We could certainly precompute these CDFs during training, but they're not needed until later, so it's better to do it after the epoch ends, or after training ends.

My second question is that I noticed the update() functions for EntropyBottleneck and GaussianConditional are different. Could you briefly explain why that is?

EntropyBottleneck uses the same distribution for all elements $y_{c, 1, 1}, \ldots, y_{c, H, W}$ within a given channel $c$ of the tensor $y \in \mathbb{R}^{C \times H \times W}$.
GaussianConditional uses a different distribution (Normal(mean, scale) or $\mathcal{N}( \mu_{c,h,w}, (\sigma_{c,h,w})^2 )$ ) for each element $y_{c,h,w}$.

The colors represent different distributions in the figure below:

`EntropyBottleneck`	`GaussianConditional`

Memory requirements for _quantized_cdf:

EntropyBottleneck requires $C \times B$ elements in memory, where $B \geq \hat{y}_{\text{max}} - \hat{y}_{\text{min}}$ is the number of bins.^¹
GaussianConditional requires $S \times B$ elements in memory, where $S$ is the size of the scale table. The scale table contains possible values of $\sigma$ from narrow ($\sigma \approx 0.11$) to wide ($\sigma \approx 256$). Notice that we can save memory by not creating a distribution for every possible mean since we can easily handle that by offsetting the input signal by the predicted means so that the resulting signal is (hopefully) zero-mean.

At a quick glance, I think GaussianConditional could have reused more of EntropyBottleneck's implementation (e.g., using _likelihood, though with means=0 and scales=scale_table).

^{¹ Technically, $B$ actually varies depending on the channel, thus saving memory on low-range channels.}

My third question is, what is the role of scale_table in GaussianConditional?

As mentioned above, the scale_table is used to precompute a bunch of zero-mean Gaussian distributions. The scale_table is a list of 64 possible $\sigma$ values / scales that a zero-mean Gaussian distribution may take from narrow ($\sigma \approx 0.11$) to wide ($\sigma \approx 256$). When encoding, the mean-shifted values y - means_hat are encoded using a distribution generated by this table whose scale is closest to the predicted scales_hat. We restrict it to 64 levels between this range since that's a good compromise between {memory usage, speed} versus bitrate. More fine-grained levels may be able to save more bits in theory, though the effect is minimal.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about `update()` function in `EntropyBottleneck` & `GaussianConditional` #311

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Questions about update() function in EntropyBottleneck & GaussianConditional #311

Uh oh!

JasonLSC Oct 12, 2024

Replies: 1 comment

Uh oh!

Uh oh!

YodaEmbedding Oct 21, 2024

Questions about `update()` function in `EntropyBottleneck` & `GaussianConditional` #311

JasonLSC
Oct 12, 2024

YodaEmbedding
Oct 21, 2024