A Pytorch reproduction of Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.
More details can be found in the following paper:
Spatiotemporal Entropy Model is All You Need for Learned Video Compression
Alibaba Group, arxiv 2021.4.13
Zhenhong Sun, Zhiyu Tan, Xiuyu Sun, Fangyi Zhang, Dongyang Li, Yichen Qian, Hao Li
Note that It Is Not An Official Implementation Code. See official code, click here
The differences with the original paper are not limited to the following:
- The number of model channels are fewer.
- The Encoder/Decoder in original paper consists of conditional conv1 to support various rate in one single model. And the architecture is the same as [2]2. However, I only use the single rate Encoder/Decoder with the same architecture as [2]2
2023.4.6 update
- Add training and evaluation script for variable-rate stem with sft module in [4]4.
- Python == 3.7.10
- Pytorch == 1.7.1
- CompressAI
I use the Vimeo90k Septuplet Dataset to train the models. The Dataset contains about 64612 training sequences and 7824 testing sequences. All sequence contains 7 frames.
The train dataset folder structure is as
.dataset/vimeo_septuplet/
│ sep_testlist.txt
│ sep_trainlist.txt
│ vimeo_septuplet.txt
│
├─sequences
│ ├─00001
│ │ ├─0001
│ │ │ f001.png
│ │ │ f002.png
│ │ │ f003.png
│ │ │ f004.png
│ │ │ f005.png
│ │ │ f006.png
│ │ │ f007.png
│ │ ├─0002
│ │ │ f001.png
│ │ │ f002.png
│ │ │ f003.png
│ │ │ f004.png
│ │ │ f005.png
│ │ │ f006.png
│ │ │ f007.png
...
I evaluate the model on UVG & HEVC TEST SEQUENCE Dataset. The test dataset folder structure is as
.dataset/UVG/
├─PNG
│ ├─Beauty
│ │ f001.png
│ │ f002.png
│ │ f003.png
│ │ ...
│ │ f598.png
│ │ f599.png
│ │ f600.png
│ │
│ ├─HoneyBee
│ │ f001.png
│ │ f002.png
│ │ f003.png
│ │ ...
│ │ f598.png
│ │ f599.png
│ │ f600.png
│ │
│ │ ...
.dataset/HEVC/
├─BasketballDrill
│ f001.png
│ f002.png
│ f003.png
│ ...
│ f098.png
│ f099.png
│ f100.png
│
├─BasketballDrive
│ f001.png
│ f002.png
│ ...
python3 trainSTEM.py -d /path/to/your/image/dataset/vimeo_septuplet --lambda 0.01 -lr 1e-4 --batch-size 16 --model-save /path/to/your/model/save/dir --cuda --checkpoint /path/to/your/iframecompressor/checkpoint.pth.tar
I tried to train with Mean-Scale Hyperprior / Joint Autoregressive Hierarchical Priors / Cheng2020Attn in CompressAI library and find that a powerful I Frame Compressor does have great performance benefits.
python3 evalSTEM.py --checkpoint /path/to/your/iframecompressor/checkpoint.pth.tar --entropy-model-path /path/to/your/stem/checkpoint.pth.tar
Currently only support evaluation on UVG & HEVC TEST SEQUENCE Dataset.
测试数据集UVG | PSNR | BPP | PSNR in paper | BPP in paper |
---|---|---|---|---|
SpatioTemporalPriorModel_Res | 36.104 | 0.087 | 35.95 | 0.080 |
SpatioTemporalPriorModel | 36.053 | 0.080 | 35.95 | 0.082 |
SpatioTemporalPriorModelWithoutTPM | None | None | 35.95 | 0.100 |
SpatioTemporalPriorModelWithoutSPM | 36.066 | 0.080 | 35.95 | 0.087 |
SpatioTemporalPriorModelWithoutSPMTPM | 36.021 | 0.141 | 35.95 | 0.123 |
PSNR in paper & BPP in paper is estimated from Figure 6 in the original paper.
It seems that the context model SPM has no good effect in my experiments.
I look forward to receiving more feedback on the test results, and feel free to share your test results!
As stated in the original paper, they use a variable-rate auto-encoder to support various rate in one single model. I tried to train STEM with GainedVAE, which is also a various rate model. Some point can achieve comparable r-d performance while others may degrade. What's more, the interpolation result could have more performance degradation cases.
Probably we need Loss Modulator3 for various rate model training. Read Oren Ripple's ICCV 2021 paper3 for more details.
2023.4.6 I use sft in [4]4 to support variable-rate compression. As u can see in results, it has performance degradation in higher rate.
The framework is based on CompressAI, I add the model in compressai.models.spatiotemporalpriors. And trainSTEM.py/evalSTEM.py is modified with reference to compressai_examples
[1] [Variable Rate Deep Image Compression With a Conditional Autoencoder](https://openaccess.thecvf.com/content_ICCV_2019/html/Choi_Variable_Rate_Deep_Image_Compression_With_a_Conditional_Autoencoder_ICCV_2019_paper.html) [2] [Joint Autoregressive and Hierarchical Priors for Learned Image Compression](https://arxiv.org/abs/1809.02736) [3] [ELF-VC Efficient Learned Flexible-Rate Video Coding](https://arxiv.org/abs/2104.14335) [4] ["Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform"](https://arxiv.org/abs/2108.09551) (ICCV 2021)Feel free to contact me if there is any question about the code or to discuss any problems with image and video compression. ([email protected])