-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for zstd content compression #423
base: master
Are you sure you want to change the base?
Conversation
The other compression schemes supported in Matroska suffer from having to be applied on a per-packet level, with no global state. This means that they can't exploit any of the redundancy between packets, and any huffman tables or other configuration has to be duplicated in every packet. This severely limits the attainable compression efficiency to approximately the level attained by zlib, which has led to the other algorithms receiving very little adoption by content authors, and little support from tool vendors. The zstd library, on the other hand, supports generating a dictionary from a large number of small inputs (e.g. packets), and using that dictionary to compress similar inputs more efficiently. This improves performance on subtitle inputs dramatically. Muxing tools can either scan the input and pass its packets into zstd before muxing an individual file, or provide an auxiliary tool that generates a dictionary from the packets of one or more input files, then saves the result to a file that can be reused when muxing tracks with similar content.
It seems like the dictionary from zstd needs to either be included somewhere at the beginning of the matroska file, or has to be provided along with it. Is there a citable open specification for zstd, other than source code? |
The dictionary goes in Zstandard is defined by RFC8478. |
Cool, I didn't know that had been published. |
This seems like a nice addition. Although for now we concentrate on adding existing feature (removing unused ones). This seems like something that should go in the next version of Matroska. In particular parsers up to v4 do not expect that value. When it is new elements, they know they can skip them. When it's values in an enum like this, that affects parsing the block data that could make new files unreadable by existing parsers. To avoid this, the muxer should mark the file as only readable from Matroska version 5. Technically there is currently no way to define a minver/maxver value for an enum value. So we need to add that (something to add to the EBML Schema format). It's not defined but you could just add As for the compression algorithm itself, the fact it's defined by a RFC is a big bonus (free to use). I wonder how practical it would be. It seems that you can only get a proper directory if you scan all your sources ahead of muxing. Otherwise you use a lowest common denominator for a particular codec but it's less efficient. And in that case will need to create their own dictionary. It's feasible but I wonder if there's much gain to expect from the other compression mechanisms. Compression is already not good on compressed codec (unlike header stripping), so that limit the compression to raw formats (audio, video, bitmap, text). |
I'd expect this to mainly be useful for subtitle formats. Some quick testing on real files indicated theoretical improvements in the range of 2.5~3x over zlib in typesetting-heavy cases. Zlib can actually use user-supplied dictionaries as well, and in a very similar way to how zstd does. The problem is that there doesn't appear to be any decent tooling available to generate dictionaries for zlib, whereas the zstd library includes functions to generate a dictionary from a passed-in data set. |
I think lzo1x can handle dictionaries too, but I couldn't find the code/link about that. |
Now that zstd is RFC8478 that's an extra incentive to support it. Since we discourage the use of some compression values
We might as well add a new value that is also known not to work on all implementations. In v5 we could make it mandatory. |
@rcombs could you rebase and adapt so it can be merged ? |
Looking at RFC8478 I wonder if we should add constraints to how zstd would be used. There is a Magic Number at the start that could be stripped, although it could be combined into a separate There are also metadata frames which don't really make sense in the context of Block compression as the container compression(s) is supposed to be transparent to the Block reader. We may mention that they SHOULD NOT be used. If some metadata are needed, there is |
Looking at the latest RFC it also seems dictionaries are not a thing yet. |
Marking as Matroska v5 as v4 publication is close. |
Based on ietf-wg-cellar#423. Co-authored-by: rcombs <[email protected]>
Based on ietf-wg-cellar#423. Co-authored-by: rcombs <[email protected]>
The other compression schemes supported in Matroska suffer from having to be applied on a per-packet level, with no global state. This means that they can't exploit any of the redundancy between packets, and any huffman tables or other configuration has to be duplicated in every packet. This severely limits the attainable compression efficiency to approximately the level attained by zlib, which has led to the other algorithms receiving very little adoption by content authors, and little support from tool vendors.
The zstd library, on the other hand, supports generating a dictionary from a large number of small inputs (e.g. packets), and using that dictionary to compress similar inputs more efficiently. This improves performance on subtitle inputs dramatically.
Muxing tools can either scan the input and pass its packets into zstd before muxing an individual file, or provide an auxiliary tool that generates a dictionary from the packets of one or more input files, then saves the result to a file that can be reused when muxing tracks with similar content.