From b8689292dd786c6a73987f7a8fbaed6bc0eeb21d Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Fri, 23 Feb 2024 11:08:04 -0500 Subject: [PATCH 1/6] Added warning about v2 version string encoding --- spec/spec.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/spec/spec.md b/spec/spec.md index 6a367b6..cd0dbea 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -1220,6 +1220,21 @@ The first four characters, `PPPP` indicate the protocol. Each genus of a given C The next three characters, `VVV`, provide in Base64 notation the major and minor version numbers of the Version of the protocol specification. The first `V` character provides the major version number, and the final two `VV` characters provide the minor version number. For example, `CAA` indicates major version 2 and minor version 00 or in dotted-decimal notation, i.e., `2.00`. Likewise, `CAQ` indicates major version 2 and minor version decimal 16 or in dotted-decimal notation `1.16`. The Version part supports up to 64 major versions with 4096 minor versions per major version. +::: warning non-canonical base64 +This is a non-canonical encoding using Base64 indicies. Most libraries will drop bits that aren't on a byte boundary using many rfc4648 compliant libraries if you just call decode on these characters naively. + +example) in python (with padding character for demonstration) our version 2.00 maps to "CAA" as above. +```python +>>> base64.urlsafe_b64decode("CAA=") +b'\x08\x00' +``` + +Which is two bytes. However, three base64 characters in this version scheme encode 18 bits. base64 encoding works on 6-bit groupings so: `C -> 0b000010, A -> 0b000000, A -> 0b000000` which is two bytes + two bits when concatenated together. In the python example above we get back `b'\x08\x00'` -> `'0b00001000 0b00000000'` which is two bytes (16 bits) in hexidecimal notation. The canonical decoding by the library is stripping the last two bits per the RFC. Implementers should thus use a library capable of getting the index of the b64 characters according to the scheme (for this version string only) and not those written to give canonical decodings. + +See https://datatracker.ietf.org/doc/html/rfc4648#section-3.5 +::: + + The next four characters, `KKKK` indicate the serialization kind in uppercase. The four supported serialization kinds are `JSON`, `CBOR`, `MGPK`, and `CESR` for the JSON, CBOR, MessagePack, and CESR serialization standards, respectively [[spec: RFC4627]] [[spec: RFC4627]] [[spec: RFC8949]] [[ref: RFC8949]] [[3]] [[ref: CESR]]. The last one, CESR, is used to represent `CESR` when the field map is converted to an in-memory data object so that it might be converted more conveniently back to the appropriate serialization. The native CESR serialization of a field map does not need a serialization type. The next four characters, `BBBB`, provide in Base64 notation the total length of the serialization, inclusive of the Version String and any prefixed characters or bytes. This length is the total number of characters in the serialization of the field map. The maximum length of a given field map serialization is thereby constrained to be 644 = 224 = 16,777,216 characters in length. This is deemed generous enough for the vast majority of anticipated applications. For serializations that may exceed this size, a secure hash chain of Messages may be employed where the value of a field in one Message is the cryptographic digest, SAID of the following Message. The total size of the chain of Messages may, therefore, be some multiple of 224. From edc484fbe3937bfac71fbab786ccfbe071320637 Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Thu, 29 Feb 2024 10:12:37 -0500 Subject: [PATCH 2/6] Removed a parenthesis and made it a sentence --- spec/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/spec.md b/spec/spec.md index cd0dbea..df07087 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -1223,7 +1223,7 @@ The next three characters, `VVV`, provide in Base64 notation the major and minor ::: warning non-canonical base64 This is a non-canonical encoding using Base64 indicies. Most libraries will drop bits that aren't on a byte boundary using many rfc4648 compliant libraries if you just call decode on these characters naively. -example) in python (with padding character for demonstration) our version 2.00 maps to "CAA" as above. +For example in python (with padding character for demonstration) our using a semantic version 2.00 that would map to "CAA" as above. ```python >>> base64.urlsafe_b64decode("CAA=") b'\x08\x00' From 03b582ac1082f39179b4679ef51b975d06267633 Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Tue, 19 Mar 2024 13:35:32 -0400 Subject: [PATCH 3/6] Update spec.md Fixed sentence per @m00sey 's suggestion. --- spec/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/spec.md b/spec/spec.md index b08f249..d1217f1 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -1239,7 +1239,7 @@ For example in python (with padding character for demonstration) our using a sem b'\x08\x00' ``` -Which is two bytes. However, three base64 characters in this version scheme encode 18 bits. base64 encoding works on 6-bit groupings so: `C -> 0b000010, A -> 0b000000, A -> 0b000000` which is two bytes + two bits when concatenated together. In the python example above we get back `b'\x08\x00'` -> `'0b00001000 0b00000000'` which is two bytes (16 bits) in hexidecimal notation. The canonical decoding by the library is stripping the last two bits per the RFC. Implementers should thus use a library capable of getting the index of the b64 characters according to the scheme (for this version string only) and not those written to give canonical decodings. +Which is two bytes. However, there are three base64 characters in this version scheme which encode 18 bits. base64 encoding works on 6-bit groupings so: `C -> 0b000010, A -> 0b000000, A -> 0b000000` which is two bytes + two bits when concatenated together. In the python example above we get back `b'\x08\x00'` -> `'0b00001000 0b00000000'` which is two bytes (16 bits) in hexidecimal notation. The canonical decoding by the library is stripping the last two bits per the RFC. Implementers should thus use a library capable of getting the index of the b64 characters according to the scheme (for this version string only) and not those written to give canonical decodings. See https://datatracker.ietf.org/doc/html/rfc4648#section-3.5 ::: From 6b082b4014eef2c430271ef1522869d6f153f6c1 Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Fri, 22 Mar 2024 09:31:52 -0400 Subject: [PATCH 4/6] This commit doesn't change anything but should fix CLA error on github --- spec/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/spec.md b/spec/spec.md index b08f249..d1217f1 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -1239,7 +1239,7 @@ For example in python (with padding character for demonstration) our using a sem b'\x08\x00' ``` -Which is two bytes. However, three base64 characters in this version scheme encode 18 bits. base64 encoding works on 6-bit groupings so: `C -> 0b000010, A -> 0b000000, A -> 0b000000` which is two bytes + two bits when concatenated together. In the python example above we get back `b'\x08\x00'` -> `'0b00001000 0b00000000'` which is two bytes (16 bits) in hexidecimal notation. The canonical decoding by the library is stripping the last two bits per the RFC. Implementers should thus use a library capable of getting the index of the b64 characters according to the scheme (for this version string only) and not those written to give canonical decodings. +Which is two bytes. However, there are three base64 characters in this version scheme which encode 18 bits. base64 encoding works on 6-bit groupings so: `C -> 0b000010, A -> 0b000000, A -> 0b000000` which is two bytes + two bits when concatenated together. In the python example above we get back `b'\x08\x00'` -> `'0b00001000 0b00000000'` which is two bytes (16 bits) in hexidecimal notation. The canonical decoding by the library is stripping the last two bits per the RFC. Implementers should thus use a library capable of getting the index of the b64 characters according to the scheme (for this version string only) and not those written to give canonical decodings. See https://datatracker.ietf.org/doc/html/rfc4648#section-3.5 ::: From c64dbbc774189df5428c370c7909528b3f5be705 Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Tue, 26 Mar 2024 09:20:09 -0400 Subject: [PATCH 5/6] Added rfc4648 spec-up link as requested in review. Also fixed some grammer and wording issues in that section although the meaning remains. --- spec/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/spec/spec.md b/spec/spec.md index 6a91536..1771fa9 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -1232,9 +1232,9 @@ The first four characters, `PPPP` indicate the protocol. Each genus of a given C The next three characters, `VVV`, provide in Base64 notation the major and minor version numbers of the Version of the protocol specification. The first `V` character provides the major version number, and the final two `VV` characters provide the minor version number. For example, `CAA` indicates major version 2 and minor version 00 or in dotted-decimal notation, i.e., `2.00`. Likewise, `CAQ` indicates major version 2 and minor version decimal 16 or in dotted-decimal notation `1.16`. The Version part supports up to 64 major versions with 4096 minor versions per major version. ::: warning non-canonical base64 -This is a non-canonical encoding using Base64 indicies. Most libraries will drop bits that aren't on a byte boundary using many rfc4648 compliant libraries if you just call decode on these characters naively. +This is a non-canonical encoding using Base64 indicies. Most [[spec: RFC4648]]-compliant libraries will drop bits that aren't on a byte boundary if you just call decode on these characters naively. -For example in python (with padding character for demonstration) our using a semantic version 2.00 that would map to "CAA" as above. +For example, in python (with padding character for demonstration), using a semantic version 2.00 that would map to "CAA" in our scheme as above. ```python >>> base64.urlsafe_b64decode("CAA=") b'\x08\x00' From 08126a2f39389318dbee3f9d41deffc34d3137f4 Mon Sep 17 00:00:00 2001 From: Charles Lanahan Date: Wed, 3 Apr 2024 15:17:40 -0400 Subject: [PATCH 6/6] Fixed typo in -0B code in tables --- spec/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/spec/spec.md b/spec/spec.md index 0eb20a0..93015c0 100644 --- a/spec/spec.md +++ b/spec/spec.md @@ -969,7 +969,7 @@ For example, suppose some application uses a list (a universal but non-overridea | `-A##` | Generic pipeline group up to 4,095 quadlets/triplets | 4 | 2 | 4 | | `-0A#####` | Generic pipeline group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | | `-B##` | Message + attachments group up to 4,095 quadlets/triplets | 4 | 2 | 4 | -| `-0A#####` | Message + attachments group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | +| `-0B#####` | Message + attachments group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | | `-C##` | Attachments only group up to 4,095 quadlets/triplets | 4 | 2 | 4 | | `-0C#####` | Attachments only group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | @@ -1023,7 +1023,7 @@ This master table includes both the Primitive and Count Code types. The types ar | `-A##` | Generic pipeline group up to 4,095 quadlets/triplets | 4 | 2 | 4 | | `-0A#####` | Generic pipeline group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | | `-B##` | Message + attachments group up to 4,095 quadlets/triplets | 4 | 2 | 4 | -| `-0A#####` | Message + attachments group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | +| `-0B#####` | Message + attachments group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | | `-C##` | Attachments only group up to 4,095 quadlets/triplets | 4 | 2 | 4 | | `-0C#####` | Attachments only group up to 1,073,741,823 quadlets/triplets | 8 | 5 | 8 | | | Universal Count Codes that do not allow genus/version override | | | |