Tokenizer fixes #113

pcuenca · 2024-08-02T18:10:53Z

Addresses two issues found while thoroughly testing the Gemma tokenizer:

Added tokens are sorted by reverse length, to prevent early partial matches. This is similar to huggingface/transformers.js@c305c38 (thanks @xenova 🙌)
The vocab is stored using NSString instead of String, because String equality only considers the canonical Unicode representation. There are tokens in the Gemma vocab with the same representation, for example: à (0x61 0x300) and à (0xe0). Using a [String : Int] dictionary would randomly choose one and ignore the other. This can potentially explain the crashes observed by @DePasqualeOrg in Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111 (thanks for reporting!).

Old and new tests pass.

Edge cases also added for other BPE tokenizers, but not for T5 yet.

Similar to huggingface/transformers.js@c305c38

canonical representation.

Sources/Tokenizers/Tokenizer.swift

Tests/TokenizersTests/TokenizerTests.swift

pcuenca · 2024-08-02T18:12:47Z

Tests/TokenizersTests/TokenizerTests.swift

+            return
+        }
+
+        // This should be 256_000, I believe


This is a bit puzzling, I need to double check.

Sources/Tokenizers/BPETokenizer.swift

pcuenca · 2024-08-02T18:17:47Z

Tests/HubTests/HubTests.swift

+        let config = Config(dict)
+
+        let vocab_nsdict = config.dictionary["vocab"] as! NSDictionary
+        let vocab_nsstring = config.dictionary["vocab"] as! [NSString: Int]


Users of Config must be aware that dictionaries with String keys may lose information. We could perhaps add dictionary getters instead, just like we do for arrays and other types.

DePasqualeOrg · 2024-08-02T18:38:36Z

This, along with ml-explore/mlx-swift-examples@885e520, fix the crashes with Gemma 2 2B for me, and the problems with the quality of the output from the model are also resolved. Thank you!

pcuenca · 2024-08-02T18:43:21Z

Awesome! Were the quality issues related to the tokenizer?

DePasqualeOrg · 2024-08-02T19:24:12Z

I'm not sure and can't do an experiment now. It's possible that the quality issues were resolved by the improvements in the model implementation of Gemma 2.

DePasqualeOrg · 2024-08-02T21:17:13Z

This is hard to test because it's non-deterministic, but I just got another crash with Gemma 2 2B in split(by captureRegex:) as mentioned here, using the branch from this PR.

pcuenca · 2024-08-04T12:09:27Z

Are you generating code by any chance? For some reason there are still 6 tokens that fail to import from the JSON file, they are these ones:

// : 77923
/* : 91007
/** : 211288
<? : 232181
 : 235316
# : 235345

The first 4 start with an invisible \ufeff prefix. 235316 is \r, and 235345 is simply #.

I'm trying to get this fixed, although I'm not sure whether this will prevent the regexp crash from happening.

DePasqualeOrg · 2024-08-04T14:54:53Z

Thanks for delving into these issues. I've just been testing Gemma 2 2B with some very simple queries like "Who are you?" and follow-up questions.

pcuenca · 2024-08-05T11:57:09Z

Regarding the missing tokes in the parsed vocabulary, this is my documentation after tracking one of the issues.

First, we are parsing the JSON file (tokenizer.json) using JSONSerialization.jsonObject. This reads data as Foundation objects, parsing tokens from the vocab dictionary as NSString instances. This is a good thing. Strings cannot be used as keys in the vocab dictionary because equality only considers the Unicode canonical representation. Parsing the JSON and casting to [String : Int] would ignore multiple entries.

However, I found that JSONSerialization fails to correctly parse some strings. Consider the following test case:

    func testArrayParsingWithBOMPrefix() {
        // The second one starts with a BOM prefix
        let items = ["a", "\u{feff}a"]

        // Neither Strings nor NSStrings are equal
        XCTAssertNotEqual(items[0], items[1])
        XCTAssertNotEqual(items[0] as NSString, items[1] as NSString)

        // JSONDecoder works
        let jsonData = try! JSONSerialization.data(withJSONObject: items, options: [])
        let decoder = JSONDecoder()
        let decoded = try! decoder.decode([String].self, from: jsonData)
        XCTAssertEqual(decoded, items)

        // JSONSerialization seems to ignore the BOM.
        // The decoded array contains two items, but they are the same NSString.
        let ns_decoded = try! JSONSerialization.jsonObject(with: jsonData, options: []) as! NSArray
        XCTAssertEqual(ns_decoded.count, items.count)                               // passes
        XCTAssertNotEqual(ns_decoded[0] as! NSString, ns_decoded[1] as! NSString)   // fails
        XCTAssertEqual(ns_decoded as! [String], items)                              // fails

        // Compare unicodeScalars
        func scalars(_ string: String) -> [UInt32] {
            string.unicodeScalars.map { $0.value }
        }
        for (decoded, expected) in zip(ns_decoded, items) {
            let decodedScalars = scalars(decoded as! String)
            let expectedScalars = scalars(expected)
            XCTAssertEqual(decodedScalars, expectedScalars)         // first passes, second fails
        }
    }

There are two strings in the test array. The second one starts with a BOM prefix. The prefix is ignored when parsing the two NSStrings, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.

Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:

        // If the non-breaking space is inside the String, all tests pass
//        let items = ["ab", "a\u{feff}b"]

I suspect this is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.

Also interestingly, JSONDecoder works and can decode the two distinct String instances in the array. We are not using JSONDecoder in this project because:

The structure of the JSON files to be parsed is quite open and flexible, I don't think it would be straightforward to write a decodable structure that represents it. Instead, we use dynamic member lookup to navigate the contents.
We can't use String instances for vocab keys, as mentioned above.

I'm not sure how to deal with this.

pcuenca · 2024-08-19T10:26:35Z

Merging now, opened #116 for the remaining edge cases. Will try to find a workaround.

pcuenca added 4 commits August 2, 2024 19:49

Bring over hf token envvar from preview branch

01c0f25

Add tests for Gemma, including edge cases

4149b55

Edge cases also added for other BPE tokenizers, but not for T5 yet.

Sort added tokens by length (descending) to avoid early partial matches

59a5b3a

Similar to huggingface/transformers.js@c305c38

Store vocab as NSString to allow multiple tokens with the same Unicode

4631505

canonical representation.

pcuenca commented Aug 2, 2024

View reviewed changes

Sources/Tokenizers/Tokenizer.swift Outdated Show resolved Hide resolved

pcuenca commented Aug 2, 2024

View reviewed changes

Tests/TokenizersTests/TokenizerTests.swift Outdated Show resolved Hide resolved

pcuenca commented Aug 2, 2024

View reviewed changes

Sources/Tokenizers/BPETokenizer.swift Outdated Show resolved Hide resolved

pcuenca commented Aug 2, 2024

View reviewed changes

pcuenca added 3 commits August 2, 2024 20:18

Remove comments

a6637e0

Go back to making vocab dictionaries private

266bbd0

Use ungated copy of Gemma tokenizer

f2044bd

Use NSString in UnigramTokenizer

b610c2d

Switch test to microsoft tokenizer, verify in Python

5d48793

pcuenca mentioned this pull request Aug 19, 2024

Edge tokenization issues: Unicode parsing #116

Open

pcuenca merged commit 4c8cf07 into main Aug 19, 2024
1 check passed

pcuenca deleted the tokenizer-fixes branch August 19, 2024 10:26

pcuenca mentioned this pull request Aug 19, 2024

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer fixes #113

Tokenizer fixes #113

Uh oh!

pcuenca commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

pcuenca Aug 2, 2024

Uh oh!

Uh oh!

pcuenca Aug 2, 2024

Uh oh!

DePasqualeOrg commented Aug 2, 2024

Uh oh!

pcuenca commented Aug 2, 2024

Uh oh!

DePasqualeOrg commented Aug 2, 2024

Uh oh!

DePasqualeOrg commented Aug 2, 2024 •

edited

Loading

Uh oh!

pcuenca commented Aug 4, 2024

Uh oh!

DePasqualeOrg commented Aug 4, 2024

Uh oh!

pcuenca commented Aug 5, 2024

Uh oh!

pcuenca commented Aug 19, 2024

Uh oh!

Uh oh!

Uh oh!

Tokenizer fixes #113

Tokenizer fixes #113

Uh oh!

Conversation

pcuenca commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

pcuenca Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcuenca Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

DePasqualeOrg commented Aug 2, 2024

Uh oh!

pcuenca commented Aug 2, 2024

Uh oh!

DePasqualeOrg commented Aug 2, 2024

Uh oh!

DePasqualeOrg commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcuenca commented Aug 4, 2024

Uh oh!

DePasqualeOrg commented Aug 4, 2024

Uh oh!

pcuenca commented Aug 5, 2024

Uh oh!

pcuenca commented Aug 19, 2024

Uh oh!

Uh oh!

Uh oh!

DePasqualeOrg commented Aug 2, 2024 •

edited

Loading