Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer fixes #113

Merged
merged 9 commits into from
Aug 19, 2024
Merged

Tokenizer fixes #113

merged 9 commits into from
Aug 19, 2024

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Aug 2, 2024

Addresses two issues found while thoroughly testing the Gemma tokenizer:

  • Added tokens are sorted by reverse length, to prevent early partial matches. This is similar to huggingface/transformers.js@c305c38 (thanks @xenova 🙌)
  • The vocab is stored using NSString instead of String, because String equality only considers the canonical Unicode representation. There are tokens in the Gemma vocab with the same representation, for example: à (0x61 0x300) and à (0xe0). Using a [String : Int] dictionary would randomly choose one and ignore the other. This can potentially explain the crashes observed by @DePasqualeOrg in Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111 (thanks for reporting!).

Old and new tests pass.

return
}

// This should be 256_000, I believe
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit puzzling, I need to double check.

let config = Config(dict)

let vocab_nsdict = config.dictionary["vocab"] as! NSDictionary
let vocab_nsstring = config.dictionary["vocab"] as! [NSString: Int]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users of Config must be aware that dictionaries with String keys may lose information. We could perhaps add dictionary getters instead, just like we do for arrays and other types.

@DePasqualeOrg
Copy link
Contributor

This, along with ml-explore/mlx-swift-examples@885e520, fix the crashes with Gemma 2 2B for me, and the problems with the quality of the output from the model are also resolved. Thank you!

@pcuenca
Copy link
Member Author

pcuenca commented Aug 2, 2024

Awesome! Were the quality issues related to the tokenizer?

@DePasqualeOrg
Copy link
Contributor

I'm not sure and can't do an experiment now. It's possible that the quality issues were resolved by the improvements in the model implementation of Gemma 2.

@DePasqualeOrg
Copy link
Contributor

DePasqualeOrg commented Aug 2, 2024

This is hard to test because it's non-deterministic, but I just got another crash with Gemma 2 2B in split(by captureRegex:) as mentioned here, using the branch from this PR.

@pcuenca
Copy link
Member Author

pcuenca commented Aug 4, 2024

Are you generating code by any chance? For some reason there are still 6 tokens that fail to import from the JSON file, they are these ones:

// : 77923
/* : 91007
/** : 211288
<? : 232181
 : 235316
# : 235345

The first 4 start with an invisible \ufeff prefix. 235316 is \r, and 235345 is simply #.

I'm trying to get this fixed, although I'm not sure whether this will prevent the regexp crash from happening.

@DePasqualeOrg
Copy link
Contributor

Thanks for delving into these issues. I've just been testing Gemma 2 2B with some very simple queries like "Who are you?" and follow-up questions.

@pcuenca
Copy link
Member Author

pcuenca commented Aug 5, 2024

Regarding the missing tokes in the parsed vocabulary, this is my documentation after tracking one of the issues.

First, we are parsing the JSON file (tokenizer.json) using JSONSerialization.jsonObject. This reads data as Foundation objects, parsing tokens from the vocab dictionary as NSString instances. This is a good thing. Strings cannot be used as keys in the vocab dictionary because equality only considers the Unicode canonical representation. Parsing the JSON and casting to [String : Int] would ignore multiple entries.

However, I found that JSONSerialization fails to correctly parse some strings. Consider the following test case:

    func testArrayParsingWithBOMPrefix() {
        // The second one starts with a BOM prefix
        let items = ["a", "\u{feff}a"]

        // Neither Strings nor NSStrings are equal
        XCTAssertNotEqual(items[0], items[1])
        XCTAssertNotEqual(items[0] as NSString, items[1] as NSString)

        // JSONDecoder works
        let jsonData = try! JSONSerialization.data(withJSONObject: items, options: [])
        let decoder = JSONDecoder()
        let decoded = try! decoder.decode([String].self, from: jsonData)
        XCTAssertEqual(decoded, items)

        // JSONSerialization seems to ignore the BOM.
        // The decoded array contains two items, but they are the same NSString.
        let ns_decoded = try! JSONSerialization.jsonObject(with: jsonData, options: []) as! NSArray
        XCTAssertEqual(ns_decoded.count, items.count)                               // passes
        XCTAssertNotEqual(ns_decoded[0] as! NSString, ns_decoded[1] as! NSString)   // fails
        XCTAssertEqual(ns_decoded as! [String], items)                              // fails

        // Compare unicodeScalars
        func scalars(_ string: String) -> [UInt32] {
            string.unicodeScalars.map { $0.value }
        }
        for (decoded, expected) in zip(ns_decoded, items) {
            let decodedScalars = scalars(decoded as! String)
            let expectedScalars = scalars(expected)
            XCTAssertEqual(decodedScalars, expectedScalars)         // first passes, second fails
        }
    }

There are two strings in the test array. The second one starts with a BOM prefix. The prefix is ignored when parsing the two NSStrings, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.

Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:

        // If the non-breaking space is inside the String, all tests pass
//        let items = ["ab", "a\u{feff}b"]

I suspect this is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.

Also interestingly, JSONDecoder works and can decode the two distinct String instances in the array. We are not using JSONDecoder in this project because:

  • The structure of the JSON files to be parsed is quite open and flexible, I don't think it would be straightforward to write a decodable structure that represents it. Instead, we use dynamic member lookup to navigate the contents.
  • We can't use String instances for vocab keys, as mentioned above.

I'm not sure how to deal with this.

@pcuenca
Copy link
Member Author

pcuenca commented Aug 19, 2024

Merging now, opened #116 for the remaining edge cases. Will try to find a workaround.

@pcuenca pcuenca merged commit 4c8cf07 into main Aug 19, 2024
1 check passed
@pcuenca pcuenca deleted the tokenizer-fixes branch August 19, 2024 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants