-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer fixes #113
Tokenizer fixes #113
Conversation
Edge cases also added for other BPE tokenizers, but not for T5 yet.
canonical representation.
return | ||
} | ||
|
||
// This should be 256_000, I believe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit puzzling, I need to double check.
let config = Config(dict) | ||
|
||
let vocab_nsdict = config.dictionary["vocab"] as! NSDictionary | ||
let vocab_nsstring = config.dictionary["vocab"] as! [NSString: Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users of Config
must be aware that dictionaries with String keys may lose information. We could perhaps add dictionary getters instead, just like we do for arrays and other types.
This, along with ml-explore/mlx-swift-examples@885e520, fix the crashes with Gemma 2 2B for me, and the problems with the quality of the output from the model are also resolved. Thank you! |
Awesome! Were the quality issues related to the tokenizer? |
I'm not sure and can't do an experiment now. It's possible that the quality issues were resolved by the improvements in the model implementation of Gemma 2. |
This is hard to test because it's non-deterministic, but I just got another crash with Gemma 2 2B in |
Are you generating code by any chance? For some reason there are still 6 tokens that fail to import from the JSON file, they are these ones:
The first 4 start with an invisible \ufeff prefix. 235316 is I'm trying to get this fixed, although I'm not sure whether this will prevent the regexp crash from happening. |
Thanks for delving into these issues. I've just been testing Gemma 2 2B with some very simple queries like "Who are you?" and follow-up questions. |
Regarding the missing tokes in the parsed vocabulary, this is my documentation after tracking one of the issues. First, we are parsing the JSON file ( However, I found that func testArrayParsingWithBOMPrefix() {
// The second one starts with a BOM prefix
let items = ["a", "\u{feff}a"]
// Neither Strings nor NSStrings are equal
XCTAssertNotEqual(items[0], items[1])
XCTAssertNotEqual(items[0] as NSString, items[1] as NSString)
// JSONDecoder works
let jsonData = try! JSONSerialization.data(withJSONObject: items, options: [])
let decoder = JSONDecoder()
let decoded = try! decoder.decode([String].self, from: jsonData)
XCTAssertEqual(decoded, items)
// JSONSerialization seems to ignore the BOM.
// The decoded array contains two items, but they are the same NSString.
let ns_decoded = try! JSONSerialization.jsonObject(with: jsonData, options: []) as! NSArray
XCTAssertEqual(ns_decoded.count, items.count) // passes
XCTAssertNotEqual(ns_decoded[0] as! NSString, ns_decoded[1] as! NSString) // fails
XCTAssertEqual(ns_decoded as! [String], items) // fails
// Compare unicodeScalars
func scalars(_ string: String) -> [UInt32] {
string.unicodeScalars.map { $0.value }
}
for (decoded, expected) in zip(ns_decoded, items) {
let decodedScalars = scalars(decoded as! String)
let expectedScalars = scalars(expected)
XCTAssertEqual(decodedScalars, expectedScalars) // first passes, second fails
}
} There are two strings in the test array. The second one starts with a BOM prefix. The prefix is ignored when parsing the two Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:
I suspect this is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data. Also interestingly,
I'm not sure how to deal with this. |
Merging now, opened #116 for the remaining edge cases. Will try to find a workaround. |
Addresses two issues found while thoroughly testing the Gemma tokenizer:
NSString
instead ofString
, because String equality only considers the canonical Unicode representation. There are tokens in the Gemma vocab with the same representation, for example:à
(0x61 0x300) andà
(0xe0). Using a[String : Int]
dictionary would randomly choose one and ignore the other. This can potentially explain the crashes observed by @DePasqualeOrg in Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111 (thanks for reporting!).Old and new tests pass.