-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitive check for Unicode #223
Comments
Covering this for every letter is possible, but results in huge arrays. https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt covers all Unicode graphemes/letters. As described in https://www.unicode.org/L2/L1999/UnicodeData.html the last 3 columns will tell you if the letter has an uppercase/lowercase/titlecase counterpart. For example:
From those definitions I made arrays and lookup functions myself, covering all 2360 (BMP) + 520 (SMP) cases for turning one letter to the other. Plus 47 letters and how they should be normalized when trying to "fold" them for case insensitivity, too. It's written for Delphi 7, so searching for SMP characters would only work via |
2360+520 possible cases? For all unicode characters only these have the lowercase/uppercase pair? |
Yes. Capital and small letters are a concept not used in many languages/alphabets, and signs which look like capital and small variants may still be different things and can't be exchanged (f.e. Katakanas). Have a read on https://en.wikipedia.org/wiki/Letter_case |
I didnt see your code yet. Im not at PC. If those char codes (which are changed by case operation) are located in few compact ranges, we can write simple change-case functions (3 of them). Can you write these functions? |
Just look at the code - I already described there's a full working sample program with it. You should be able to understand how the unit works. Ranges are difficult to come up with - by far not all letters are aligned like Latin ones. The lookup per character=codepoint however is already optimized for a quite fast access. If you still have questions, feel free to ask them. Preferably afterwards. ;) If |
I see the code: very big work is done. good. now i ask: now much the speed loss is? today, TRegExpr lowercase/upcase are very fast (but they need 2 130Kb arrays: to lower, to upper). if not much slower: where can we find 3 functions: to-lower, to-upper, to-title? I mean funcs which change WideChar to WideChar. maybe you can make final patch (or changed tregexpr)? |
this place is to change in regexpr.pas:
|
No, you haven't understood it throughly. Just because it's a function you don't need to care for the result - its parameter is a reference, not only a value. You could do:
The function result just tells you if there was a match. Both other functions act the same way. Be aware that in those functions I casted the Speed wise it must be slower than your two giant arrays (at worst 177 jumps and 15 loop iterations for 1 character, at best 1 jump and 1 loop iteration), but on the contrary my approach should need much less memory/storage. If you want I can benchmark both (or at least write code for it). Sometimes I'm eager for speed, sometimes I want to avoid bloat - both approaches have their advantages. Either way - codepoints beyond U+FFFF should be dealth with, also for the Unicode category (as "word character"). If picking one character out of a |
I got it, so I can write the patch. |
You won't like it immediately, because it's again code that needs to be understood. But it got all you want:
But it all works:
It still only works for BMP (that is U+0000 to U+FFFF) and not beyond. The test project may also be a base for benchmarking/comparing both ways (this against the huge arrays) for speed (any maybe memory consumption). |
I got comment
So our approach has weak place. We match case insensitive using 2 arrays- to lowercase, to uppercase.
How to fix?
The text was updated successfully, but these errors were encountered: