Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Chinese names have unnecessary spaces at the end when transliterating #64

Open
taraskuzyk opened this issue May 20, 2021 · 1 comment

Comments

@taraskuzyk
Copy link

taraskuzyk commented May 20, 2021

When trying to transliterate

"马云"
I receive

"Ma Yun " (notice the space in the end) instead of

"Ma Yun"

Here's the code you can use to replicate this issue:

import unittest
import unidecode

class TestStrings(unittest.TestCase):
    def test_replace_non_ascii_letters_with_chinese_name(self):
        self.assertEquals(unidecode.unidecode("马云"), "Ma Yun")

The test fails with the following error:

AssertionError: 'Ma Yun ' != 'Ma Yun'
- Ma Yun 
?       -
+ Ma Yun

Run on Python 3.8.5

EDIT:

Google Translate seems to be doing this with no issue, but perhaps Google Translate has the faulty transliteration. Chinese speakers welcome to correct me.
Screen Shot 2021-05-20 at 4 21 49 PM

@avian2
Copy link
Owner

avian2 commented May 21, 2021

The technical reason why transliteration for each letter includes a space at the end is because otherwise you would not get spaces between letters. In your example you would get "MaYun". Unidecode just does a simple mapping from a Unicode character to ASCII sequences and doesn't know which letter appears last in your name. Hence the last letter will leave a trailing space.

I don't speak Chines, but the original author of Unidecode thought it was better to have spaces so I will leave it like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants