Skip to content

TffvVectorizer Enconding #15970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

TffvVectorizer Enconding #15970

wants to merge 1 commit into from

Conversation

DelgadoPanadero
Copy link

@DelgadoPanadero DelgadoPanadero commented Dec 25, 2019

With this PR I have implemented the TffvVectorizer class. This class is a text feature encoder that works similarly to TfidfVectorizer, however it performs better results in the problem of imbalanced class categorizacion, as it can be seen in this study

Liu, Y. et al., Imbalanced text classification:
doi:10.1016/j.eswa.2007.10.042

Personally I have checked the improvement in the performance of this encoding compared with the classical tfidf encoding for tweets classification. It would be a pleassure for me to contribute to sklearn, so any feedback would be great. Thank you!

@jnothman
Copy link
Member

jnothman commented Dec 25, 2019 via email

@rth
Copy link
Member

rth commented Dec 25, 2019

Yes, please make this PR to https://github.com/scikit-learn-contrib/scikit-learn-extra/ . Also I think the transformer would be enough, no need to add the vectorizer (see #14966).

@rth rth closed this Dec 25, 2019
@rth
Copy link
Member

rth commented Dec 25, 2019

To elaborate why this doesn't qualify for inclusion at the moment: there are a number of unsupervised or supervised weighting schemes that are proposed on top of TF-IDF (see e.g. https://github.com/textvec/textvec and also TF-IGM) for term weighting in text classification. Most claim to over-perform TF-IDF in some cases. Whether there are some that are consistently better / are generally accepted as such by the community at the moment is not clear, and that's why it would be better to start with adding this PR to scikit-learn-extra, then considering adding it to scikit-learn in the future if the user feedback is very positive.

@rth rth added the Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository label Dec 25, 2019
@DelgadoPanadero
Copy link
Author

DelgadoPanadero commented Dec 27, 2019

Ok I see your point, thank you for your feedback and congratulation for the quickness in the answer, I didn't expect it to be so soon. I have a doubt. When tagging the issue with 'Move to scikit-learn-extra', do you mean that I should add the PR to scikit-learn-extra or you task to add it when you consider?

I wil think about other enhancements for the library, because I would really like to help to improve it. Thank you!

@rth
Copy link
Member

rth commented Dec 27, 2019

I made a related PR recently scikit-learn-contrib/scikit-learn-extra#45 . I would be good to use the same benchmarks to compare once that one is merged (and at least you should put you estimator in the same module).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants