Skip to content
This repository has been archived by the owner on Aug 27, 2024. It is now read-only.

Natural language processing : bag of words #31

Open
SamVanhoutte opened this issue Apr 29, 2020 · 0 comments
Open

Natural language processing : bag of words #31

SamVanhoutte opened this issue Apr 29, 2020 · 0 comments
Labels
feature-suggestion All issues related suggestion of a new feature. These are nice to haves but not customer requests

Comments

@SamVanhoutte
Copy link
Contributor

SamVanhoutte commented Apr 29, 2020

Provide support for bag of words translation of documents
Bag of words is a method used to convert to convert unstructured text to vector values. This is commonly used in NLP. An explanation of the concept can be found here.

import scipy as sp


def GetBagOfWords(train_set, test_set, text_property_name, additional_feature_names=None):
    # Omzetten naar bag-of-words
    train_text_values = train_set[text_property_name]
    test_text_values = test_set[text_property_name]

    count_vectorizer = CountVectorizer(binary=False)
    count_vectorizer.fit(train_text_values)
    train_set_bow = count_vectorizer.transform(train_text_values)
    test_set_bow = count_vectorizer.transform(test_text_values)

    transformer = TfidfTransformer()
    transformer = TfidfTransformer(use_idf=True).fit(train_set_bow)
    train_set_bow = transformer.transform(train_set_bow)
    test_set_bow = transformer.transform(test_set_bow)

    if(additional_feature_names):
        train_set_bow = sp.sparse.hstack((train_set_bow,
                                          train_set[additional_feature_names].values),
                                         format='csr').toarray()
        test_set_bow = sp.sparse.hstack((test_set_bow,
                                         test_set[additional_feature_names].values),
                                        format='csr')

    return (train_set_bow, test_set_bow)
@SamVanhoutte SamVanhoutte added the feature-suggestion All issues related suggestion of a new feature. These are nice to haves but not customer requests label Apr 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-suggestion All issues related suggestion of a new feature. These are nice to haves but not customer requests
Projects
None yet
Development

No branches or pull requests

1 participant