Skip to content

mzarnecki/companyDescriptionClassification

Repository files navigation

Text classification into industry categories

The goal of this projest is to present 3 different approaches for text classification. Company descriptions are classified into 1835 industry codes from official german classification WZ 2008. Complete description of the task and used approaches can be found in this article.

classify_industry_codes_approches.png

Approach with LLM (Large Language Model) + RAG (Retrieval Augmented Generation)

Project shows classification of text with OpenAI GPT-4 model accessed via API. To improve classification results prompt is enriched by additional guidelines text accessed with RAG (Retrieval Augmented Generation) is used.

classify-industry-with-LLM-and-RAG.ipynb

Approach with zero-shot transformer classification dedicated model

Sample classification using model MoritzLaurer/mDeBERTa-v3-base-mnli-xnli downloaded from HuggingFace.

[classify-industry-with-self-train-model.ipynb](https://github.com/mzarnecki/companyDescriptionClassification/blob/master/classify-industry-with-zero-shot-tranformer.ipynb)

Approach with self trained model - RandomForestClassifier

Project contains text normalization, tokenization, model training and evaluation as well as model export and import. Presented approaches were used to classify company descriptions from data/data.csv into 2 industry categories.

classify-industry-with-self-train-model.ipynb

In order to use these examples in your classification problems just replace datasets.

About

The goal of this projest is to present 3 different approaches for text classification (trained ML model, zer-shot transformer, RAG).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •