Text classification into industry categories

The goal of this projest is to present 3 different approaches for text classification. Company descriptions are classified into 1835 industry codes from official german classification WZ 2008. Complete description of the task and used approaches can be found in this article.

Approach with LLM (Large Language Model) + RAG (Retrieval Augmented Generation)

Project shows classification of text with OpenAI GPT-4 model accessed via API. To improve classification results prompt is enriched by additional guidelines text accessed with RAG (Retrieval Augmented Generation) is used.

classify-industry-with-LLM-and-RAG.ipynb

Approach with zero-shot transformer classification dedicated model

Sample classification using model MoritzLaurer/mDeBERTa-v3-base-mnli-xnli downloaded from HuggingFace.

[classify-industry-with-self-train-model.ipynb](https://github.com/mzarnecki/companyDescriptionClassification/blob/master/classify-industry-with-zero-shot-tranformer.ipynb)

Approach with self trained model - RandomForestClassifier

Project contains text normalization, tokenization, model training and evaluation as well as model export and import. Presented approaches were used to classify company descriptions from data/data.csv into 2 industry categories.

classify-industry-with-self-train-model.ipynb

In order to use these examples in your classification problems just replace datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
classify-industry-peft-lora-fine-tuning.ipynb		classify-industry-peft-lora-fine-tuning.ipynb
classify-industry-with-LLM-and-RAG.ipynb		classify-industry-with-LLM-and-RAG.ipynb
classify-industry-with-self-train-model.ipynb		classify-industry-with-self-train-model.ipynb
classify-industry-with-zero-shot-tranformer.ipynb		classify-industry-with-zero-shot-tranformer.ipynb
classify_industry_codes_approches.png		classify_industry_codes_approches.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text classification into industry categories

Approach with LLM (Large Language Model) + RAG (Retrieval Augmented Generation)

Approach with zero-shot transformer classification dedicated model

Approach with self trained model - RandomForestClassifier

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mzarnecki/companyDescriptionClassification

Folders and files

Latest commit

History

Repository files navigation

Text classification into industry categories

Approach with LLM (Large Language Model) + RAG (Retrieval Augmented Generation)

Approach with zero-shot transformer classification dedicated model

Approach with self trained model - RandomForestClassifier

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages