The goal of this projest is to present 3 different approaches for text classification. Company descriptions are classified into 1835 industry codes from official german classification WZ 2008. Complete description of the task and used approaches can be found in this article.
Project shows classification of text with OpenAI GPT-4 model accessed via API. To improve classification results prompt is enriched by additional guidelines text accessed with RAG (Retrieval Augmented Generation) is used.
classify-industry-with-LLM-and-RAG.ipynb
Sample classification using model MoritzLaurer/mDeBERTa-v3-base-mnli-xnli downloaded from HuggingFace.
[classify-industry-with-self-train-model.ipynb](https://github.com/mzarnecki/companyDescriptionClassification/blob/master/classify-industry-with-zero-shot-tranformer.ipynb)
Project contains text normalization, tokenization, model training and evaluation as well as model export and import. Presented approaches were used to classify company descriptions from data/data.csv into 2 industry categories.
classify-industry-with-self-train-model.ipynb
In order to use these examples in your classification problems just replace datasets.
