This project is an application of natural language processing in machine learning. It is a classifer that distinguishes between spam and ham (non-spam) tests. The goal is to develop a model that can accurately identify and filter unwanted texts.
How It Works
-
Data Collection and Labeling: We start with a dataset of emails that are already labeled as either spam or ham.
-
Text Preprocessing: The raw email text is cleaned and transformed with CountVectorizer:
- Removing punctuation and special characters.
- Converting all text to lowercase.
- Removing common, non-informative words (known as "stop words," like "the," "is," "a").
-
Model Training: We will use a Naive Bayes classifier, a probabilistic algorithm that is well-suited for text classification tasks. The model is trained on our SMS data to learn the patterns that differentiate spam from ham.
-
Model Evaluation: The trained model's performance is tested on a separate set of emails it has never seen before. We measure its accuracy and other metrics to ensure it is effective.
-
Classify Message: We can input any SMS message and the model will classify it as either spam or ham.