The naive assumption of independence in Naive Bayes means that the algorithm assumes that all features (in this case, attributes or characteristics of the data) are independent of each other, given the class label. This assumption simplifies the computation of conditional probabilities, but in reality, features are often correlated. Despite this simplification, Naive Bayes tends to perform well in practice, especially in problems like text classification.
In Naive Bayes, for any class (say, spam or not spam), the algorithm assumes that the probability of observing each feature is independent of the others. It does not consider any relationships or dependencies between the features. For instance, in text classification, the algorithm assumes that the presence of one word in an email does not affect the presence or absence of another word.
Consider spam detection in email classification. The features here are the individual words in the email, and the class label is whether the email is spam or not spam.
Suppose you have the following data for an email:
- Features (words): ["buy", "cheap", "now", "limited", "offer"]
- Class Label: Spam or Not Spam
Naive Bayes assumes that the probability of these words appearing together in an email is independent of one another, given that the email is spam or not spam.
- P("buy" | Spam) is the probability that "buy" appears in a spam email.
- P("cheap" | Spam) is the probability that "cheap" appears in a spam email.
- P("now" | Spam) is the probability that "now" appears in a spam email.
According to Naive Bayes, these probabilities are considered independent, meaning the presence of "buy" in the email does not affect the presence of "cheap" or "now," even though intuitively we might expect that certain words (like "buy" and "cheap") tend to appear together in spam emails.
This simplifies the overall probability calculation. The total probability that an email is spam, given that it contains the words "buy," "cheap," and "now," would be calculated as:
Say you have an email containing the following sentence:
"Buy now, limited offer on cheap products!"
A Naive Bayes classifier will look at the probability of each word appearing in a spam email individually, without considering whether these words might naturally occur together in a spam email:
$P(\text{"Buy"} | \text{Spam})$ $P(\text{"cheap"} | \text{Spam})$ $P(\text{"offer"} | \text{Spam})$
The classifier will then multiply these probabilities to determine the overall likelihood that the email is spam. It does not consider that "Buy," "cheap," and "offer" often occur together in spam emails.
In reality, these words are not independent. If an email contains "cheap," it is more likely to also contain "buy" and "offer" in a spam context. But Naive Bayes ignores this correlation and treats each word as if its appearance is independent of the others.
Despite this unrealistic assumption, Naive Bayes often works well because in many applications like spam filtering or document classification, the overall patterns of word occurrence are captured reasonably well by these independent probability estimates. Even though words may not be completely independent, the general frequency of words in spam versus non-spam emails allows the classifier to make good predictions.
Let's break down Bayes' theorem with an example that uses the formula:
[ P(C|F) = \frac{P(F|C) \times P(C)}{P(F)} ]
This means the probability of class (C) given feature (F).
You receive an email, and you want to predict whether it is spam (S) or not spam (¬S) based on certain features in the email. For simplicity, let’s use one feature: whether the word "discount" appears in the email.
-
P(S): The prior probability that an email is spam (regardless of any features). Let’s say, from past data, 30% of all emails are spam: [ P(S) = 0.30 ]
-
P(¬S): The prior probability that an email is not spam: [ P(¬S) = 0.70 ] (since 70% of emails are not spam).
-
P(F|S): The probability that the word "discount" appears in an email, given that the email is spam. Based on training data, we found that the word "discount" appears in 40% of spam emails: [ P(F|S) = 0.40 ]
-
P(F|¬S): The probability that the word "discount" appears in a non-spam email. Based on training data, the word "discount" appears in 5% of non-spam emails: [ P(F|¬S) = 0.05 ]
-
P(F): The total probability that the word "discount" appears in any email (spam or not). We’ll calculate this using the law of total probability:
[ P(F) = P(F|S) \times P(S) + P(F|¬S) \times P(¬S) ]
[ P(F) = (0.40 \times 0.30) + (0.05 \times 0.70) ]
[ P(F) = 0.12 + 0.035 = 0.155 ]
Now, we want to calculate the probability that an email is spam, given that the word "discount" appears in the email.
Using Bayes' Theorem, we calculate (P(S|F)), which is the probability that the email is spam given that it contains the word "discount":
[ P(S|F) = \frac{P(F|S) \times P(S)}{P(F)} ]
Substitute the values:
[ P(S|F) = \frac{0.40 \times 0.30}{0.155} ]
[ P(S|F) = \frac{0.12}{0.155} = 0.774 ]
The probability that the email is spam given that it contains the word "discount" is 77.4%.
-
P(S|F): This is the posterior probability, or the probability that the email is spam given that the word "discount" is present.
-
P(F|S): This is the likelihood, or the probability of observing the word "discount" given that the email is spam.
-
P(S): This is the prior probability, or the overall probability that any email is spam, based on prior knowledge.
-
P(F): This is the evidence, or the total probability of observing the word "discount" in any email, whether it is spam or not.
Even though not all emails containing the word "discount" are spam, Bayes’ theorem helps you weigh this likelihood based on past data, allowing you to make a more informed prediction about the likelihood of spam.
Naive Bayes classifiers in NLP apply this concept to classify documents, considering multiple features (words) to compute the likelihood that the document belongs to a specific class (e.g., spam or non-spam).
-
Prior Probability (P(C)) The prior probability represents the initial belief about an event before seeing any evidence. It’s the probability of an event or a hypothesis before you take any specific data into account.
Example:
- In the context of spam detection: [ P(\text{spam}) = 0.30 ]
-
Posterior Probability (P(C|F)) The posterior probability is the updated probability of the event after considering the evidence. This is the probability of the hypothesis (or class) given the observed features.
Example:
- After seeing the word "discount" in an email, you use Bayes' Theorem to calculate the posterior probability that the email is spam: [ P(S|F) = \frac{P(F|S) \times P(S)}{P(F)} ]
-
Likelihood (P(F|C)) The likelihood represents the probability of observing the evidence given that the hypothesis is true. In other words, it’s the probability of seeing the feature (F) (the evidence) if the event or class (C) is true.
Example:
- The likelihood of seeing the word "discount" in spam emails: [ P(\text{discount}|\text{spam}) = 0.40 ]
-
Evidence (P(F)) The evidence (also called the marginal likelihood) is the total probability of the feature (F) being observed, regardless of the class.
Example:
- The total probability that the word "discount" appears in any email: [ P(F) = P(F|\text{spam}) \times P(\text{spam}) + P(F|\text{not spam}) \times P(\text{not spam}) ]
In the email spam classification example:
-
Prior Probability (P(\text{spam})): The probability that an email is spam before looking at the specific word "discount".
-
Posterior Probability (P(\text{spam}|\text{discount})): The updated probability that the email is spam after seeing that the word "discount" is present.
-
Likelihood (P(\text{discount}|\text{spam})): The probability that the word "discount" appears in a spam email.
-
Evidence (P(\text{discount})): The overall probability that the word "discount" appears in any email, whether it’s spam or not.
[ P(C|F) = \frac{P(F|C) \times P(C)}{P(F)} ]
Where:
- Prior Probability (P(C)) is your belief about (C) (e.g., spam) before observing any feature.
- Likelihood (P(F|C)) is the chance of observing (F) (e.g., the word "discount") if (C) is true.
- Evidence (P(F)) is the overall chance of observing the feature (F), regardless of (C).
- Posterior Probability (P(C|F)) is your updated belief about (C) after observing (F).
This allows you to refine your predictions and better classify new data based on evidence.
Great question! Let me clarify why P(discount|spam) is called the likelihood and not the prior probability or posterior probability.
In Bayes' Theorem:
[ P(C|F) = \frac{P(F|C) \times P(C)}{P(F)} ]
- P(C|F): Posterior probability — the probability of class (C) (e.g., spam) given the feature (F) (e.g., the word "discount").
- P(F|C): Likelihood — the probability of observing the feature (F) (e.g., the word "discount") given that the class (C) (e.g., spam) is true.
- P(C): Prior probability — the probability of class (C) (spam) before seeing the feature (F) (before considering any evidence).
- P(F): Evidence — the total probability of the feature (F) (the word "discount") appearing in any email, regardless of the class.
- Likelihood: (P(F|C)) represents how likely we are to see the feature (F) (the word "discount") if we already know that the email is spam (class (C)).
- In this case, P(discount|spam) is asking, "What is the probability that the word 'discount' will appear given that the email is spam?"
- This is called the likelihood because it reflects how likely a particular feature (e.g., "discount") is under a certain condition (spam).
- P(discount|spam) = 40% means that 40% of spam emails contain the word "discount".
-
Prior Probability: The prior probability P(spam) is our belief about the class (spam) before observing any feature (the word "discount"). It reflects the overall rate of spam emails, without considering any specific feature.
- Example: P(spam) = 0.30 means that, based on historical data, 30% of all emails are spam, regardless of whether or not they contain the word "discount".
- This is a class-level probability that doesn't depend on features like the word "discount".
-
Posterior Probability: The posterior probability P(spam|discount) is what we calculate using Bayes' Theorem — it tells us the probability that an email is spam after we have observed that the email contains the word "discount."
- Example: P(spam|discount) = 77.4% means that, after seeing the word "discount" in an email, we now believe that there is a 77.4% chance that the email is spam.
- This is a class-level probability updated by evidence (the feature "discount").
- P(discount|spam) is called the likelihood because it tells us how likely we are to observe the feature (word "discount") given the class (spam).
- The prior probability is the overall probability of spam before seeing any features.
- The posterior probability is the probability of spam after we observe that the word "discount" appears.
In short, likelihood focuses on the probability of features given the class, while prior and posterior probabilities focus on the probability of the class itself.