Covariance, a statistical measure, provides insights into how two variables change together.
Imagine counting green and red apples in 5 different grocery stores, and estimating their mean and variance.
Note: For further explanation, I will use Gene
$X$ and Gene$Y$ instead of green apple and red apple.
When measurements are taken in pairs, like the Gene counts from the same cells, we can plot each pair as a single dot, combining values on the
In the left graph, both measurements are less than their respective mean values.
Covariance helps answer: "Do paired measurements reveal something that individual measurements do not?"
In the right graph, a visible trend shows that cells with lower values for Gene
- The main idea behind covariance is that it can classify three types of relationships.:
- Relationships with positive trends
- Relationships with negative trends
- Times when there is no relationship because there is no trend
The other main idea behind covariance is while covariance provides insights into the relationship between two variables, it's often not the final metric of interest due to its scale sensitivity and interpretational challenges. It's primarily used as a computational stepping stone to derive more insightful metrics, such as correlation.
The covariance between two random variables, X and Y, is calculated using the following formula:
Where:
-
$\text{Cov}(X, Y)$ is the covariance between X and Y. -
$X_i$ and$Y_i$ are individual data points from the datasets X and Y, respectively. -
$\bar{X}$ and$\bar{Y}$ are the means (average values) of X and Y, respectively. -
$n$ is the number of data points.
To get an intuitive sense for how covariance is calculated, let’s go back to the mean value for Gene
Let's plug in the Gene
So we see that when the values for Gene
Ultimately, we end up with a covariance = 116. Since the covariance value 116 is positive, it means that the slope of the relationship between Gene
For negative covariance value:
For calculating the covariance when there is no trend:
- when every value for Gene
$X$ corresponds to the same value for Gene$Y$ , the covariance = 0.
- when every value for Gene
$Y$ corresponds to the same value for Gene$X$ , the covariance = 0.
- Even though there are multiple values for Gene
$X$ and$Y$ , there is still no trend because as Gene$X$ increases, Gene$Y$ increases and decreases.
In other words, the negative value for the left high point is cancelled out by the positive value of left low point. Thus the covariance is 0.
So we see that covariance = 0 when there is no relationship between Gene
Note, the covariance value itself isn't very easy to interpret and depends on the context. For example, the covariance value does not tell us if the slope of the line representing the relationship is steep or not steep. It just tells us the slope is positive. More importantly, the covariance value doesn't tell us if the points are relatively close to the dotted line or relatively far from the dotted line. Again, it just tells us that the slope of the relationship is positive.
Even though covariance is hard to interpret, it is a computational stepping stone to more interesting things.
Why covariance is hard to interpret?
Let's go all the way back to looking at just Gene
Then
In other words, the covariance for Gene
When we multiply the data by 2, the relative positions of the data did not change, and each dot still falls on the same straight line with positive slope. The only thing that changed was the scale that the data is on. However, when we do the math, we get covariance = 408, which is 4 times what we got before.
Thus, we see that the covariance value changes even when the relationship does not. In other words, covariance values are sensitive to the scale of the data, and this makes them difficult to interpret. The sensitivity to scale also prevents the covariance value from telling us if the data are close to the dotted line that represents the relationship or far from it.
In this example, the covariance on the left, when each point is on the dotted line, is 102, and the covariance on the right, when the data are relatively far from the dotted line, is 381. So in this case, when the data are far from the line, the covariance is larger.
Now, let's just change the scale on the right-hand side and recalculate the covariance, and now the covariance is less for the data that does not fall on the line.
The lucky thing is there was something to describe relationships that wasn't sensitive to the scale of the data, calculating covariance is the first step in calculating correlation. Correlation describes relationships and is not sensitive to the scale of the data.
The covariance values itself is difficult to interpret. However, it is useful for calculating correlations and in other computational settings.
Covariance values are used as stepping stones in a wide variety of analyses. For example, covariance values were used for Principal Component Analysis(PCA) and are still used in other settings as computational stepping stones to other, more interesting things.
In summary, the sign of the covariance indicates the nature of the relationship between two variables:
-
If
$\text{Cov}(X, Y) > 0$ , it suggests a positive relationship. This means that as one variable increases, the other tends to increase as well. -
If
$\text{Cov}(X, Y) < 0$ , it indicates a negative relationship. This means that as one variable increases, the other tends to decrease. -
If
$\text{Cov}(X, Y) = 0$ , it implies no linear relationship between the variables. However, it's essential to note that a covariance of zero does not necessarily mean there is no relationship; it only means there is no linear relationship.
Consider a case, where the points are
covariance values were used for Principal Component Analysis(PCA) why?