Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in using contrasts and model matrix as shown in the tutorial #6

Open
hok-lee opened this issue Jul 28, 2022 · 5 comments
Open

Comments

@hok-lee
Copy link

hok-lee commented Jul 28, 2022

First of all, your tutorial is great. It gave me great insights about the inner working of DESeq2.

However, when I tried to follow the tutorial but did not get the result expected when I used the model matrix. I wrote some sample codes (R markdown and the resulting HTML) to show the disagreement.
https://github.com/hok-lee/error_tutorial_deseq_contrast.git

I am quite puzzled. I hope you could help. Thank you!

@tavareshugo
Copy link
Owner

tavareshugo commented Aug 4, 2022

Thanks for your question @hok-lee and for providing a nicely reproducible example.

I believe you may have missed one important aspect of interpreting the coefficients of models with two factors.
When you do this contrast:

results(dds_age_treatment, contrast = c("age", "30", "20"))

You are comparing the difference between ages 30 and 20 for the reference level of treatment, i.e. for treatment A only.

To define these contrasts with numeric vectors, you should create numeric vectors for every combination of both factors you want to account for:

matrix_model <- model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
age_20_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "20" & dds$treatment == "A", ])
age_30_treatment_a <- colMeans(matrix_model[dds_age_treatment$age == "30" & dds$treatment == "A", ])

results(dds_age_treatment, contrast = age_30_treatment_a - age_20_treatment_a)

The same logic then applies to other comparisons.

The comparison you were doing:

results(dds_age_treatment, contrast = array_for_contrast_30 - array_for_contrast_20 )

Is for the average difference between 30 and 20 across all the levels of treatment.
There's similar question recently on stackoverflow that may help illustrate this (see the second diagram, which illustrates the equivalent of your comparison between the average age difference across both levels of treatment).

I'll leave this issue open, as I may add a note to the materials to highlight this for future readers.

@hok-lee
Copy link
Author

hok-lee commented Aug 4, 2022

Thank you for the detailed explanation. I understand the issue now, it makes a lot more sense with the subsetting of the model matrix on both the age and the treatment condition. I appreciate you taking the time to explain it to me!

@joowkim
Copy link

joowkim commented May 1, 2024

Hi @tavareshugo , I have a quick question about interpreting the coefficient.

"""
The key point to remember about designs with interaction terms is that, unlike for a design ~genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level).

The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype.
"""

This is from Deseq2 vignettes. With interaction terms, the difference between different conditions only represents for the reference level of genotype. Without the interaction terms, it is the overall difference between the different conditions.

The design, design(dds_age_treatment) <- ~age + treatment, doesn't have an interaction term, however, the contrast, results(dds_age_treatment, contrast = c("age", "30", "20")), is for difference between 30 and 20 for treatment A (reference level).

This contradicts what the vignettes says... I'm a bit confused and am making a wild guess that this is because the design is an imbalanced design. (6 samples are age10 and 7 samples are age30)

> model.matrix(design(dds_age_treatment), colData(dds_age_treatment))
         (Intercept) age10 age30 treatmentB
sample1            1     1     0          0
sample2            1     0     1          0
sample3            1     0     0          0
sample4            1     0     1          0
sample5            1     0     1          0
sample6            1     0     1          0
sample7            1     1     0          0
sample8            1     0     0          0
sample9            1     0     0          0
sample10           1     0     1          0
sample11           1     0     0          1
sample12           1     0     1          1
sample13           1     1     0          1
sample14           1     0     0          1
sample15           1     0     0          1
sample16           1     0     1          1
sample17           1     1     0          1
sample18           1     1     0          1
sample19           1     1     0          1
sample20           1     0     0          1

@tavareshugo
Copy link
Owner

Yes, it can be a bit confusing and I might have made it more confusing with my earlier solution.

Anyway, let's take a step back and consider what the terms in the model mean.
Your model is additive only (no interaction): ~ age + treatment

You have the following levels in each variable:

  • age with 3 levels: 20 (ref), 10 and 30
  • treatment with 2 levels: A (ref) and B

So, your model will have the following coefficients:

  • Intercept = the average for the reference group A/20
  • age10 = the difference between age10 and age20
  • age30 = the difference between age30 and age20
  • treatmentB = the difference between treatmentB and treatmentA

Now, because there is no interaction, the coefficient age10 is assumed to be the same for both treatment groups. I.e. regardless of whether you're in treatmentA or treatmentB, the difference age10 - age20 is assumed to be the same. The same goes for age30.

Equally, the coefficient treatmentB is also assumed to be the same, regardless of age.

If that assumption doesn't seem right to you, then you should fit a model with an interaction.
At which point you will have more coefficients and therefore more possible contrasts.

@joowkim
Copy link

joowkim commented May 3, 2024

Thank you for taking the time to explain it to me. I should probably review your tutorial materials as well as my linear regression book again. Thanks again and have a great weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants