Add DE set comparisons (#489)

Zethson · web-flow · commit ec4fb3a0f954 · 2024-01-06T13:37:19.000+01:00
* Add DE set comparisons

Signed-off-by: zethson &lt;lukas.heumos@posteo.net&gt;

* Add docs

Signed-off-by: zethson &lt;lukas.heumos@posteo.net&gt;

---------

Signed-off-by: zethson &lt;lukas.heumos@posteo.net&gt;
diff --git a/docs/usage/usage.md b/docs/usage/usage.md
@@ -80,10 +80,6 @@ Simple functions for:
 
 Assigning guides based on thresholds. Each cell is assigned to the most expressed gRNA if it has at least the specified number of counts.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: preprocessing
@@ -111,6 +107,20 @@ ga.plot_heatmap(gdo, layer="assigned_guides")
 
 ## Tools
 
+### Differential gene expression
+
+Differential gene expression involves the quantitative comparison of gene expression levels between two or more groups,
+such as different cell types, tissues, or conditions to discern genes that are significantly up- or downregulated in response to specific biological contexts or stimuli.
+Pertpy provides utilities to conduct differential gene expression tests through a common interface that supports complex designs and methods.
+
+```{eval-rst}
+.. autosummary::
+    :toctree: preprocessing
+    :nosignatures:
+
+    tools.DifferentialGeneExpression
+```
+
 ### Pooled CRISPR screens
 
 #### Mixscape
@@ -122,10 +132,6 @@ Mixscape first tries to remove confounding sources of variation such as cell cyc
 Next, it determines which targeted cells were affected by the genetic perturbation (=KO) and which targeted cells were not (=NP) with the use of mixture models.
 Finally, it visualizes similarities and differences across different perturbations.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -155,10 +161,6 @@ See [mixscape tutorial](https://pertpy.readthedocs.io/en/latest/tutorials/notebo
 A Python implementation of Milo for differential abundance testing on KNN graphs, to ease interoperability with scverse pipelines for single-cell analysis.
 See [Differential abundance testing on single-cell data using k-nearest neighbor graphs](https://www.nature.com/articles/s41587-021-01033-z) for details on the statistical framework.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -195,10 +197,6 @@ milo.da_nhoods(mdata, design="~Status")
 Reimplementation of scCODA for identification of compositional changes in high-throughput sequencing count data and tascCODA for sparse, tree-aggregated modeling of high-throughput sequencing data.
 See [scCODA is a Bayesian model for compositional single-cell data analysis](https://www.nature.com/articles/s41467-021-27150-6) for statistical methodology and benchmarking performance of scCODA and [tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data](https://www.frontiersin.org/articles/10.3389/fgene.2021.766405/full) for statistical methodology and benchmarking performance of tascCODA.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -246,10 +244,6 @@ sccoda.plot_effects_barplot(
 A **work in progress (!)** Python implementation of DIALOGUE for the discovery of multicellular programs.
 See [DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data](https://www.nature.com/articles/s41587-022-01288-0) for more details on the methodology.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -286,10 +280,6 @@ all_results, new_mcps = dl.multilevel_modeling(
 
 #### Enrichment
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -312,10 +302,6 @@ pt_enricher.score(adata)
 General purpose functions for distances and permutation tests.
 Reimplements functions from [scperturb](http://projects.sanderlab.org/scperturb/) package.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -375,10 +361,6 @@ results["summary_metrics"]
 
 See [augur tutorial](https://pertpy.readthedocs.io/en/latest/tutorials/notebooks/augur.html) for a more elaborate tutorial.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -392,10 +374,6 @@ See [augur tutorial](https://pertpy.readthedocs.io/en/latest/tutorials/notebooks
 Reimplementation of scGen for perturbation response prediction of scRNA-seq data in Jax.
 See [scGen predicts single-cell perturbation responses](https://www.nature.com/articles/s41592-019-0494-8) for more details.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -435,10 +413,6 @@ CINEMA-OT separates confounding sources of variation from perturbation effects t
 These cell pairs represent causal perturbation responses permitting a number of novel analyses, such as individual treatment effect analysis, response clustering, attribution analysis, and synergy analysis.
 See [Causal identification of single-cell experimental perturbation effects with CINEMA-OT](https://www.biorxiv.org/content/10.1101/2022.07.31.502173v3.abstract) for more details.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -473,10 +447,6 @@ See [CINEMA-OT tutorial](https://pertpy.readthedocs.io/en/latest/tutorials/noteb
 
 Various modules for calculating and evaluating perturbation spaces.
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: tools
@@ -533,10 +503,6 @@ Available databases for mechanism of action metadata:
 
 -   [CLUE](https://clue.io/)
 
-```{eval-rst}
-.. currentmodule:: pertpy
-```
-
 ```{eval-rst}
 .. autosummary::
     :toctree: metadata
diff --git a/pertpy/tools/_differential_gene_expression.py b/pertpy/tools/_differential_gene_expression.py
@@ -5,9 +5,10 @@
 import decoupler as dc
 import numpy as np
 import numpy.typing as npt
+import pandas as pd
+from scipy.stats import kendalltau, pearsonr, spearmanr
 
 if TYPE_CHECKING:
-    import pandas as pd
     from anndata import AnnData
 
 
@@ -143,6 +144,78 @@ def filter_by_prop(self, adata: AnnData, min_prop: float = 0.2, min_samples: int
 
         return filtered_adata
 
+    def calculate_correlation(
+        self,
+        de_res_1: pd.DataFrame,
+        de_res_2: pd.DataFrame,
+        method: Literal["spearman", "pearson", "kendall-tau"] = "spearman",
+    ) -> pd.DataFrame:
+        """Calculate the Spearman correlation coefficient for 'pvals_adj' and 'logfoldchanges' columns.
+
+        Args:
+            de_res_1: A DataFrame with DE result columns.
+            de_res_2: Another DataFrame with the same DE result columns.
+            method: The correlation method to apply. One of `spearman`, `pearson`, `kendall-tau`.
+                    Defaults to `spearman`.
+
+        Returns:
+            A DataFrame with the Spearman correlation coefficients for 'pvals_adj' and 'logfoldchanges'.
+        """
+        columns_of_interest = ["pvals_adj", "logfoldchanges"]
+        correlation_data = {}
+        for col in columns_of_interest:
+            match method:
+                case "spearman":
+                    correlation, _ = spearmanr(de_res_1[col], de_res_2[col])
+                case "pearson":
+                    correlation, _ = pearsonr(de_res_1[col], de_res_2[col])
+                case "kendall-tau":
+                    correlation, _ = kendalltau(de_res_1[col], de_res_2[col])
+                case _:
+                    raise ValueError("Unknown correlation method.")
+            correlation_data[col] = correlation
+
+        return pd.DataFrame([correlation_data], columns=columns_of_interest)
+
+    def calculate_jaccard_index(self, de_res_1: pd.DataFrame, de_res_2: pd.DataFrame, threshold: float = 0.05) -> float:
+        """Calculate the Jaccard index for sets of significantly expressed genes/features based on a p-value threshold.
+
+        Args:
+            de_res_1: A DataFrame with DE result columns, including 'pvals'.
+            de_res_2: Another DataFrame with the same DE result columns.
+            threshold: A threshold for determining significant expression (default is 0.05).
+
+        Returns:
+            The Jaccard index.
+        """
+        significant_set_1 = set(de_res_1[de_res_1["pvals"] <= threshold].index)
+        significant_set_2 = set(de_res_2[de_res_2["pvals"] <= threshold].index)
+
+        intersection = significant_set_1.intersection(significant_set_2)
+        union = significant_set_1.union(significant_set_2)
+
+        return len(intersection) / len(union) if union else 0
+
+    def calculate_cohens_d(self, de_res_1: pd.DataFrame, de_res_2: pd.DataFrame) -> pd.Series:
+        """Calculate Cohen's D for the logfoldchanges.
+
+        Args:
+            de_res_1: A DataFrame with DE result columns, including 'logfoldchanges'.
+            de_res_2: Another DataFrame with the same DE result columns.
+
+        Returns:
+            A pandas Series containing Cohen's D for each gene/feature.
+        """
+        means_1 = de_res_1["logfoldchanges"].mean()
+        means_2 = de_res_2["logfoldchanges"].mean()
+        sd_1 = de_res_1["logfoldchanges"].std()
+        sd_2 = de_res_2["logfoldchanges"].std()
+
+        pooled_sd = np.sqrt((sd_1**2 + sd_2**2) / 2)
+        cohens_d = (means_1 - means_2) / pooled_sd
+
+        return cohens_d
+
     def de_analysis(
         self,
         adata: AnnData,
diff --git a/tests/tools/test_differential_gene_expression.py b/tests/tools/test_differential_gene_expression.py
@@ -0,0 +1,57 @@
+import pandas as pd
+import pertpy as pt
+import pytest
+
+
+@pytest.fixture
+def dummy_de_results():
+    data1 = {"pvals": [0.1, 0.2, 0.3, 0.4], "pvals_adj": [0.1, 0.25, 0.35, 0.45], "logfoldchanges": [1, 2, 3, 4]}
+    data2 = {"pvals": [0.1, 0.2, 0.3, 0.4], "pvals_adj": [0.15, 0.2, 0.35, 0.5], "logfoldchanges": [2, 3, 4, 5]}
+    de_res_1 = pd.DataFrame(data1)
+    de_res_2 = pd.DataFrame(data2)
+
+    return de_res_1, de_res_2
+
+
+@pytest.fixture
+def pt_de():
+    pt_de = pt.tl.DifferentialGeneExpression()
+    return pt_de
+
+
+def test_calculate_spearman_correlation(dummy_de_results, pt_de):
+    de_res_1, de_res_2 = dummy_de_results
+
+    result = pt_de.calculate_correlation(de_res_1, de_res_2, method="spearman")
+    assert result.shape == (1, 2)
+    assert all(column in result for column in ["pvals_adj", "logfoldchanges"])
+
+
+def test_calculate_pearson_correlation(dummy_de_results, pt_de):
+    de_res_1, de_res_2 = dummy_de_results
+
+    result = pt_de.calculate_correlation(de_res_1, de_res_2, method="pearson")
+    assert result.shape == (1, 2)
+    assert all(column in result for column in ["pvals_adj", "logfoldchanges"])
+
+
+def test_calculate_kendall_tau__correlation(dummy_de_results, pt_de):
+    de_res_1, de_res_2 = dummy_de_results
+
+    result = pt_de.calculate_correlation(de_res_1, de_res_2, method="kendall-tau")
+    assert result.shape == (1, 2)
+    assert all(column in result for column in ["pvals_adj", "logfoldchanges"])
+
+
+def test_jaccard_index(dummy_de_results, pt_de):
+    de_res_1, de_res_2 = dummy_de_results
+
+    jaccard_index = pt_de.calculate_jaccard_index(de_res_1, de_res_2)
+    assert 0 <= jaccard_index <= 1
+
+
+def test_calculate_cohens_d(dummy_de_results, pt_de):
+    de_res_1, de_res_2 = dummy_de_results
+
+    cohens_d = pt_de.calculate_cohens_d(de_res_1, de_res_2)
+    assert isinstance(cohens_d, float)