Adding g-formula stochastic treatments

pzivich · pzivich · commit bfbab69f140e · 2019-03-11T08:01:30.000-04:00
diff --git a/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/2_gformula_stochastic.ipynb b/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/2_gformula_stochastic.ipynb
@@ -0,0 +1,208 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Parametric g-formula: stochastic interventions\n",
+    "In the previous tutorial we went over the basics of the parametric g-formula using `TimeFixedGFormula` for basic interventions. Additionally, we can use the g-formula to look at stochastic interventions. Stochastic interventions are treatment plans under which not necessarily everyone is treated, but some random percentage are treated.\n",
+    "\n",
+    "To estimate the g-formula for stochastic treatments, the process is fairly similar. However, instead of treating everyone, some percentage are treated. A random percentage are treated and then $\\hat{Y_i^a}$ are predicted and averaged. This process is repeated some number times and the average of the averaged potential outcomes is returned.\n",
+    "\n",
+    "For our example, we will return to the previous data set on ART among HIV-infected individuals and all-cause mortality. First, we will load the data (again ignoring missing data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "Int64Index: 517 entries, 0 to 546\n",
+      "Data columns (total 9 columns):\n",
+      "id          517 non-null int64\n",
+      "male        517 non-null int64\n",
+      "age0        517 non-null int64\n",
+      "cd40        517 non-null int64\n",
+      "dvl0        517 non-null int64\n",
+      "art         517 non-null int64\n",
+      "dead        517 non-null float64\n",
+      "t           517 non-null float64\n",
+      "cd4_wk45    430 non-null float64\n",
+      "dtypes: float64(3), int64(6)\n",
+      "memory usage: 40.4 KB\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from zepid import load_sample_data, spline\n",
+    "from zepid.causal.gformula import TimeFixedGFormula\n",
+    "\n",
+    "df = load_sample_data(timevary=False)\n",
+    "dfs = df.dropna(subset=['dead']).copy()\n",
+    "dfs.info()\n",
+    "\n",
+    "dfs[['cd4_rs1', 'cd4_rs2']] = spline(dfs, 'cd40', n_knots=3, term=2, restricted=True)\n",
+    "dfs[['age_rs1', 'age_rs2']] = spline(dfs, 'age0', n_knots=3, term=2, restricted=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Similar to the previous tutorial, we initialize the `TimeFixedGFormula` with the data set (`dfs`), our treatment variable (`art`), and binary outcome (`dead`). Then we fit a regression model predicting all-cause mortality as a function of ART and our set of confounding variables (age, CD4 T-cell count, detectable viral load, gender)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                 Generalized Linear Model Regression Results                  \n",
+      "==============================================================================\n",
+      "Dep. Variable:                   dead   No. Observations:                  517\n",
+      "Model:                            GLM   Df Residuals:                      507\n",
+      "Model Family:                Binomial   Df Model:                            9\n",
+      "Link Function:                  logit   Scale:                          1.0000\n",
+      "Method:                          IRLS   Log-Likelihood:                -202.83\n",
+      "Date:                Mon, 11 Mar 2019   Deviance:                       405.67\n",
+      "Time:                        07:08:33   Pearson chi2:                     534.\n",
+      "No. Iterations:                     6   Covariance Type:             nonrobust\n",
+      "==============================================================================\n",
+      "                 coef    std err          z      P>|z|      [0.025      0.975]\n",
+      "------------------------------------------------------------------------------\n",
+      "Intercept     -3.9822      2.621     -1.520      0.129      -9.119       1.154\n",
+      "art           -0.7278      0.393     -1.854      0.064      -1.497       0.042\n",
+      "male          -0.0773      0.334     -0.231      0.817      -0.732       0.578\n",
+      "age0           0.1548      0.092      1.689      0.091      -0.025       0.334\n",
+      "age_rs1       -0.0059      0.004     -1.493      0.135      -0.014       0.002\n",
+      "age_rs2        0.0129      0.006      2.035      0.042       0.000       0.025\n",
+      "cd40          -0.0121      0.004     -3.028      0.002      -0.020      -0.004\n",
+      "cd4_rs1     1.887e-05   1.19e-05      1.581      0.114   -4.52e-06    4.23e-05\n",
+      "cd4_rs2    -3.866e-05   4.57e-05     -0.846      0.398      -0.000    5.09e-05\n",
+      "dvl0          -0.1254      0.398     -0.315      0.753      -0.905       0.654\n",
+      "==============================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "g = TimeFixedGFormula(dfs, exposure='art', outcome='dead')\n",
+    "g.outcome_model(model='art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However, this time we do some backgound research and find that one potential intervention to increase ART prescriptions increases the probability of ART treatment to 80%. As a result, it is potentially misleading to compare to compare the treat-all vs treat-none scenarios. Instead, we will compare the stochastic treatment where 80% of individuals are treated with ART to the scenario where no one is treated.\n",
+    "\n",
+    "## Stochastic Treatment Plans\n",
+    "To do this using `TimeFixedGFormula` we will instead call `fit_stochastic()` function instead of `fit()`. This function allows us to estimate a stochastic treatment. We specify `p=0.8` to have 80% of the population treated at random. By default, `fit_stochastic()` repeats this process 100 times and takes the average of these repeated random treatments. I will also use the `seed` argument to get replicable results. Let's look at the example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "RD: -0.06041404870415\n"
+     ]
+    }
+   ],
+   "source": [
+    "g.fit_stochastic(p=0.8, seed=1000191)\n",
+    "r_80 = g.marginal_outcome\n",
+    "\n",
+    "g.fit(treatment='none')\n",
+    "r_none = g.marginal_outcome\n",
+    "\n",
+    "print('RD:', r_80 - r_none)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Under the treatment plan where 80% of people are randomly treated, the risk of all-cause mortality would have been 6.0% points lower than if no one was treated. \n",
+    "\n",
+    "After reading some more articles, we find an alternative treatment plan. Under this plan, 75% of men and 90% of women start using HIV. For this plan, we are interested in a conditional stochastic treatment. Again, we want to compare this to the scenario where no one is treated\n",
+    "\n",
+    "## Conditional Stochastic Treatment Plans\n",
+    "For conditionally stochastic treatments, we instead provide `p` a list of probabilities. Additionally, we specify the `conditional` argument with the group restrictions. Again, we will need to use the magic-g functionality. Below is the example of the stochastic plan where 75% of men are treated and 90% of women"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "RD: -0.058656195525173926\n"
+     ]
+    }
+   ],
+   "source": [
+    "g.fit_stochastic(p=[0.75, 0.90], conditional=[\"g['male']==1\", \"g['male']==0\"], seed=518012)\n",
+    "r_cs = g.marginal_outcome\n",
+    "\n",
+    "print('RD:', r_cs - r_none)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Under the treatment plan where 75% of men and 90% of women are randomly treated, the risk of all-cause mortality would have been 5.9% points lower than if no one was treated. This plan reduces the marginal mortality less than the previous stochastic plan because our HIV-infected population is predominantly men. \n",
+    "\n",
+    "# Conclusion\n",
+    "In this tutorial, I detailed stochastic treatment plans using the g-formula. While presented for a binary outcome, the same procedure can also be used to estimate stochastic treatments for continuous outcomes. Please view other tutorials for information other functions in *zEpid*\n",
+    "\n",
+    "## Further Readings\n",
+    "Ahern et al. (2016). Predicting the population health impacts of community interventions: the case of alcohol outlets and binge drinking. *AJPH*, 106(11), 1938-1943.\n",
+    "\n",
+    "Snowden et al. (2011) \"Implementation of G-computation on a simulated data set: demonstration of a causal inference technique.\" *AJE* 173.7: 731-738.\n",
+    "\n",
+    "Robins. (1986) \"A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.\" *Mathematical modelling* 7.9-12: 1393-1512"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/3_Epidemiology_Analysis/c_causal_inference/README.md b/3_Epidemiology_Analysis/c_causal_inference/README.md
@@ -0,0 +1,67 @@
+Throughout the following tutorials in this branch, we will make the following identifiability assumptions. 
+We additionally will assume no measurement error, no selection bias, and no interference.
+
+# Assumptions
+
+## Conditional Exchangeability
+Conditional exchangeability is the assumption that potential outcomes are independent of the treatment received 
+conditional on some set of covariates. Using causal diagrams, this amounts to no open backdoor paths between the 
+treatment and outcome. See the further reading list for publications on the assumption of conditional exchangeability
+and introductions to two different approaches to causal diagrams (directed acyclic graphs (DAG) and single-world 
+intervention graphs (SWIG))
+
+### Further Reading
+Hernán MA, Robins JM. (2006). Estimating causal effects from epidemiological data. *Journal of Epidemiology 
+& Community Health*, 60(7), 578-586.
+
+Greenland S, Pearl J, Robins JM. (1999). Causal diagrams for epidemiologic research. *Epidemiology*, 10, 37-48.
+
+Richardson TS, Robins JM. (2013). Single world intervention graphs: a primer. *In Second UAI workshop on 
+causal structure learning*, Bellevue, Washington.
+
+Breskin A, Cole SR, Hudgens MG. (2018). A practical example demonstrating the utility of single-world 
+intervention graphs. *Epidemiology*, 29(3), e20-e21.
+
+## Positivity
+The positivity assumption is that there are treated and untreated individuals at every combination of covariates. There
+are two potential positivity violations; deterministic or random. Deterministic positivity violations can never occur
+despite additional data collection. For an example of a deterministic positivity violation, consider the risk of death 
+by hysterectomy. Since men lack a uterus, they are unable to receive a hysterectomy. Random positivity violations 
+occur as a result of finite samples. In a small sample, it may just occur that we didn't observe anyone treated between
+ages 32-35. It isn't that no one could have been treated in that age group, we just didn't observe it in our sample. 
+For these scenarios, we will assume that our statistical model correctly interpolates over these areas (often a 
+strong assumption in small data sets)
+
+### Further Reading
+Westreich D, Cole SR. (2010). Invited commentary: positivity in practice. *American Journal of Epidemiology*, 
+171(6), 674-677.
+
+Cole SR, Hernán MA. (2008). Constructing inverse probability weights for marginal structural models. 
+*American Journal of Epidemiology*, 168(6), 656-664.
+
+## Causal Consistency
+Causal consistency is also referred to as treatment variation irrelevance. Under this assumption we assume that there 
+is only one version of treatment (consistency) or that any differences remaining between treatments is irrelevant
+(treatment variation irrelevance). For example, consider a study on 200mg daily aspirin and all-cause mortality. In our
+study, we may be willing to assume that taking aspirin in the morning versus at night is irrelevant to all-cause 
+mortality. This is an example of assuming treatment variation irrelevance. Generally, defining the treatment more
+precisely can get you out of this as an issue. There are also some additional approaches. I recommend reviewing the 
+below readings for further discussions
+
+### Further Reading
+Cole SR, Frangakis CE. (2009). The consistency statement in causal inference: a definition or an assumption?. 
+*Epidemiology*, 20(1), 3-5.
+
+VanderWeele TJ. (2009). Concerning the consistency assumption in causal inference. *Epidemiology*, 20(6), 880-883.
+
+VanderWeele TJ. (2018). On well-defined hypothetical interventions in the potential outcomes framework. 
+*Epidemiology*, 29(4), e24-e25.
+
+## Correctly specified model
+Since we will be working with continuous and high-dimensional data, we will be using parametric regression models. 
+We assume that these models are correctly specified. To make less restrictive assumptions regarding the functional
+forms of continuous variables, we will use splines throughout. Please refer to the Data Basics for an intro to 
+using splines with *zEpid*
+
+Additionally, we will sometime uses machine learning approaches to relax this assumption further (see TMLE tutorials
+for some examples)