Skip to content

Commit bfbab69

Browse files
committed
Adding g-formula stochastic treatments
1 parent 5fadacc commit bfbab69

File tree

2 files changed

+275
-0
lines changed

2 files changed

+275
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Parametric g-formula: stochastic interventions\n",
8+
"In the previous tutorial we went over the basics of the parametric g-formula using `TimeFixedGFormula` for basic interventions. Additionally, we can use the g-formula to look at stochastic interventions. Stochastic interventions are treatment plans under which not necessarily everyone is treated, but some random percentage are treated.\n",
9+
"\n",
10+
"To estimate the g-formula for stochastic treatments, the process is fairly similar. However, instead of treating everyone, some percentage are treated. A random percentage are treated and then $\\hat{Y_i^a}$ are predicted and averaged. This process is repeated some number times and the average of the averaged potential outcomes is returned.\n",
11+
"\n",
12+
"For our example, we will return to the previous data set on ART among HIV-infected individuals and all-cause mortality. First, we will load the data (again ignoring missing data)"
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": 2,
18+
"metadata": {},
19+
"outputs": [
20+
{
21+
"name": "stdout",
22+
"output_type": "stream",
23+
"text": [
24+
"<class 'pandas.core.frame.DataFrame'>\n",
25+
"Int64Index: 517 entries, 0 to 546\n",
26+
"Data columns (total 9 columns):\n",
27+
"id 517 non-null int64\n",
28+
"male 517 non-null int64\n",
29+
"age0 517 non-null int64\n",
30+
"cd40 517 non-null int64\n",
31+
"dvl0 517 non-null int64\n",
32+
"art 517 non-null int64\n",
33+
"dead 517 non-null float64\n",
34+
"t 517 non-null float64\n",
35+
"cd4_wk45 430 non-null float64\n",
36+
"dtypes: float64(3), int64(6)\n",
37+
"memory usage: 40.4 KB\n"
38+
]
39+
}
40+
],
41+
"source": [
42+
"import numpy as np\n",
43+
"import pandas as pd\n",
44+
"\n",
45+
"from zepid import load_sample_data, spline\n",
46+
"from zepid.causal.gformula import TimeFixedGFormula\n",
47+
"\n",
48+
"df = load_sample_data(timevary=False)\n",
49+
"dfs = df.dropna(subset=['dead']).copy()\n",
50+
"dfs.info()\n",
51+
"\n",
52+
"dfs[['cd4_rs1', 'cd4_rs2']] = spline(dfs, 'cd40', n_knots=3, term=2, restricted=True)\n",
53+
"dfs[['age_rs1', 'age_rs2']] = spline(dfs, 'age0', n_knots=3, term=2, restricted=True)"
54+
]
55+
},
56+
{
57+
"cell_type": "markdown",
58+
"metadata": {},
59+
"source": [
60+
"Similar to the previous tutorial, we initialize the `TimeFixedGFormula` with the data set (`dfs`), our treatment variable (`art`), and binary outcome (`dead`). Then we fit a regression model predicting all-cause mortality as a function of ART and our set of confounding variables (age, CD4 T-cell count, detectable viral load, gender)"
61+
]
62+
},
63+
{
64+
"cell_type": "code",
65+
"execution_count": 3,
66+
"metadata": {},
67+
"outputs": [
68+
{
69+
"name": "stdout",
70+
"output_type": "stream",
71+
"text": [
72+
" Generalized Linear Model Regression Results \n",
73+
"==============================================================================\n",
74+
"Dep. Variable: dead No. Observations: 517\n",
75+
"Model: GLM Df Residuals: 507\n",
76+
"Model Family: Binomial Df Model: 9\n",
77+
"Link Function: logit Scale: 1.0000\n",
78+
"Method: IRLS Log-Likelihood: -202.83\n",
79+
"Date: Mon, 11 Mar 2019 Deviance: 405.67\n",
80+
"Time: 07:08:33 Pearson chi2: 534.\n",
81+
"No. Iterations: 6 Covariance Type: nonrobust\n",
82+
"==============================================================================\n",
83+
" coef std err z P>|z| [0.025 0.975]\n",
84+
"------------------------------------------------------------------------------\n",
85+
"Intercept -3.9822 2.621 -1.520 0.129 -9.119 1.154\n",
86+
"art -0.7278 0.393 -1.854 0.064 -1.497 0.042\n",
87+
"male -0.0773 0.334 -0.231 0.817 -0.732 0.578\n",
88+
"age0 0.1548 0.092 1.689 0.091 -0.025 0.334\n",
89+
"age_rs1 -0.0059 0.004 -1.493 0.135 -0.014 0.002\n",
90+
"age_rs2 0.0129 0.006 2.035 0.042 0.000 0.025\n",
91+
"cd40 -0.0121 0.004 -3.028 0.002 -0.020 -0.004\n",
92+
"cd4_rs1 1.887e-05 1.19e-05 1.581 0.114 -4.52e-06 4.23e-05\n",
93+
"cd4_rs2 -3.866e-05 4.57e-05 -0.846 0.398 -0.000 5.09e-05\n",
94+
"dvl0 -0.1254 0.398 -0.315 0.753 -0.905 0.654\n",
95+
"==============================================================================\n"
96+
]
97+
}
98+
],
99+
"source": [
100+
"g = TimeFixedGFormula(dfs, exposure='art', outcome='dead')\n",
101+
"g.outcome_model(model='art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"However, this time we do some backgound research and find that one potential intervention to increase ART prescriptions increases the probability of ART treatment to 80%. As a result, it is potentially misleading to compare to compare the treat-all vs treat-none scenarios. Instead, we will compare the stochastic treatment where 80% of individuals are treated with ART to the scenario where no one is treated.\n",
109+
"\n",
110+
"## Stochastic Treatment Plans\n",
111+
"To do this using `TimeFixedGFormula` we will instead call `fit_stochastic()` function instead of `fit()`. This function allows us to estimate a stochastic treatment. We specify `p=0.8` to have 80% of the population treated at random. By default, `fit_stochastic()` repeats this process 100 times and takes the average of these repeated random treatments. I will also use the `seed` argument to get replicable results. Let's look at the example"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": 7,
117+
"metadata": {},
118+
"outputs": [
119+
{
120+
"name": "stdout",
121+
"output_type": "stream",
122+
"text": [
123+
"RD: -0.06041404870415\n"
124+
]
125+
}
126+
],
127+
"source": [
128+
"g.fit_stochastic(p=0.8, seed=1000191)\n",
129+
"r_80 = g.marginal_outcome\n",
130+
"\n",
131+
"g.fit(treatment='none')\n",
132+
"r_none = g.marginal_outcome\n",
133+
"\n",
134+
"print('RD:', r_80 - r_none)"
135+
]
136+
},
137+
{
138+
"cell_type": "markdown",
139+
"metadata": {},
140+
"source": [
141+
"Under the treatment plan where 80% of people are randomly treated, the risk of all-cause mortality would have been 6.0% points lower than if no one was treated. \n",
142+
"\n",
143+
"After reading some more articles, we find an alternative treatment plan. Under this plan, 75% of men and 90% of women start using HIV. For this plan, we are interested in a conditional stochastic treatment. Again, we want to compare this to the scenario where no one is treated\n",
144+
"\n",
145+
"## Conditional Stochastic Treatment Plans\n",
146+
"For conditionally stochastic treatments, we instead provide `p` a list of probabilities. Additionally, we specify the `conditional` argument with the group restrictions. Again, we will need to use the magic-g functionality. Below is the example of the stochastic plan where 75% of men are treated and 90% of women"
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": 9,
152+
"metadata": {},
153+
"outputs": [
154+
{
155+
"name": "stdout",
156+
"output_type": "stream",
157+
"text": [
158+
"RD: -0.058656195525173926\n"
159+
]
160+
}
161+
],
162+
"source": [
163+
"g.fit_stochastic(p=[0.75, 0.90], conditional=[\"g['male']==1\", \"g['male']==0\"], seed=518012)\n",
164+
"r_cs = g.marginal_outcome\n",
165+
"\n",
166+
"print('RD:', r_cs - r_none)"
167+
]
168+
},
169+
{
170+
"cell_type": "markdown",
171+
"metadata": {},
172+
"source": [
173+
"Under the treatment plan where 75% of men and 90% of women are randomly treated, the risk of all-cause mortality would have been 5.9% points lower than if no one was treated. This plan reduces the marginal mortality less than the previous stochastic plan because our HIV-infected population is predominantly men. \n",
174+
"\n",
175+
"# Conclusion\n",
176+
"In this tutorial, I detailed stochastic treatment plans using the g-formula. While presented for a binary outcome, the same procedure can also be used to estimate stochastic treatments for continuous outcomes. Please view other tutorials for information other functions in *zEpid*\n",
177+
"\n",
178+
"## Further Readings\n",
179+
"Ahern et al. (2016). Predicting the population health impacts of community interventions: the case of alcohol outlets and binge drinking. *AJPH*, 106(11), 1938-1943.\n",
180+
"\n",
181+
"Snowden et al. (2011) \"Implementation of G-computation on a simulated data set: demonstration of a causal inference technique.\" *AJE* 173.7: 731-738.\n",
182+
"\n",
183+
"Robins. (1986) \"A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.\" *Mathematical modelling* 7.9-12: 1393-1512"
184+
]
185+
}
186+
],
187+
"metadata": {
188+
"kernelspec": {
189+
"display_name": "Python 3",
190+
"language": "python",
191+
"name": "python3"
192+
},
193+
"language_info": {
194+
"codemirror_mode": {
195+
"name": "ipython",
196+
"version": 3
197+
},
198+
"file_extension": ".py",
199+
"mimetype": "text/x-python",
200+
"name": "python",
201+
"nbconvert_exporter": "python",
202+
"pygments_lexer": "ipython3",
203+
"version": "3.6.3"
204+
}
205+
},
206+
"nbformat": 4,
207+
"nbformat_minor": 2
208+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
Throughout the following tutorials in this branch, we will make the following identifiability assumptions.
2+
We additionally will assume no measurement error, no selection bias, and no interference.
3+
4+
# Assumptions
5+
6+
## Conditional Exchangeability
7+
Conditional exchangeability is the assumption that potential outcomes are independent of the treatment received
8+
conditional on some set of covariates. Using causal diagrams, this amounts to no open backdoor paths between the
9+
treatment and outcome. See the further reading list for publications on the assumption of conditional exchangeability
10+
and introductions to two different approaches to causal diagrams (directed acyclic graphs (DAG) and single-world
11+
intervention graphs (SWIG))
12+
13+
### Further Reading
14+
Hernán MA, Robins JM. (2006). Estimating causal effects from epidemiological data. *Journal of Epidemiology
15+
& Community Health*, 60(7), 578-586.
16+
17+
Greenland S, Pearl J, Robins JM. (1999). Causal diagrams for epidemiologic research. *Epidemiology*, 10, 37-48.
18+
19+
Richardson TS, Robins JM. (2013). Single world intervention graphs: a primer. *In Second UAI workshop on
20+
causal structure learning*, Bellevue, Washington.
21+
22+
Breskin A, Cole SR, Hudgens MG. (2018). A practical example demonstrating the utility of single-world
23+
intervention graphs. *Epidemiology*, 29(3), e20-e21.
24+
25+
## Positivity
26+
The positivity assumption is that there are treated and untreated individuals at every combination of covariates. There
27+
are two potential positivity violations; deterministic or random. Deterministic positivity violations can never occur
28+
despite additional data collection. For an example of a deterministic positivity violation, consider the risk of death
29+
by hysterectomy. Since men lack a uterus, they are unable to receive a hysterectomy. Random positivity violations
30+
occur as a result of finite samples. In a small sample, it may just occur that we didn't observe anyone treated between
31+
ages 32-35. It isn't that no one could have been treated in that age group, we just didn't observe it in our sample.
32+
For these scenarios, we will assume that our statistical model correctly interpolates over these areas (often a
33+
strong assumption in small data sets)
34+
35+
### Further Reading
36+
Westreich D, Cole SR. (2010). Invited commentary: positivity in practice. *American Journal of Epidemiology*,
37+
171(6), 674-677.
38+
39+
Cole SR, Hernán MA. (2008). Constructing inverse probability weights for marginal structural models.
40+
*American Journal of Epidemiology*, 168(6), 656-664.
41+
42+
## Causal Consistency
43+
Causal consistency is also referred to as treatment variation irrelevance. Under this assumption we assume that there
44+
is only one version of treatment (consistency) or that any differences remaining between treatments is irrelevant
45+
(treatment variation irrelevance). For example, consider a study on 200mg daily aspirin and all-cause mortality. In our
46+
study, we may be willing to assume that taking aspirin in the morning versus at night is irrelevant to all-cause
47+
mortality. This is an example of assuming treatment variation irrelevance. Generally, defining the treatment more
48+
precisely can get you out of this as an issue. There are also some additional approaches. I recommend reviewing the
49+
below readings for further discussions
50+
51+
### Further Reading
52+
Cole SR, Frangakis CE. (2009). The consistency statement in causal inference: a definition or an assumption?.
53+
*Epidemiology*, 20(1), 3-5.
54+
55+
VanderWeele TJ. (2009). Concerning the consistency assumption in causal inference. *Epidemiology*, 20(6), 880-883.
56+
57+
VanderWeele TJ. (2018). On well-defined hypothetical interventions in the potential outcomes framework.
58+
*Epidemiology*, 29(4), e24-e25.
59+
60+
## Correctly specified model
61+
Since we will be working with continuous and high-dimensional data, we will be using parametric regression models.
62+
We assume that these models are correctly specified. To make less restrictive assumptions regarding the functional
63+
forms of continuous variables, we will use splines throughout. Please refer to the Data Basics for an intro to
64+
using splines with *zEpid*
65+
66+
Additionally, we will sometime uses machine learning approaches to relax this assumption further (see TMLE tutorials
67+
for some examples)

0 commit comments

Comments
 (0)