Skip to content

Commit 6d5d213

Browse files
committed
AIPTW tutorial
1 parent bfbab69 commit 6d5d213

File tree

1 file changed

+310
-0
lines changed
  • 3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments

1 file changed

+310
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Augmented Inverse Probability of Treatment Weights\n",
8+
"Augmented-IPTW (AIPTW) is a doubly robust estimator. Essentially, AIPTW combines the IPTW estimator and g-formula into a single estimate. Before continuing, I will briefly outline what a doubly-robust estimator is and why you would want to use one. In observational research with high-dimensional data, we (generally) are forced to use parametric models to adjust for many confounders. In this scenario, we assume that our parametric models are correctly specified. Our statistical model, $\\mathcal{M}$, must include the distribution that the data came from. \n",
9+
"\n",
10+
"With other estimators, like IPTW or g-formula, we have one chance to specify $\\mathcal{M}$ correctly. Doubly-robust estimators use a model to predict the treatment (like IPTW) and another model to predict the outcome (like g-formula). The estimator then combines the estimates, such that if either is correct, then our estimate will be consistent. Essentially, we get two chances to get the statistical model correct.\n",
11+
"\n",
12+
"A more in-depth description of doubly robust estimators is available in [this pre-print](https://statnav.files.wordpress.com/2017/10/doublerobustness-preprint.pdf)\n",
13+
"\n",
14+
"## AIPTW\n",
15+
"\n",
16+
"AIPTW takes the following form\n",
17+
"$$E[Y^a] = \\frac{1}{n} \\sum_i^n \\left(\\frac{Y \\times I(A=a)}{\\widehat{\\Pr}(A=a|L)} - \\frac{\\hat{E}[Y|A=a, L] \\times (I(A=a) - \\widehat{\\Pr}(A=a|L))}{1 - \\widehat{\\Pr}(A=a|L)}\\right)$$\n",
18+
"where $\\widehat{\\Pr}(A=a|L)$ comes from the IPTW model and $\\hat{E}[Y|A=a,L]$ comes from the g-formula. If we do some manipulations and assume an infinite sample size, we can end up with\n",
19+
"$$\\hat{E}^{IPW}[Y^a] \\times \\frac{\\Pr(A=a|L)}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})} - \\hat{E}^{STD}[Y^a] \\times \\frac{\\Pr(A=a|L) - \\widehat{\\Pr}(A=a|L, \\mathcal{M})}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})}$$\n",
20+
"from this form, we can see that if as long as one estimate is correct then AIPTW will be unbiased\n",
21+
"\n",
22+
"## An example\n",
23+
"To motivate our example, we will use a simulated data set included with *zEpid*. In the data set, we have a cohort of HIV-positive individuals. We are interested in the sample average treatment effect of antiretroviral therapy (ART) on all-cause mortality at 45-weeks. Based on substantive background knowledge, we believe that the treated and untreated population are exchangeable based gender, age, CD4 T-cell count, and detectable viral load. \n",
24+
"\n",
25+
"In this tutorial, we will focus on a complete case analysis. Therefore, we will drop the `cd4_wk45` column and all the missing data in `dead`. This will leave 517 observations with no missing data"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": 1,
31+
"metadata": {},
32+
"outputs": [
33+
{
34+
"name": "stdout",
35+
"output_type": "stream",
36+
"text": [
37+
"<class 'pandas.core.frame.DataFrame'>\n",
38+
"Int64Index: 517 entries, 0 to 546\n",
39+
"Data columns (total 12 columns):\n",
40+
"id 517 non-null int64\n",
41+
"male 517 non-null int64\n",
42+
"age0 517 non-null int64\n",
43+
"cd40 517 non-null int64\n",
44+
"dvl0 517 non-null int64\n",
45+
"art 517 non-null int64\n",
46+
"dead 517 non-null float64\n",
47+
"t 517 non-null float64\n",
48+
"age_rs1 517 non-null float64\n",
49+
"age_rs2 517 non-null float64\n",
50+
"cd4_rs1 517 non-null float64\n",
51+
"cd4_rs2 517 non-null float64\n",
52+
"dtypes: float64(6), int64(6)\n",
53+
"memory usage: 52.5 KB\n"
54+
]
55+
}
56+
],
57+
"source": [
58+
"import numpy as np\n",
59+
"import pandas as pd\n",
60+
"\n",
61+
"from zepid import load_sample_data, spline\n",
62+
"from zepid.causal.doublyrobust import AIPTW\n",
63+
"\n",
64+
"df = load_sample_data(False)\n",
65+
"df[['age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=3, term=2, restricted=True)\n",
66+
"df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)\n",
67+
"\n",
68+
"dfcc = df.drop(columns=['cd4_wk45']).dropna()\n",
69+
"dfcc.info()"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"metadata": {},
75+
"source": [
76+
"Our data is now ready to conduct a complete case analysis using TMLE. First, we initialize TMLE with our complete-case data (dfcc), the treatment (art), and the outcome (dead)"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 3,
82+
"metadata": {},
83+
"outputs": [],
84+
"source": [
85+
"aipw = AIPTW(dfcc, exposure='art', outcome='dead')"
86+
]
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"metadata": {},
91+
"source": [
92+
"### Treatment Model\n",
93+
"First, we will specify our treatment model. We believe the sufficient set for the treatment model is gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). To relax the functional for assumptions, we will model age and CD4 using restricted quadratic splines"
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": 4,
99+
"metadata": {},
100+
"outputs": [
101+
{
102+
"name": "stdout",
103+
"output_type": "stream",
104+
"text": [
105+
"\n",
106+
"----------------------------------------------------------------\n",
107+
"MODEL: art ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n",
108+
"-----------------------------------------------------------------\n",
109+
" Generalized Linear Model Regression Results \n",
110+
"==============================================================================\n",
111+
"Dep. Variable: art No. Observations: 517\n",
112+
"Model: GLM Df Residuals: 508\n",
113+
"Model Family: Binomial Df Model: 8\n",
114+
"Link Function: logit Scale: 1.0000\n",
115+
"Method: IRLS Log-Likelihood: -206.06\n",
116+
"Date: Mon, 11 Mar 2019 Deviance: 412.12\n",
117+
"Time: 08:24:05 Pearson chi2: 510.\n",
118+
"No. Iterations: 5 Covariance Type: nonrobust\n",
119+
"==============================================================================\n",
120+
" coef std err z P>|z| [0.025 0.975]\n",
121+
"------------------------------------------------------------------------------\n",
122+
"Intercept 1.4498 1.679 0.864 0.388 -1.841 4.741\n",
123+
"male -0.1159 0.321 -0.361 0.718 -0.745 0.513\n",
124+
"age0 -0.1026 0.059 -1.726 0.084 -0.219 0.014\n",
125+
"age_rs1 0.0048 0.003 1.706 0.088 -0.001 0.010\n",
126+
"age_rs2 -0.0077 0.006 -1.373 0.170 -0.019 0.003\n",
127+
"cd40 0.0041 0.004 0.964 0.335 -0.004 0.012\n",
128+
"cd4_rs1 -2.422e-05 1.2e-05 -2.014 0.044 -4.78e-05 -6.49e-07\n",
129+
"cd4_rs2 8.875e-05 4.55e-05 1.952 0.051 -3.81e-07 0.000\n",
130+
"dvl0 -0.0158 0.399 -0.040 0.968 -0.797 0.765\n",
131+
"==============================================================================\n"
132+
]
133+
}
134+
],
135+
"source": [
136+
"aipw.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')"
137+
]
138+
},
139+
{
140+
"cell_type": "markdown",
141+
"metadata": {},
142+
"source": [
143+
"`AIPTW` uses a logistic regression model to estimate the probabilities of treatment and the corresponding summary of the model fit are printed to the console. \n",
144+
"\n",
145+
"### Outcome Model\n",
146+
"Now, we will estimate the outcome model. We will model the outcomes as ART (`art`), gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). Again, we will model age and CD4 using restricted quadratic splines "
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": 5,
152+
"metadata": {},
153+
"outputs": [
154+
{
155+
"name": "stdout",
156+
"output_type": "stream",
157+
"text": [
158+
"\n",
159+
"----------------------------------------------------------------\n",
160+
"MODEL: dead ~ art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n",
161+
"-----------------------------------------------------------------\n",
162+
" Generalized Linear Model Regression Results \n",
163+
"==============================================================================\n",
164+
"Dep. Variable: dead No. Observations: 517\n",
165+
"Model: GLM Df Residuals: 507\n",
166+
"Model Family: Binomial Df Model: 9\n",
167+
"Link Function: logit Scale: 1.0000\n",
168+
"Method: IRLS Log-Likelihood: -202.85\n",
169+
"Date: Mon, 11 Mar 2019 Deviance: 405.71\n",
170+
"Time: 08:25:18 Pearson chi2: 535.\n",
171+
"No. Iterations: 6 Covariance Type: nonrobust\n",
172+
"==============================================================================\n",
173+
" coef std err z P>|z| [0.025 0.975]\n",
174+
"------------------------------------------------------------------------------\n",
175+
"Intercept -4.0961 2.713 -1.510 0.131 -9.413 1.220\n",
176+
"art -0.7274 0.392 -1.853 0.064 -1.497 0.042\n",
177+
"male -0.0774 0.334 -0.232 0.817 -0.732 0.577\n",
178+
"age0 0.1605 0.096 1.670 0.095 -0.028 0.349\n",
179+
"age_rs1 -0.0058 0.004 -1.481 0.139 -0.013 0.002\n",
180+
"age_rs2 0.0128 0.006 2.026 0.043 0.000 0.025\n",
181+
"cd40 -0.0123 0.004 -2.987 0.003 -0.020 -0.004\n",
182+
"cd4_rs1 1.872e-05 1.18e-05 1.584 0.113 -4.45e-06 4.19e-05\n",
183+
"cd4_rs2 -3.868e-05 4.59e-05 -0.842 0.400 -0.000 5.13e-05\n",
184+
"dvl0 -0.1261 0.398 -0.317 0.751 -0.906 0.653\n",
185+
"==============================================================================\n"
186+
]
187+
}
188+
],
189+
"source": [
190+
"aipw.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')"
191+
]
192+
},
193+
{
194+
"cell_type": "markdown",
195+
"metadata": {},
196+
"source": [
197+
"Again, logistic regression is used to predict the outcome data\n",
198+
"\n",
199+
"### Estimation\n",
200+
"To estimate the risk difference and risk ratio, we will now call the `fit()` function. After this, `AIPTW` gains the attributes `risk_difference` and `risk_ratio`. Additionally, results can be printed to the console using the `summary()` function"
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": 6,
206+
"metadata": {},
207+
"outputs": [
208+
{
209+
"name": "stdout",
210+
"output_type": "stream",
211+
"text": [
212+
"RD: -0.06857139263314216\n",
213+
"RR: 0.5844630051369846\n",
214+
"----------------------------------------------------------------------\n",
215+
"Risk Difference: -0.0686\n",
216+
"Risk Ratio: 0.5845\n",
217+
"----------------------------------------------------------------------\n"
218+
]
219+
}
220+
],
221+
"source": [
222+
"aipw.fit()\n",
223+
"\n",
224+
"print('RD:', aipw.risk_difference)\n",
225+
"print('RR:', aipw.risk_ratio)\n",
226+
"\n",
227+
"aipw.summary()"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {},
233+
"source": [
234+
"Interpreting the risk difference, we would conclude that had everyone in our cohort been treated with ART, the risk of all-cause mortality would have been 6.9% points lower than had no one been treated.\n",
235+
"\n",
236+
"### Confidence Intervals\n",
237+
"To obtain correct confidence intervals, we need to use a bootstrap procedure. Influence curves are an alternative, but not currently available"
238+
]
239+
},
240+
{
241+
"cell_type": "code",
242+
"execution_count": 8,
243+
"metadata": {},
244+
"outputs": [
245+
{
246+
"name": "stdout",
247+
"output_type": "stream",
248+
"text": [
249+
"95% LCL: -0.12324970574432371\n",
250+
"95% UCL -0.005298084807482341\n"
251+
]
252+
}
253+
],
254+
"source": [
255+
"rd_results = []\n",
256+
"for i in range(1000):\n",
257+
" dfs = dfcc.sample(n=dfcc.shape[0],replace=True)\n",
258+
" s = AIPTW(dfs,exposure='art',outcome='dead')\n",
259+
" s.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n",
260+
" print_results=False)\n",
261+
" s.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n",
262+
" print_results=False)\n",
263+
" s.fit()\n",
264+
" rd_results.append(s.risk_difference)\n",
265+
"\n",
266+
"\n",
267+
"print('95% LCL:', np.percentile(rd_results, q=2.5))\n",
268+
"print('95% UCL', np.percentile(rd_results, q=97.5))"
269+
]
270+
},
271+
{
272+
"cell_type": "markdown",
273+
"metadata": {},
274+
"source": [
275+
"Under the counterfactual of everyone receiving treatment with ART, the risk of all-cause mortality was 6.9% points lower (95% CL: ) than the counterfactual where no one had been treated.\n",
276+
"\n",
277+
"# Conclusion\n",
278+
"In this tutorial, I introduced the concept of doubly-robust estimators and detailed augmented-IPTW. I demonstrated estimation with `AIPTW` using *zEpid* and how to obtain confidence intervals. Please view other tutorials for information on other functionality within *zEpid*\n",
279+
"\n",
280+
"## References\n",
281+
"Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. (2011). Doubly robust estimation of causal effects. *AJE*, 173(7), 761-767.\n",
282+
"\n",
283+
"Keil AP et al. (2018). Resolving an apparent paradox in doubly robust estimators. *AJE*, 187(4), 891-892.\n",
284+
"\n",
285+
"Robins JM, Rotnitzky A, Zhao LP. (1994). Estimation of regression coefficients when some regressors are not always observed. *JASA*, 89(427), 846-866."
286+
]
287+
}
288+
],
289+
"metadata": {
290+
"kernelspec": {
291+
"display_name": "Python 3",
292+
"language": "python",
293+
"name": "python3"
294+
},
295+
"language_info": {
296+
"codemirror_mode": {
297+
"name": "ipython",
298+
"version": 3
299+
},
300+
"file_extension": ".py",
301+
"mimetype": "text/x-python",
302+
"name": "python",
303+
"nbconvert_exporter": "python",
304+
"pygments_lexer": "ipython3",
305+
"version": "3.6.3"
306+
}
307+
},
308+
"nbformat": 4,
309+
"nbformat_minor": 2
310+
}

0 commit comments

Comments
 (0)