|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Augmented Inverse Probability of Treatment Weights\n", |
| 8 | + "Augmented-IPTW (AIPTW) is a doubly robust estimator. Essentially, AIPTW combines the IPTW estimator and g-formula into a single estimate. Before continuing, I will briefly outline what a doubly-robust estimator is and why you would want to use one. In observational research with high-dimensional data, we (generally) are forced to use parametric models to adjust for many confounders. In this scenario, we assume that our parametric models are correctly specified. Our statistical model, $\\mathcal{M}$, must include the distribution that the data came from. \n", |
| 9 | + "\n", |
| 10 | + "With other estimators, like IPTW or g-formula, we have one chance to specify $\\mathcal{M}$ correctly. Doubly-robust estimators use a model to predict the treatment (like IPTW) and another model to predict the outcome (like g-formula). The estimator then combines the estimates, such that if either is correct, then our estimate will be consistent. Essentially, we get two chances to get the statistical model correct.\n", |
| 11 | + "\n", |
| 12 | + "A more in-depth description of doubly robust estimators is available in [this pre-print](https://statnav.files.wordpress.com/2017/10/doublerobustness-preprint.pdf)\n", |
| 13 | + "\n", |
| 14 | + "## AIPTW\n", |
| 15 | + "\n", |
| 16 | + "AIPTW takes the following form\n", |
| 17 | + "$$E[Y^a] = \\frac{1}{n} \\sum_i^n \\left(\\frac{Y \\times I(A=a)}{\\widehat{\\Pr}(A=a|L)} - \\frac{\\hat{E}[Y|A=a, L] \\times (I(A=a) - \\widehat{\\Pr}(A=a|L))}{1 - \\widehat{\\Pr}(A=a|L)}\\right)$$\n", |
| 18 | + "where $\\widehat{\\Pr}(A=a|L)$ comes from the IPTW model and $\\hat{E}[Y|A=a,L]$ comes from the g-formula. If we do some manipulations and assume an infinite sample size, we can end up with\n", |
| 19 | + "$$\\hat{E}^{IPW}[Y^a] \\times \\frac{\\Pr(A=a|L)}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})} - \\hat{E}^{STD}[Y^a] \\times \\frac{\\Pr(A=a|L) - \\widehat{\\Pr}(A=a|L, \\mathcal{M})}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})}$$\n", |
| 20 | + "from this form, we can see that if as long as one estimate is correct then AIPTW will be unbiased\n", |
| 21 | + "\n", |
| 22 | + "## An example\n", |
| 23 | + "To motivate our example, we will use a simulated data set included with *zEpid*. In the data set, we have a cohort of HIV-positive individuals. We are interested in the sample average treatment effect of antiretroviral therapy (ART) on all-cause mortality at 45-weeks. Based on substantive background knowledge, we believe that the treated and untreated population are exchangeable based gender, age, CD4 T-cell count, and detectable viral load. \n", |
| 24 | + "\n", |
| 25 | + "In this tutorial, we will focus on a complete case analysis. Therefore, we will drop the `cd4_wk45` column and all the missing data in `dead`. This will leave 517 observations with no missing data" |
| 26 | + ] |
| 27 | + }, |
| 28 | + { |
| 29 | + "cell_type": "code", |
| 30 | + "execution_count": 1, |
| 31 | + "metadata": {}, |
| 32 | + "outputs": [ |
| 33 | + { |
| 34 | + "name": "stdout", |
| 35 | + "output_type": "stream", |
| 36 | + "text": [ |
| 37 | + "<class 'pandas.core.frame.DataFrame'>\n", |
| 38 | + "Int64Index: 517 entries, 0 to 546\n", |
| 39 | + "Data columns (total 12 columns):\n", |
| 40 | + "id 517 non-null int64\n", |
| 41 | + "male 517 non-null int64\n", |
| 42 | + "age0 517 non-null int64\n", |
| 43 | + "cd40 517 non-null int64\n", |
| 44 | + "dvl0 517 non-null int64\n", |
| 45 | + "art 517 non-null int64\n", |
| 46 | + "dead 517 non-null float64\n", |
| 47 | + "t 517 non-null float64\n", |
| 48 | + "age_rs1 517 non-null float64\n", |
| 49 | + "age_rs2 517 non-null float64\n", |
| 50 | + "cd4_rs1 517 non-null float64\n", |
| 51 | + "cd4_rs2 517 non-null float64\n", |
| 52 | + "dtypes: float64(6), int64(6)\n", |
| 53 | + "memory usage: 52.5 KB\n" |
| 54 | + ] |
| 55 | + } |
| 56 | + ], |
| 57 | + "source": [ |
| 58 | + "import numpy as np\n", |
| 59 | + "import pandas as pd\n", |
| 60 | + "\n", |
| 61 | + "from zepid import load_sample_data, spline\n", |
| 62 | + "from zepid.causal.doublyrobust import AIPTW\n", |
| 63 | + "\n", |
| 64 | + "df = load_sample_data(False)\n", |
| 65 | + "df[['age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=3, term=2, restricted=True)\n", |
| 66 | + "df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)\n", |
| 67 | + "\n", |
| 68 | + "dfcc = df.drop(columns=['cd4_wk45']).dropna()\n", |
| 69 | + "dfcc.info()" |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "markdown", |
| 74 | + "metadata": {}, |
| 75 | + "source": [ |
| 76 | + "Our data is now ready to conduct a complete case analysis using TMLE. First, we initialize TMLE with our complete-case data (dfcc), the treatment (art), and the outcome (dead)" |
| 77 | + ] |
| 78 | + }, |
| 79 | + { |
| 80 | + "cell_type": "code", |
| 81 | + "execution_count": 3, |
| 82 | + "metadata": {}, |
| 83 | + "outputs": [], |
| 84 | + "source": [ |
| 85 | + "aipw = AIPTW(dfcc, exposure='art', outcome='dead')" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "metadata": {}, |
| 91 | + "source": [ |
| 92 | + "### Treatment Model\n", |
| 93 | + "First, we will specify our treatment model. We believe the sufficient set for the treatment model is gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). To relax the functional for assumptions, we will model age and CD4 using restricted quadratic splines" |
| 94 | + ] |
| 95 | + }, |
| 96 | + { |
| 97 | + "cell_type": "code", |
| 98 | + "execution_count": 4, |
| 99 | + "metadata": {}, |
| 100 | + "outputs": [ |
| 101 | + { |
| 102 | + "name": "stdout", |
| 103 | + "output_type": "stream", |
| 104 | + "text": [ |
| 105 | + "\n", |
| 106 | + "----------------------------------------------------------------\n", |
| 107 | + "MODEL: art ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n", |
| 108 | + "-----------------------------------------------------------------\n", |
| 109 | + " Generalized Linear Model Regression Results \n", |
| 110 | + "==============================================================================\n", |
| 111 | + "Dep. Variable: art No. Observations: 517\n", |
| 112 | + "Model: GLM Df Residuals: 508\n", |
| 113 | + "Model Family: Binomial Df Model: 8\n", |
| 114 | + "Link Function: logit Scale: 1.0000\n", |
| 115 | + "Method: IRLS Log-Likelihood: -206.06\n", |
| 116 | + "Date: Mon, 11 Mar 2019 Deviance: 412.12\n", |
| 117 | + "Time: 08:24:05 Pearson chi2: 510.\n", |
| 118 | + "No. Iterations: 5 Covariance Type: nonrobust\n", |
| 119 | + "==============================================================================\n", |
| 120 | + " coef std err z P>|z| [0.025 0.975]\n", |
| 121 | + "------------------------------------------------------------------------------\n", |
| 122 | + "Intercept 1.4498 1.679 0.864 0.388 -1.841 4.741\n", |
| 123 | + "male -0.1159 0.321 -0.361 0.718 -0.745 0.513\n", |
| 124 | + "age0 -0.1026 0.059 -1.726 0.084 -0.219 0.014\n", |
| 125 | + "age_rs1 0.0048 0.003 1.706 0.088 -0.001 0.010\n", |
| 126 | + "age_rs2 -0.0077 0.006 -1.373 0.170 -0.019 0.003\n", |
| 127 | + "cd40 0.0041 0.004 0.964 0.335 -0.004 0.012\n", |
| 128 | + "cd4_rs1 -2.422e-05 1.2e-05 -2.014 0.044 -4.78e-05 -6.49e-07\n", |
| 129 | + "cd4_rs2 8.875e-05 4.55e-05 1.952 0.051 -3.81e-07 0.000\n", |
| 130 | + "dvl0 -0.0158 0.399 -0.040 0.968 -0.797 0.765\n", |
| 131 | + "==============================================================================\n" |
| 132 | + ] |
| 133 | + } |
| 134 | + ], |
| 135 | + "source": [ |
| 136 | + "aipw.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')" |
| 137 | + ] |
| 138 | + }, |
| 139 | + { |
| 140 | + "cell_type": "markdown", |
| 141 | + "metadata": {}, |
| 142 | + "source": [ |
| 143 | + "`AIPTW` uses a logistic regression model to estimate the probabilities of treatment and the corresponding summary of the model fit are printed to the console. \n", |
| 144 | + "\n", |
| 145 | + "### Outcome Model\n", |
| 146 | + "Now, we will estimate the outcome model. We will model the outcomes as ART (`art`), gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). Again, we will model age and CD4 using restricted quadratic splines " |
| 147 | + ] |
| 148 | + }, |
| 149 | + { |
| 150 | + "cell_type": "code", |
| 151 | + "execution_count": 5, |
| 152 | + "metadata": {}, |
| 153 | + "outputs": [ |
| 154 | + { |
| 155 | + "name": "stdout", |
| 156 | + "output_type": "stream", |
| 157 | + "text": [ |
| 158 | + "\n", |
| 159 | + "----------------------------------------------------------------\n", |
| 160 | + "MODEL: dead ~ art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n", |
| 161 | + "-----------------------------------------------------------------\n", |
| 162 | + " Generalized Linear Model Regression Results \n", |
| 163 | + "==============================================================================\n", |
| 164 | + "Dep. Variable: dead No. Observations: 517\n", |
| 165 | + "Model: GLM Df Residuals: 507\n", |
| 166 | + "Model Family: Binomial Df Model: 9\n", |
| 167 | + "Link Function: logit Scale: 1.0000\n", |
| 168 | + "Method: IRLS Log-Likelihood: -202.85\n", |
| 169 | + "Date: Mon, 11 Mar 2019 Deviance: 405.71\n", |
| 170 | + "Time: 08:25:18 Pearson chi2: 535.\n", |
| 171 | + "No. Iterations: 6 Covariance Type: nonrobust\n", |
| 172 | + "==============================================================================\n", |
| 173 | + " coef std err z P>|z| [0.025 0.975]\n", |
| 174 | + "------------------------------------------------------------------------------\n", |
| 175 | + "Intercept -4.0961 2.713 -1.510 0.131 -9.413 1.220\n", |
| 176 | + "art -0.7274 0.392 -1.853 0.064 -1.497 0.042\n", |
| 177 | + "male -0.0774 0.334 -0.232 0.817 -0.732 0.577\n", |
| 178 | + "age0 0.1605 0.096 1.670 0.095 -0.028 0.349\n", |
| 179 | + "age_rs1 -0.0058 0.004 -1.481 0.139 -0.013 0.002\n", |
| 180 | + "age_rs2 0.0128 0.006 2.026 0.043 0.000 0.025\n", |
| 181 | + "cd40 -0.0123 0.004 -2.987 0.003 -0.020 -0.004\n", |
| 182 | + "cd4_rs1 1.872e-05 1.18e-05 1.584 0.113 -4.45e-06 4.19e-05\n", |
| 183 | + "cd4_rs2 -3.868e-05 4.59e-05 -0.842 0.400 -0.000 5.13e-05\n", |
| 184 | + "dvl0 -0.1261 0.398 -0.317 0.751 -0.906 0.653\n", |
| 185 | + "==============================================================================\n" |
| 186 | + ] |
| 187 | + } |
| 188 | + ], |
| 189 | + "source": [ |
| 190 | + "aipw.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')" |
| 191 | + ] |
| 192 | + }, |
| 193 | + { |
| 194 | + "cell_type": "markdown", |
| 195 | + "metadata": {}, |
| 196 | + "source": [ |
| 197 | + "Again, logistic regression is used to predict the outcome data\n", |
| 198 | + "\n", |
| 199 | + "### Estimation\n", |
| 200 | + "To estimate the risk difference and risk ratio, we will now call the `fit()` function. After this, `AIPTW` gains the attributes `risk_difference` and `risk_ratio`. Additionally, results can be printed to the console using the `summary()` function" |
| 201 | + ] |
| 202 | + }, |
| 203 | + { |
| 204 | + "cell_type": "code", |
| 205 | + "execution_count": 6, |
| 206 | + "metadata": {}, |
| 207 | + "outputs": [ |
| 208 | + { |
| 209 | + "name": "stdout", |
| 210 | + "output_type": "stream", |
| 211 | + "text": [ |
| 212 | + "RD: -0.06857139263314216\n", |
| 213 | + "RR: 0.5844630051369846\n", |
| 214 | + "----------------------------------------------------------------------\n", |
| 215 | + "Risk Difference: -0.0686\n", |
| 216 | + "Risk Ratio: 0.5845\n", |
| 217 | + "----------------------------------------------------------------------\n" |
| 218 | + ] |
| 219 | + } |
| 220 | + ], |
| 221 | + "source": [ |
| 222 | + "aipw.fit()\n", |
| 223 | + "\n", |
| 224 | + "print('RD:', aipw.risk_difference)\n", |
| 225 | + "print('RR:', aipw.risk_ratio)\n", |
| 226 | + "\n", |
| 227 | + "aipw.summary()" |
| 228 | + ] |
| 229 | + }, |
| 230 | + { |
| 231 | + "cell_type": "markdown", |
| 232 | + "metadata": {}, |
| 233 | + "source": [ |
| 234 | + "Interpreting the risk difference, we would conclude that had everyone in our cohort been treated with ART, the risk of all-cause mortality would have been 6.9% points lower than had no one been treated.\n", |
| 235 | + "\n", |
| 236 | + "### Confidence Intervals\n", |
| 237 | + "To obtain correct confidence intervals, we need to use a bootstrap procedure. Influence curves are an alternative, but not currently available" |
| 238 | + ] |
| 239 | + }, |
| 240 | + { |
| 241 | + "cell_type": "code", |
| 242 | + "execution_count": 8, |
| 243 | + "metadata": {}, |
| 244 | + "outputs": [ |
| 245 | + { |
| 246 | + "name": "stdout", |
| 247 | + "output_type": "stream", |
| 248 | + "text": [ |
| 249 | + "95% LCL: -0.12324970574432371\n", |
| 250 | + "95% UCL -0.005298084807482341\n" |
| 251 | + ] |
| 252 | + } |
| 253 | + ], |
| 254 | + "source": [ |
| 255 | + "rd_results = []\n", |
| 256 | + "for i in range(1000):\n", |
| 257 | + " dfs = dfcc.sample(n=dfcc.shape[0],replace=True)\n", |
| 258 | + " s = AIPTW(dfs,exposure='art',outcome='dead')\n", |
| 259 | + " s.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n", |
| 260 | + " print_results=False)\n", |
| 261 | + " s.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n", |
| 262 | + " print_results=False)\n", |
| 263 | + " s.fit()\n", |
| 264 | + " rd_results.append(s.risk_difference)\n", |
| 265 | + "\n", |
| 266 | + "\n", |
| 267 | + "print('95% LCL:', np.percentile(rd_results, q=2.5))\n", |
| 268 | + "print('95% UCL', np.percentile(rd_results, q=97.5))" |
| 269 | + ] |
| 270 | + }, |
| 271 | + { |
| 272 | + "cell_type": "markdown", |
| 273 | + "metadata": {}, |
| 274 | + "source": [ |
| 275 | + "Under the counterfactual of everyone receiving treatment with ART, the risk of all-cause mortality was 6.9% points lower (95% CL: ) than the counterfactual where no one had been treated.\n", |
| 276 | + "\n", |
| 277 | + "# Conclusion\n", |
| 278 | + "In this tutorial, I introduced the concept of doubly-robust estimators and detailed augmented-IPTW. I demonstrated estimation with `AIPTW` using *zEpid* and how to obtain confidence intervals. Please view other tutorials for information on other functionality within *zEpid*\n", |
| 279 | + "\n", |
| 280 | + "## References\n", |
| 281 | + "Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. (2011). Doubly robust estimation of causal effects. *AJE*, 173(7), 761-767.\n", |
| 282 | + "\n", |
| 283 | + "Keil AP et al. (2018). Resolving an apparent paradox in doubly robust estimators. *AJE*, 187(4), 891-892.\n", |
| 284 | + "\n", |
| 285 | + "Robins JM, Rotnitzky A, Zhao LP. (1994). Estimation of regression coefficients when some regressors are not always observed. *JASA*, 89(427), 846-866." |
| 286 | + ] |
| 287 | + } |
| 288 | + ], |
| 289 | + "metadata": { |
| 290 | + "kernelspec": { |
| 291 | + "display_name": "Python 3", |
| 292 | + "language": "python", |
| 293 | + "name": "python3" |
| 294 | + }, |
| 295 | + "language_info": { |
| 296 | + "codemirror_mode": { |
| 297 | + "name": "ipython", |
| 298 | + "version": 3 |
| 299 | + }, |
| 300 | + "file_extension": ".py", |
| 301 | + "mimetype": "text/x-python", |
| 302 | + "name": "python", |
| 303 | + "nbconvert_exporter": "python", |
| 304 | + "pygments_lexer": "ipython3", |
| 305 | + "version": "3.6.3" |
| 306 | + } |
| 307 | + }, |
| 308 | + "nbformat": 4, |
| 309 | + "nbformat_minor": 2 |
| 310 | +} |
0 commit comments