diff --git a/.gitignore b/.gitignore index 1c08d93..da30c8e 100644 --- a/.gitignore +++ b/.gitignore @@ -9,5 +9,7 @@ docs *.pdf *_notes_files/ .quarto +.vdoc.py +.vdoc.r !sidebars-toggle.html diff --git a/08_main.qmd b/08_main.qmd index 01e8618..831bed9 100644 --- a/08_main.qmd +++ b/08_main.qmd @@ -2,6 +2,10 @@ ## Learning Objectives -- item 1 -- item 2 -- item 3 +- Use **decision trees** o model relationships between predictors and an outcome +- Compare and contrast tree-based models with other model types +- Use **tree-based ensemble methods** to improve predictive models +- Compare and contrast the various methods of building tree ensembles: bagging, boosting, random forests, and Bayesian Additive Regression Trees (BART) + +Sources: +https://github.com/JauntyJJS/islr2-bookclub-cohort3-chapter8, https://hastie.su.domains/ISLR2/Slides/Ch8_Tree_Based_Methods.pdf diff --git a/08_notes.qmd b/08_notes.qmd index 7a77083..20dfe88 100644 --- a/08_notes.qmd +++ b/08_notes.qmd @@ -1,2 +1,417 @@ -## Notes {.unnumbered} +# Notes {-} +## Introduction: Tree-based methods + +- Involve **stratifying** or **segmenting** the predictor space into a number of simple regions +- Are simple and useful for interpretation +- However, basic decision trees are NOT competitive with the best supervised learning approaches in terms of prediction accuracy +- Thus, we also discuss **bagging**, **random forests**, and **boosting** (i.e., tree-based ensemble methods) to grow multiple trees which are then combined to yield a single consensus prediction +- These can result in dramatic improvements in prediction accuracy (but some loss of interpretability) +- Can be applied to both regression and classification + +## Regression Trees + +First, let's take a look at `Hitters` dataset. +```{r} +#| label: 08-hitters-dataset +#| echo: false +library(dplyr) +library(tidyr) +library(readr) + +df <- read_csv('./data/Hitters.csv') %>% +select(Names, Hits, Years, Salary) %>% +drop_na() %>% +mutate(log_Salary = log(Salary)) + +df +``` + +```{r} +#| label: 08-reg-trees-intro +#| echo: false +#| out-width: 100% +knitr::include_graphics("images/08_1_salary_data.png") + +knitr::include_graphics("images/08_2_basic_tree.png") +``` + +- For the Hitters data, a regression tree for predicting the log salary of a baseball player based on: + + 1. number of years that he has played in the major leagues + 2. number of hits that he made in the previous year + +## Terminology + +```{r} +#| label: 08-decision-trees-terminology-1 +#| echo: false +#| out-width: 100% +knitr::include_graphics("images/08_3_basic_tree_term.png") +``` + +```{r} +#| label: 08-decision-trees-terminology-2 +#| echo: false +#| fig-cap: The three-region partition for the Hitters data set from the regression tree +#| out-width: 100% +knitr::include_graphics("images/08_4_hitters_predictor_space.png") +``` + +- Overall, the tree stratifies or segments the players into three regions of predictor space: + - R1 = {X \| Years\< 4.5} + - R2 = {X \| Years\>=4.5, Hits\<117.5} + - R3 = {X \| Years\>=4.5, Hits\>=117.5} + + where R1, R2, and R3 are **terminal nodes** (leaves) and green lines (where the predictor space is split) are the **internal nodes** + +- The number in each leaf/terminal node is the mean of the response for the observations that fall there + +## Interpretation of results: regression tree (Hitters data) + +```{r} +#| label: 08-reg-trees-interpreration +#| echo: false +#| out-width: 100% +knitr::include_graphics("images/08_2_basic_tree.png") +``` + +1. `Years` is the most important factor in determining `Salary`: players with less experience earn lower salaries than more experienced players +2. Given that a player is less experienced, the number of `Hits` that he made in the previous year seems to play little role in his `Salary` +3. But among players who have been in the major leagues for 5 or more years, the number of Hits made in the previous year does affect Salary: players who made more Hits last year tend to have higher salaries +4. This is surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain + +## Tree-building process (regression) + +1. Divide the predictor space --- that is, the set of possible values for $X_1,X_2, . . . ,X_p$ --- into $J$ distinct and **non-overlapping** regions, $R_1,R_2, . . . ,R_J$ + - Regions can have ANY shape - they don't have to be boxes +2. For every observation that falls into the region $R_j$, we make the same prediction: the **mean** of the response values in $R_j$ +3. The goal is to find regions (here boxes) $R_1, . . . ,R_J$ that **minimize** the $RSS$, given by + +$$\mathrm{RSS}=\sum_{j=1}^{J}\sum_{i{\in}R_j}^{}(y_i - \hat{y}_{R_j})^2$$ + +where $\hat{y}_{R_j}$ is the **mean** response for the training observations within the $j$th box + +- Unfortunately, it is **computationally infeasible** to consider every possible partition of the feature space into $J$ boxes. + +## Recursive binary splitting + +So, take a top-down, greedy approach known as recursive binary splitting: + +- **top-down** because it begins at the top of the tree and then successively splits the predictor space +- **greedy** because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step + +1. First, select the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions ${\{X|X_jk}\hat{f}^{b-1}_{k'}(x_i)$ + +for the $i$th observation, $i = 1, …, n$ + +- Rather than fitting a new tree to this partial residual, BART chooses a perturbation to the tree from a previous iteration $\hat{f}^{b-1}_{k}$ favoring perturbations that improve the fit to the partial residual +- To perturb trees: + - change the structure of the tree by adding/pruning branches + - change the prediction in each terminal node of the tree +- The output of BART is a collection of prediction models: + +$\hat{f}^b(x) = \sum_{k=1}^{K}\hat{f}^b_k(x)$ + +for $b = 1, 2,…, B$ + +## BART algorithm: figure + +```{r} +#| label: 08-bart-algo +#| echo: false +#| out-width: 100% +knitr::include_graphics("images/08_12_bart_algorithm.png") +``` + +- **Comment**: the first few prediction models obtained in the earlier iterations (known as the $burn-in$ period; denoted by $L$) are typically thrown away since they tend to not provide very good results, like you throw away the first pancake of the batch + +## BART: additional details + +- A key element of BART is that a fresh tree is NOT fit to the current partial residual: instead, we improve the fit to the current partial residual by slightly modifying the tree obtained in the previous iteration (Step 3(a)ii) +- This guards against overfitting since it limits how "hard" the data is fit in each iteration +- Additionally, the individual trees are typically pretty small +- BART, as the name suggests, can be viewed as a *Bayesian* approach to fitting an ensemble of trees: + - each time a tree is randomly perturbed to fit the residuals = drawing a new tree from a *posterior* distribution + +## To apply BART: + +- We must select the number of trees $K$, the number of iterations $B$ and the number of burn-in iterations $L$ +- Typically, large values are chosen for $B$ and $K$ and a moderate value for $L$: e.g. $K$ = 200, $B$ = 1,000 and $L$ = 100 +- BART has been shown to have impressive out-of-box performance - i.e., it performs well with minimal tuning diff --git a/_freeze/03_notes/execute-results/html.json b/_freeze/03_notes/execute-results/html.json index 09e50a6..a753dd4 100644 --- a/_freeze/03_notes/execute-results/html.json +++ b/_freeze/03_notes/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "3e85649170309cf3a5e235e9d4c6c1b8", + "hash": "08bcf4f45674dfc276d50a821a693bbd", "result": { - "markdown": "# Notes {-}\n\n## Questions to Answer\n\nRecall the `Advertising` data from **Chapter 2**. Here are a few important questions that we might seek to address:\n\n1. **Is there a relationship between advertising budget and sales?**\n2. **How strong is the relationship between advertising budget and sales?** Does knowledge of the advertising budget provide a lot of information about product sales?\n3. **Which media are associated with sales?**\n4. **How large is the association between each medium and sales?** For every dollar spent on advertising in a particular medium, by what amount will sales increase? \n5. **How accurately can we predict future sales?**\n6. **Is the relationship linear?** If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.\n7. **Is there synergy among the advertising media?** Or, in stats terms, is there an interaction effect?\n\n## Simple Linear Regression: Definition\n\n**Simple linear regression:** Very straightforward approach to predicting response $Y$ on predictor $X$.\n\n\n$$Y \\approx \\beta_{0} + \\beta_{1}X$$\n\n\n- Read \"$\\approx$\" as *\"is approximately modeled by.\"*\n- $\\beta_{0}$ = intercept\n- $\\beta_{1}$ = slope\n\n\n$$\\hat{y} = \\hat{\\beta}_{0} + \\hat{\\beta}_{1}x$$\n\n\n- $\\hat{\\beta}_{0}$ = our approximation of intercept\n- $\\hat{\\beta}_{1}$ = our approximation of slope\n- $x$ = sample of $X$\n- $\\hat{y}$ = our prediction of $Y$ from $x$\n- hat symbol denotes \"estimated value\" \n\n- Linear regression is a simple approach to supervised learning\n\n## Simple Linear Regression: Visualization\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For the `Advertising` data, the least squares fit for the regression of `sales` onto `TV` is shown. The fit is found by minimizing the residual sum of squares. Each grey line segment represents a residual. In this case a linear fit captures the essence of the relationship, although it overestimates the trend in the left of the plot.](images/fig3_1.jpg){width=100%}\n:::\n:::\n\n\n## Simple Linear Regression: Math\n\n- **RSS** = *residual sum of squares*\n\n\n$$\\mathrm{RSS} = e^{2}_{1} + e^{2}_{2} + \\ldots + e^{2}_{n}$$\n\n$$\\mathrm{RSS} = (y_{1} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{1})^{2} + (y_{2} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{2})^{2} + \\ldots + (y_{n} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{n})^{2}$$\n\n$$\\mathrm{RSS} = (y_{1} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{1})^{2} + (y_{2} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{2})^{2} + \\ldots + (y_{n} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{n})^{2}$$\n\n$$\\hat{\\beta}_{0} = \\bar{y} - \\hat{\\beta}_{1}\\bar{x}$$\n\n\n- $\\bar{x}$, $\\bar{y}$ = sample means of $x$ and $y$\n\n### Visualization of Fit\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Contour and three-dimensional plots of the RSS on the `Advertising` data, using `sales` as the response and `TV` as the predictor. The red dots correspond to the least squares estimates $\\\\hat\\\\beta_0$ and $\\\\hat\\\\beta_1$, given by (3.4).](images/fig3_2.jpg){width=100%}\n:::\n:::\n\n\n**Learning Objectives:**\n\n- Perform linear regression with a **single predictor variable.**\n\n## Assessing Accuracy of Coefficient Estimates\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\epsilon$$\n\n\n- **RSE** = *residual standard error*\n- Estimate of $\\sigma$\n\n\n$$\\mathrm{RSE} = \\sqrt{\\frac{\\mathrm{RSS}}{n - 2}}$$\n\n$$\\mathrm{SE}(\\hat\\beta_0)^2 = \\sigma^2 \\left[\\frac{1}{n} + \\frac{\\bar{x}^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}\\right],\\ \\ \\mathrm{SE}(\\hat\\beta_1)^2 = \\frac{\\sigma^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}$$\n\n\n- **95% confidence interval:** a range of values such that with 95% probability, the range will contain the true unknown value of the parameter\n - If we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter\n\n\n$$\\hat\\beta_1 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_1)$$\n\n$$\\hat\\beta_0 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_0)$$\n\n\n**Learning Objectives:**\n\n- Estimate the **standard error** of regression coefficients.\n\n## Assessing the Accuracy of the Model\n\n- **RSE** can be considered a measure of the *lack of fit* of the model. \na\n- *$R^2$* statistic (also called coefficient of determination) provides an alternative that is in the form of a *proportion of the variance explained*, ranges from 0 to 1, a *good value* depends on the application.\n\n\n$$R^2 = 1 - \\frac{RSS}{TSS}$$\n\n\nwhere TSS is the *total sum of squarse*:\n\n$$TSS = \\Sigma (y_i - \\bar{y})^2$$\n\n\nQuiz: Can *$R^2$* be negative?\n\n[Answer](https://www.graphpad.com/support/faq/how-can-rsup2sup-be-negative/)\n\n## Multiple Linear Regression\n\n**Multiple linear regression** extends simple linear regression for *p* predictors:\n\n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + ... +\\beta_{p}X_p + \\epsilon_i$$\n\n\n- $\\beta_{j}$ is the *average* effect on $Y$ from $X_{j}$ holding all other predictors fixed. \n\n- Fit is once again choosing the $\\beta_{j}$ that minimizes the RSS.\n\n- Example in book shows that although fitting *sales* against *newspaper* alone indicated a significant slope (0.055 +- 0.017), when you include *radio* in a multiple regression, *newspaper* no longer has any significant effect. (-0.001 +- 0.006) \n\n### Important Questions\n\n1. *Is at least one of the predictors $X_1$, $X_2$, ... , $X_p$ useful in predicting\nthe response?*\n\n F statistic close to 1 when there is no relationship, otherwise greater then 1.\n\n\n$$F = \\frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$\n\n\n2. *Do all the predictors help to explain $Y$ , or is only a subset of the\npredictors useful?*\n\n p-values can help identify important predictors, but it is possible to be mislead by this especially with large number of predictors. Variable selection methods include Forward selection, backward selection and mixed. Topic is continued in Chapter 6.\n\n3. *How well does the model fit the data?*\n\n **$R^2$** still gives *proportion of the variance explained*, so look for values \"close\" to 1. Can also look at **RSE** which is generalized for multiple regression as:\n \n\n$$RSE = \\sqrt{\\frac{1}{n-p-1}RSS}$$\n\n\n4. *Given a set of predictor values, what response value should we predict,\nand how accurate is our prediction?* \n\n Three sets of uncertainty in predictions:\n \n * Uncertainty in the estimates of $\\beta_i$\n * Model bias\n * Irreducible error $\\epsilon$\n\n## Qualitative Predictors\n\n* Dummy variables: if there are $k$ levels, introduce $k-1$ dummy variables which are equal to one (\"one hot\") when the underlying qualitative predictor takes that value. For example if there are 3 levels, introduce two new dummy variables and fit the model:\n\n\n$$y_i = \\beta_0 + \\beta_1 x_{i1} + \\beta_2 x_{i2} + \\epsilon_i$$\n\n\n| Qualitative Predicitor | $x_{i1}$ | $x_{i2}$ |\n| ---------------------- |:--------:|:--------:|\n| level 0 (baseline) | 0 | 0 |\n| level 1 | 1 | 0 |\n| level 2 | 0 | 1 |\n\n* Coefficients are interpreted the average effect relative to the baseline.\n\n* Alternative is to use index variables, a different coefficient for each level:\n\n\n$$y_i = \\beta_{0 1} + \\beta_{0 2} +\\beta_{0 3} + \\epsilon_i$$\n\n\n## Extensions\n\n- Interaction / Synergy effects\n \n Include a product term to account for synergy where one changes in one variable changes the association of the Y with another:\n \n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + \\beta_{3}X_1 X_2 + \\epsilon_i$$\n\n\n- Non-linear relationships (e.g. polynomial fits)\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\beta_{2}X^2 + ... \\beta_{n}X^n + \\epsilon_i$$\n\n\n## Potential Problems\n\n1. *Non-linear relationships* \n\n Residual plots are useful tool to see if any remaining trends exist. If so consider fitting transformation of the data. \n \n2. *Correlation of Error Terms*\n\n Linear regression assumes that the error terms $\\epsilon_i$ are uncorrelated. Residuals may indicate that this is not correct (obvious *tracking* in the data). One could also look at the autocorrelation of the residuals. What to do about it?\n \n3. *Non-constant variance of error terms*\n\n Again this can be revealed by examining the residuals. Consider transformation of the predictors to remove non-constant variance. The figure below shows residuals demonstrating non-constant variance, and shows this being mitigated to a great extent by log transforming the data.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.11](images/fig3_11.png){width=100%}\n:::\n:::\n\n\n4. *Outliers*\n\n - Outliers are points with for which $y_i$ is far from value predicted by the model (including irreducible error). See point labeled '20' in figure 3.13.\n - Detect outliers by plotting studentized residuals (residual $e_i$ divided by the estimated error) and look for residuals larger then 3 standard deviations in absolute value.\n - An outlier may not effect the fit much but can have dramatic effect on the **RSE**. \n - Often outliers are mistakes in data collection and can be removed, but could also be an indicator of a deficient model. \n\n5. *High Leverage Points* \n\n - These are points with unusual values of $x_i$. Examples is point labeled '41' in figure 3.13.\n - These points can have large impact on the fit, as in the example, including point 41 pulls slope up significantly.\n - Use *leverage statistic* to identify high leverage points, which can be hard to identify in multiple regression.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.13](images/fig3_13.png){width=100%}\n:::\n:::\n\n\n6. *Collinearity*\n\n - Two or more predictor variables are closely related to one another.\n - Simple collinearity can be identified by looking at correlations between predictors. \n - Causes the standard error to grow (and p-values to grow)\n - Often can be dealt with by removing one of the highly correlated predictors or combining them. \n - *Multicollinearity* (involving 3 or more predictors) is not so easy to identify. Use *Variance inflation factor*, which is the ratio of the variance of $\\hat{\\beta_j}$ when fitting the full model to fitting the parameter on its own. Can be computed using the formula:\n \n\n$$VIF(\\hat{\\beta_j}) = \\frac{1}{1-R^2_{X_j|X_{-j}}}$$\n\n\nwhere $R^2_{X_j|X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all the other predictors.\n\n## Answers to the Marketing Plan questions\n\n1. **Is there a relationship between advertising budget and sales?**\n\n Tool: Multiple regression, look at F-statistic.\n\n2. **How strong is the relationship between advertising budget and sales?** \n\n Tool: **$R^2$** and **RSE**\n \n3. **Which media are associated with sales?**\n \n Tool: p-values for each predictor's *t-statistic*. Explored further in chapter 6.\n\n4. **How large is the association between each medium and sales?**\n\n Tool: Confidence intervals on $\\hat{\\beta_j}$\n\n5. **How accurately can we predict future sales?**\n\n Tool:: Prediction intervals for individual response, confidence intervals for average response.\n \n \n6. **Is the relationship linear?** \n\n Tool: Residual Plots\n \n7. **Is there synergy among the advertising media?** \n\n Tool: Interaction terms and associated p-vales.\n\n## Comparison of Linear Regression with K-Nearest Neighbors\n\n- This section examines the K-nearest neighbor (KNN) method (a non-parameteric method).\n- This is essentially a k-point moving average.\n- This serves to illustrate the Bias-Variance trade-off nicely.\n\n", + "markdown": "# Notes {-}\n\n## Questions to Answer\n\nRecall the `Advertising` data from **Chapter 2**. Here are a few important questions that we might seek to address:\n\n1. **Is there a relationship between advertising budget and sales?**\n2. **How strong is the relationship between advertising budget and sales?** Does knowledge of the advertising budget provide a lot of information about product sales?\n3. **Which media are associated with sales?**\n4. **How large is the association between each medium and sales?** For every dollar spent on advertising in a particular medium, by what amount will sales increase? \n5. **How accurately can we predict future sales?**\n6. **Is the relationship linear?** If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.\n7. **Is there synergy among the advertising media?** Or, in stats terms, is there an interaction effect?\n\n## Simple Linear Regression: Definition\n\n**Simple linear regression:** Very straightforward approach to predicting response $Y$ on predictor $X$.\n\n\n$$Y \\approx \\beta_{0} + \\beta_{1}X$$\n\n\n- Read \"$\\approx$\" as *\"is approximately modeled by.\"*\n- $\\beta_{0}$ = intercept\n- $\\beta_{1}$ = slope\n\n\n$$\\hat{y} = \\hat{\\beta}_{0} + \\hat{\\beta}_{1}x$$\n\n\n- $\\hat{\\beta}_{0}$ = our approximation of intercept\n- $\\hat{\\beta}_{1}$ = our approximation of slope\n- $x$ = sample of $X$\n- $\\hat{y}$ = our prediction of $Y$ from $x$\n- hat symbol denotes \"estimated value\" \n\n- Linear regression is a simple approach to supervised learning\n\n## Simple Linear Regression: Visualization\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For the `Advertising` data, the least squares fit for the regression of `sales` onto `TV` is shown. The fit is found by minimizing the residual sum of squares. Each grey line segment represents a residual. In this case a linear fit captures the essence of the relationship, although it overestimates the trend in the left of the plot.](images/fig3_1.jpg){width=100%}\n:::\n:::\n\n\n## Simple Linear Regression: Math\n\n- **RSS** = *residual sum of squares*\n\n\n$$\\mathrm{RSS} = e^{2}_{1} + e^{2}_{2} + \\ldots + e^{2}_{n}$$\n\n$$\\mathrm{RSS} = (y_{1} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{1})^{2} + (y_{2} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{2})^{2} + \\ldots + (y_{n} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{n})^{2}$$\n\n$$\\hat{\\beta}_{1} = \\frac{\\sum_{i=1}^{n}{(x_{i}-\\bar{x})(y_{i}-\\bar{y})}}{\\sum_{i=1}^{n}{(x_{i}-\\bar{x})^{2}}}$$\n\n$$\\hat{\\beta}_{0} = \\bar{y} - \\hat{\\beta}_{1}\\bar{x}$$\n\n\n- $\\bar{x}$, $\\bar{y}$ = sample means of $x$ and $y$\n\n### Visualization of Fit\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Contour and three-dimensional plots of the RSS on the `Advertising` data, using `sales` as the response and `TV` as the predictor. The red dots correspond to the least squares estimates $\\\\hat\\\\beta_0$ and $\\\\hat\\\\beta_1$, given by (3.4).](images/fig3_2.jpg){width=100%}\n:::\n:::\n\n\n**Learning Objectives:**\n\n- Perform linear regression with a **single predictor variable.**\n\n## Assessing Accuracy of Coefficient Estimates\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\epsilon$$\n\n\n- **RSE** = *residual standard error*\n- Estimate of $\\sigma$\n\n\n$$\\mathrm{RSE} = \\sqrt{\\frac{\\mathrm{RSS}}{n - 2}}$$\n\n$$\\mathrm{SE}(\\hat\\beta_0)^2 = \\sigma^2 \\left[\\frac{1}{n} + \\frac{\\bar{x}^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}\\right],\\ \\ \\mathrm{SE}(\\hat\\beta_1)^2 = \\frac{\\sigma^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}$$\n\n\n- **95% confidence interval:** a range of values such that with 95% probability, the range will contain the true unknown value of the parameter\n - If we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter\n\n\n$$\\hat\\beta_1 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_1)$$\n\n$$\\hat\\beta_0 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_0)$$\n\n\n**Learning Objectives:**\n\n- Estimate the **standard error** of regression coefficients.\n\n## Assessing the Accuracy of the Model\n\n- **RSE** can be considered a measure of the *lack of fit* of the model. \na\n- *$R^2$* statistic (also called coefficient of determination) provides an alternative that is in the form of a *proportion of the variance explained*, ranges from 0 to 1, a *good value* depends on the application.\n\n\n$$R^2 = 1 - \\frac{RSS}{TSS}$$\n\n\nwhere TSS is the *total sum of squarse*:\n\n$$TSS = \\Sigma (y_i - \\bar{y})^2$$\n\n\nQuiz: Can *$R^2$* be negative?\n\n[Answer](https://www.graphpad.com/support/faq/how-can-rsup2sup-be-negative/)\n\n## Multiple Linear Regression\n\n**Multiple linear regression** extends simple linear regression for *p* predictors:\n\n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + ... +\\beta_{p}X_p + \\epsilon_i$$\n\n\n- $\\beta_{j}$ is the *average* effect on $Y$ from $X_{j}$ holding all other predictors fixed. \n\n- Fit is once again choosing the $\\beta_{j}$ that minimizes the RSS.\n\n- Example in book shows that although fitting *sales* against *newspaper* alone indicated a significant slope (0.055 +- 0.017), when you include *radio* in a multiple regression, *newspaper* no longer has any significant effect. (-0.001 +- 0.006) \n\n### Important Questions\n\n1. *Is at least one of the predictors $X_1$, $X_2$, ... , $X_p$ useful in predicting\nthe response?*\n\n F statistic close to 1 when there is no relationship, otherwise greater then 1.\n\n\n$$F = \\frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$\n\n\n2. *Do all the predictors help to explain $Y$ , or is only a subset of the\npredictors useful?*\n\n p-values can help identify important predictors, but it is possible to be mislead by this especially with large number of predictors. Variable selection methods include Forward selection, backward selection and mixed. Topic is continued in Chapter 6.\n\n3. *How well does the model fit the data?*\n\n **$R^2$** still gives *proportion of the variance explained*, so look for values \"close\" to 1. Can also look at **RSE** which is generalized for multiple regression as:\n \n\n$$RSE = \\sqrt{\\frac{1}{n-p-1}RSS}$$\n\n\n4. *Given a set of predictor values, what response value should we predict,\nand how accurate is our prediction?* \n\n Three sets of uncertainty in predictions:\n \n * Uncertainty in the estimates of $\\beta_i$\n * Model bias\n * Irreducible error $\\epsilon$\n\n## Qualitative Predictors\n\n* Dummy variables: if there are $k$ levels, introduce $k-1$ dummy variables which are equal to one (\"one hot\") when the underlying qualitative predictor takes that value. For example if there are 3 levels, introduce two new dummy variables and fit the model:\n\n\n$$y_i = \\beta_0 + \\beta_1 x_{i1} + \\beta_2 x_{i2} + \\epsilon_i$$\n\n\n| Qualitative Predicitor | $x_{i1}$ | $x_{i2}$ |\n| ---------------------- |:--------:|:--------:|\n| level 0 (baseline) | 0 | 0 |\n| level 1 | 1 | 0 |\n| level 2 | 0 | 1 |\n\n* Coefficients are interpreted the average effect relative to the baseline.\n\n* Alternative is to use index variables, a different coefficient for each level:\n\n\n$$y_i = \\beta_{0 1} + \\beta_{0 2} +\\beta_{0 3} + \\epsilon_i$$\n\n\n## Extensions\n\n- Interaction / Synergy effects\n \n Include a product term to account for synergy where one changes in one variable changes the association of the Y with another:\n \n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + \\beta_{3}X_1 X_2 + \\epsilon_i$$\n\n\n- Non-linear relationships (e.g. polynomial fits)\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\beta_{2}X^2 + ... \\beta_{n}X^n + \\epsilon_i$$\n\n\n## Potential Problems\n\n1. *Non-linear relationships* \n\n Residual plots are useful tool to see if any remaining trends exist. If so consider fitting transformation of the data. \n \n2. *Correlation of Error Terms*\n\n Linear regression assumes that the error terms $\\epsilon_i$ are uncorrelated. Residuals may indicate that this is not correct (obvious *tracking* in the data). One could also look at the autocorrelation of the residuals. What to do about it?\n \n3. *Non-constant variance of error terms*\n\n Again this can be revealed by examining the residuals. Consider transformation of the predictors to remove non-constant variance. The figure below shows residuals demonstrating non-constant variance, and shows this being mitigated to a great extent by log transforming the data.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.11](images/fig3_11.png){width=100%}\n:::\n:::\n\n\n4. *Outliers*\n\n - Outliers are points with for which $y_i$ is far from value predicted by the model (including irreducible error). See point labeled '20' in figure 3.13.\n - Detect outliers by plotting studentized residuals (residual $e_i$ divided by the estimated error) and look for residuals larger then 3 standard deviations in absolute value.\n - An outlier may not effect the fit much but can have dramatic effect on the **RSE**. \n - Often outliers are mistakes in data collection and can be removed, but could also be an indicator of a deficient model. \n\n5. *High Leverage Points* \n\n - These are points with unusual values of $x_i$. Examples is point labeled '41' in figure 3.13.\n - These points can have large impact on the fit, as in the example, including point 41 pulls slope up significantly.\n - Use *leverage statistic* to identify high leverage points, which can be hard to identify in multiple regression.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.13](images/fig3_13.png){width=100%}\n:::\n:::\n\n\n6. *Collinearity*\n\n - Two or more predictor variables are closely related to one another.\n - Simple collinearity can be identified by looking at correlations between predictors. \n - Causes the standard error to grow (and p-values to grow)\n - Often can be dealt with by removing one of the highly correlated predictors or combining them. \n - *Multicollinearity* (involving 3 or more predictors) is not so easy to identify. Use *Variance inflation factor*, which is the ratio of the variance of $\\hat{\\beta_j}$ when fitting the full model to fitting the parameter on its own. Can be computed using the formula:\n \n\n$$VIF(\\hat{\\beta_j}) = \\frac{1}{1-R^2_{X_j|X_{-j}}}$$\n\n\nwhere $R^2_{X_j|X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all the other predictors.\n\n## Answers to the Marketing Plan questions\n\n1. **Is there a relationship between advertising budget and sales?**\n\n Tool: Multiple regression, look at F-statistic.\n\n2. **How strong is the relationship between advertising budget and sales?** \n\n Tool: **$R^2$** and **RSE**\n \n3. **Which media are associated with sales?**\n \n Tool: p-values for each predictor's *t-statistic*. Explored further in chapter 6.\n\n4. **How large is the association between each medium and sales?**\n\n Tool: Confidence intervals on $\\hat{\\beta_j}$\n\n5. **How accurately can we predict future sales?**\n\n Tool:: Prediction intervals for individual response, confidence intervals for average response.\n \n \n6. **Is the relationship linear?** \n\n Tool: Residual Plots\n \n7. **Is there synergy among the advertising media?** \n\n Tool: Interaction terms and associated p-vales.\n\n## Comparison of Linear Regression with K-Nearest Neighbors\n\n- This section examines the K-nearest neighbor (KNN) method (a non-parameteric method).\n- This is essentially a k-point moving average.\n- This serves to illustrate the Bias-Variance trade-off nicely.\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/08_notes/execute-results/html.json b/_freeze/08_notes/execute-results/html.json new file mode 100644 index 0000000..17482ad --- /dev/null +++ b/_freeze/08_notes/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "ee34acabde75dba55f5aa78d7199e10e", + "result": { + "markdown": "# Notes {-}\n\n## Introduction: Tree-based methods\n\n- Involve **stratifying** or **segmenting** the predictor space into a number of simple regions\n- Are simple and useful for interpretation\n- However, basic decision trees are NOT competitive with the best supervised learning approaches in terms of prediction accuracy\n- Thus, we also discuss **bagging**, **random forests**, and **boosting** (i.e., tree-based ensemble methods) to grow multiple trees which are then combined to yield a single consensus prediction\n- These can result in dramatic improvements in prediction accuracy (but some loss of interpretability)\n- Can be applied to both regression and classification\n\n## Regression Trees\n\nFirst, let's take a look at `Hitters` dataset.\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'dplyr'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:stats':\n\n filter, lag\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:base':\n\n intersect, setdiff, setequal, union\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nRows: 322 Columns: 21\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (4): Names, League, Division, NewLeague\ndbl (17): AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, CHmRun...\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 263 × 5\n Names Hits Years Salary log_Salary\n \n 1 -Alan Ashby 81 14 475 6.16\n 2 -Alvin Davis 130 3 480 6.17\n 3 -Andre Dawson 141 11 500 6.21\n 4 -Andres Galarraga 87 2 91.5 4.52\n 5 -Alfredo Griffin 169 11 750 6.62\n 6 -Al Newman 37 2 70 4.25\n 7 -Argenis Salazar 73 3 100 4.61\n 8 -Andres Thomas 81 2 75 4.32\n 9 -Andre Thornton 92 13 1100 7.00\n10 -Alan Trammell 159 10 517. 6.25\n# ℹ 253 more rows\n```\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/08_1_salary_data.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/08_2_basic_tree.png){width=100%}\n:::\n:::\n\n\n- For the Hitters data, a regression tree for predicting the log salary of a baseball player based on:\n\n 1. number of years that he has played in the major leagues\n 2. number of hits that he made in the previous year\n\n## Terminology\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/08_3_basic_tree_term.png){width=100%}\n:::\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![The three-region partition for the Hitters data set from the regression tree](images/08_4_hitters_predictor_space.png){width=100%}\n:::\n:::\n\n\n- Overall, the tree stratifies or segments the players into three regions of predictor space:\n - R1 = {X \\| Years\\< 4.5}\n - R2 = {X \\| Years\\>=4.5, Hits\\<117.5}\n - R3 = {X \\| Years\\>=4.5, Hits\\>=117.5}\n \n where R1, R2, and R3 are **terminal nodes** (leaves) and green lines (where the predictor space is split) are the **internal nodes**\n\n- The number in each leaf/terminal node is the mean of the response for the observations that fall there\n\n## Interpretation of results: regression tree (Hitters data)\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/08_2_basic_tree.png){width=100%}\n:::\n:::\n\n\n1. `Years` is the most important factor in determining `Salary`: players with less experience earn lower salaries than more experienced players\n2. Given that a player is less experienced, the number of `Hits` that he made in the previous year seems to play little role in his `Salary`\n3. But among players who have been in the major leagues for 5 or more years, the number of Hits made in the previous year does affect Salary: players who made more Hits last year tend to have higher salaries\n4. This is surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain\n\n## Tree-building process (regression)\n\n1. Divide the predictor space --- that is, the set of possible values for $X_1,X_2, . . . ,X_p$ --- into $J$ distinct and **non-overlapping** regions, $R_1,R_2, . . . ,R_J$\n - Regions can have ANY shape - they don't have to be boxes\n2. For every observation that falls into the region $R_j$, we make the same prediction: the **mean** of the response values in $R_j$\n3. The goal is to find regions (here boxes) $R_1, . . . ,R_J$ that **minimize** the $RSS$, given by\n\n\n$$\\mathrm{RSS}=\\sum_{j=1}^{J}\\sum_{i{\\in}R_j}^{}(y_i - \\hat{y}_{R_j})^2$$\n\n\nwhere $\\hat{y}_{R_j}$ is the **mean** response for the training observations within the $j$th box\n\n- Unfortunately, it is **computationally infeasible** to consider every possible partition of the feature space into $J$ boxes.\n\n## Recursive binary splitting\n\nSo, take a top-down, greedy approach known as recursive binary splitting:\n\n- **top-down** because it begins at the top of the tree and then successively splits the predictor space\n- **greedy** because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step\n\n1. First, select the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions ${\\{X|X_jk}\\hat{f}^{b-1}_{k'}(x_i)$\n\nfor the $i$th observation, $i = 1, …, n$\n\n- Rather than fitting a new tree to this partial residual, BART chooses a perturbation to the tree from a previous iteration $\\hat{f}^{b-1}_{k}$ favoring perturbations that improve the fit to the partial residual\n- To perturb trees:\n - change the structure of the tree by adding/pruning branches\n - change the prediction in each terminal node of the tree\n- The output of BART is a collection of prediction models:\n\n$\\hat{f}^b(x) = \\sum_{k=1}^{K}\\hat{f}^b_k(x)$\n\nfor $b = 1, 2,…, B$\n\n## BART algorithm: figure\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/08_12_bart_algorithm.png){width=100%}\n:::\n:::\n\n- **Comment**: the first few prediction models obtained in the earlier iterations (known as the $burn-in$ period; denoted by $L$) are typically thrown away since they tend to not provide very good results, like you throw away the first pancake of the batch\n\n## BART: additional details\n\n- A key element of BART is that a fresh tree is NOT fit to the current partial residual: instead, we improve the fit to the current partial residual by slightly modifying the tree obtained in the previous iteration (Step 3(a)ii)\n- This guards against overfitting since it limits how \"hard\" the data is fit in each iteration\n- Additionally, the individual trees are typically pretty small\n- BART, as the name suggests, can be viewed as a *Bayesian* approach to fitting an ensemble of trees:\n - each time a tree is randomly perturbed to fit the residuals = drawing a new tree from a *posterior* distribution\n\n## To apply BART:\n\n- We must select the number of trees $K$, the number of iterations $B$ and the number of burn-in iterations $L$\n- Typically, large values are chosen for $B$ and $K$ and a moderate value for $L$: e.g. $K$ = 200, $B$ = 1,000 and $L$ = 100\n- BART has been shown to have impressive out-of-box performance - i.e., it performs well with minimal tuning\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/data/Hitters.csv b/data/Hitters.csv new file mode 100644 index 0000000..cf86793 --- /dev/null +++ b/data/Hitters.csv @@ -0,0 +1,323 @@ +"Names","AtBat","Hits","HmRun","Runs","RBI","Walks","Years","CAtBat","CHits","CHmRun","CRuns","CRBI","CWalks","League","Division","PutOuts","Assists","Errors","Salary","NewLeague" +"-Andy Allanson",293,66,1,30,29,14,1,293,66,1,30,29,14,"A","E",446,33,20,NA,"A" +"-Alan Ashby",315,81,7,24,38,39,14,3449,835,69,321,414,375,"N","W",632,43,10,475,"N" +"-Alvin Davis",479,130,18,66,72,76,3,1624,457,63,224,266,263,"A","W",880,82,14,480,"A" +"-Andre Dawson",496,141,20,65,78,37,11,5628,1575,225,828,838,354,"N","E",200,11,3,500,"N" +"-Andres Galarraga",321,87,10,39,42,30,2,396,101,12,48,46,33,"N","E",805,40,4,91.5,"N" +"-Alfredo Griffin",594,169,4,74,51,35,11,4408,1133,19,501,336,194,"A","W",282,421,25,750,"A" +"-Al Newman",185,37,1,23,8,21,2,214,42,1,30,9,24,"N","E",76,127,7,70,"A" +"-Argenis Salazar",298,73,0,24,24,7,3,509,108,0,41,37,12,"A","W",121,283,9,100,"A" +"-Andres Thomas",323,81,6,26,32,8,2,341,86,6,32,34,8,"N","W",143,290,19,75,"N" +"-Andre Thornton",401,92,17,49,66,65,13,5206,1332,253,784,890,866,"A","E",0,0,0,1100,"A" +"-Alan Trammell",574,159,21,107,75,59,10,4631,1300,90,702,504,488,"A","E",238,445,22,517.143,"A" +"-Alex Trevino",202,53,4,31,26,27,9,1876,467,15,192,186,161,"N","W",304,45,11,512.5,"N" +"-Andy VanSlyke",418,113,13,48,61,47,4,1512,392,41,205,204,203,"N","E",211,11,7,550,"N" +"-Alan Wiggins",239,60,0,30,11,22,6,1941,510,4,309,103,207,"A","E",121,151,6,700,"A" +"-Bill Almon",196,43,7,29,27,30,13,3231,825,36,376,290,238,"N","E",80,45,8,240,"N" +"-Billy Beane",183,39,3,20,15,11,3,201,42,3,20,16,11,"A","W",118,0,0,NA,"A" +"-Buddy Bell",568,158,20,89,75,73,15,8068,2273,177,1045,993,732,"N","W",105,290,10,775,"N" +"-Buddy Biancalana",190,46,2,24,8,15,5,479,102,5,65,23,39,"A","W",102,177,16,175,"A" +"-Bruce Bochte",407,104,6,57,43,65,12,5233,1478,100,643,658,653,"A","W",912,88,9,NA,"A" +"-Bruce Bochy",127,32,8,16,22,14,8,727,180,24,67,82,56,"N","W",202,22,2,135,"N" +"-Barry Bonds",413,92,16,72,48,65,1,413,92,16,72,48,65,"N","E",280,9,5,100,"N" +"-Bobby Bonilla",426,109,3,55,43,62,1,426,109,3,55,43,62,"A","W",361,22,2,115,"N" +"-Bob Boone",22,10,1,4,2,1,6,84,26,2,9,9,3,"A","W",812,84,11,NA,"A" +"-Bob Brenly",472,116,16,60,62,74,6,1924,489,67,242,251,240,"N","W",518,55,3,600,"N" +"-Bill Buckner",629,168,18,73,102,40,18,8424,2464,164,1008,1072,402,"A","E",1067,157,14,776.667,"A" +"-Brett Butler",587,163,4,92,51,70,6,2695,747,17,442,198,317,"A","E",434,9,3,765,"A" +"-Bob Dernier",324,73,4,32,18,22,7,1931,491,13,291,108,180,"N","E",222,3,3,708.333,"N" +"-Bo Diaz",474,129,10,50,56,40,10,2331,604,61,246,327,166,"N","W",732,83,13,750,"N" +"-Bill Doran",550,152,6,92,37,81,5,2308,633,32,349,182,308,"N","W",262,329,16,625,"N" +"-Brian Downing",513,137,20,90,95,90,14,5201,1382,166,763,734,784,"A","W",267,5,3,900,"A" +"-Bobby Grich",313,84,9,42,30,39,17,6890,1833,224,1033,864,1087,"A","W",127,221,7,NA,"A" +"-Billy Hatcher",419,108,6,55,36,22,3,591,149,8,80,46,31,"N","W",226,7,4,110,"N" +"-Bob Horner",517,141,27,70,87,52,9,3571,994,215,545,652,337,"N","W",1378,102,8,NA,"N" +"-Brook Jacoby",583,168,17,83,80,56,5,1646,452,44,219,208,136,"A","E",109,292,25,612.5,"A" +"-Bob Kearney",204,49,6,23,25,12,7,1309,308,27,126,132,66,"A","W",419,46,5,300,"A" +"-Bill Madlock",379,106,10,38,60,30,14,6207,1906,146,859,803,571,"N","W",72,170,24,850,"N" +"-Bobby Meacham",161,36,0,19,10,17,4,1053,244,3,156,86,107,"A","E",70,149,12,NA,"A" +"-Bob Melvin",268,60,5,24,25,15,2,350,78,5,34,29,18,"N","W",442,59,6,90,"N" +"-Ben Oglivie",346,98,5,31,53,30,16,5913,1615,235,784,901,560,"A","E",0,0,0,NA,"A" +"-Bip Roberts",241,61,1,34,12,14,1,241,61,1,34,12,14,"N","W",166,172,10,NA,"N" +"-BillyJo Robidoux",181,41,1,15,21,33,2,232,50,4,20,29,45,"A","E",326,29,5,67.5,"A" +"-Bill Russell",216,54,0,21,18,15,18,7318,1926,46,796,627,483,"N","W",103,84,5,NA,"N" +"-Billy Sample",200,57,6,23,14,14,9,2516,684,46,371,230,195,"N","W",69,1,1,NA,"N" +"-Bill Schroeder",217,46,7,32,19,9,4,694,160,32,86,76,32,"A","E",307,25,1,180,"A" +"-Butch Wynegar",194,40,7,19,29,30,11,4183,1069,64,486,493,608,"A","E",325,22,2,NA,"A" +"-Chris Bando",254,68,2,28,26,22,6,999,236,21,108,117,118,"A","E",359,30,4,305,"A" +"-Chris Brown",416,132,7,57,49,33,3,932,273,24,113,121,80,"N","W",73,177,18,215,"N" +"-Carmen Castillo",205,57,8,34,32,9,5,756,192,32,117,107,51,"A","E",58,4,4,247.5,"A" +"-Cecil Cooper",542,140,12,46,75,41,16,7099,2130,235,987,1089,431,"A","E",697,61,9,NA,"A" +"-Chili Davis",526,146,13,71,70,84,6,2648,715,77,352,342,289,"N","W",303,9,9,815,"N" +"-Carlton Fisk",457,101,14,42,63,22,17,6521,1767,281,1003,977,619,"A","W",389,39,4,875,"A" +"-Curt Ford",214,53,2,30,29,23,2,226,59,2,32,32,27,"N","E",109,7,3,70,"N" +"-Cliff Johnson",19,7,0,1,2,1,4,41,13,1,3,4,4,"A","E",0,0,0,NA,"A" +"-Carney Lansford",591,168,19,80,72,39,9,4478,1307,113,634,563,319,"A","W",67,147,4,1200,"A" +"-Chet Lemon",403,101,12,45,53,39,12,5150,1429,166,747,666,526,"A","E",316,6,5,675,"A" +"-Candy Maldonado",405,102,18,49,85,20,6,950,231,29,99,138,64,"N","W",161,10,3,415,"N" +"-Carmelo Martinez",244,58,9,28,25,35,4,1335,333,49,164,179,194,"N","W",142,14,2,340,"N" +"-Charlie Moore",235,61,3,24,39,21,14,3926,1029,35,441,401,333,"A","E",425,43,4,NA,"A" +"-Craig Reynolds",313,78,6,32,41,12,12,3742,968,35,409,321,170,"N","W",106,206,7,416.667,"N" +"-Cal Ripken",627,177,25,98,81,70,6,3210,927,133,529,472,313,"A","E",240,482,13,1350,"A" +"-Cory Snyder",416,113,24,58,69,16,1,416,113,24,58,69,16,"A","E",203,70,10,90,"A" +"-Chris Speier",155,44,6,21,23,15,16,6631,1634,98,698,661,777,"N","E",53,88,3,275,"N" +"-Curt Wilkerson",236,56,0,27,15,11,4,1115,270,1,116,64,57,"A","W",125,199,13,230,"A" +"-Dave Anderson",216,53,1,31,15,22,4,926,210,9,118,69,114,"N","W",73,152,11,225,"N" +"-Doug Baker",24,3,0,1,0,2,3,159,28,0,20,12,9,"A","W",80,4,0,NA,"A" +"-Don Baylor",585,139,31,93,94,62,17,7546,1982,315,1141,1179,727,"A","E",0,0,0,950,"A" +"-Dann Bilardello",191,37,4,12,17,14,4,773,163,16,61,74,52,"N","E",391,38,8,NA,"N" +"-Daryl Boston",199,53,5,29,22,21,3,514,120,8,57,40,39,"A","W",152,3,5,75,"A" +"-Darnell Coles",521,142,20,67,86,45,4,815,205,22,99,103,78,"A","E",107,242,23,105,"A" +"-Dave Collins",419,113,1,44,27,44,12,4484,1231,32,612,344,422,"A","E",211,2,1,NA,"A" +"-Dave Concepcion",311,81,3,42,30,26,17,8247,2198,100,950,909,690,"N","W",153,223,10,320,"N" +"-Darren Daulton",138,31,8,18,21,38,3,244,53,12,33,32,55,"N","E",244,21,4,NA,"N" +"-Doug DeCinces",512,131,26,69,96,52,14,5347,1397,221,712,815,548,"A","W",119,216,12,850,"A" +"-Darrell Evans",507,122,29,78,85,91,18,7761,1947,347,1175,1152,1380,"A","E",808,108,2,535,"A" +"-Dwight Evans",529,137,26,86,97,97,15,6661,1785,291,1082,949,989,"A","E",280,10,5,933.333,"A" +"-Damaso Garcia",424,119,6,57,46,13,9,3651,1046,32,461,301,112,"A","E",224,286,8,850,"N" +"-Dan Gladden",351,97,4,55,29,39,4,1258,353,16,196,110,117,"N","W",226,7,3,210,"A" +"-Danny Heep",195,55,5,24,33,30,8,1313,338,25,144,149,153,"N","E",83,2,1,NA,"N" +"-Dave Henderson",388,103,15,59,47,39,6,2174,555,80,285,274,186,"A","W",182,9,4,325,"A" +"-Donnie Hill",339,96,4,37,29,23,4,1064,290,11,123,108,55,"A","W",104,213,9,275,"A" +"-Dave Kingman",561,118,35,70,94,33,16,6677,1575,442,901,1210,608,"A","W",463,32,8,NA,"A" +"-Davey Lopes",255,70,7,49,35,43,15,6311,1661,154,1019,608,820,"N","E",51,54,8,450,"N" +"-Don Mattingly",677,238,31,117,113,53,5,2223,737,93,349,401,171,"A","E",1377,100,6,1975,"A" +"-Darryl Motley",227,46,7,23,20,12,5,1325,324,44,156,158,67,"A","W",92,2,2,NA,"A" +"-Dale Murphy",614,163,29,89,83,75,11,5017,1388,266,813,822,617,"N","W",303,6,6,1900,"N" +"-Dwayne Murphy",329,83,9,50,39,56,9,3828,948,145,575,528,635,"A","W",276,6,2,600,"A" +"-Dave Parker",637,174,31,89,116,56,14,6727,2024,247,978,1093,495,"N","W",278,9,9,1041.667,"N" +"-Dan Pasqua",280,82,16,44,45,47,2,428,113,25,61,70,63,"A","E",148,4,2,110,"A" +"-Darrell Porter",155,41,12,21,29,22,16,5409,1338,181,746,805,875,"A","W",165,9,1,260,"A" +"-Dick Schofield",458,114,13,67,57,48,4,1350,298,28,160,123,122,"A","W",246,389,18,475,"A" +"-Don Slaught",314,83,13,39,46,16,5,1457,405,28,156,159,76,"A","W",533,40,4,431.5,"A" +"-Darryl Strawberry",475,123,27,76,93,72,4,1810,471,108,292,343,267,"N","E",226,10,6,1220,"N" +"-Dale Sveum",317,78,7,35,35,32,1,317,78,7,35,35,32,"A","E",45,122,26,70,"A" +"-Danny Tartabull",511,138,25,76,96,61,3,592,164,28,87,110,71,"A","W",157,7,8,145,"A" +"-Dickie Thon",278,69,3,24,21,29,8,2079,565,32,258,192,162,"N","W",142,210,10,NA,"N" +"-Denny Walling",382,119,13,54,58,36,12,2133,594,41,287,294,227,"N","W",59,156,9,595,"N" +"-Dave Winfield",565,148,24,90,104,77,14,7287,2083,305,1135,1234,791,"A","E",292,9,5,1861.46,"A" +"-Enos Cabell",277,71,2,27,29,14,15,5952,1647,60,753,596,259,"N","W",360,32,5,NA,"N" +"-Eric Davis",415,115,27,97,71,68,3,711,184,45,156,119,99,"N","W",274,2,7,300,"N" +"-Eddie Milner",424,110,15,70,47,36,7,2130,544,38,335,174,258,"N","W",292,6,3,490,"N" +"-Eddie Murray",495,151,17,61,84,78,10,5624,1679,275,884,1015,709,"A","E",1045,88,13,2460,"A" +"-Ernest Riles",524,132,9,69,47,54,2,972,260,14,123,92,90,"A","E",212,327,20,NA,"A" +"-Ed Romero",233,49,2,41,23,18,8,1350,336,7,166,122,106,"A","E",102,132,10,375,"A" +"-Ernie Whitt",395,106,16,48,56,35,10,2303,571,86,266,323,248,"A","E",709,41,7,NA,"A" +"-Fred Lynn",397,114,23,67,67,53,13,5589,1632,241,906,926,716,"A","E",244,2,4,NA,"A" +"-Floyd Rayford",210,37,8,15,19,15,6,994,244,36,107,114,53,"A","E",40,115,15,NA,"A" +"-Franklin Stubbs",420,95,23,55,58,37,3,646,139,31,77,77,61,"N","W",206,10,7,NA,"N" +"-Frank White",566,154,22,76,84,43,14,6100,1583,131,743,693,300,"A","W",316,439,10,750,"A" +"-George Bell",641,198,31,101,108,41,5,2129,610,92,297,319,117,"A","E",269,17,10,1175,"A" +"-Glenn Braggs",215,51,4,19,18,11,1,215,51,4,19,18,11,"A","E",116,5,12,70,"A" +"-George Brett",441,128,16,70,73,80,14,6675,2095,209,1072,1050,695,"A","W",97,218,16,1500,"A" +"-Greg Brock",325,76,16,33,52,37,5,1506,351,71,195,219,214,"N","W",726,87,3,385,"A" +"-Gary Carter",490,125,24,81,105,62,13,6063,1646,271,847,999,680,"N","E",869,62,8,1925.571,"N" +"-Glenn Davis",574,152,31,91,101,64,3,985,260,53,148,173,95,"N","W",1253,111,11,215,"N" +"-George Foster",284,64,14,30,42,24,18,7023,1925,348,986,1239,666,"N","E",96,4,4,NA,"N" +"-Gary Gaetti",596,171,34,91,108,52,6,2862,728,107,361,401,224,"A","W",118,334,21,900,"A" +"-Greg Gagne",472,118,12,63,54,30,4,793,187,14,102,80,50,"A","W",228,377,26,155,"A" +"-George Hendrick",283,77,14,45,47,26,16,6840,1910,259,915,1067,546,"A","W",144,6,5,700,"A" +"-Glenn Hubbard",408,94,4,42,36,66,9,3573,866,59,429,365,410,"N","W",282,487,19,535,"N" +"-Garth Iorg",327,85,3,30,44,20,8,2140,568,16,216,208,93,"A","E",91,185,12,362.5,"A" +"-Gary Matthews",370,96,21,49,46,60,15,6986,1972,231,1070,955,921,"N","E",137,5,9,733.333,"N" +"-Graig Nettles",354,77,16,36,55,41,20,8716,2172,384,1172,1267,1057,"N","W",83,174,16,200,"N" +"-Gary Pettis",539,139,5,93,58,69,5,1469,369,12,247,126,198,"A","W",462,9,7,400,"A" +"-Gary Redus",340,84,11,62,33,47,5,1516,376,42,284,141,219,"N","E",185,8,4,400,"A" +"-Garry Templeton",510,126,2,42,44,35,11,5562,1578,44,703,519,256,"N","W",207,358,20,737.5,"N" +"-Gorman Thomas",315,59,16,45,36,58,13,4677,1051,268,681,782,697,"A","W",0,0,0,NA,"A" +"-Greg Walker",282,78,13,37,51,29,5,1649,453,73,211,280,138,"A","W",670,57,5,500,"A" +"-Gary Ward",380,120,5,54,51,31,8,3118,900,92,444,419,240,"A","W",237,8,1,600,"A" +"-Glenn Wilson",584,158,15,70,84,42,5,2358,636,58,265,316,134,"N","E",331,20,4,662.5,"N" +"-Harold Baines",570,169,21,72,88,38,7,3754,1077,140,492,589,263,"A","W",295,15,5,950,"A" +"-Hubie Brooks",306,104,14,50,58,25,7,2954,822,55,313,377,187,"N","E",116,222,15,750,"N" +"-Howard Johnson",220,54,10,30,39,31,5,1185,299,40,145,154,128,"N","E",50,136,20,297.5,"N" +"-Hal McRae",278,70,7,22,37,18,18,7186,2081,190,935,1088,643,"A","W",0,0,0,325,"A" +"-Harold Reynolds",445,99,1,46,24,29,4,618,129,1,72,31,48,"A","W",278,415,16,87.5,"A" +"-Harry Spilman",143,39,5,18,30,15,9,639,151,16,80,97,61,"N","W",138,15,1,175,"N" +"-Herm Winningham",185,40,4,23,11,18,3,524,125,7,58,37,47,"N","E",97,2,2,90,"N" +"-Jesse Barfield",589,170,40,107,108,69,6,2325,634,128,371,376,238,"A","E",368,20,3,1237.5,"A" +"-Juan Beniquez",343,103,6,48,36,40,15,4338,1193,70,581,421,325,"A","E",211,56,13,430,"A" +"-Juan Bonilla",284,69,1,33,18,25,5,1407,361,6,139,98,111,"A","E",122,140,5,NA,"N" +"-John Cangelosi",438,103,2,65,32,71,2,440,103,2,67,32,71,"A","W",276,7,9,100,"N" +"-Jose Canseco",600,144,33,85,117,65,2,696,173,38,101,130,69,"A","W",319,4,14,165,"A" +"-Joe Carter",663,200,29,108,121,32,4,1447,404,57,210,222,68,"A","E",241,8,6,250,"A" +"-Jack Clark",232,55,9,34,23,45,12,4405,1213,194,702,705,625,"N","E",623,35,3,1300,"N" +"-Jose Cruz",479,133,10,48,72,55,17,7472,2147,153,980,1032,854,"N","W",237,5,4,773.333,"N" +"-Julio Cruz",209,45,0,38,19,42,10,3859,916,23,557,279,478,"A","W",132,205,5,NA,"A" +"-Jody Davis",528,132,21,61,74,41,6,2641,671,97,273,383,226,"N","E",885,105,8,1008.333,"N" +"-Jim Dwyer",160,39,8,18,31,22,14,2128,543,56,304,268,298,"A","E",33,3,0,275,"A" +"-Julio Franco",599,183,10,80,74,32,5,2482,715,27,330,326,158,"A","E",231,374,18,775,"A" +"-Jim Gantner",497,136,7,58,38,26,11,3871,1066,40,450,367,241,"A","E",304,347,10,850,"A" +"-Johnny Grubb",210,70,13,32,51,28,15,4040,1130,97,544,462,551,"A","E",0,0,0,365,"A" +"-Jerry Hairston",225,61,5,32,26,26,11,1568,408,25,202,185,257,"A","W",132,9,0,NA,"A" +"-Jack Howell",151,41,4,26,21,19,2,288,68,9,45,39,35,"A","W",28,56,2,95,"A" +"-John Kruk",278,86,4,33,38,45,1,278,86,4,33,38,45,"N","W",102,4,2,110,"N" +"-Jeffrey Leonard",341,95,6,48,42,20,10,2964,808,81,379,428,221,"N","W",158,4,5,100,"N" +"-Jim Morrison",537,147,23,58,88,47,10,2744,730,97,302,351,174,"N","E",92,257,20,277.5,"N" +"-John Moses",399,102,3,56,34,34,5,670,167,4,89,48,54,"A","W",211,9,3,80,"A" +"-Jerry Mumphrey",309,94,5,37,32,26,13,4618,1330,57,616,522,436,"N","E",161,3,3,600,"N" +"-Joe Orsulak",401,100,2,60,19,28,4,876,238,2,126,44,55,"N","E",193,11,4,NA,"N" +"-Jorge Orta",336,93,9,35,46,23,15,5779,1610,128,730,741,497,"A","W",0,0,0,NA,"A" +"-Jim Presley",616,163,27,83,107,32,3,1437,377,65,181,227,82,"A","W",110,308,15,200,"A" +"-Jamie Quirk",219,47,8,24,26,17,12,1188,286,23,100,125,63,"A","W",260,58,4,NA,"A" +"-Johnny Ray",579,174,7,67,78,58,6,3053,880,32,366,337,218,"N","E",280,479,5,657,"N" +"-Jeff Reed",165,39,2,13,9,16,3,196,44,2,18,10,18,"A","W",332,19,2,75,"N" +"-Jim Rice",618,200,20,98,110,62,13,7127,2163,351,1104,1289,564,"A","E",330,16,8,2412.5,"A" +"-Jerry Royster",257,66,5,31,26,32,14,3910,979,33,518,324,382,"N","W",87,166,14,250,"A" +"-John Russell",315,76,13,35,60,25,3,630,151,24,68,94,55,"N","E",498,39,13,155,"N" +"-Juan Samuel",591,157,16,90,78,26,4,2020,541,52,310,226,91,"N","E",290,440,25,640,"N" +"-John Shelby",404,92,11,54,49,18,6,1354,325,30,188,135,63,"A","E",222,5,5,300,"A" +"-Joel Skinner",315,73,5,23,37,16,4,450,108,6,38,46,28,"A","W",227,15,3,110,"A" +"-Jeff Stone",249,69,6,32,19,20,4,702,209,10,97,48,44,"N","E",103,8,2,NA,"N" +"-Jim Sundberg",429,91,12,41,42,57,13,5590,1397,83,578,579,644,"A","W",686,46,4,825,"N" +"-Jim Traber",212,54,13,28,44,18,2,233,59,13,31,46,20,"A","E",243,23,5,NA,"A" +"-Jose Uribe",453,101,3,46,43,61,3,948,218,6,96,72,91,"N","W",249,444,16,195,"N" +"-Jerry Willard",161,43,4,17,26,22,3,707,179,21,77,99,76,"A","W",300,12,2,NA,"A" +"-Joel Youngblood",184,47,5,20,28,18,11,3327,890,74,419,382,304,"N","W",49,2,0,450,"N" +"-Kevin Bass",591,184,20,83,79,38,5,1689,462,40,219,195,82,"N","W",303,12,5,630,"N" +"-Kal Daniels",181,58,6,34,23,22,1,181,58,6,34,23,22,"N","W",88,0,3,86.5,"N" +"-Kirk Gibson",441,118,28,84,86,68,8,2723,750,126,433,420,309,"A","E",190,2,2,1300,"A" +"-Ken Griffey",490,150,21,69,58,35,14,6126,1839,121,983,707,600,"A","E",96,5,3,1000,"N" +"-Keith Hernandez",551,171,13,94,83,94,13,6090,1840,128,969,900,917,"N","E",1199,149,5,1800,"N" +"-Kent Hrbek",550,147,29,85,91,71,6,2816,815,117,405,474,319,"A","W",1218,104,10,1310,"A" +"-Ken Landreaux",283,74,4,34,29,22,10,3919,1062,85,505,456,283,"N","W",145,5,7,737.5,"N" +"-Kevin McReynolds",560,161,26,89,96,66,4,1789,470,65,233,260,155,"N","W",332,9,8,625,"N" +"-Kevin Mitchell",328,91,12,51,43,33,2,342,94,12,51,44,33,"N","E",145,59,8,125,"N" +"-Keith Moreland",586,159,12,72,79,53,9,3082,880,83,363,477,295,"N","E",181,13,4,1043.333,"N" +"-Ken Oberkfell",503,136,5,62,48,83,10,3423,970,20,408,303,414,"N","W",65,258,8,725,"N" +"-Ken Phelps",344,85,24,69,64,88,7,911,214,64,150,156,187,"A","W",0,0,0,300,"A" +"-Kirby Puckett",680,223,31,119,96,34,3,1928,587,35,262,201,91,"A","W",429,8,6,365,"A" +"-Kurt Stillwell",279,64,0,31,26,30,1,279,64,0,31,26,30,"N","W",107,205,16,75,"N" +"-Leon Durham",484,127,20,66,65,67,7,3006,844,116,436,458,377,"N","E",1231,80,7,1183.333,"N" +"-Len Dykstra",431,127,8,77,45,58,2,667,187,9,117,64,88,"N","E",283,8,3,202.5,"N" +"-Larry Herndon",283,70,8,33,37,27,12,4479,1222,94,557,483,307,"A","E",156,2,2,225,"A" +"-Lee Lacy",491,141,11,77,47,37,15,4291,1240,84,615,430,340,"A","E",239,8,2,525,"A" +"-Len Matuszek",199,52,9,26,28,21,6,805,191,30,113,119,87,"N","W",235,22,5,265,"N" +"-Lloyd Moseby",589,149,21,89,86,64,7,3558,928,102,513,471,351,"A","E",371,6,6,787.5,"A" +"-Lance Parrish",327,84,22,53,62,38,10,4273,1123,212,577,700,334,"A","E",483,48,6,800,"N" +"-Larry Parrish",464,128,28,67,94,52,13,5829,1552,210,740,840,452,"A","W",0,0,0,587.5,"A" +"-Luis Rivera",166,34,0,20,13,17,1,166,34,0,20,13,17,"N","E",64,119,9,NA,"N" +"-Larry Sheets",338,92,18,42,60,21,3,682,185,36,88,112,50,"A","E",0,0,0,145,"A" +"-Lonnie Smith",508,146,8,80,44,46,9,3148,915,41,571,289,326,"A","W",245,5,9,NA,"A" +"-Lou Whitaker",584,157,20,95,73,63,10,4704,1320,93,724,522,576,"A","E",276,421,11,420,"A" +"-Mike Aldrete",216,54,2,27,25,33,1,216,54,2,27,25,33,"N","W",317,36,1,75,"N" +"-Marty Barrett",625,179,4,94,60,65,5,1696,476,12,216,163,166,"A","E",303,450,14,575,"A" +"-Mike Brown",243,53,4,18,26,27,4,853,228,23,101,110,76,"N","E",107,3,3,NA,"N" +"-Mike Davis",489,131,19,77,55,34,7,2051,549,62,300,263,153,"A","W",310,9,9,780,"A" +"-Mike Diaz",209,56,12,22,36,19,2,216,58,12,24,37,19,"N","E",201,6,3,90,"N" +"-Mariano Duncan",407,93,8,47,30,30,2,969,230,14,121,69,68,"N","W",172,317,25,150,"N" +"-Mike Easler",490,148,14,64,78,49,13,3400,1000,113,445,491,301,"A","E",0,0,0,700,"N" +"-Mike Fitzgerald",209,59,6,20,37,27,4,884,209,14,66,106,92,"N","E",415,35,3,NA,"N" +"-Mel Hall",442,131,18,68,77,33,6,1416,398,47,210,203,136,"A","E",233,7,7,550,"A" +"-Mickey Hatcher",317,88,3,40,32,19,8,2543,715,28,269,270,118,"A","W",220,16,4,NA,"A" +"-Mike Heath",288,65,8,30,36,27,9,2815,698,55,315,325,189,"N","E",259,30,10,650,"A" +"-Mike Kingery",209,54,3,25,14,12,1,209,54,3,25,14,12,"A","W",102,6,3,68,"A" +"-Mike LaValliere",303,71,3,18,30,36,3,344,76,3,20,36,45,"N","E",468,47,6,100,"N" +"-Mike Marshall",330,77,19,47,53,27,6,1928,516,90,247,288,161,"N","W",149,8,6,670,"N" +"-Mike Pagliarulo",504,120,28,71,71,54,3,1085,259,54,150,167,114,"A","E",103,283,19,175,"A" +"-Mark Salas",258,60,8,28,33,18,3,638,170,17,80,75,36,"A","W",358,32,8,137,"A" +"-Mike Schmidt",20,1,0,0,0,0,2,41,9,2,6,7,4,"N","E",78,220,6,2127.333,"N" +"-Mike Scioscia",374,94,5,36,26,62,7,1968,519,26,181,199,288,"N","W",756,64,15,875,"N" +"-Mickey Tettleton",211,43,10,26,35,39,3,498,116,14,59,55,78,"A","W",463,32,8,120,"A" +"-Milt Thompson",299,75,6,38,23,26,3,580,160,8,71,33,44,"N","E",212,1,2,140,"N" +"-Mitch Webster",576,167,8,89,49,57,4,822,232,19,132,83,79,"N","E",325,12,8,210,"N" +"-Mookie Wilson",381,110,9,61,45,32,7,3015,834,40,451,249,168,"N","E",228,7,5,800,"N" +"-Marvell Wynne",288,76,7,34,37,15,4,1644,408,16,198,120,113,"N","W",203,3,3,240,"N" +"-Mike Young",369,93,9,43,42,49,5,1258,323,54,181,177,157,"A","E",149,1,6,350,"A" +"-Nick Esasky",330,76,12,35,41,47,4,1367,326,55,167,198,167,"N","W",512,30,5,NA,"N" +"-Ozzie Guillen",547,137,2,58,47,12,2,1038,271,3,129,80,24,"A","W",261,459,22,175,"A" +"-Oddibe McDowell",572,152,18,105,49,65,2,978,249,36,168,91,101,"A","W",325,13,3,200,"A" +"-Omar Moreno",359,84,4,46,27,21,12,4992,1257,37,699,386,387,"N","W",151,8,5,NA,"N" +"-Ozzie Smith",514,144,0,67,54,79,9,4739,1169,13,583,374,528,"N","E",229,453,15,1940,"N" +"-Ozzie Virgil",359,80,15,45,48,63,7,1493,359,61,176,202,175,"N","W",682,93,13,700,"N" +"-Phil Bradley",526,163,12,88,50,77,4,1556,470,38,245,167,174,"A","W",250,11,1,750,"A" +"-Phil Garner",313,83,9,43,41,30,14,5885,1543,104,751,714,535,"N","W",58,141,23,450,"N" +"-Pete Incaviglia",540,135,30,82,88,55,1,540,135,30,82,88,55,"A","W",157,6,14,172,"A" +"-Paul Molitor",437,123,9,62,55,40,9,4139,1203,79,676,390,364,"A","E",82,170,15,1260,"A" +"-Pete O'Brien",551,160,23,86,90,87,5,2235,602,75,278,328,273,"A","W",1224,115,11,NA,"A" +"-Pete Rose",237,52,0,15,25,30,24,14053,4256,160,2165,1314,1566,"N","W",523,43,6,750,"N" +"-Pat Sheridan",236,56,6,41,19,21,5,1257,329,24,166,125,105,"A","E",172,1,4,190,"A" +"-Pat Tabler",473,154,6,61,48,29,6,1966,566,29,250,252,178,"A","E",846,84,9,580,"A" +"-Rafael Belliard",309,72,0,33,31,26,5,354,82,0,41,32,26,"N","E",117,269,12,130,"N" +"-Rick Burleson",271,77,5,35,29,33,12,4933,1358,48,630,435,403,"A","W",62,90,3,450,"A" +"-Randy Bush",357,96,7,50,45,39,5,1394,344,43,178,192,136,"A","W",167,2,4,300,"A" +"-Rick Cerone",216,56,4,22,18,15,12,2796,665,43,266,304,198,"A","E",391,44,4,250,"A" +"-Ron Cey",256,70,13,42,36,44,16,7058,1845,312,965,1128,990,"N","E",41,118,8,1050,"A" +"-Rob Deer",466,108,33,75,86,72,3,652,142,44,102,109,102,"A","E",286,8,8,215,"A" +"-Rick Dempsey",327,68,13,42,29,45,18,3949,939,78,438,380,466,"A","E",659,53,7,400,"A" +"-Rich Gedman",462,119,16,49,65,37,7,2131,583,69,244,288,150,"A","E",866,65,6,NA,"A" +"-Ron Hassey",341,110,9,45,49,46,9,2331,658,50,249,322,274,"A","E",251,9,4,560,"A" +"-Rickey Henderson",608,160,28,130,74,89,8,4071,1182,103,862,417,708,"A","E",426,4,6,1670,"A" +"-Reggie Jackson",419,101,18,65,58,92,20,9528,2510,548,1509,1659,1342,"A","W",0,0,0,487.5,"A" +"-Ricky Jones",33,6,0,2,4,7,1,33,6,0,2,4,7,"A","W",205,5,4,NA,"A" +"-Ron Kittle",376,82,21,42,60,35,5,1770,408,115,238,299,157,"A","W",0,0,0,425,"A" +"-Ray Knight",486,145,11,51,76,40,11,3967,1102,67,410,497,284,"N","E",88,204,16,500,"A" +"-Randy Kutcher",186,44,7,28,16,11,1,186,44,7,28,16,11,"N","W",99,3,1,NA,"N" +"-Rudy Law",307,80,1,42,36,29,7,2421,656,18,379,198,184,"A","W",145,2,2,NA,"A" +"-Rick Leach",246,76,5,35,39,13,6,912,234,12,102,96,80,"A","E",44,0,1,250,"A" +"-Rick Manning",205,52,8,31,27,17,12,5134,1323,56,643,445,459,"A","E",155,3,2,400,"A" +"-Rance Mulliniks",348,90,11,50,45,43,10,2288,614,43,295,273,269,"A","E",60,176,6,450,"A" +"-Ron Oester",523,135,8,52,44,52,9,3368,895,39,377,284,296,"N","W",367,475,19,750,"N" +"-Rey Quinones",312,68,2,32,22,24,1,312,68,2,32,22,24,"A","E",86,150,15,70,"A" +"-Rafael Ramirez",496,119,8,57,33,21,7,3358,882,36,365,280,165,"N","W",155,371,29,875,"N" +"-Ronn Reynolds",126,27,3,8,10,5,4,239,49,3,16,13,14,"N","E",190,2,9,190,"N" +"-Ron Roenicke",275,68,5,42,42,61,6,961,238,16,128,104,172,"N","E",181,3,2,191,"N" +"-Ryne Sandberg",627,178,14,68,76,46,6,3146,902,74,494,345,242,"N","E",309,492,5,740,"N" +"-Rafael Santana",394,86,1,38,28,36,4,1089,267,3,94,71,76,"N","E",203,369,16,250,"N" +"-Rick Schu",208,57,8,32,25,18,3,653,170,17,98,54,62,"N","E",42,94,13,140,"N" +"-Ruben Sierra",382,101,16,50,55,22,1,382,101,16,50,55,22,"A","W",200,7,6,97.5,"A" +"-Roy Smalley",459,113,20,59,57,68,12,5348,1369,155,713,660,735,"A","W",0,0,0,740,"A" +"-Robby Thompson",549,149,7,73,47,42,1,549,149,7,73,47,42,"N","W",255,450,17,140,"N" +"-Rob Wilfong",288,63,3,25,33,16,10,2682,667,38,315,259,204,"A","W",135,257,7,341.667,"A" +"-Reggie Williams",303,84,4,35,32,23,2,312,87,4,39,32,23,"N","W",179,5,3,NA,"N" +"-Robin Yount",522,163,9,82,46,62,13,7037,2019,153,1043,827,535,"A","E",352,9,1,1000,"A" +"-Steve Balboni",512,117,29,54,88,43,6,1750,412,100,204,276,155,"A","W",1236,98,18,100,"A" +"-Scott Bradley",220,66,5,20,28,13,3,290,80,5,27,31,15,"A","W",281,21,3,90,"A" +"-Sid Bream",522,140,16,73,77,60,4,730,185,22,93,106,86,"N","E",1320,166,17,200,"N" +"-Steve Buechele",461,112,18,54,54,35,2,680,160,24,76,75,49,"A","W",111,226,11,135,"A" +"-Shawon Dunston",581,145,17,66,68,21,2,831,210,21,106,86,40,"N","E",320,465,32,155,"N" +"-Scott Fletcher",530,159,3,82,50,47,6,1619,426,11,218,149,163,"A","W",196,354,15,475,"A" +"-Steve Garvey",557,142,21,58,81,23,18,8759,2583,271,1138,1299,478,"N","W",1160,53,7,1450,"N" +"-Steve Jeltz",439,96,0,44,36,65,4,711,148,1,68,56,99,"N","E",229,406,22,150,"N" +"-Steve Lombardozzi",453,103,8,53,33,52,2,507,123,8,63,39,58,"A","W",289,407,6,105,"A" +"-Spike Owen",528,122,1,67,45,51,4,1716,403,12,211,146,155,"A","W",209,372,17,350,"A" +"-Steve Sax",633,210,6,91,56,59,6,3070,872,19,420,230,274,"N","W",367,432,16,90,"N" +"-Tony Armas",16,2,0,1,0,0,2,28,4,0,1,0,0,"A","E",247,4,8,NA,"A" +"-Tony Bernazard",562,169,17,88,73,53,8,3181,841,61,450,342,373,"A","E",351,442,17,530,"A" +"-Tom Brookens",281,76,3,42,25,20,8,2658,657,48,324,300,179,"A","E",106,144,7,341.667,"A" +"-Tom Brunansky",593,152,23,69,75,53,6,2765,686,133,369,384,321,"A","W",315,10,6,940,"A" +"-Tony Fernandez",687,213,10,91,65,27,4,1518,448,15,196,137,89,"A","E",294,445,13,350,"A" +"-Tim Flannery",368,103,3,48,28,54,8,1897,493,9,207,162,198,"N","W",209,246,3,326.667,"N" +"-Tom Foley",263,70,1,26,23,30,4,888,220,9,83,82,86,"N","E",81,147,4,250,"N" +"-Tony Gwynn",642,211,14,107,59,52,5,2364,770,27,352,230,193,"N","W",337,19,4,740,"N" +"-Terry Harper",265,68,8,26,30,29,7,1337,339,32,135,163,128,"N","W",92,5,3,425,"A" +"-Toby Harrah",289,63,7,36,41,44,17,7402,1954,195,1115,919,1153,"A","W",166,211,7,NA,"A" +"-Tommy Herr",559,141,2,48,61,73,8,3162,874,16,421,349,359,"N","E",352,414,9,925,"N" +"-Tim Hulett",520,120,17,53,44,21,4,927,227,22,106,80,52,"A","W",70,144,11,185,"A" +"-Terry Kennedy",19,4,1,2,3,1,1,19,4,1,2,3,1,"N","W",692,70,8,920,"A" +"-Tito Landrum",205,43,2,24,17,20,7,854,219,12,105,99,71,"N","E",131,6,1,286.667,"N" +"-Tim Laudner",193,47,10,21,29,24,6,1136,256,42,129,139,106,"A","W",299,13,5,245,"A" +"-Tom O'Malley",181,46,1,19,18,17,5,937,238,9,88,95,104,"A","E",37,98,9,NA,"A" +"-Tom Paciorek",213,61,4,17,22,3,17,4061,1145,83,488,491,244,"A","W",178,45,4,235,"A" +"-Tony Pena",510,147,10,56,52,53,7,2872,821,63,307,340,174,"N","E",810,99,18,1150,"N" +"-Terry Pendleton",578,138,1,56,59,34,3,1399,357,7,149,161,87,"N","E",133,371,20,160,"N" +"-Tony Perez",200,51,2,14,29,25,23,9778,2732,379,1272,1652,925,"N","W",398,29,7,NA,"N" +"-Tony Phillips",441,113,5,76,52,76,5,1546,397,17,226,149,191,"A","W",160,290,11,425,"A" +"-Terry Puhl",172,42,3,17,14,15,10,4086,1150,57,579,363,406,"N","W",65,0,0,900,"N" +"-Tim Raines",580,194,9,91,62,78,8,3372,1028,48,604,314,469,"N","E",270,13,6,NA,"N" +"-Ted Simmons",127,32,4,14,25,12,19,8396,2402,242,1048,1348,819,"N","W",167,18,6,500,"N" +"-Tim Teufel",279,69,4,35,31,32,4,1359,355,31,180,148,158,"N","E",133,173,9,277.5,"N" +"-Tim Wallach",480,112,18,50,71,44,7,3031,771,110,338,406,239,"N","E",94,270,16,750,"N" +"-Vince Coleman",600,139,0,94,29,60,2,1236,309,1,201,69,110,"N","E",300,12,9,160,"N" +"-Von Hayes",610,186,19,107,98,74,6,2728,753,69,399,366,286,"N","E",1182,96,13,1300,"N" +"-Vance Law",360,81,5,37,44,37,7,2268,566,41,279,257,246,"N","E",170,284,3,525,"N" +"-Wally Backman",387,124,1,67,27,36,7,1775,506,6,272,125,194,"N","E",186,290,17,550,"N" +"-Wade Boggs",580,207,8,107,71,105,5,2778,978,32,474,322,417,"A","E",121,267,19,1600,"A" +"-Will Clark",408,117,11,66,41,34,1,408,117,11,66,41,34,"N","W",942,72,11,120,"N" +"-Wally Joyner",593,172,22,82,100,57,1,593,172,22,82,100,57,"A","W",1222,139,15,165,"A" +"-Wayne Krenchicki",221,53,2,21,23,22,8,1063,283,15,107,124,106,"N","E",325,58,6,NA,"N" +"-Willie McGee",497,127,7,65,48,37,5,2703,806,32,379,311,138,"N","E",325,9,3,700,"N" +"-Willie Randolph",492,136,5,76,50,94,12,5511,1511,39,897,451,875,"A","E",313,381,20,875,"A" +"-Wayne Tolleson",475,126,3,61,43,52,6,1700,433,7,217,93,146,"A","W",37,113,7,385,"A" +"-Willie Upshaw",573,144,9,85,60,78,8,3198,857,97,470,420,332,"A","E",1314,131,12,960,"A" +"-Willie Wilson",631,170,9,77,44,31,11,4908,1457,30,775,357,249,"A","W",408,4,3,1000,"A" diff --git a/images/08_10_boosting_algorithm.png b/images/08_10_boosting_algorithm.png new file mode 100644 index 0000000..7340d7f Binary files /dev/null and b/images/08_10_boosting_algorithm.png differ diff --git a/images/08_11_boosting_gene_exp_data.png b/images/08_11_boosting_gene_exp_data.png new file mode 100644 index 0000000..599a7c8 Binary files /dev/null and b/images/08_11_boosting_gene_exp_data.png differ diff --git a/images/08_12_bart_algorithm.png b/images/08_12_bart_algorithm.png new file mode 100644 index 0000000..7d6934f Binary files /dev/null and b/images/08_12_bart_algorithm.png differ diff --git a/images/08_1_salary_data.png b/images/08_1_salary_data.png new file mode 100644 index 0000000..663c527 Binary files /dev/null and b/images/08_1_salary_data.png differ diff --git a/images/08_2_basic_tree.png b/images/08_2_basic_tree.png new file mode 100644 index 0000000..73b6226 Binary files /dev/null and b/images/08_2_basic_tree.png differ diff --git a/images/08_3_basic_tree_term.png b/images/08_3_basic_tree_term.png new file mode 100644 index 0000000..564ada1 Binary files /dev/null and b/images/08_3_basic_tree_term.png differ diff --git a/images/08_4_hitters_predictor_space.png b/images/08_4_hitters_predictor_space.png new file mode 100644 index 0000000..c671ff0 Binary files /dev/null and b/images/08_4_hitters_predictor_space.png differ diff --git a/images/08_5_hitters_unpruned_tree.png b/images/08_5_hitters_unpruned_tree.png new file mode 100644 index 0000000..d5e5cc4 Binary files /dev/null and b/images/08_5_hitters_unpruned_tree.png differ diff --git a/images/08_6_hitters_mse.png b/images/08_6_hitters_mse.png new file mode 100644 index 0000000..71d738c Binary files /dev/null and b/images/08_6_hitters_mse.png differ diff --git a/images/08_7_classif_tree_heart.png b/images/08_7_classif_tree_heart.png new file mode 100644 index 0000000..c828ab7 Binary files /dev/null and b/images/08_7_classif_tree_heart.png differ diff --git a/images/08_8_var_importance.png b/images/08_8_var_importance.png new file mode 100644 index 0000000..6b74a26 Binary files /dev/null and b/images/08_8_var_importance.png differ diff --git a/images/08_9_rand_forest_gene_exp.png b/images/08_9_rand_forest_gene_exp.png new file mode 100644 index 0000000..4681f30 Binary files /dev/null and b/images/08_9_rand_forest_gene_exp.png differ