-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the A-B-Testing-Udacity wiki. Here I will describe the analysis I did on the final project. The answers were verified in the quiz page of the course.
Suppose there are two options on a course overview page students can click:
-
start free trial: Students will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.
-
access course materials: Students will be able to view the videos and take the quizzes for free, but they will not receive a verified certificate.
In this experiment we tested the following change: Students who clicked "start free trial" were asked how many hours per week will dedicate to the course. If the student indicated 5 or more, they would be taken through the checkout process as usual. Else, a warning would appear indicating that courses usually requires more time. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.
The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time, without significantly reducing the number of students to continue past the free trial and eventually complete the course.
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward.
To view detailed instructions go to this link
Possible metrics (absolute practical significance level in parenthesis):
- Number of cookies: number of unique cookies to view the course overview page. (dmin=3000)
- Number of user-ids: number of users who enroll in the free trial. (dmin=50)
- Number of clicks: number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
- Click-through-probability: number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
- Gross conversion: number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
- Retention: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
- Net conversion: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)
-
Number of cookies: this is not expected to change since the warning message appears after sign-in in the home page.
-
Number of clicks: this is not expected to change since the warning message appears after click the "Start free trial" button.
-
Click-through-probability: the same argument as the ones mentioned before.
-
Gross conversion: this is sensible by definition to the change we are testing.
-
Net conversion: this is essential to track since we want to test in the end if students, after reading the warning message, find clearer expectations on the course and so reducing the number of frustrated students.
This spreadsheet contains rough estimates of the baseline values for each metric on a sample size of 40000 cookies per day.
To estimate the SD analytically, we first need to make an assumption on the distribution behind each metric. For the ones considered, we can assume they follow a binomial distribution since there are just two outcomes: students enroll after clicking or not.
The std formula for the binomial is:
Net conversion estimate:
This metric represents the probability that a student who clicked will pay, given by:
We calculate the number of clicks for 5000 cookies and replace it in the denominator of Eq.(1). We know that for 40000 cookies there are 3200 clicks, so for 5000 we estimate 400 clicks. Then,
Gross conversion estimate:
Analogously, for this metric we have:
Then,
Retention estimate:
If we were chosen "retention" as evaluation metric (although it is a perfect candidate we will see later why this choice would not be convenient), here is how we compute it:
Before replacing in Eq.(1), we need to calculate the number of enrollments for 5000 cookies. We know that for 40000 cookies there were 660, so for 5000 we estimate 82.5 enrollments. Then,
Empirical vs Analytical estimate
For all these metrics, do we expect different results between the empirical and analytical calculations?. This will be the case if the unit of analysis differs from the unit of diversion. To figure out the unit of analysis we are using we just need to look at the denominator of each metric. For gross and net conversion this is number of cookies, so we expect in this case the empirical and analytical variances to match. But for retention, the unit of analysis is user-id, and so we expect the analytical variance to underestimate the empirical one. For more information on this effect you can look at this paper
To estimate the size (number of pageviews) needed in order to see a statistical and practical significant change (if it exist), we used this online calculator.
The confidence and power we want are:
The following estimates were obtain from the calculator:
Metric | Baseline conversion rate | Minimum Detectable Effect | Sample Size per variation |
---|---|---|---|
Gross Conversion | 20.63% | 1% | 25835 |
Net Conversion | 10.931% | 0.75% | 27413 |
Retention | 53% | 1% | 39115 |
Now, we obtained the sample size per variation (number of cookies clicking the "Start free trial button"). In order to get number of pageviews, we need to convert it to number of users visiting the homepage. Also, we need to consider we have two groups (control and experiment) and so we need to double those amounts. We know that CTP and P(enroll|click) are 8% and 20.62%, so we have:
Metric | Pageviews |
---|---|
Gross conversion | 25835*2/0.08 = 645875 |
Net Conversion | 27413*2/0.08 = 685325 |
Retention | 39115*2/(0.20625 * 0.08) = 4741212 |
Here you can see that if we had chosen retention, we would have needed 4741212 pageviews, which is a long time for an experiment. That is why we selected to use gross and net conversion metrics. We need then 685325 total pageviews to achieve our desired power.
To estimate the time required to get the pageviews needed in the experiment, we need to decide first how much traffic we want to divert the population. Since we do not collect any extra sensitive data than before, it seems reasonable to use around all of our traffic to the experiment (traffic = 1). Given that we have 40000 pageviews per day, then for 685325 we will need 685325/(1*40000) ~ 2.5 weeks.
The data to analyze is here.
Let's first load the data:
control <- read.csv("Control.csv", stringsAsFactors = FALSE)
experiment <- read.csv("Experiment.csv", stringsAsFactors = FALSE)
We expect no differences between control and experiment groups for the invariant metrics. Here we validate this. We calculate the the confidence interval for counts using the following code:
#Calculate confidence interval and observed value for counts
calculate_ci_count <- function(n_control, n_experiment, alpha=0.05, p=0.5){
n <- n_control + n_experiment
z <- -qnorm(alpha/2)
#standard error for the proportion
SE <- sqrt((p * (1 - p))/n)
#margin of error
margin <- SE * z
#observed p
p_obs <- n_control/n
#Confidence interval
CI <- c(p - margin, p+margin)
return(list("p_obs"=p_obs, "Confidence Interval"=CI))
}
#expect fraction to be 0.5
ncookies_cont <- sum(control$Pageviews)
ncookies_exp <- sum(experiment$Pageviews)
calculate_ci_count(ncookies_cont, ncookies_exp)
which gives:
## $p_obs
## [1] 0.5006397
##
## $`Confidence Interval`
## [1] 0.4988204 0.5011796
We see that the p_obs falls within the CI centered at the expected value of p = 0.5, so this metric passes the sanity check.
nclicks_cont <- sum(control$Clicks)
nclicks_exp <- sum(experiment$Clicks)
calculate_ci_count(nclicks_cont, nclicks_exp)
which gives:
## $p_obs
## [1] 0.5004673
##
## $`Confidence Interval`
## [1] 0.4958846 0.5041154
And therefore same conclusion as before.
For the click through probability we need a different formula:
calculate_ci_proportion <- function(Xcont, Xexp, Ncont, Nexp, alpha=0.05, metric="invariant"){
#XCont - sucess on control
#Xexp - sucess on experiment
#Ncont - number of observations in control
#Nexp - number of observations in experiment
#
p_cont <- Xcont/Ncont
p_exp <- Xexp/Nexp
diff <- p_exp - p_cont
if(metric == "invariant"){
diff_obs <- 0
}else if(metric == "evaluation"){
diff_obs <- diff
}else{
stop(paste(metric, "is not a valid value for metric. Try 'invariant' or 'evaluation'" ))
}
p_pool <- (Xcont+Xexp)/(Ncont+Nexp)
SEpool <- sqrt(p_pool * (1- p_pool) * (1/Nexp+1/Ncont))
#alpha and z
alpha <- 0.05
z <- -qnorm(alpha/2)
#margin of error
margin <- z * SEpool
CI <- c(diff_obs-margin, diff_obs+margin)
return(list("diff"=diff, "Confidence Interval"=CI))
}
calculate_ci_proportion(nclicks_cont, nclicks_exp, ncookies_cont, ncookies_exp)
## $diff
## [1] 5.662709e-05
##
## $`Confidence Interval`
## [1] -0.001295655 0.001295655
The observed difference is inside the confidence interval, therefore this metric also passes the sanity check.
#Gross conversion
control_subset <- control[complete.cases(control), ]
experiment_subset <- experiment[complete.cases(experiment), ]
Xcont <- sum(control_subset$Enrollments)
Ncont <- sum(control_subset$Clicks)
Xexp <- sum(experiment_subset$Enrollments)
Nexp <- sum(experiment_subset$Clicks)
calculate_ci_proportion(Xcont, Xexp, Ncont, Nexp, metric="evaluation")
## $diff
## [1] -0.02055487
##
## $`Confidence Interval`
## [1] -0.02912320 -0.01198655
The confidence interval puts the estimated difference lower than zero and lower than the practical significance value. The difference in gross conversion is statistically and practically significant.
#Net Conversion
Xcont <- sum(control_subset$Payments)
Ncont <- sum(control_subset$Clicks)
Xexp <- sum(experiment_subset$Payments)
Nexp <- sum(experiment_subset$Clicks)
calculate_ci_proportion(Xcont, Xexp, Ncont, Nexp, metric="evaluation")
## $diff
## [1] -0.004873723
##
## $`Confidence Interval`
## [1] -0.011604501 0.001857055
The change in conversion rate is not statistically significant and not practically significant. However, because the lower margin of CI is lower than the negative practical significance, there’s a chance that the net conversion may be reduced to a level that matters to the business.
Suppose we have 3 metrics, each with a statistical significance of 0.05, then for all the metrics, the chance of at least one false positive (FP) is P(FP >= 1) = 1 - 0.950.950.95 = 0.143. Here we assumed independence between metrics, so in reality this is an overestimation. By using multiple metrics you can easier realize if a significant effect is just a statistical fluctuation, but on the other hand, the probability of FP increases as you increase the number of metrics. One way to fix this is using a higher CL for each metric (for a given overall significance). Assuming independence we compute this as follows:
However, there is another method that people uses more in practice, which doesn't make any assumption on the correlation between metrics. This is called the Bonferroni correction. This is also very conservative, and guaranteed to give an overall significance at least as small as specified, i.e reducing FPs.
To calculate the individual significance level needed for an overall level you want for n metrics use the following:
The disadvantage of this method is that it is too conservative for correlated metrics. The correction comes at the cost of increasing the probability of producing false negatives, i.e., reducing statistical power. In our experiment awe have correlated metrics and that's is why we decided not to use the Bonferroni corection.
The sign tests corresponds to make a binomial test. For the gross conversion we have a p value of 0.0026, which indicates statistical significance. There is a significant difference between the gross conversion of the control and experimental group.
For the net conversion, the p value is 0.6776. There is no significant difference of net conversion between the control and experimental group.
control_subset$gross_conversion <- control_subset$Enrollments / control_subset$Clicks
control_subset$net_conversion <- control_subset$Payments / control_subset$Clicks
experiment_subset$gross_conversion <- experiment_subset$Enrollments / experiment_subset$Clicks
experiment_subset$net_conversion <- experiment_subset$Payments / experiment_subset$Clicks
sign_test_gross <- control_subset$gross_conversion - experiment_subset$gross_conversion
ammount_plus_gross <- sum(sign_test_gross > 0)
n <- length(sign_test_gross)
binom.test(ammount_plus_gross, n)
##
## Exact binomial test
##
## data: ammount_plus_gross and n
## number of successes = 19, number of trials = 23, p-value =
## 0.002599
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.6121881 0.9504924
## sample estimates:
## probability of success
## 0.826087
sign_test_net <- control_subset$net_conversion - experiment_subset$net_conversion
ammount_plus_net <- sum(sign_test_net > 0)
n <- length(sign_test_net)
binom.test(ammount_plus_net, n)
##
## Exact binomial test
##
## data: ammount_plus_net and n
## number of successes = 13, number of trials = 23, p-value = 0.6776
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3449466 0.7680858
## sample estimates:
## probability of success
## 0.5652174
Requiring all metrics makes our experiment already vulnerable to type II errors, which would not be helped by using the Bonferroni correction. (Wikipedia, 2016, Davies, 2010, Discussions.udacity.com, 2016).
From the effect size tests we observe a significant difference in the gross conversion and no significant difference in the net conversion. The sign tests agree with this conclusions.
The gross conversion was reduced but the net conversion stayed the same. It seems we achieved our goal of reducing the number of students which enroll and may become frustrated and we did not reduce the number of students to continue past the free trial.
However, having a look at the net conversion results, we see that the lower margin of the confidence interval is lower than the negative practical significance boundary. This means there is a chance that there is a decrease of net conversion that may be relevant to the Udacity business. Given this fact, we should be careful and make further experiments to understand if this risk is real.