Updated: Jul 12, 2020
A/B testing is a very interesting concept that finally I got the chance to write few lines about it. A/B testing (a.k.a split testing) is a process of identifying which experiment is better than the other one (A vs B). Predominately, it has been used in web/user interaction studies, showing two variants of the same web page to different visitors at the same time. Then comparing the results using statistical models.
Let's say, there is two variants of a web page with slightly different layouts, and the business want to understand which version is better (e.g. more appealing / attractive for the customers).
Using A/B testing concept, both variants go live, and web traffic will be redirected randomly (50 / 50) to A or B. Then, one metric, let say conversion rate, is measured for actions taken / clicks on the website.
Naively we can compare average conversion rates recorded for a specific period. The flaws in this approach lays on the fact that mathematical average could sometimes misleading and is not fully representative of the population. For example, if we pick two exactly identical populations A and B, then add one or two outliers to each, the statistical average of A and B are not exactly same number, one would be smaller than the other one, but in fact both populations are not very different (not significantly different).
There are many statistical tests that can help understanding the underlying difference between A and B such as Z-test, T-test, Chi-Square test, and based on desired lift (improvement gap) the number of required samples are determined. Each statistical method has its assumptions and minimum population size requirements.
In machine learning (ML) field, there is also another use of A/B testing for improving the performance of our models.
Let's say we have two ML models that have been trained and validated on the same data set. The overall performance of both models is very close and we cannot anticipate which model will perform better in production. Since, the real data might have different distribution/characteristics than training data.
A/B test paradigm can be use here to run both models in parallel but randomly apply A or B on incoming data. The experiment should be running for a fair bit of time to satisfy the underlying statistical test minimum sample size (based on the Lift (sensitivity), and power, and type of the test). Then, the accuracy of both models should be compared statistically to identify the most accurate model.
In this scenario, the null hypothesis is that the performance of both models is not significantly different, unless the p is less than desired significant level.