So you changed your website. How do you now measure success and avoid common A/B mistakes?

A graphic illustration of two human figures riding the same bicycle

Writers Jouni Seppänen and Janne Sinkkonen are hands-deep in machine learning, AI development, and trying to solve the intricacies of A/B testing, too. 

Imagine for a moment that you run a webshop for bicycle components and have now decided to redesign the site. The partner you commissioned to do the work promises this new design is practically guaranteed to increase sales, improve the server latency by a bazillion percent, and will provide a great shopping experience to your customer thanks to the brand-spanking new AI recommender system. 

Such promises are nice.

But as some would say, in God we trust, others bring data.

And when it comes to figuring out if this grand redesign of your made any difference, you should only trust data.

Web developers are only human after all. They’re just as subject to biases as we all. When you work on a project for a long time – any project at that – you’ll be so psychologically invested in it, it’s only natural to want it to be deemed a success. And as humans, we’re great at fooling ourselves into proving we’ve come out on top. 

But did we?

Luckily, this challenge is not unique to website owners. In fields such as science and medicine – where quantifiable results matter and can be obscured by biases – researchers have long been seeking objective verification. Even more, they’ve also come up with useful methods, which apply to measuring the effectiveness of website changes too.

The seemingly obvious way to start is to measure your sales before and after the redesign.

Did the sales go up or down after the launch? Up, success, down, fail, right?

Unfortunately, wrong. Such changes could be caused by a myriad of reasons. If you deploy the site in the spring when people are just getting their bikes out, you could see a spike in sales and mistakenly believe it was due to your new design. If you deploy the changes when the economy is taking a downturn, you could see no change in sales because few people are looking for bike components, and conclude that the redesign was useless.

The problem here’s that you compared the redesign in the future to the old site in the past. Any improvements from the redesign get mixed up with differences between the future and the past. If we could communicate with a parallel universe, you could compare results properly without being confounded by other developments in the timeline. Sadly, we don’t have that kind of communication technology yet, and in scientific terms, you’ve forgotten to control your experiment.

If we can’t talk to a parallel universe, we might still talk to different parts of our own universe. 

Perhaps you have multiple bike shop locations. You could deploy the redesign in only some of them and compare them to each other. Your Finnish location gets a redesign while your French location keeps the old. Now, if Finland sees an increase in sales, it’s surely the redesign.

Perhaps yes, perhaps not. Maybe spring came around earlier in France than in Finland, or maybe a Finnish competitor just raised their prices, or maybe Finns really got into cycling because a Finn just won an international cycling race.

Again, the comparison mixes two possible causes of the difference in sales: the redesign and the location. This attempt might be more useful than the previous one since historical data could at least tell you if there is a time lag between French and Finnish shops picking up sales.

This is a familiar problem to medical researchers who are investigating whether a new treatment works.

In this kind of research, the gold standard is a randomized controlled trial: You give some patients the real treatment and others a placebo (or the best existing treatment), and you make the choice in a way that is completely random.

Randomization schemes in the context of the web are often called A/B testing. 

A simple A/B test for your bikeparts could look like this: You show the old design to some browsers and the new design to others. Then you see which one makes more sales. 

You want to keep showing the same design to the same person, because if every other page is from a different design, not only will the customer be confused about your site’s identity, you yourself will be confused about how to attribute the sales. The obvious solution is to set a cookie in the browser (but in Europe, that seems to necessitate an obnoxious GDPR modal for every user of your site). Now one person, as long as they only use one browser, will see a consistent storefront.

So far so good!

Now imagine that at the end of the testing period (say, one week) you have 50,000 clicks, 1,200 sales, and revenue of $36,000 from users of the old site (arm A of the randomized test) and 55,000 clicks, 1,100 sales, and revenue of $37,000) from the new one (arm B). 

Can you decide which one is better? 

There are at least two entangled questions here. 

Question #1: What is the target of the redesign? 

Number of site visits, number of sales, amount of revenue, average profit margin per sale, or what? 

That’s a strategic decision where the buck stops with the business owner, not the web designer or a data scientist. Of course, a data scientist should help with choosing and evaluating suitable metrics, especially when direct business metrics are not available. If you’re in the business of selling products to customers for money, revenue or profits are quite natural metrics to use and can be measured from the software running the shop, but many websites justify their existence in more indirect ways – brand awareness or directing customers to other retail channels. Then you might end up following metrics such as the number of clicks or time spent on the pages. The connection between these metrics and business goals is more nebulous.

Question #2: How reliable is the difference between the two?

If, say, total revenue is agreed to be a good metric, arm B in the example looks like the winner. But, the data scientist says, these are just raw numbers! 

What could be wrong with raw numbers? Everything of course, but the data scientist probably means something subtler: We don’t know the underlying reality, we only have a finite sample of random people, and… random is random. More on that in a minute, but let’s first look at the procedural issues, the “everything” part.

The measurement itself could be faulty. 

Clicks and time on the page are typically not followed on the actual web server sending the pages to the client but by using tracking scripts. Some users will block these scripts in the name of privacy. Pings of some devices are not accurately synced to what the customer does. Some pages may include programming errors that prevent the tracking script from loading. Some devices will be behind slow and unreliable links where not every script gets loaded consistently. As long as all such sources of error are independent of the randomized group assignments A vs. B, glitches should occur equally on both sides.

But this assumption does a lot of hidden work. What if the redesign includes larger images and causes the slower links to time out before the tracking script is loaded? What if the programming error only occurs on the older site? 

As wet and messy as the physical environment may be, the technological environments of the web and mobile devices have pathological discontinuities. 

Patients don’t typically timeout because you offer them placebo pills with too-hi-res JPEG images, nor do your bottles have syntax errors preventing the patients from taking their pills.

Also, randomization can go wrong.  This seems impossible with proper virtual dice, but too often the dice is less proper, just a lazy distinction in a device script. You may, for example, accidentally end up with all Android phones in arm A, and the property of being an Android user (as opposed to an Apple user) could very well be correlated with some personality or cultural trait, or willingness to spend money on ostentatious purchases. 

Speaking of Android phones, one particular randomization scheme we evidenced was based on the “Secure Android ID”, a value that was supposedly different on each phone and constant (up until a factory reset or major upgrade). But, a large number of cheaper phones from one manufacturer shared an ID and therefore ended up on the same side of the randomization. Oopsy daisy!

How can we even know if we have problems like this? 

In some cases, misbalanced randomization is revealed by a metric that shouldn’t be different between the groups. For example, the number of times a new user comes to the front page shouldn’t really depend on the content they see afterward. Another technique is called the “A/A test”: run an A/B test but use just one treatment: content, recommendation algorithm, or whatever you are going to test. 

For bonus points, run the hypothetical new algorithm on arm B and discard the result and apply the treatment A gets, so you incur timing problems somewhat similar to the real test. If you see a significant difference between the groups, something is obviously wrong.

Next, let’s assume the procedural side is fine.

The data scientist is still paranoid about raw numbers! 

And indeed, raw numbers don’t tell how our users will react tomorrow, site-wide, for we have just yesterday’s test with small samples and a single page. The data scientist wants to catch some parts of this uncertainty, preferably all of them. 

The word significant appeared above. If you took a statistics course in college, you might remember that there are “significance tests” where you get a p-value and, if it’s smaller than a magic threshold, you have a significant difference, otherwise there is no difference.

Significance tests are a complex topic. 

But in short, they are a way to accept or reject measured differences, on the basis of the size of the differences, and the volume of data used to compute them. If you have five users on both sides, common sense makes you doubt conclusions drawn from their differences. But if users are in the millions, even a tiny measured difference naturally has weight. Significance tests are an effort to formalize these intuitions.

So, there’s quite a lot to take into account in A/B testing to actually know anything.

Keeping these basics in mind, you’ll be able to get started. More things to consider – which we’ll get to in the next chapter:

  • How long to test, and why stopping too early may be bad. Just like you can stop watching a bike race early when your favorite cyclist is in the lead, doesn’t yet mean she’s going to win.
  • What could go wrong if you have multiple criteria by which to decide? If our cyclist wasn’t the fastest, but since she switched gears the fewest times, I’d pick her as the winner on quality.
  • How to improve the tests with other data? Think demographics of users or categories of items!

All this being said, don’t be discouraged! 

Start with something simple and find a data scientist to help. Pursuing the truth is one of the hardest challenges we face, but definitely a pursuit worth the effort. 

A/B testing is one way of monitoring the effectiveness and quality of deployed machine learning models Reaktor takes part in the European Industrial Grade Machine Learning for Enterprises research program that develops such things.  
If you found yourself alternately nodding in agreement, wishing that more people understood this, or shaking your head because we omitted something everyone should know about A/B testing, head over to our careers page. We have just the position for you, as you are.

Sign up for our newsletter

Get the latest from us in tech, business, design – and why not life.