Tuesday, October 27, 2015

Lessons learned from taking an online sales website live

We recently finished migrating a website for a large insurer from PHP to Asp.Net MVC. The website is the primary source of new business for the insurer which means that any problem with the site may result in lost sales.

The process we followed basically consisted of working through the old code base line by line and converting it to the appropriate Dot Net code. In some cases it was simply easier to run the old site, grab the generated HTML and work backwards from there.

We worked very hard for 3 months and I personally spent many weekends and 18 hour days on the project.  On the last two days before going live someone made an innocent looking change to the code that went unnoticed...

When we finally took the site live, it broke spectacularly in production. We ended up having to cut back to the legacy site and the client was fairly unhappy.

Jumping forward a few days, we managed to get the site running and all seemed well until we noticed that the number of clients reaching the website that actually bought something (the conversion rate) was a lot lower than the legacy site.  We threw as many people as we could find at the site as testers but we just couldn't see anything wrong.

In a last desperate attempt, before giving up, we installed Inspectlet on the site. Inspectlet allows you to play back the session of a user and see what they saw when using your site. Within an hour we managed to spot a recurring pattern of behaviour in a certain group of users. We fixed the bug and suddenly the conversion was back to normal.

What we did right

Deploy at any time of day

It is crucial that you have the ability to update your site as many times in a day as you want and at any time of day. Sometimes we deployed 10 times a day as we fixed minor bugs.

You also need the ability to roll back if you got it wrong.  Previously we adopted an approach of only ever rolling forward and just fixing any problems and moving on.  On a high traffic public site you need to be able to recover very quickly before you go and find that bug.

Using a VIP swap on Azure cloud services gave us the ability to do deployments with zero downtime.  In order to be able to do this we didn't use session state at all. This meant all our users weren't kicked off the site every time we deployed.  If you need to keep track of logged in users, you can use JSON Web Tokens instead of user sessions.

Backwards compatible database scripts

In order to move from one version of the website to another with zero downtime, your database needs to be compatible with the old as well as the new version of the website. We ended up writing two scripts in some cases.

One script to run now. The resulting database needs to support the current version as well as the future version of the site.  This is to allow updating the database while the old version is running, rolling forward to the next version, as well as rolling back if you need to.

Another script to run later. Once you are confident that there is no chance of rolling back to the previous version, it is then safe to run the final script that breaks compatibility with the previous version.

What we learned

Know what the business measures

We knew that total sales was important and we were monitoring that closely. What we did not know was that the business values conversion rate even more because they pay for clicks and not for sales only. We should have known every key metric that we needed to measure as well as the target ranges.

Session recordings

Some users are infinitely more stupid than you can imagine. During testing we tried to come up with every possible scenario for users interacting with the screen.  Once we ran Inspectlet and saw what some of them were doing we were just shaking our heads.  This lead to lots of little changes to make the site easier to navigate, not to mention the subtle bugs we found.

Release to a subset of users

We should have built the necessary infrastructure to enable us to run the old and the new site side by side. We should have synced the data between the sites so that a user can jump between them at any time and just carry on. Then we should have controlled the percentage of business that goes to each site and gradually increased the volume going to the new site as our confidence grew.  This would have limited the impact of any bugs to a small number of users instead of letting all users suffer.

Smoke tests

If I could turn back time...  After the system went up in flames during the initial cutover, we wrote two functional tests that exercised the two main paths of the system. It probably only took about 4 hours to write in the end. I always meant to get to it but there was just never time for it... Next time I will make time, no matter what. In subsequent releases it caught so many bugs it is not even funny.

No comments:

Post a Comment