How to Scale from 0 to 30 QPS in 2 Weeks
The Problem
When running Bask, one of our early corporate partners required us to agree to a pretty aggressive SLA in order to sign on for a 1 month test of our recommendation engine. The SLA included the following requirements:
Previous to this integration, we had only pushed our recommendation engine to .5 QPS (1 query every 2 seconds) in a production environment. We needed to achieve 30 QPS and we only had 2 weeks to get it done.
The Solution
We built our enviroment on the AWS (Amazon Web Services) platform, so we could throw compute power at the problem. Our partner was footing the bill, and although it wasn’t part of the SLA, we knew we couldn’t hit them with $5,000 bill each month. Our solution contained two components:
Simplifying the Software Stack
Previous to this deployment, we had performance firmly under control with average response times of under 500ms, our stress testing showed us that under load these average went way up, and 30 QPS certainly represented load. Our analysis showed this latency was due primarily to 2 bottlenecks in our stack.
First, a quick explanation of the original stack. Our recommendation engine was build in python and exposed as an API on the web. Requests were proxied from Apache through to PHP, which would then grab some data out of MySql and pass the relevant data to python which then sent our personalized recommendations back up the stack.
From our analysis, we knew that we need to drop the PHP layer. We also wanted to replace MySql with something faster. So we made sure Apache could go directly python (using ModWSGI) and we replaced MySql with a Redis cache. We now had the leanest stack we were comfortable deploying in production.
Auto Scaling - 2 Ways
We then went to work setting up the most efficient auto scaling infrastructure we could in AWS. We knew that our partner's traffic was very time-sensitive, this was an online food ordering site after all, so most of the traffic came at lunch and dinner time. Therefore, we set up time based scaling, automatically doubling capacity from 11am to 2pm and again from 5pm until 10pm.
In addition to the time based scaling, we also set up CPU based scaling. The CPU based scaled was configured to fire up new instances if the average CPU load of all available machines exceeded 80%.
The Result
After the solution outlined above was implemented, we saw 100% uptime during the 1 month test, peaking at 35 QPS, with an average response time of 360ms. The infrastructure cost $1200 for the month. Despite an early concern that our small team couldn't handle scaling to their needs, our partner was hugely impressed by the resilience of our product to their traffic.