Devops Best Practices For Digital Sports Entertainment, Part Two
Devops Best Practices For Digital Sports Entertainment, Part Two
Chris Johnson | VP, Engineering
August 22, 2017
In our previous post we talked about the work that was done to get Major League Soccer’s (MLS Digital) AWS platform into a reliably identifiable healthy state. With that done, the platform had enough stability for us to begin work to ensure capacity was being used efficiently.
Remember that each club site was its own Drupal instance. MLS had chosen to host each site on a separate instance of the codebase which allowed them the flexibility of slowly rolling out updates among all of the clubs.
The drawback, however, was that this placed particular constraints on the infrastructure due to the size of the codebase and system requirements when multiplied across 20 sites. The desired benefits of this setup could be achieved without the need for a distinct codebase for each site instance as only a codebase for each live version was really needed. In the end, we moved to a traditional multi-site approach.
Choosing a Multi-site architecture
Before multisite, the deployed codebase for all 20 sites had an aggregated footprint of 2500MB and the RAM needed for the different site instances led to a need to choose M3.large instances at AWS. Switching to a traditional multi-site architecture cut our deployed footprint to approximately 150MB (from 2500MB).
Since all sites share a common codebase, PHP’s opcode cache only needed to store an object once, versus caching the same object for 20+ different sites. This allowed us to keep more of the site in opcode cache, and to serve pages faster.
By reducing the RAM required, we would be able to go from an M3.large to the newer C4.xlarge instance which is cheaper and faster. Another huge performance leap came from upgrading to PHP 5.5. The PHP 5.5 upgrade required us to make some minor code changes, mostly changing how variables are referenced. We also needed to make a few changes in the codebase to support the move to multi-site. Overall, these two changes gained us a 40% performance increase.
Load Testing and Performance Tuning
Having addressed what could be done at the architectural and system level, there were also application level enhancements to be found. The load tests done as part of investigating the health check issue showed that the database was slow to execute certain queries under high load. High load was defined as ~100 concurrent cache-busted requests to a single server.
We ran some profiling against the site and generated flame graphs from the data. Our lead developer analyzed the flame graphs and determined there was an issue related to a slow View query in Drupal. He was able to rebuild the feature using custom code in a Bean instead of Views, which led to some massive performance improvements on these pages.
An Infrastructure Built For Traffic Spikes
In the lead up to Decision Day, where every team in the league faced off against one another over the course of four hours, we needed to predict how the MLS sites and infrastructure would react to a large influx of traffic. We turned to our preferred load testing stack: JMeter load tests run through BlazeMeter’s excellent load test service, and also set up New Relic and its PHP agent to pull transaction traces and aggregate PHP performance information without needing to pull traces and generate flame graphs using a tool like XHProf.
We devised a set of load tests that simulated real-world access patterns to the site including some random 404s and other flows to exercise the LEMP stack and not just Varnish. As we tested, we started noticing two major detriments to site performance.
The MLS platform was originally architected to run in 2 AWS regions at any given time with a read/write RDS database in the primary region, read-only in the secondary region and using latency-based routing to connect visitors to an appropriate datacenter. Since Drupal’s object cache was stored in Amazon ElastiCache, the database in the secondary region could be run read-only with a small core patch and a module called Orbital Cache Nuke that coordinated cache clears across the two regions.
We also consulted with MLS Digital stakeholders and determined that serving the site out of 2 regions simultaneously was not a priority for their business so long as the secondary region was still available for disaster recovery purposes. As we looked into slow transactions, we discovered that a considerable amount of PHP execution time was being spent in the Autoslave driver.
Autoslave works by providing a different database driver and a different default cache class that wraps the other cache classes. In the course of testing we discovered that any time the memcache driver had a miss, Autoslave would recalculate the list of which tables to get from which server at a considerable cost in CPU use. In some traces, Autoslave was responsible for 60% or more of the CPU time in generating the request. We applied a band-aid by moving to a larger ElastiCache instance which resulted in a smaller number of memcache cache misses that triggered Autoslave’s considerable CPU use.
Even worse, the read/write split we were getting from Autoslave was simply not very high, usually hovering around 5% of queries sent to the read-only replica server. Given this low split percentage, we decided to run another load test with Autoslave disabled. The throughput more than tripled on the same hardware, and we knew that disabling Autoslave was the correct next move.
The net result looks like this as shown in New Relic. This was prepared by Brian Aznar, MLS’s Director of Engineering to share the work we did to improve the MP7 platform. The event marked #3 was when we increased the size of the ElastiCache instance, and #4 is when we disabled AutoSlave completely.
Here’s an additional view from DataDog, MLS’s metrics and monitoring system. You’ll notice the considerable drop in load average and CPU usage, as well as the fact that the autoscaling pool was able to stay at two instances consistently after disabling autoslave. Later in the month we reduced the instance size in the autoscaling pool and two instances were still sufficient to handle the traffic.
These load tests and performance improvements actually spanned a period where we worked on the third aspect needed for autoscaling - bringing capacity online quickly, which I’ll cover in the next post.
This iterative approach is typical as fixing an issue results in the need to identify the next bottleneck or next most important valuable step for your business. The ability to bring on capacity quickly and automatically may be more important than whether that capacity is being used most efficiently if it is needed to buy your team time to dig into other issues which may exist.