Tuesday, January 22nd, 2013

Keeping up with +500 new PHPUnit tests per week


Explore other Box tech blog posts here

Let’s start with some numbers:

  • 13,500 PHPUnit tests
  • 105 processor-minutes to run them all in sequence (due to many of them being integration tests)
  • 500 new patchsets submitted to code review each week
  • 10 minutes to provide test results on each patchset in accordance with our SLA

Providing this level of service requires a serious commitment to our testing infrastructure. But more importantly, once that service level is achieved, how do we maintain it in the face of 500 new tests being added each week? We’ve found a number of ways to tackle this incrementally over the last few months.


The most essential component is a system we call PHPUnit Cluster Runner (PhpCR) which provides the parallelization required to run our test suite in a reasonable amount of time. It divides up our test files, spreads the work across n slave VMs, and collates the results.

We began with a very simplistic method for dividing tests between slaves, but later discovered there are considerable performance gains from improving the test distribution strategy.

Per-file invocation of PHPUnit

Background: Running PHPUnit on a directory or a list of tests creates a single thread to execute all those tests in sequence. As a side-effect, any global state changes which are not explicitly reset are carried on to the next test. Having an unknown global state when a test starts is bad! A test that passes individually may fail when run in sequence with other tests because it makes assumptions about global state the other tests don’t adhere to. And since the sequence order can determine whether these global state issues manifest, your passing tests might suddenly start failing when PhpCR distributes files differently among its slaves.

Our solution was obvious: run each test file individually with its own invocation of PHPUnit and combine the results — isolating any global state changes from affecting other test files. We hesitated to try this at first because we assumed performance would be significantly degraded, but surprisingly, we actually found a marginal increase in overall speed. Another lesson in making untested assumptions about performance!

Our switch to per-file invocation shaved 10 processor-minutes off our total run time and solved our stability problems.

Test file distribution

Our initial implementation of PHPUnit Cluster Runner divided the test files evenly between slave VMs. Unfortunately, the execution time of our test files is highly variable due to differences in the number of tests per file and the execution speed of each test. Observing output from the slave VMs made it clear that one or two slaves were getting a disproportionate amount of work and often ran significantly longer than their siblings.

By simply changing our distribution strategy to account for the number of tests in each file, much of the workload disparity was eliminated and we realized an immediate 20% decrease in execution time — a sizeable improvement for minimal effort.

Excluding unnecessary test frameworks

We’ve talked about how to make test execution faster, but of course, the easiest way to improve test performance is to not run irrelevant tests! We came up with a mapping of file types (.js, .tmpl, .php, .cs, .yaml) to test types (PHPUnit, QUnit, API) which allows us to skip certain test types on certain patchsets. For example, we are now skipping PHPUnit on patchsets that only modify .js files. Again, another easy win which gave us around 20% more hardware availability.

Killing obsolete builds

Here’s a common scenario: a developer pushes a patchset to code review, quickly spots a problem, and then pushes a second patchset with a fix. Great! Except now we are wasting resources testing both patchsets, when we only care about the second one.

In response, whenever someone pushes a patchset, we now check for queued builds from the previous patchset and kill them. This is mostly useful when developers are swarming the system in anticipation of a release deadline. This was also a common request from developers who felt discouraged from pushing quick follow-up patchsets because of the additional strain it was putting on the test infrastructure.

PHPUnit cluster management in Jenkins

Don’t do it! Our initial thought was to register all the PhpCR slaves in Jenkins, maintaining Jenkins as the single source of information about our test automation VMs. But it quickly became apparent that didn’t make sense, for two reasons:

  1. Jenkins does not control the slaves. It controls the PhpCR masters and those masters control their slaves. Having the slaves registered in Jenkins is unnecessary and allows for confusion about which system is responsible for what. Each slave should have only one ‘manager’ to avoid conflicts.
  2. Managing 100+ individual VMs in the Jenkins GUI is a nightmare. Adding or moving a group of slaves between masters is likely to give someone carpal tunnel. Jenkins is great for a lot of things, but mass updates / additions is not one of them.

We manage the cluster masters in Jenkins, and each cluster master reads a global configuration file to determine its slaves. This makes it easy to update / rearrange the clusters and provides clear lines of responsibility.

The Future

The steps we’ve described so far have resulted in significant improvements, allowing us to keep up with increasing demand. But there’s a lot more we can do (fortunately, since next week there will be 500 more tests!):

Coverage-based test selection

We can achieve a massive reduction in the number of tests run if we can determine which tests exercise the code that was changed. Fortunately PHPUnit can generate this information in the form of code coverage reports. Using this, we’re building a mapping of SUT files to test files for every merged commit to determine which tests should run for a given patchset. Our trial runs have shown this could reduce the number of PHPUnit tests run per patchset by a factor of ~10.

To guarantee our optimizations don’t allow any test breakages to sneak by, we will only employ this system for testing patchsets in code review. All PHPUnit tests will still be run when a commit is merged.

Optimizing test distribution among slaves

Scoring each test file by the number of tests contained in it has improved the ‘fairness’ of test distribution among the slaves considerably — but there are still gains to be made. For example, distributing test files by their typical execution time is a better strategy. Alternatively, we could keep the test files in a common queue and dole one out to each slave, allowing them to request the next file when they finish.

Final Thoughts

Without the improvements we’ve made in the last six months we would be completely over our heads — swamped with tests and unable to provide fast feedback.

We are regularly surprised by (and proud of) the rate of test development at Box, which has ultimately required a wide variety of tricks just to keep up. We’re now in a position to build on our success and (hopefully) make significant reductions in our time-to-feedback metric over the next few months. It’s hard to imagine a more immediately effective way of increasing our agility as an engineering organization.

  • http://twitter.com/lox Lachlan Donald

    Very interesting. How are you guys handling aggregating coverage data from the distributed test runners?

    (We have similar challenges at 99designs, we recently open-sourced a piece of our test distribution environment https://github.com/99designs/testcloud)

    • http://twitter.com/euphoria83 Aman Goel

      Each phpunit execution creates a result.xml file. We copy them over to the Cluster-Runner Master, read in the XML, append them together to form a larger XML and then analyze it for errors and failures.

      Does this answer your question ? Feel free to get in touch with us if you want to discuss this further.

  • Pingback: The Box Blog » High-Class Problems, Kick-Ass Engineers