Thursday, December 6th, 2012

Optimizing MySQL Performance at Scale with Anemometer

By

Anyone who spends time optimizing a Relational Database Management System (RDBMS) can tell you that the largest performance gains can be made from query optimization. It’s true that selecting good hardware and tuning the OS and server settings matter a lot in this process. However, by tuning a query, you can have a drastic impact on execution time — in certain instances going from minutes to milliseconds. Any database server may have tens of thousands of unique queries running through it, so how do you identify which queries are worth your time to optimize? How do you track changes in performance over time?

While MySQL has some existing tools out there for this purpose, none of them met all of our needs at Box. Additionally, there are some more sophisticated commercial products in the market, but their licensing fees can be prohibitive when you have many databases. This is why we created Anemometer. We built it on top of existing open source tools, which though they provided a lot of value, didn’t take the final step of visualizing the data and making it as useful as it could be. This is what Anemometer is: an open source query performance visualization tool with powerful interactive reporting features.

In this post I’ll show some common methods used to gain insights into MySQL queries and explain why we needed to take the next step. In the end, we were able to increase our total throughput by 4x while simultaneously reducing our average query latency by 5x. That’s an astounding result — we scaled up our database usage and made it all happen faster at the same time!

What’s the server doing?

The first and best place to start gathering performance data is MySQL’s Slow Query Log. This feature writes information about queries after they execute so you can view detailed information about what happened internally. You can set a threshold and log all queries which execute longer than N time.

When it was first introduced, it had a minimum time resolution in seconds — you couldn’t log anything that wasn’t slower than one second, and this was a major limitation. Now, however, you can set subsecond values, which means you can effectively log all queries. This is a huge improvement because one of the most important metrics is the aggregate execution time for all queries that follow the same pattern — not the execution time of a single query. If you have a heavy reporting query which runs once an hour and takes 60 seconds, and a fast query which takes 50 miliseconds and executes 10 times per second, which one should you worry about? The first one takes 24*60 seconds daily (1440) but the second takes 24*60*60*10*0.050 seconds (43200!) This is a huge difference. If you were to spend your time trying to reduce the execution time by half of just one of these queries, the second one would be a much bigger win. MySQL isn’t really affected that much by occasional long running queries, but it is often overwhelmed with many small queries even if they’re normally very fast.

So you’ve got the slow log on and now you’re logging everything. But now you have a new problem. This is an absolute torrent of information — not only is the log not readable from top to bottom, you can’t leave it on all the time on a busy server!

The solution is sampling. You can enable the log for a short period of time and then disable it again. To process these logs you’ll also need some tools such as pt-query-digest from Percona. You can use it to digest your slow logs and come up with a report like this:

# Profile
# Rank Query ID           Response time Calls R/Call Apdx V/M   Item
# ==== ================== ============= ===== ====== ==== ===== ==========
#    1 0x9A77D60B6E48BE4D  0.1615 31.9%   260 0.0006 1.00  0.00 SELECT customer address
#    2 0xB4EB3788763D984E  0.0964 19.0%     1 0.0964 1.00  0.00 SELECT information_schema.COLUMNS
#    3 0xF47B7F81FA0E3307  0.0882 17.4%     1 0.0882 1.00  0.00 SELECT information_schema.PARTITIONS
#    4 0xAD94A0AA9EAD66FF  0.0175  3.5%     1 0.0175 1.00  0.00 SELECT information_schema.VIEWS
#    5 0xF34A9B78776B18DD  0.0143  2.8%   262 0.0001 1.00  0.00 SELECT actor

# Query 1: 2.77 QPS, 0.00x concurrency, ID 0x9A77D60B6E48BE4D at byte 979815
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0], V/M = 0.00
# Query_time sparkline: | -^_    |
# Time range: 2012-09-24 14:25:18 to 14:26:52
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          9     260
# Exec time     31   162ms    15us     4ms   621us     2ms   572us   657us
# Lock time     83    19ms       0   345us    73us   176us    62us    89us
# Rows sent     56     166       0       2    0.64    0.99    0.48    0.99
# Rows examine  98 104.87k       0   1.17k  413.01  592.07  331.40  592.07
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read     56     166       0       2    0.64    0.99    0.48    0.99
# Bytes sent     2 312.26k   1.18k   1.38k   1.20k   1.20k   30.00   1.14k
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size     9  20.29k      78      80   79.90   76.28       0   76.28
# String:
# Databases    sakila
# Hosts        localhost
# InnoDB trxID 4EE9 (1/0%), 4EEA (1/0%), 4EEB (1/0%)... 162 more
# Last errno   0
# Users        root
# Query_time distribution
#   1us
#  10us  ################################################
# 100us  ################################################################
#   1ms  ############################
#  10ms
# 100ms
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `sakila` LIKE 'customer'\G
#    SHOW CREATE TABLE `sakila`.`customer`\G
#    SHOW TABLE STATUS FROM `sakila` LIKE 'address'\G
#    SHOW CREATE TABLE `sakila`.`address`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM customer JOIN address USING (address_id) WHERE postal_code='94065'\G

This sample report shows a lot of aggregated statistics about queries that executed while the slow query log was enabled. At the top is a summary of queries found in the log, ordered by total execution time. Below the summary is a more detailed analysis of a single query pattern. As you can see from the execution time, this query represents 31 percent of the time our sample database server spent running queries.

Such a report helps a lot. If you’ve never seen statistics like this about your database server, it can be a revelation. However, if you really want the information to identify the queries in your system that need optimizing, this report may still not be enough. This report is only a small slice of time — you really want to know how things change over time so you can see if the query that’s hammering your server today was there yesterday as well. You also probably have many database servers, and this report only shows the logs from one — you want to see statistics over all your servers, which means you need a way to aggregate this data across multiple servers. Where are the graphs? Information that’s hard to tease out of text files can be discovered instantly by graphing this data over time.

These are the reasons that motivated us to build Anemometer. But why throw away existing work? pt-query-digest can process and save its data to a database. We decided to build on this powerful tool and provide the reporting and visualization layer that makes the data even more useful.

What does it look like?

Again, we’re using existing open source tools to build something that is functional, attractive, and familiar to almost everyone out there. We used Bootstrap and JQuery for the interface elements and interactivity. We used Flot for the amazing interactive HTML canvas graphs. The following sample is a basic view of query data in Anemometer:

In this sample you can instantly notice that performance over time isn’t consistent at all — there are a lot of changes! In Anemometer, you have the option to change which data you want to graph. You can even quickly compare different database servers all at once. Below the graph is a table which displays all queries that executed during the selected time period. The graph supports zooming so you can highlight a time range to drill down — the table of queries updates dynamically so you can quickly see the exact queries which took place during that time. Graphing, grouping, sorting. It’s so simple, but it adds so much power to the task of exploring your query data.

Once you find a query you want to examine, you can get additional details (such as the explain plan and table structure), keep notes and see a history of query performance. This information is almost everything you need to decide how to optimize queries.

One important note: at Box we append the program backtrace to each query as it’s passed to the database server. This is a common best-practice that often helps us troubleshoot query issues. This makes it super easy to find the code associated with a query that needs work.

Integrating it into Development

Anemometer also supports multiple datasources, so you can monitor your development environment in addition to production databases. Using the ability to create custom reports and return the results as JSON objects, we’ve integrated Anemometer into our development process, so we can catch bad queries before they hit our live servers.

Before a release is pushed, we query Anemometer with a pre-defined report which returns potentially bad queries. It checks for things such as full table scans, filesorts, temporary table operations, and low index usage. This is where the comment/review functionality really helps. We mark queries with a status, and then we can add a filter to the report so that they wont continue to appear in the report results. By doing this, we flag queries that need to be improved and also mark queries which are ok to go live. When we first started reviewing queries this way, we had a lot of results. After reviewing this initial backlog, we now only see a list of new queries to review.

Since the interface uses all GET parameters, we can give you the exact query string of the report, and you can run it on your own data!

?datasource=Staging&dimension-ts_min_start=-7+day&dimension-ts_min_end=now&table_fields%5B%5D=checksum&table_fields%5B%5D=age&table_fields%5B%5D=snippet&table_fields%5B%5D=index_ratio&table_fields%5B%5D=query_time_avg&table_fields%5B%5D=rows_sent_avg&table_fields%5B%5D=examined_ratio&table_fields%5B%5D=ord&table_fields%5B%5D=checksum&table_fields%5B%5D=checksum&table_fields%5B%5D=ts_cnt&table_fields%5B%5D=Query_time_sum&table_fields%5B%5D=Lock_time_sum&table_fields%5B%5D=Rows_sent_sum&table_fields%5B%5D=Rows_examined_sum&table_fields%5B%5D=Full_scan_sum&table_fields%5B%5D=Tmp_table_sum&table_fields%5B%5D=Filesort_sum&fact-group=checksum&fact-order=age&fact-having=SUM%28dimension.Full_scan_sum%29+OR+SUM%28dimension.Full_join_sum%29+OR+%28ord+>%3D2+AND+index_ratio+>+20+AND+SUM%28dimension.ts_cnt%29%2FSUM%28dimension.Rows_examined_sum%29%29&fact-limit=10000&submit=Search&fact-where=fact.reviewed_on+IS+NULL+AND+fact.fingerprint+NOT+LIKE+%27EXPLAIN%25%27+AND+fact.fingerprint+NOT+LIKE+%27DESCRIBE%25%27&&action=report

There are more features as well! You can grab the project from GitHub and see more information there. There’s also my Lightning Demo of Anemometer at Velocity Europe 2012.

  • Shreyas

    This is indeed helpful!

  • pjimenezsolis2012

    Very interesting article. I want to test this with my IaaS MySQL database.

    Nice Job

  • Pdiemert

    Similarly, http://www.graphdat.com graphs mysql query performance in realtime.

    • serverman

      but this is freeeeeeee

  • Pingback: Optimizing MySQL Performance at Scale with Anemometer « SSHlab.com

  • wesey E batton

    Really very helpful for me!

  • Jer

    Would you be willing to elaborate on this?

    “One important note: at Box we append the program backtrace to each query
    as it’s passed to the database server. This is a common best-practice
    that often helps us troubleshoot query issues. This makes it super easy
    to find the code associated with a query that needs work.”