Ding!

Level up your MySQL query tuning

AlexanderRubin
tuning

A closer look at query optimization with a focus on queries with GROUP BY and ORDER BY, from Percona consultant Alexander Rubin.

 In
this article we will talk about query optimization with the focus
on the queries with GROUP BY and ORDER BY. We will start with the
basic concepts (Indexes, Explain plan in MySQL, etc) and then will
talk about the advanced query optimization techniques. We will
cover “loose index scan” and “tight index scan” optimizations and
show the benchmark results.

Indexes

In this article we will focus on the b-tree
indexes only (InnoDB and MyISAM only support B-tree indexes).
Figure 1 illustrates a basic
b-tree index
implementation.

 

B-tree supports both “equality” (where id = 1)
and range (where date > ‘2013-01-01’ and date < ‘2013-07-01’)
search. Figures 2 and 3 show examples of these,
illustrated.

Figure 2 is of an equality search with primary
or unique key: select
select * from table where id =
12
. In this scenario MySQL will be able to scan
through the tree and go directly to 1 leaf and then stop. That is
usually the fastest index scan operation.

In InnoDB primary key searches is
even faster as InnoDB “clusters” records around primary
key.

Range: select * from table where id
in (6, 12, 18)

In this scenario MySQL will need to scan thru
the tree and visit many leafs/nodes:

 

Table for the tests

I will use a couple of tables for the tests. The
first table we will use to illustrate is explained below. This
table is the part of MySQL “world” test db and can be downloaded
from
dev.mysql.com.

CREATE TABLE City (
    ID int(11) NOT NULL AUTO_INCREMENT,
    Name char(35) NOT NULL DEFAULT '',
    CountryCode char(3) NOT NULL DEFAULT '',
    District char(20) NOT NULL DEFAULT '',
    Population int(11) NOT NULL DEFAULT '0',
    PRIMARY KEY (ID),
    KEY CountryCode (CountryCode)
) Engine=InnoDB;

Explain plan

The main way to “profile” a query with MySQL is
to use
“explain”.
The output below shows an example. As we can see from the explain,
MySQL does not use any indexes (key: NULL) and will have to perform
a full table scan.

mysql> EXPLAIN select * from City where Name = 'London'G
*************************** 1. row
           id: 1
  select_type: SIMPLE
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4079
        Extra: Using where

In this case we can add an index to restrict the
number of rows:

mysql> alter table City add key (Name);
Query OK, 4079 rows affected (0.02 sec)
Records: 4079  Duplicates: 0  Warnings: 0

The new explain shows that MySQL will use an
index:

mysql> explain select * from City where Name = 'London'G
*********************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: Name
          key: Name
      key_len: 35
          ref: const
         rows: 1
        Extra: Using where

Index usages

MySQL will choose only 1 index per query (and
per table if the query joins multiple tables). In some cases MySQL
can also intersect indexes, however we will not cover it in this
article. MySQL uses index statistics to make a decision about the
best possible index.

Combined Indexes

Combined indexes are very important for MySQL
query optimizations. MySQL can use leftmost part of any index. For
example, if we have this index:
 

Comb(CountryCode, District, Population)
Then MySQL can use:
CountryCode only part
CountryCode + District
CountryCode + District + Population
From the explain plan in below we can understand which
part(s) will be used:
mysql> explain select * from City 
      where CountryCode = 'USA'G
********************** 1. row ******************
        table: City
         type: ref
possible_keys: comb
          key: comb
      key_len: 3
          ref: const
         rows: 273 
Note
the
key_len part – it shows: 3. This is
the number of bytes used from our index. As the

CountryCode field is declared as char(3), that
means that MySQL will use the first field from our combined
index.

Similarly, MySQL can use 2
leftmost fields or all 3 fields from the index. In this example,
the 2 leftmost fields are used:

mysql> explain select * from City 
where CountryCode = 'USA' and District = 'California'G
********************** 1. row ******************
        table: City
         type: ref
possible_keys: comb
          key: comb
      key_len: 23
          ref: const,const
         rows: 68

So MySQL will use 2 first fields from the comb key:

  • CountryCode = 3 chars
  • District = 20 chars
  • Total = 23

Whereas in this one, all 3 fields are used from the index:

mysql> explain select * from City 
where CountryCode = 'USA' and District = 'California’
and population > 10000G
********************** 1. row ******************
        table: City
         type: range
possible_keys: comb
          key: comb
      key_len: 27
          ref: NULL
         rows: 68
  • CountryCode = 3 chars/bytes
  • District = 20 chars/bytes
  • Population = 4 bytes (INT)
  • Total = 27

However, if the query does not have the first leftmost part of
an index, MySQL will not be able to use it:

mysql> explain select * from City where  
District = 'California' and population > 10000G
********************** 1. row ******************
 table: City
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 3868

As we do not have the CountryCode in the where clause, MySQL
will not be able to use our “comb” index.

Covered index

The covered index is an index that covers all fields in the
query. For example, for this query:

select name from City  where CountryCode = 'USA' and District = 'Alaska' 
and population > 10000

the following index will be a “covered” index:

cov1(CountryCode, District, population, name)

The above index uses all fields in the query in particular
order:

  1. Where part
  2. Group By/Order (not used now)
  3. Select part (here: name)

Here is an example:

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

“Using index” in the extra field of the explain output means
that MySQL will use our covered index. That also means that MySQL
will use an index only to satisfy the query: all the information
that MySQL needs is in the index. That is usually much faster,
especially if we have a large text or blob fields in the table.

Order of the fields in index

The order of the fields in the index is very important. The way
b-tree works, it is more beneficial to have a field which will be
used for “equality” comparison first and the fields with “range”
(more than and less than comparison) second.

For example, take the following query:

select * from City where district = 'California' and population > 30000

The best Index will be on (district, population), in this
particular order.

  1. (district, population) index [Figure 4]: In this case MySQL
    will be able to go “directly” (via index scan) to the correct
    district (“CA”) and do a range scan by population. All other nodes
    for the ”district” field (other US states in this example) will not
    be scanned.

  1. (population, district) index [Figure 5]: In this example, MySQL
    will have to do a “range” scan for the population and for each
    index record it will have to check for the correct district, which
    will be slower.

Complex Slow Queries

In this article we will be concentrating on 2 major query
types:

  • Queries with “group by”
  • Queries with “order by”

Those queries are usually the slowest ones. We will show how to
optimize those queries and decrease the query response time as well
as the application performance in general.

Group by example

Let’s look at this simple example, a query of how many cities
there are in each country.

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

As we can see the output above, MySQL does not use any indexes
(no proper indexes are available), but it also shows “Using
temporary; Using filesort”. MySQL will need to create a temporary
table to satisfy the “Group by” clause if there is no appropriate
index (You can find more information about the temporary
tables here).
However, MySQL can use a combined index to satisfy the “group by”
clause and avoid creating temporary table.

Group by and covered index

To illustrate the “group by” queries I will use the following
table:

CREATE TABLE ontime_2012 (
    YearD int(11) DEFAULT NULL,
    MonthD tinyint(4) DEFAULT NULL,
    DayofMonth tinyint(4) DEFAULT NULL,
    DayOfWeek tinyint(4) DEFAULT NULL,
    Carrier char(2) DEFAULT NULL,
    Origin char(5) DEFAULT NULL,
    DepDelayMinutes int(11) DEFAULT NULL,
    ...
) ENGINE=InnoDB DEFAULT CHARSET=latin1 

The table contains freely available airline
performance statistics
. The table is 6 million rows and
approximately 2GB in size.

  • From this data we want to:
  • Find maximum delay for flights on Sunday
  • Group by airline

Our example query is:

select max(DepDelayMinutes), 
carrier, dayofweek 
from ontime_2012 
where dayofweek = 7 
group by Carrier

The explain plan is:

...
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 4833086
       Extra: Using where; Using temporary; Using filesort

As we can see, MySQL does not use any index and have to scan ~4M
rows. In addition, it will have to create a large temporary
table.

If we will create an index on “dayofweek” it will only filter
out some rows and MySQL will still need to create a temporary
table:

mysql> explain select name from City  where CountryCode = 'USA' and District = 'Alaska' and population > 10000G
*************************** 1. row ***********
        table: City
         type: range
possible_keys: cov1
          key: cov1
      key_len: 27
          ref: NULL
         rows: 1
        Extra: Using where; Using index

However, we can create a covered index on
(dayofweek, Carrier, DepDelayMinutes) in this particular order. In
this case MySQL will be able to use this index and avoid creating a
temporary table:

mysql> alter table ontime_2012 
add key covered(dayofweek, Carrier, DepDelayMinutes);

explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 where dayofweek =7 group by CarrierG
...             
possible_keys: DayOfWeek,covered
         key: covered
     key_len: 2
         ref: const
        rows: 905138
       Extra: Using where; Using index

As we can see from the explain, MySQL will use our index and
will avoid creating a temporary table. This is the fastest possible
solution.

Note that MySQL will also be able to use non-covered index on
(dayofweek, Carrier) and avoid creating a temporary table. However,
covered index will be faster as MySQL will be able to satisfy the
whole query by just reading the index.

Group by and a range scan

The covered index works well if we have a “const” (where
dayofweek=N). However, MySQL will not be able to use an index and
avoid a filesort if we have a “range” scan in the where clause.
Here’s an example:

mysql> explain select max(DepDelayMinutes), Carrier, dayofweek from ontime_2012 
where dayofweek > 5 group by Carrier, dayofweekG
...

        type: range
possible_keys: covered
         key: covered
     key_len: 2
         ref: NULL
        rows: 2441781
       Extra: Using where; Using index; Using temporary; Using filesort

MySQL will still have to create a temporary table. To fix
that we can use a simple trick and rewrite the query into 2 parts
with UNION:

(select max(DepDelayMinutes), Carrier, dayofweek 
from ontime_2012 
where dayofweek = 6 
group by Carrier, dayofweek)
union
(select max(DepDelayMinutes), Carrier, dayofweek 
from ontime_2012 
where dayofweek = 7 
group by Carrier, dayofweek)

For each of the 2 queries in the union MySQL will be able
to use an index instead of creating a temporary table, as shown in
the explain plan below:

*************************** 1. row ***************************
       table: ontime_2012
         key: covered
...
       Extra: Using where; Using index
*************************** 2. row ***************************
       table: ontime_2012
         key: covered
...
       Extra: Using where; Using index
*************************** 3. row ***************************
          id: NULL
 select_type: UNION RESULT
       table: <union1,2>
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
       Extra: Using temporary

As we can see, MySQL uses our covered index for each of the 2
queries. It will still have to create a temporary table to merge
the results, however, it will probably be much smaller temporary
table as it will only need to store
the resultsets of 2 queries.

Group by and a loose index scan

Loose index scan is another MySQL algorithm to optimise Group By
queries. Loose index scan considers only a fraction of the keys in
an index and is very fast. The following limitations apply:

  • The query is over a single table.
  • The GROUP BY names only columns that form a leftmost prefix of
    the index and no other columns.
  • The only aggregate functions used in the select list (if any)
    are MIN() and MAX(), same column

More information can be found in MySQL documentation
on Loose
Index Scan
.

For loose index scan to work we will need to create an
additional index and start with columns in the “Group by” clause
and then all fields in the “where” clause (order of the fields in
the index matters!). For example, for our query:

select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012  where dayofweek = 5  group by Carrier, dayofweek

We will need to create this index:

KEY loose_index_scan (Carrier,DayOfWeek,DepDelayMinutes)

Note that looseindexscan is only a
name of the index, it can be any name:

mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012  where dayofweek = 5  group by Carrier, dayofweek

       table: ontime_2012
        type: range
possible_keys: covered
         key: loose_index_scan
     key_len: 5
         ref: NULL
        rows: 201
       Extra: Using where; Using index for group-by

“Using index for group-by” in the where clause means that
MySQL will use the loose index scan. Loose index scan is very fast
as it only scans a fraction of the key. It will also work with the
range scan:

mysql> explain select max(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 
where dayofweek > 5 group by Carrier, dayofweek;

table: ontime_2012
        type: range
possible_keys: covered
         key: loose_index_scan
     key_len: 5
         ref: NULL
        rows: 213
       Extra: Using where; Using index for group-by; 

Benchmark

We have performed a benchmark to compare query speed with a
temporary table, a tight index scan (covered index) and a loose
index scan. The table is 6 million rows and approximately 2GB in
size.

As we can see, loose index scan index shows the best
performance. Unfortunately, loose index scan only works with 2
aggregate functions, min and max for the group by. “AVG” + group by
does not work with loose index scan. As we can see below, MySQL
will use a covered index (not loose
indexscan
index) and, because we have a range in the where clause (dayofweek
> 5), it will have to create a temporary
table.

mysql> explain select avg(DepDelayMinutes) as ddm, Carrier, dayofweek from ontime_2012 where dayofweek >5  group by Carrier, dayofweek G 

       table: ontime_2012
        type: range
         key: covered
     key_len: 2
         ref: NULL
        rows: 2961617
       Extra: Using where; Using index; Using temporary; Using fileso

ORDER BY and filesort

MySQL may have to perform a “filesort” operation when a query
uses the “order by” clause. You can find more information about
“filesort” here.

The filesort operation below is usually pretty slow operation
(even if it does not involve creating a file on disk), especially
if MySQL will have to sort a lots of rows:

table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4079
        Extra: Using where; Using filesort

To optimize this query we can use a combined
index:

mysql> alter table City 
add key my_sort2 (CountryCode, population);

mysql> explain select district, name, population from City where CountryCode = 'USA' order by population desc limit 10G

       table: City
        type: ref
         key: my_sort2
     key_len: 3
         ref: const
        rows: 207
       Extra: Using where

MySQL was able to use our combined index to avoid sorting: as
the index is sorted, MySQL was able to read the index leafs in the
correct order.

Sorting and limit

If we have a “LIMIT” clause in our query (and the limit is
relatively small (i.e. LIMIT 10 or LIMIT 100) compared to the
amount of rows in the table), MySQL can avoid using a filesort and
can utilize index instead.

Here’s an example:

mysql> alter table ontime_2012 add key (DepDelayMinutes);

We can create an index on DepDelayMinutes fields only and
run the explain below (note the query with LIMIT 10):

mysql> explain select * from ontime_2012 
where dayofweek in (6,7) order by DepDelayMinutes desc
limit 10G

        type: index
possible_keys: DayOfWeek,covered
         key: DepDelayMinutes
     key_len: 5
         ref: NULL
        rows: 24
       Extra: Using where
A closer look at query
optimization with a focus on queries with GROUP BY and ORDER BY,
from Percona consultant Alexander Rubin. This article was
originally published on webandphp.com.

Image by John Dronges.
Author
Comments
comments powered by Disqus