Mysql Derived Table Performance

1.9k Views Asked by At

I have a table with the following columns:

record_id

source_id

user_id

mobile

called_at

I am trying to run these two queries

SELECT
      t1.user_id,
      t1.mobile,
      COUNT(DISTINCT(t1.called_at )) AS cnt
    FROM
      (
        SELECT
          user_id,
          mobile,
          called_at
        FROM
          users
        WHERE
          called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
      ) t1
    GROUP BY t1.user_id, t1.mobile
    HAVING cnt > 1

And

SELECT
      user_id,
      mobile,
      COUNT(DISTINCT(called_at )) AS cnt
      FROM users
      WHERE called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
    GROUP BY user_id, mobile
    HAVING cnt > 1

Both queries are logically same and also give same output. But the first query is running very quickly ~ 3 seconds and the second one ~ 55 seconds.

Even explain says that the first query involves extra scan on derived table using filesort and still its much faster.

How is this possbile?

Explain Output:

+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| id | select_type | table                 | type | possible_keys         | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
|  1 | PRIMARY     | <derived2>            | ALL  | NULL                  | NULL | NULL    | NULL | 1025150 | Using filesort |
|  2 | DERIVED     | users                 | ALL  | idx_fa_af,idx_a_di_um | NULL | NULL    | NULL | 2221923 | Using where    |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+

+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| id | select_type | table                 | type  | possible_keys         | key         | key_len | ref  | rows    | Extra       |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
|  1 | SIMPLE      | users                 | index | idx_fa_af,idx_a_di_um | idx_a_di_um | 23      | NULL | 2221923 | Using where |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+

| users | CREATE TABLE `users` (
  `record_id` varchar(100) NOT NULL,
  `source_id` int(11) NOT NULL,
  `user_id` int(11) DEFAULT NULL,
  `mobile` varchar(15) DEFAULT NULL,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `called_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
  UNIQUE KEY `idx_unique_a_ri_si` (`record_id`,`source_id`),
  KEY `idx_fa_af` (`called_at`),
  KEY `idx_fa_um` (`mobile`),
  KEY `idx_a_di_um` (`user_id`,`mobile`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |


+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 2       |
| Handler_mrr_init           | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 1       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 0       |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_next      | 3676447 |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 1208173 |
+----------------------------+---------+

+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 2       |
| Handler_mrr_init           | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 1       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 2468272 |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_next      | 0       |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
1

There are 1 best solutions below

12
On

Add INDEX(user_id, called_at, mobile), then run each query twice. Twice is to avoid caching issues that could be hiding I/O.

I suspect that the first query ran fast because it was all in RAM. And the second was using index idx_a_di_um which was not cached.

The index I am suggestion should make both of them run faster.

Is any combination of columns 'unique'? If so, make the combination the PRIMARY KEY. This will improve things further. If not, at least provide a id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY.

Why it will help

An index is a BTree. (See wikipedia for a good definition.) That index structure is separate from the data, which is in a separate BTree, ordered by the PRIMARY KEY. A BTree is very efficient at finding a row or set of consecutive rows. ("Consecutive" according to the index.) When using a secondary key (ie, not PRIMARY), first the row(s) of the index are located, then each data row is looked up using the PRIMARY KEY. Unless... If all the columns that are needed in a SELECT are in the secondary key, there is no need to reach over to the data. This is called 'covering'; EXPLAIN indicates it by saying "Using index". My index is a "covering" index for the subquery.

The order of the columns in any index is important. In this one case, the index has all user_id IS NOT NULL rows together. But that is about the only argument for the order of the 3 columns.

Handler trick

Here's a way to get more insight into what a query is doing, it it does not depend on caching, server-restart, etc:

FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';

Numbers that look like the size (rows) of the table indicate a table (or index) scan. Numbers that look like the size of the output indicate some final manipulation. Handler_write... indicates a tmp table. Etc.