I have a table with the following columns:
record_id
source_id
user_id
mobile
called_at
I am trying to run these two queries
SELECT
t1.user_id,
t1.mobile,
COUNT(DISTINCT(t1.called_at )) AS cnt
FROM
(
SELECT
user_id,
mobile,
called_at
FROM
users
WHERE
called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
) t1
GROUP BY t1.user_id, t1.mobile
HAVING cnt > 1
And
SELECT
user_id,
mobile,
COUNT(DISTINCT(called_at )) AS cnt
FROM users
WHERE called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
GROUP BY user_id, mobile
HAVING cnt > 1
Both queries are logically same and also give same output. But the first query is running very quickly ~ 3 seconds and the second one ~ 55 seconds.
Even explain says that the first query involves extra scan on derived table using filesort and still its much faster.
How is this possbile?
Explain Output:
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1025150 | Using filesort |
| 2 | DERIVED | users | ALL | idx_fa_af,idx_a_di_um | NULL | NULL | NULL | 2221923 | Using where |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| 1 | SIMPLE | users | index | idx_fa_af,idx_a_di_um | idx_a_di_um | 23 | NULL | 2221923 | Using where |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| users | CREATE TABLE `users` (
`record_id` varchar(100) NOT NULL,
`source_id` int(11) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`mobile` varchar(15) DEFAULT NULL,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`called_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
UNIQUE KEY `idx_unique_a_ri_si` (`record_id`,`source_id`),
KEY `idx_fa_af` (`called_at`),
KEY `idx_fa_um` (`mobile`),
KEY `idx_a_di_um` (`user_id`,`mobile`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 3676447 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 1208173 |
+----------------------------+---------+
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 2468272 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+---------+
Add
INDEX(user_id, called_at, mobile)
, then run each query twice. Twice is to avoid caching issues that could be hiding I/O.I suspect that the first query ran fast because it was all in RAM. And the second was using index
idx_a_di_um
which was not cached.The index I am suggestion should make both of them run faster.
Is any combination of columns 'unique'? If so, make the combination the
PRIMARY KEY
. This will improve things further. If not, at least provide aid INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY
.Why it will help
An index is a BTree. (See wikipedia for a good definition.) That index structure is separate from the data, which is in a separate BTree, ordered by the
PRIMARY KEY
. A BTree is very efficient at finding a row or set of consecutive rows. ("Consecutive" according to the index.) When using a secondary key (ie, notPRIMARY
), first the row(s) of the index are located, then each data row is looked up using thePRIMARY KEY
. Unless... If all the columns that are needed in aSELECT
are in the secondary key, there is no need to reach over to the data. This is called 'covering';EXPLAIN
indicates it by saying "Using index". My index is a "covering" index for the subquery.The order of the columns in any index is important. In this one case, the index has all
user_id IS NOT NULL
rows together. But that is about the only argument for the order of the 3 columns.Handler trick
Here's a way to get more insight into what a query is doing, it it does not depend on caching, server-restart, etc:
Numbers that look like the size (rows) of the table indicate a table (or index) scan. Numbers that look like the size of the output indicate some final manipulation. Handler_write... indicates a tmp table. Etc.