I have a table with the following columns:
record_id
source_id
user_id
mobile
called_at
I am trying to run these two queries
SELECT
t1.user_id,
t1.mobile,
COUNT(DISTINCT(t1.called_at )) AS cnt
FROM
(
SELECT
user_id,
mobile,
called_at
FROM
users
WHERE
called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
) t1
GROUP BY t1.user_id, t1.mobile
HAVING cnt > 1
And
SELECT
user_id,
mobile,
COUNT(DISTINCT(called_at )) AS cnt
FROM users
WHERE called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
GROUP BY user_id, mobile
HAVING cnt > 1
Both queries are logically same and also give same output. But the first query is running very quickly ~ 3 seconds and the second one ~ 55 seconds.
Even explain says that the first query involves extra scan on derived table using filesort and still its much faster.
How is this possbile?
Explain Output:
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1025150 | Using filesort |
| 2 | DERIVED | users | ALL | idx_fa_af,idx_a_di_um | NULL | NULL | NULL | 2221923 | Using where |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| 1 | SIMPLE | users | index | idx_fa_af,idx_a_di_um | idx_a_di_um | 23 | NULL | 2221923 | Using where |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| users | CREATE TABLE `users` (
`record_id` varchar(100) NOT NULL,
`source_id` int(11) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`mobile` varchar(15) DEFAULT NULL,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`called_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
UNIQUE KEY `idx_unique_a_ri_si` (`record_id`,`source_id`),
KEY `idx_fa_af` (`called_at`),
KEY `idx_fa_um` (`mobile`),
KEY `idx_a_di_um` (`user_id`,`mobile`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 3676447 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 1208173 |
+----------------------------+---------+
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 2468272 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+---------+
Add
INDEX(user_id, called_at, mobile), then run each query twice. Twice is to avoid caching issues that could be hiding I/O.I suspect that the first query ran fast because it was all in RAM. And the second was using index
idx_a_di_umwhich was not cached.The index I am suggestion should make both of them run faster.
Is any combination of columns 'unique'? If so, make the combination the
PRIMARY KEY. This will improve things further. If not, at least provide aid INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY.Why it will help
An index is a BTree. (See wikipedia for a good definition.) That index structure is separate from the data, which is in a separate BTree, ordered by the
PRIMARY KEY. A BTree is very efficient at finding a row or set of consecutive rows. ("Consecutive" according to the index.) When using a secondary key (ie, notPRIMARY), first the row(s) of the index are located, then each data row is looked up using thePRIMARY KEY. Unless... If all the columns that are needed in aSELECTare in the secondary key, there is no need to reach over to the data. This is called 'covering';EXPLAINindicates it by saying "Using index". My index is a "covering" index for the subquery.The order of the columns in any index is important. In this one case, the index has all
user_id IS NOT NULLrows together. But that is about the only argument for the order of the 3 columns.Handler trick
Here's a way to get more insight into what a query is doing, it it does not depend on caching, server-restart, etc:
Numbers that look like the size (rows) of the table indicate a table (or index) scan. Numbers that look like the size of the output indicate some final manipulation. Handler_write... indicates a tmp table. Etc.