The execution plans below seem disappointing and suboptimal, even though the queries are straightforward.
I'm using MySQL 5.7. Here is the fiddle (although it only offers 5.6).
CREATE TABLE `event` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(63) CHARSET ASCII COLLATE ASCII_BIN NOT NULL,
`is_sequenced` TINYINT(3) UNSIGNED NOT NULL,
`sequence_number` BIGINT(20) UNSIGNED DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `Name-SequenceNumber` (`name`,`sequence_number`),
KEY `Name-IsSequenced` (`name`,`is_sequenced`,`id`)
) ENGINE=INNODB
;
INSERT INTO `event`
(id, `name`, is_sequenced, sequence_number)
VALUES
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL)
;
We will be using the Name-IsSequenced
secondary index. Let's try the following EXPLAIN
. (The queries are in the Fiddle. Open "View execution plan" to see their EXPLAIN
result.)
EXPLAIN
SELECT * -- This part needs the PK
FROM `event` e
WHERE e.name = 'OrderCreated'
AND e.is_sequenced = 0
AND e.id <= 3
;
So far, so good. Using index condition
makes sense: the entire condition can be resolved on the expected index Name-IsSequenced
, and then the PK is needed to get the remaining data for SELECT *
.
We should be able to improve that to Using index
if we only select something that is part of our secondary index, right? (Note that the PK is always part of any secondary index, but we can even ensure this by including id
at the end of our secondary index. The result is the same, as it should be.)
EXPLAIN
SELECT id
FROM `event` e
WHERE e.name = 'OrderCreated'
AND e.is_sequenced = 0
AND e.id <= 3
;
Now, the result is Using where; Using index
. Wait, that's... worse?! We made it do less work, and the plan indicates that it is working harder.
Using index
should be achievable. Find the range where name=OrderCreated
, then inside it find the subrange where is_sequenced=0
, then inside that find the subrange where id<=3
.
Curiously, I have other experiments (with some more data) where I can get Using index
by changing id<=3
to id=3
(combined with FORCE INDEX
to prevent it from prefering to the PK). I see no reason for the difference. (If we try this with the Fiddle, it stays the same - perhaps because of the small data set.)
Can anybody explain why the execution plan does not indicate the expected efficient use of the secondary index? Is there are a way to straighten it out?
The rule is simple: Do the
=
columns first, in any order. Then you get one crack at a 'range'.Do not listen to the old-wives-tale about ordering them based on cardinality.
With
SELECT id
, that index contains all the columns needed, so it is "covering", as indicated byEXPLAIN's
"Using index".With
SELECT *
the index is missingsequence_number
. So, it has two ways to be executed:Plan A: Use the index; for each row relevant row in the index's BTree, reach over into the Data's BTree (via
id
) to find the missing column.Plan B: Shun the index and simply scan the data, which is ordered by the
PRIMARY KEY(id)
. But lo and behold,id < 3
is actually a pretty good use of the PK. TheEXPLAIN
will probably sayPRIMARY
andRange
.The Optimizer will make a semi-intelligent choice between the Plans and usually pick the better one.
Plan C: Plan A can be improved upon. Add
sequence_number
(on the end) to makeINDEX(name, is_sequenced, id, sequence_number)
. Now you get "covering" ("Using index") and the fastest possible index.More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
About 5.6 / 5.7 / 8.0, the Optimizer was revamped a lot. It moved to a "cost-based model", where it uses index statistics, etc, to compute estimates of how costly each possible execution plan would be. The change was rolled out with a lot of fanfare, but the net effect on query plans was minimal. One area where no model does well is if the
WHERE
clause has range criteria on two different tables in aJOIN
.ORDER BY
and/orLIMIT
throw extra monkey-wrenches into the fray.ANALYZE TABLE
used to be important for 'fixing' the statistics; 5.6 made a fundamental improvement in that. Still, the "statistics" are not perfect.id=3
-- Well, you are asking for all the columns, and using the PK has all the columns, so why even consider some secondary index. (The PK is "clustered" with the data.) Even if there is an index that is just as good, the data is more likely to be cached in RAM. (The cost model does not [yet] take into account caching or SSD vs HDD.)As a Rule of Thumb (empirically determined), a secondary index will be shunned if more than 20% of it is needed. The bouncing back and forth between the secondary BTree and the Data BTree is assumed to be more costly than simply scanning the data. In your tiny table, 30% of the index is needed. QED. (Actually this is another gray area in which the Optimizer sometimes "gets it wrong".)