"Using index" with composite index: A=, B=, C<=

324 Views Asked by At

The execution plans below seem disappointing and suboptimal, even though the queries are straightforward.

I'm using MySQL 5.7. Here is the fiddle (although it only offers 5.6).

CREATE TABLE `event` (
  `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(63) CHARSET ASCII COLLATE ASCII_BIN NOT NULL,
  `is_sequenced` TINYINT(3) UNSIGNED NOT NULL,
  `sequence_number` BIGINT(20) UNSIGNED DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `Name-SequenceNumber` (`name`,`sequence_number`),
  KEY `Name-IsSequenced` (`name`,`is_sequenced`,`id`)
) ENGINE=INNODB
;

INSERT INTO `event`
(id, `name`, is_sequenced, sequence_number)
VALUES
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL),
(NULL, 'OrderCreated', 0, NULL)
;

We will be using the Name-IsSequenced secondary index. Let's try the following EXPLAIN. (The queries are in the Fiddle. Open "View execution plan" to see their EXPLAIN result.)

EXPLAIN
SELECT * -- This part needs the PK
FROM `event` e
WHERE e.name = 'OrderCreated'
AND e.is_sequenced = 0
AND e.id <= 3
;

So far, so good. Using index condition makes sense: the entire condition can be resolved on the expected index Name-IsSequenced, and then the PK is needed to get the remaining data for SELECT *.

We should be able to improve that to Using index if we only select something that is part of our secondary index, right? (Note that the PK is always part of any secondary index, but we can even ensure this by including id at the end of our secondary index. The result is the same, as it should be.)

EXPLAIN
SELECT id
FROM `event` e
WHERE e.name = 'OrderCreated'
AND e.is_sequenced = 0
AND e.id <= 3
;

Now, the result is Using where; Using index. Wait, that's... worse?! We made it do less work, and the plan indicates that it is working harder.

Using index should be achievable. Find the range where name=OrderCreated, then inside it find the subrange where is_sequenced=0, then inside that find the subrange where id<=3.

Curiously, I have other experiments (with some more data) where I can get Using index by changing id<=3 to id=3 (combined with FORCE INDEX to prevent it from prefering to the PK). I see no reason for the difference. (If we try this with the Fiddle, it stays the same - perhaps because of the small data set.)

Can anybody explain why the execution plan does not indicate the expected efficient use of the secondary index? Is there are a way to straighten it out?

1

There are 1 best solutions below

3
On
WHERE e.name = 'OrderCreated'
  AND e.is_sequenced = 0
  AND e.id <= 3

The rule is simple: Do the = columns first, in any order. Then you get one crack at a 'range'.

INDEX(name, is_sequenced, -- in either order
      id)                 -- last

Do not listen to the old-wives-tale about ordering them based on cardinality.

With SELECT id, that index contains all the columns needed, so it is "covering", as indicated by EXPLAIN's "Using index".

With SELECT * the index is missing sequence_number. So, it has two ways to be executed:

Plan A: Use the index; for each row relevant row in the index's BTree, reach over into the Data's BTree (via id) to find the missing column.

Plan B: Shun the index and simply scan the data, which is ordered by the PRIMARY KEY(id). But lo and behold, id < 3 is actually a pretty good use of the PK. The EXPLAIN will probably say PRIMARY and Range.

The Optimizer will make a semi-intelligent choice between the Plans and usually pick the better one.

Plan C: Plan A can be improved upon. Add sequence_number (on the end) to make INDEX(name, is_sequenced, id, sequence_number). Now you get "covering" ("Using index") and the fastest possible index.

More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

About 5.6 / 5.7 / 8.0, the Optimizer was revamped a lot. It moved to a "cost-based model", where it uses index statistics, etc, to compute estimates of how costly each possible execution plan would be. The change was rolled out with a lot of fanfare, but the net effect on query plans was minimal. One area where no model does well is if the WHERE clause has range criteria on two different tables in a JOIN. ORDER BY and/or LIMIT throw extra monkey-wrenches into the fray.

ANALYZE TABLE used to be important for 'fixing' the statistics; 5.6 made a fundamental improvement in that. Still, the "statistics" are not perfect.

id=3 -- Well, you are asking for all the columns, and using the PK has all the columns, so why even consider some secondary index. (The PK is "clustered" with the data.) Even if there is an index that is just as good, the data is more likely to be cached in RAM. (The cost model does not [yet] take into account caching or SSD vs HDD.)

As a Rule of Thumb (empirically determined), a secondary index will be shunned if more than 20% of it is needed. The bouncing back and forth between the secondary BTree and the Data BTree is assumed to be more costly than simply scanning the data. In your tiny table, 30% of the index is needed. QED. (Actually this is another gray area in which the Optimizer sometimes "gets it wrong".)