How to properly utilize a composite index in JOIN queries?

125 Views Asked by At

I am using MySQL Server 8 and have a MyISAM table containing daily statistics for 'cell' objects for a period of 2 years. In the test table described here there are between 51 and 57 cells (rows) for each day. In a production environment every day a new batch of 80+ cells with daily statistics for the previous day will be added. The columns in the table are many--about 860 counters. The database cannot be normalized, because all columns are equally important. The query produces about 840 columns of statistics for a cluster of user defined list of cells. Each column in the result set is a KPI, which is calculated based on one or more raw counters. The query joins a table containing cluster definitions clusters_cust with the main statistical table h_cell. Each cell from the cluster is matched with the statistical records in table h_cell for the same cell. The user defines a reporting period of several days to several months, selects a defined cluster from table clusters_cust and the result set are calculated cells' statistics aggregated per cluster level for for each day of the reporting period.

An example for the contents of clusters_cust is below. This table contains a list of cells in the first column 'Cell' and their corresponding Cluster Names in the second column 'Cluster'. One cell can be a part of one or more clusters. Cells that are not a part of any cluster are not present in this table. In this table there could be many clusters (each user is able to define as many clusters as they want) and each cluster can contain a user-defined list of cells--from just a few to tens of thousands.

Cell Cluster
Cell_1 Cluster_1
Cell_2 Cluster_1
Cell_3 Cluster_1
Cell_10 Cluster_2
Cell_11 Cluster_2
Cell_12 Cluster_2
Cell_13 Cluster_2
Cell_2 Cluster_3
Cell_14 Cluster_3
Cell_15 Cluster_3

An example of the contents of table h_cell is below. This is the statistical table. It contains daily values of counters for each cell. Each row is the record for specific cell for specific date. Every day a new batch of statistics for 80.000+ cells are inserted.

Time Cell Counter_1 Counter_2 Counter_X Counter_860
2023-01-01 Cell_1 3 2 .. 2
2023-01-01 Cell_2 4 5 .. 5
2023-01-01 Cell_3 1 0 .. 0
.......... ...... .. .. .. ..
2023-01-02 Cell_1 6 5 .. 5
2023-01-02 Cell_2 3 4 .. 4
2023-01-02 Cell_3 4 3 .. 3
.......... ...... .. .. .. ..

Simplified explanation of the query:

  1. The user selects 'Cluster_1';
  2. Then the user selects the reporting period - from 2018-01-01 to 2018-01-05, which is 5 days;
  3. The optimizer should extract the contents of 'Cluster_1' from table clusters_cust, namely 'Cell_1', 'Cell_2' and 'Cell_3';
  4. Then it matches these 3 cells with the records for 'Cell_1', 'Cell_2' and 'Cell_3' in table h_cell for 2018-01-01, 2018-01-02, 2018-01-03, 2018-01-04 and 2018-01-05;
  5. Next step is to sum the values of ('Counter_1' divided by 'Counter_2') for all 3 cells, so that to aggregate them to one cluster value;
  6. Since grouing is done by date there should be one cluster value for each of the 5 days in the reporting period.
SELECT cluster, Time,
    ROUND(SUM(counter1)/SUM(counter1+counter2)*100,3) AS 'KPI1',
    SUM(counter1) AS 'KPI2',
    .......
    SUM(counterN) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62'
AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time 

I expected that this query would rely on a composite index in table h_cell, which consists of 2 columns--'Cell' and 'Time' (more information about table and index structure is posted below). But the optimizer was unable to use both columns of the index. It used either just the first column 'Cell' and not the second 'Time' or used the 'Time' part of another composite index designed for other type of queries (see below for outputs of EXPLAIN). I am joining two tables and applying filters on both of them and the optimizer doesn't know if it's better to start accessing table #1 and then scan table #2, or the other way around.

CREATE TABLE `h_cell` (
  Time date NOT NULL,
  Cell char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  LocalCI tinyint NOT NULL,
  Integrity varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  counter1 int DEFAULT NULL,
  counter2 int DEFAULT NULL,
  counter3 double DEFAULT NULL,
  counter4 float DEFAULT NULL,
  ...........
  counter860 int DEFAULT NULL,
  PRIMARY KEY (Cell,Time) USING BTREE,
  KEY `Time` (Time,LocalCI) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC

CREATE TABLE `clusters_cust` (
  Cell varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
  Cluster varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  Comment varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  PRIMARY KEY (Cluster,Cell),
  KEY `Cell` (Cell),
  KEY `Comment` (Comment,Cluster)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell |          0 | PRIMARY  |            1 | Cell        | A         |       58258 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          0 | PRIMARY  |            2 | Time        | A         |    39090988 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            1 | Time        | A         |         730 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| h_cell |          1 | Time     |            2 | LocalCI     | A         |       15081 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

The primary key is designed to serve queries on cell level--it shows statistics for a given cell and period. I was hoping that it will also help the join queries on cluster level, but this didn't happen.

  1. Below is the EXPLAIN result for a cluster of 62 cells and 2 months reporting period. It seems that from the composite primary key only the first column 'Cell' is used and not the 'Time' part:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        |   63 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
  1. For a larger cluster of 3.000 cells and reporting period of 2 months the situation is the same--only the first column 'Cell' of the index is used:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table         | partitions | type | possible_keys       | key     | key_len | ref                          | rows | filtered | Extra                                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | clusters_cust | NULL       | ref  | PRIMARY,Cell        | PRIMARY | 322     | const                        | 4067 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | h_cell        | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | ee_4g_hua.clusters_cust.Cell |  671 |     5.32 | Using index condition                     |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
  1. But for the same cluster of 3.000 cells and a shorter reporting period of 1 month the primary key is not used at all and another index 'Time' is used (this index was designed for another type of queries). Again only the first column of this index is used, but this was expected:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 1056817 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
  1. For an even larger cluster of 20.000 cells and reporting period of 2 months again the primary key is not used, but the 'Time' index (its first column):
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
  1. To get rid of the JOIN I executed a subquery in WHERE clause to get the needed list of cells and then IN operator was used in the query result. The idea was to make the WHERE clause as much as possible look like the PRIMARY KEY index structure: to have (cell=A or cell=B or ...) and range scan in Time column.
SELECT Time,
    ROUND(SUM(counter1)/SUM(counter1+counter2)*100,3) AS 'KPI1',
    SUM(counter1) AS 'KPI2',
    .......
    SUM(counterN) AS 'KPI840'
FROM h_cell
WHERE cell IN (
    SELECT cell
    FROM clusters_cust
    WHERE cluster='cluster20k')
AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time 

EXPLAIN showed the same plan as if there was a JOIN and again 'Time' index was used, instead of PRIMARY KEY:

+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type   | possible_keys       | key     | key_len | ref                         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
|  1 | SIMPLE      | h_cell        | NULL       | range  | PRIMARY,Time        | Time    | 3       | NULL                        | 2080777 |   100.00 | Using index condition    |
|  1 | SIMPLE      | clusters_cust | NULL       | eq_ref | PRIMARY,Cell        | PRIMARY | 368     | const,ee_4g_hua.h_cell.Cell |       1 |   100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
  1. Again JOIN was omitted and there was a subquery which directly gets the needed list of cells from a pre-filtered table clust. One more time IN operator was used.
SELECT Time,
    ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
    SUM(`counter1`) AS 'KPI2',
    .......
    SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust)
AND Time>='2018-05-01' and Time<='2018-06-30' 
GROUP BY Time 

The result of EXPLAIN is below. This time PRIMARY KEY index was used, but only the first column from it--'cell'--and not both columns--'cell' and 'Time'. Execution time was so poor that I was not able to wait for query to end.

+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type  | table       | partitions | type | possible_keys       | key     | key_len | ref              | rows  | filtered | Extra                        |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             |  NULL |   100.00 | Using where; Using temporary |
|  1 | SIMPLE       | h_cell      | NULL       | ref  | PRIMARY,Time        | PRIMARY | 32      | <subquery2>.cell |   671 |     5.32 | Using index condition        |
|  2 | MATERIALIZED | clust       | NULL       | ALL  | NULL                | NULL    | NULL    | NULL             | 20000 |   100.00 | NULL                         |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+

What should be changed, either in my query or in the table design, for the optimizer to use both 'Cell' and 'Time' members of the primary key index?

Is this possible?

Should there be another more effective index?

2

There are 2 best solutions below

5
DRapp On

Since alternatives have not worked, I would suggest this alternative.

on your h_cell table, have an index on ( time, cell ).

As for the query, I also added the keyword STRAIGHT_JOIN. I also qualified each column with the table (alias) to the corresponding columns to better follow which column comes from which table.

Now, from even your own description of data being daily of 51-57k records for a period of roughly 60 days is still running and computing over 300k records. Now, since you only care about 'cluster62' per this example, I would assume the count is less. Also, for clarity in the query, you should not have to add the tick characters around every column... table.column or alias.column gives the engine explicit qualification to prevent ambiguity of where data comes from, such as TIME may be a reserved word, but h.time is explicitly a column from the h (alias to h_cell table).

SELECT STRAIGHT_JOIN
        cc.cluster,
        h.Time,
        ROUND(  SUM( h.counter1 ) / SUM( h.counter1 + h.counter2 ) * 100,3) AS KPI1,
        SUM( h.counter1 ) AS KPI2,
        .......
        SUM( h.counterN ) AS KPI840
    FROM 
        h_cell h
            INNER JOIN clusters_cust cc
                ON  cc.cluster = 'cluster62' 
                AND h.cell = cc.cell
    WHERE 
            h.Time >= '2018-05-01' 
        AND h.Time <= '2018-06-30'
    GROUP BY 
        h.Time 
2
Rick James On

Is this what you have?

  • Several "clusters"
  • Each cluster has several "cells"
  • Each cell have several rows in h_cell, and more come in over time.
  • The query needs to fetch the data for all the cells in one cluster.

Let's restructure.

SHOW CREATE TABLE h_cell;
CREATE TABLE `h_cell` (
  `Time` date NOT NULL,
   ClusterId SMALLINT UNSIGNED NOT NULL,  -- added
   CellId MEDIUMINT UNSIGNED NOT NULL,    -- changed
  `LocalCI` tinyint NOT NULL,
  `Integrity` varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `counter1` int DEFAULT NULL,
  `counter2` int DEFAULT NULL,
  `counter3` double DEFAULT NULL,
  `counter4` float DEFAULT NULL,
  ...........
  `counter860` int DEFAULT NULL,
  PRIMARY KEY (`ClusterId, `Time`, `CellId`) USING BTREE,  -- Changed
  KEY `Time` (`Time`,`LocalCI`) USING BTREE
  -- (other queries may need other indexes)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC

Notes/questions:

  • What are LocalCI and Integrity for?

  • Are you sure MySQL 8 kept the table as MyISAM?

  • If 8.0 silently switched the table to InnoDB (or you are forced to later), The PK will probably need changing (for efficiency).

  • Check the size of the Ids that I added.

  • Make appropriate changes for the "Ids" in the other tables.

  • (minor) If you frequently do monthly scans, consider changing from

        AND  hc.Time>='2018-05-01'
        AND  hc.Time<='2018-06-30'
    

    to this formulation:

        AND  hc.Time>='2018-05-01'
        AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH
    

The principle here is to turn a lot of lookups into a simple index scan.

Resulting query:

SELECT  cc.cluster, hc.Time,
        ROUND ,,,
    FROM  clusters_cust AS cc
    JOIN  h_cell AS hc  ON cc.ClusterId = hc.ClusterId
    WHERE  cc.cluster='cluster62'
      AND  hc.Time>='2018-05-01'
      AND  hc.Time <'2018-05-01' + INTERVAL 1 MONTH
    GROUP BY  hc.Time