I am using MySQL Server 8 and have a MyISAM table containing daily statistics for 'cell' objects for a period of 2 years. In the test table described here there are between 51 and 57 cells (rows) for each day. In a production environment every day a new batch of 80+ cells with daily statistics for the previous day will be added. The columns in the table are many--about 860 counters. The database cannot be normalized, because all columns are equally important. The query produces about 840 columns of statistics for a cluster of user defined list of cells. Each column in the result set is a KPI, which is calculated based on one or more raw counters. The query joins a table containing cluster definitions clusters_cust with the main statistical table h_cell. Each cell from the cluster is matched with the statistical records in table h_cell for the same cell. The user defines a reporting period of several days to several months, selects a defined cluster from table clusters_cust and the result set are calculated cells' statistics aggregated per cluster level for for each day of the reporting period.
An example for the contents of clusters_cust is below. This table contains a list of cells in the first column 'Cell' and their corresponding Cluster Names in the second column 'Cluster'. One cell can be a part of one or more clusters. Cells that are not a part of any cluster are not present in this table. In this table there could be many clusters (each user is able to define as many clusters as they want) and each cluster can contain a user-defined list of cells--from just a few to tens of thousands.
| Cell | Cluster |
|---|---|
| Cell_1 | Cluster_1 |
| Cell_2 | Cluster_1 |
| Cell_3 | Cluster_1 |
| Cell_10 | Cluster_2 |
| Cell_11 | Cluster_2 |
| Cell_12 | Cluster_2 |
| Cell_13 | Cluster_2 |
| Cell_2 | Cluster_3 |
| Cell_14 | Cluster_3 |
| Cell_15 | Cluster_3 |
An example of the contents of table h_cell is below. This is the statistical table. It contains daily values of counters for each cell. Each row is the record for specific cell for specific date. Every day a new batch of statistics for 80.000+ cells are inserted.
| Time | Cell | Counter_1 | Counter_2 | Counter_X | Counter_860 |
|---|---|---|---|---|---|
| 2023-01-01 | Cell_1 | 3 | 2 | .. | 2 |
| 2023-01-01 | Cell_2 | 4 | 5 | .. | 5 |
| 2023-01-01 | Cell_3 | 1 | 0 | .. | 0 |
| .......... | ...... | .. | .. | .. | .. |
| 2023-01-02 | Cell_1 | 6 | 5 | .. | 5 |
| 2023-01-02 | Cell_2 | 3 | 4 | .. | 4 |
| 2023-01-02 | Cell_3 | 4 | 3 | .. | 3 |
| .......... | ...... | .. | .. | .. | .. |
Simplified explanation of the query:
- The user selects 'Cluster_1';
- Then the user selects the reporting period - from 2018-01-01 to 2018-01-05, which is 5 days;
- The optimizer should extract the contents of 'Cluster_1' from table
clusters_cust, namely 'Cell_1', 'Cell_2' and 'Cell_3'; - Then it matches these 3 cells with the records for 'Cell_1', 'Cell_2' and 'Cell_3' in table
h_cellfor 2018-01-01, 2018-01-02, 2018-01-03, 2018-01-04 and 2018-01-05; - Next step is to sum the values of ('Counter_1' divided by 'Counter_2') for all 3 cells, so that to aggregate them to one cluster value;
- Since grouing is done by date there should be one cluster value for each of the 5 days in the reporting period.
SELECT cluster, Time,
ROUND(SUM(counter1)/SUM(counter1+counter2)*100,3) AS 'KPI1',
SUM(counter1) AS 'KPI2',
.......
SUM(counterN) AS 'KPI840'
FROM h_cell
INNER JOIN clusters_cust ON clusters_cust.cell = h_cell.cell
WHERE cluster='cluster62'
AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
I expected that this query would rely on a composite index in table h_cell, which consists of 2 columns--'Cell' and 'Time' (more information about table and index structure is posted below). But the optimizer was unable to use both columns of the index. It used either just the first column 'Cell' and not the second 'Time' or used the 'Time' part of another composite index designed for other type of queries (see below for outputs of EXPLAIN). I am joining two tables and applying filters on both of them and the optimizer doesn't know if it's better to start accessing table #1 and then scan table #2, or the other way around.
CREATE TABLE `h_cell` (
Time date NOT NULL,
Cell char(8) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
LocalCI tinyint NOT NULL,
Integrity varchar(6) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
counter1 int DEFAULT NULL,
counter2 int DEFAULT NULL,
counter3 double DEFAULT NULL,
counter4 float DEFAULT NULL,
...........
counter860 int DEFAULT NULL,
PRIMARY KEY (Cell,Time) USING BTREE,
KEY `Time` (Time,LocalCI) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC
CREATE TABLE `clusters_cust` (
Cell varchar(11) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '',
Cluster varchar(80) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
Comment varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
PRIMARY KEY (Cluster,Cell),
KEY `Cell` (Cell),
KEY `Comment` (Comment,Cluster)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> SHOW INDEX FROM h_cell;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| h_cell | 0 | PRIMARY | 1 | Cell | A | 58258 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 0 | PRIMARY | 2 | Time | A | 39090988 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 1 | Time | A | 730 | NULL | NULL | | BTREE | | | YES | NULL |
| h_cell | 1 | Time | 2 | LocalCI | A | 15081 | NULL | NULL | | BTREE | | | YES | NULL |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
The primary key is designed to serve queries on cell level--it shows statistics for a given cell and period. I was hoping that it will also help the join queries on cluster level, but this didn't happen.
- Below is the
EXPLAINresult for a cluster of 62 cells and 2 months reporting period. It seems that from the composite primary key only the first column 'Cell' is used and not the 'Time' part:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 63 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
- For a larger cluster of 3.000 cells and reporting period of 2 months the situation is the same--only the first column 'Cell' of the index is used:
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | clusters_cust | NULL | ref | PRIMARY,Cell | PRIMARY | 322 | const | 4067 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | ee_4g_hua.clusters_cust.Cell | 671 | 5.32 | Using index condition |
+----+-------------+---------------+------------+------+---------------------+---------+---------+------------------------------+------+----------+-------------------------------------------+
- But for the same cluster of 3.000 cells and a shorter reporting period of 1 month the primary key is not used at all and another index 'Time' is used (this index was designed for another type of queries). Again only the first column of this index is used, but this was expected:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 1056817 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
- For an even larger cluster of 20.000 cells and reporting period of 2 months again the primary key is not used, but the 'Time' index (its first column):
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
- To get rid of the
JOINI executed a subquery inWHEREclause to get the needed list of cells and thenINoperator was used in the query result. The idea was to make theWHEREclause as much as possible look like thePRIMARY KEYindex structure: to have (cell=A or cell=B or ...) and range scan in Time column.
SELECT Time,
ROUND(SUM(counter1)/SUM(counter1+counter2)*100,3) AS 'KPI1',
SUM(counter1) AS 'KPI2',
.......
SUM(counterN) AS 'KPI840'
FROM h_cell
WHERE cell IN (
SELECT cell
FROM clusters_cust
WHERE cluster='cluster20k')
AND Time>='2018-05-01' AND Time<='2018-06-30'
GROUP BY Time
EXPLAIN showed the same plan as if there was a JOIN and again 'Time' index was used, instead of PRIMARY KEY:
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | h_cell | NULL | range | PRIMARY,Time | Time | 3 | NULL | 2080777 | 100.00 | Using index condition |
| 1 | SIMPLE | clusters_cust | NULL | eq_ref | PRIMARY,Cell | PRIMARY | 368 | const,ee_4g_hua.h_cell.Cell | 1 | 100.00 | Using where; Using index |
+----+-------------+---------------+------------+--------+---------------------+---------+---------+-----------------------------+---------+----------+--------------------------+
- Again
JOINwas omitted and there was a subquery which directly gets the needed list of cells from a pre-filtered tableclust. One more timeINoperator was used.
SELECT Time,
ROUND(SUM(`counter1`)/SUM(`counter1`+`counter2`)*100,3) AS 'KPI1',
SUM(`counter1`) AS 'KPI2',
.......
SUM(`counterN`) AS 'KPI840'
FROM h_cell
WHERE cell IN (SELECT * FROM clust)
AND Time>='2018-05-01' and Time<='2018-06-30'
GROUP BY Time
The result of EXPLAIN is below. This time PRIMARY KEY index was used, but only the first column from it--'cell'--and not both columns--'cell' and 'Time'. Execution time was so poor that I was not able to wait for query to end.
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | h_cell | NULL | ref | PRIMARY,Time | PRIMARY | 32 | <subquery2>.cell | 671 | 5.32 | Using index condition |
| 2 | MATERIALIZED | clust | NULL | ALL | NULL | NULL | NULL | NULL | 20000 | 100.00 | NULL |
+----+--------------+-------------+------------+------+---------------------+---------+---------+------------------+-------+----------+------------------------------+
What should be changed, either in my query or in the table design, for the optimizer to use both 'Cell' and 'Time' members of the primary key index?
Is this possible?
Should there be another more effective index?
Since alternatives have not worked, I would suggest this alternative.
on your h_cell table, have an index on ( time, cell ).
As for the query, I also added the keyword STRAIGHT_JOIN. I also qualified each column with the table (alias) to the corresponding columns to better follow which column comes from which table.
Now, from even your own description of data being daily of 51-57k records for a period of roughly 60 days is still running and computing over 300k records. Now, since you only care about 'cluster62' per this example, I would assume the count is less. Also, for clarity in the query, you should not have to add the
tickcharacters around every column... table.column or alias.column gives the engine explicit qualification to prevent ambiguity of where data comes from, such asTIMEmay be a reserved word, buth.timeis explicitly a column from theh(alias to h_cell table).