Why can't select records for each code having the maximum date with having date=max(date)?

154 Views Asked by At

Show create table structure with;

CREATE TABLE `quote` (
  `id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `code` text COLLATE utf8mb4_unicode_ci,
  `date` date DEFAULT NULL,
  `open` double DEFAULT NULL,
  `high` double DEFAULT NULL,
  `low` double DEFAULT NULL,
  `close` double DEFAULT NULL,
  `volume` bigint(15) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=17449887 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 

Find the records for each code having the maximum date with inner join such as:

SELECT q1.*
FROM quote q1
INNER JOIN
(
    SELECT code, MAX(date) AS max_date
    FROM quote
    GROUP BY code
) q2
    ON q2.code = q1.code AND
       q2.max_date = q1.date;

I wonder why the subquery can't get desired result as above:

select * from quote group by code having date=max(date);

Please explain the reason in detail.

3

There are 3 best solutions below

0
On BEST ANSWER

I wonder why the subquery can't get desired result as above:

select * from quote group by code having date=max(date);

To start with:

select * from quote group by code

This query itself is not a valid one from SQL Standard perspective.

It could be, if all other columns were functionally dependent on code, which is not the case based on table definition(code is not unique and not a primary key). Related reading: Group by clause in mySQL and postgreSQL, why the error in postgreSQL?

The query behaves kind of ANY_VALUE:

select code, ANY_VALUE(id), ANY_VALUE(`date`), ANY_VALUE(`open`)...
from quote
group by code

Regarding second part:

having date=max(date);
--
having any_value(date) = max(date) -- sidenote: it will work for single row per `code`

Here the condition in HAVING applies after aggregation, which means the comparison is between MAX(date) per code compared to "unspecified" date.


Illustrating with example(this code will work only if only_full_group_by is turned off):

CREATE TABLE `quote` (
  `id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `code` text COLLATE utf8mb4_unicode_ci,
  `date` date DEFAULT NULL,
  `open` double DEFAULT NULL,
  PRIMARY KEY (`id`)
) ;

INSERT INTO quote(`code`, `date`, `open`)
VALUES ('a', '2020-01-01',10),
       ('a', '2021-01-01',20),
       ('a', '2022-01-01',30);

And queries:

SELECT * FROM quote;
+-----+-------+-------------+------+
| id  | code  |    date     | open |
+-----+-------+-------------+------+
|  1  | a     | 2020-01-01  |   10 |
|  2  | a     | 2021-01-01  |   20 |
|  3  | a     | 2022-01-01  |   30 |
+-----+-------+-------------+------+

select * from quote group by code;

-- this part is unspecified, id/date/open are arbitrary
+-----+-------+-------------+------+
| id  | code  |    date     | open |
+-----+-------+-------------+------+
|  1  | a     | 2020-01-01  |    1 |
+-----+-------+-------------+------+

select *, MAX(date) from quote group by code;

-- MAX(date) is stable, date is arbitrary, comparison does not make sense at this point
+-----+-------+-------------+-------+------------+
| id  | code  |    date     | open  | MAX(date)  |
+-----+-------+-------------+-------+------------+
|  1  | a     | 2020-01-01  |   10  | 2022-01-01 |  
+-----+-------+-------------+-------+------------+

select * from quote group by code having date=max(date);

-- empty
+-----+-------+-------+------+
| id  | code  | date  | open |
+-----+-------+-------+------+

db<>fiddle demo


Saying so, in order to get all columns ranking functions MySQL 8.0+ could be used:

This section describes nonaggregate window functions that, for each row from a query, perform a calculation using rows related to that row

SELECT *
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY `code` ORDER BY `date` DESC) AS rn
      FROM `quote`) s   --RANK() if `date` is not unique per code
WHERE rn = 1;

db<>fiddle demo 2

0
On

@scrapy I believe the answer to your question has to do with the difference between having a sub-query vs how the other query is structured. The way having a sub-query works is MySQL creates a derived table from the query before the outer query is run, then the derived table is used when MySQL executes the outer query (here's the documentation on derived tables for you to refer to).

The sub query you have works because you are only selecting 1 column (code) and then you are getting an aggregate value from using the MAX(date) as your second column and lastly you are grouping by code on the last line of your sub-query.

In your second query, you are using SELECT * and then grouping by just code before you try to use the MAX(date) in your HAVING clause. This query doesn't work because you are selecting every column in the table by using SELECT * but you are only grouping by code in your GROUP BY clause. Starting in MySQL v5.7 and beyond there is something called only_full_group_by that does not allow you to run a query using GROUP BY unless you specify every column in your SELECT statement in your GROUP BY, IE: to get your second query to work you would have to have every column in your table listed in your GROUP BY clause because you are using a SELECT * for your select statement (here is the documentation that talks about only_full_group_by).

And lastly, in order to get the result set you are looking for you have to group by the right columns, like you are doing in your sub-query. If you use anything other than code in a query where you try to get the max date per code, the result set will not be the same because you would have to group by the extra columns which would throw off your result set.

0
On

This

select * from quote group by code having date=max(date);

has max(date) which makes sense in the context of GROUP BY code. But date does not. The issue is which row's date should be compared? Simply put, it is probably 'invalid' SQL.

See also discussions of "only_full_group_by". (Newer versions of MySQL would flag your query as invalid. That flag is a way to turn it off to get the old, wrong, evaluation.)

That leads to subqueries, such as the one you have. There are a few others. Here is my catalog of the optimal ways to do groupwise-max: http://mysql.rjweb.org/doc.php/groupwise_max

There are also many discussions; see the tag [groupwise-maximum] that I added.

Other issues: Is code a ticker symbol? If so, it does not need to be TEXT. By changing to, say, VARCHAR(15), you can gain a lot of performance:

Get rid of id and change to PRIMARY KEY(code, date). This will spread up that subquery significantly and will probably improve some other queries.