I have two tables, as follows (simplified from actual):
mysql> desc small_table; +-----------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------+------+-----+---------+-------+ | event_time | datetime | NO | | NULL | | | user_id | char(15) | NO | | NULL | | | other_data | int(11) | NO | MUL | NULL | | +-----------------+---------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> desc large_table; +-----------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------+------+-----+---------+-------+ | event_time | datetime | NO | | NULL | | | user_id | char(15) | NO | | NULL | | | other_data | int(11) | NO | | NULL | | +-----------------+---------------+------+-----+---------+-------+ 3 rows in set (0.00 sec)
Now, small_table
is, well, small: for each user_id
there is usually only one row (though there are sometimes more). In large_table
, on the other hand, each user_id
appears numerous times.
mysql> select count(1) from small_table\G *************************** 1. row *************************** count(1): 20182 1 row in set (0.00 sec) mysql> select count(1) from large_table\G *************************** 1. row *************************** count(1): 2870522 1 row in set (0.00 sec)
However, and this is important, for each row in small_table
, there is at least one row in large_table
with the same user_id
, the same other_data
, and similar event_time
(the same within a few minutes, say).
I want to know whether small_table
has a row corresponding to the first, or the second, or the whateverth distinct row in large_table
for the same user_id
and similar event_time
. That is, I'd like:
- for each
user_id
, a count of distinct rows oflarge_table
in order byevent_time
, but only forevent_time
within, say, three hours; that is, I seek only the count of such rows as haveevent_time
within, say, three hours of one another; and - for each such collection of distinct rows, an identification of which row in that list (in order by
event_time
) has a corresponding row insmall_table
.
I don't seem able to write even a query that will do the first step, let alone a query that will do the second, and would appreciate any direction.
This should perhaps be a comment on Jonathan Leffler's detailed and helpful answer but (a) it's too long and (b) it does help answer my question, so I'm posting it as an answer.
The code titled "Multiple Event Ranges" in Jonathan Leffler's answer finds ranges where a second instance is soon after the first, and a penultimate instance is soon before the last, and no big breaks appear, but bars any big gap between interior instances, even if there are otherinstances between them. So, for example, if the limit is 3 hours, instances at 1, 2, 4, 6, and 7 would be barred because of the gap between 2 and 6. I think the correct code would instead be (building directly on Jonathan Leffler's):
which obviates the need for the last two
and exists
s in the code titled "Multiple Event Ranges" in Jonathan Leffler's answer and, indeed, obviates the need for the "Singleton ranges" and "Doubleton ranges" code in his answer.Unless I'm missing something.