Efficient point-in-time query of group membership

160 Views Asked by At

We have a scenario like this:

  • Millions of records (Record 1, Record 2, Record 3...)
  • Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
  • Membership gradually changes over time, i.e. a record may be reassigned to another group.

We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:

  1. To which group did Record 15544 belong, three years ago? (Call this Group g).
  2. What records belonged to Group g, three years ago?

Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:

  1. Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
  2. Find all records that have at any time belonged to Group g.
  3. For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.

This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?


ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.

1

There are 1 best solutions below

0
On

How about creating two tables:

  1. (recordID, time-> groupID) - key is recordID, time - sorted by recordID, and secondary by time (Let that be map1)
  2. (groupID, time-> List) - key is groupID, time - sorted by recordID, and secondary by time (Let that be map2)

At each record change:

  • Retrieve the current groupID of the record you are changing
  • set t <- current time
  • create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
  • Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
  • Add a new entry (recordId,t,newGroupId) to map1

During query:

  • You need to find the entry in map2 that is 'closest' and smaller than (recordId,desired_time) - this is classic O(logN) operation in sorted data structure.
  • This will give you the group g the element belonged to at the desired time.
  • Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.

This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.

An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.