I want to compute similarity between items (0,1,2,3..) based on their temporal information. Temporal information may be time instant (startdate), time interval (startdate,enddate) or null (NaT); see an example of dataframe (df_for) below.
- {instant,instant} : if equal sim = 1, else sim =0
- {instant,null} or vice versa, sim =0
- {instant, interval}: if instant within interval, sim =1 or if an interval contains an instant, sim = 1
- {interval,interval} : if intervals overlaps, sim = intersection of both intervals / union of both intervals
- {interval,interval} : if an interval contains another, then sim = 1
The following python code obtains temporal information from the dataframe and performs the conditions above (1-5). The code is verbose, i wonder if there is a smart way/lib to calculate similarity between time periods and time instants using python.
m, k = df_for.shape
sim = np.zeros((m, m))
data = df_for.values
for i in range(m):
for j in range(m):
if i != j:
st1 = data[i][0]
ed1 = data[i][1]
st2 = data[j][0]
ed2 = data[j][1]
#both items are null values
if pd.isnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, instant} => equal, not equal
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if st1 == st2:
sim[i][j] = 1.
else:
sim[i][j] = 0.
# {instant, null} => sim is 0
if pd.notnull(st1) and pd.isnull(ed1) and pd.isnull(st2) and pd.isnull(ed2):
sim[i][j] = 0.
# {instant, interval} => meets, during
if pd.notnull(st1) and pd.isnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if(st2 <= st1 <= ed2):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, instant} => meets, contains
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.isnull(ed2):
if(st1 <= st2 <= ed1):
sim[i][j] = 1. #a time is between two other times
else:
sim[i][j] = 0.
# {interval, interval} => equal, overlaps, not overlaps
if pd.notnull(st1) and pd.notnull(ed1) and pd.notnull(st2) and pd.notnull(ed2):
if (st1 <= st2 <= ed1) or (st2 <= st1 <= ed2):
intersect = min(ed1,ed2)- max(st1,st2) # earliestend-lateststart
union = max(st1,st2,ed1,ed2) - min(ed1,ed2,st1,st2)
overlaps = intersect/union
#print(intersect/np.timedelta64(1, 'D'),union/np.timedelta64(1, 'D'))
if (st1 > st2 and ed1 < ed2) or (st1 < st2 and ed1 > ed2): # contains, during
overlaps = 1.0
sim[i][j]=overlaps
else:
sim[i][j] = 0.
else:
sim[i][j] = 1.
I can see several ways to simplify the code.
First, I'd suggest moving the comparison code to a separate function that takes
st1
,ed1
,st2
, anded2
as arguments. (As a side note, whyst
(start time) buted
(end date)? Consistent names might be nice.) You'd be able toreturn
your result to the calling code, which would be responsible for doing the assignment into the array of results.Speaking of that calling code... It doesn't need to call the comparison function for every pair of time-ranges. The results should always be symmetrical (e.g.
compare(data[i][0], data[i][1], data[j][0], data[j][1])
is going to return the same thing ascompare(data[j][0], data[j][1], data[i][0], data[i][1])
). So just call one of those, and assign the results to bothsim[i][j]
andsim[j][i]
.Rather than lots of
if some_test_expression: return 1; else: return 0
blocks, just return the results of the comparison:return some_test_expression
. Python'sbool
type is a subclass ofint
, so these should convert to numbers automatically when you assign them into a numpy array (you could also cast tofloat
explicitly if you want the function to always return the same type).Handle all cases with nulls at the top. They're an easy case, and if you deal with them first (and return, so the rest of the function doesn't run), you don't need to check as many things for null later. You don't need to handle null/null, null/instantand null/interval separately, just return zero if either start value is null:
if isnull(st1) or isnull(st2): return 0.
Instant/interval comparisons are symmetrical, so just write the logic one way and flip the parameters around if they're initially in the wrong order:
if isnull(ed2): st1, ed1, st2, ed2 = st2, ed2, st1, ed1
I don't see too much that can be improved with the specific implementations of your comparisons between instants and intervals or between two intervals (other than the general things mentioned already above). However, I would like to say that your formula for overlapping intervals will have a big discontinuity in similarity as a small interval slowly overlaps more with a larger one, then gets fully contained. For instance, a one second interval that starts just a millisecond before a one hour interval will result in a similarity value between the two of
0.000277
(0.999 / (3600 + 0.001)
), but one that starts at the same time as the larger interval does will have similarity of1.0
(a big difference). A more continuously changing measure of similarity would come from dividing the overlap amount by the size of the smaller interval (rather than the size of the union). This formula doesn't need a special case for one interval fully containing another, since the default calculation will give a similarity of1.0
already (the overlap will be the size of the smaller interval, so you'll be dividing it by itself).So, putting all that together, here's how I'd rewrite things:
A few notes that I couldn't fit into comments in the code:
No test for the values being intervals is needed at the last step of the
compare_interval
function, since all other cases (involving instants and nulls) have already been handled.I've changed the assignments to the
sim
array to use tuple indexes, as that's more natural than nested indexes for multidimensional numpy arrays (slices don't work if you use nested indexes). You might be able to do the same thing with thedata
lookups, but I've not changed them since I don't know pandas nearly as well as I know numpy.Speaking of my lack of pandas knowledge, there is probably a way to apply the comparison function directly to a "join" of the dataframe with itself. I don't know how to do that though. If it exists, it will probably be faster than the nested loops I left mostly unchanged in the calling code (even if it doesn't optimize out the symmetric cases).