I've got a sqlite database of music tracks, and I want to remove duplicates. I'd like to compare tracks based on title and duration. (I'll probably try to throw artists in later, but that's a separate table (multiple artists per track), but for now, I've got a text field for the title and an integer field for the duration (in seconds).) Duplicate tracks in this database tend to have similar titles (or at least with similar prefixes) and durations within 5-10 seconds of each other.
I'm trying to learn recordlinkage to detect the duplicates, and my first attempt was to make a full index, use Smith-Waterman to compare titles and a simple linear numeric comparison for the duration. No big surprise; the database was WAAY too big to do a full index. I could do a block index on the duration to limit down to pairs to durations that are identical, but the durations are often off by a few seconds. I could do sorted neighborhood, but if* I'm understanding this correctly (*a big "if"), that means that if I set a window to (for example) 10, each track will only be paired with the 10 closest tracks in terms of duration, which will pretty much always be identical durations and completely miss the durations that are close but not identical. It seems to me like having an "approximate blocking index" or something like that would be a natural step, but I can't seem to find any simple way to do that.
Can anyone help me out here?
Okay, answering my own question here because I believe I've figured out the misunderstanding in my original question.
I was misunderstanding how sorted neighborhood indexing works. I was thinking that if you set the window to (for example) 3, it would sort all the records by the key and then pair each record with exactly 3 neighbor records (the record itself, the one above it, and the one below it). So if there were more than 5 records with the same key value, this would actually result in fewer pairs than a block index. But I'm now pretty sure that it's actually grouping the values by the key first, so that a window of 3 will pair with all records with the exact same key value, all the records with the next highest key value and all the records with the next lowest key value.
Now this doesn't get me exactly what I asked for, but it gets me close enough. If I set a window size of 11 (or 21), then I'll be guaranteed to get all values within 5 seconds (or 10 seconds). If the data is sparse with respect to duration, there will be a bit more. (And this only works because it's integer data. If it were floating point numbers of arbitrary precision, then that would be a different matter.)