look ahead time analysis in R (data mining algorithm)

104 Views Asked by At

I have a file (dozens of columns and millions of rows) that essentially looks like this:

customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...

One record in the file might look like this:

1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...

When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.

I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.

My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.

Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.

Thanks

2

There are 2 best solutions below

0
On

I suggest to use the Linux "date" command to convert the date to the unix time stamps. Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.

It will be something like this:

>date --date='201106' "+%s"
1604642400

So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".

0
On

If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?

At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...

Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?