I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks
I suggest to use the Linux "date" command to convert the date to the unix time stamps. Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".