I was actually trying to solve analytics vidya recent Hackathon LTFS(Bank Data), and there I faced something unique problem, actually not too unique. Let me explain
Problem
There are few columns in a Bureau dataset named
REPORTED DATE - HIST
, CUR BAL - HIST
, AMT OVERDUE
- HIST & AMT PAID - HIST
which consists blank value ,,
or more than one value in a row, and also there is not the same number of value in each row
Here is the part of the dataset (it's not original data, because of the big row size)
**Requested Date - Hist**
20180430,20180331,
20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331
,
20121031,20120930,20120831,20120731,20120630,20120531,20120430,
----------------x-----------2nd column------------x-----------------------------------
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
-----x--other columns are similar---x---------------------
Seeking for a better option, if possible
Previously when I solved this kind of problem, it was genres of Movielens project and there I use used dummy column concept, it worked there because there had not too many values in genres columns and also some of the values are repeating value in many rows, so it was quite easy. But here it seems quite hard here because of two reasons
1st reason
because it contains lots of value and at the same time it may contain no value
2nd reason
how to create a column for each unique value or a row like in Movielens genre case
**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action
# so here I had extracted all unique value and created columns
**genre** | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy | 1 | 1 | 1 | 0 | 0 |
carton|scifi|action | 1 | 0 | 0 | 1 | 1 |
biopic|adventure|comedy | 0 | 1 | 1 | 0 | 0 |
Thrill|action | 1 | 0 | 0 | 0 | 0 |
# but here it's different how can I deal with this, I have no clue
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
When in recommender is common to have sparse matrixes. Those can be very consuming space (too many zeros, or empty spaces), perhaps good to move to sparse matrix scipy representation, as in here. As mentioned it is common in recommenders, please find here excellent example.
Unfortunately I cannot use the original data, perhaps good to have a smaller example in csv. So I will use the example from recommender, since is as well very commmon.
Let see how that looks like as a matrix:
We do not need to create that matrix, as a matter of fact, it is better to avoid same it could be very resource consuming.
We should convert that into a
csr_matrix
, with just a portion of the size:That looks like:
With the above you can see the process with a smaller data set.