How to handle multi-value row of numerical value with unknown size in python?

99 Views Asked by At

I was actually trying to solve analytics vidya recent Hackathon LTFS(Bank Data), and there I faced something unique problem, actually not too unique. Let me explain

Problem

There are few columns in a Bureau dataset named REPORTED DATE - HIST, CUR BAL - HIST, AMT OVERDUE - HIST & AMT PAID - HIST which consists blank value ,, or more than one value in a row, and also there is not the same number of value in each row

Here is the part of the dataset (it's not original data, because of the big row size)

**Requested Date - Hist**                                                                   
20180430,20180331,
20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331
,
20121031,20120930,20120831,20120731,20120630,20120531,20120430,

----------------x-----------2nd column------------x-----------------------------------

**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866

-----x--other columns are similar---x---------------------

Seeking for a better option, if possible

Previously when I solved this kind of problem, it was genres of Movielens project and there I use used dummy column concept, it worked there because there had not too many values in genres columns and also some of the values are repeating value in many rows, so it was quite easy. But here it seems quite hard here because of two reasons

1st reason because it contains lots of value and at the same time it may contain no value

2nd reason how to create a column for each unique value or a row like in Movielens genre case

**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action

# so here I had extracted all unique value and created columns 

**genre**                 | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy   |   1        |     1        |      1    |     0     |      0    |    
carton|scifi|action       |   1        |     0        |      0    |     1     |      1    |
biopic|adventure|comedy   |   0        |     1        |      1    |     0     |      0    |
Thrill|action             |   1        |     0        |      0    |     0     |      0    |

# but here it's different how can I deal with this, I have no clue
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
1

There are 1 best solutions below

6
On

When in recommender is common to have sparse matrixes. Those can be very consuming space (too many zeros, or empty spaces), perhaps good to move to sparse matrix scipy representation, as in here. As mentioned it is common in recommenders, please find here excellent example.

Unfortunately I cannot use the original data, perhaps good to have a smaller example in csv. So I will use the example from recommender, since is as well very commmon.

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

df = pd.DataFrame({
    'genres' : ["action|adventure|comedy", "carton|scifi|action","biopic|adventure|comedy","Thrill|action"],
})
print(df)
                    genres
0  action|adventure|comedy
1      carton|scifi|action
2  biopic|adventure|comedy
3            Thrill|action

Let see how that looks like as a matrix:

# To identify the genres so we can create our columns
genres = []
for G in df['genres'].unique():
    for i in G.split("|"):
        print(i)
        genres.append(i)
# To remove duplicates
genres = list(set(genres))

# Create a column for each genere
for g in genres:
    df[g] = df.genres.transform(lambda x: int(g in x))

# This is the sparse matrix with many 0
movie_genres = df.drop(columns=['genres'])
print(movie_genres)
   comedy  carton  adventure  Thrill  biopic  action  scifi
0       1       0          1       0       0       1      0
1       0       1          0       0       0       1      1
2       1       0          1       0       1       0      0
3       0       0          0       1       0       1      0

We do not need to create that matrix, as a matter of fact, it is better to avoid same it could be very resource consuming.

We should convert that into a csr_matrix, with just a portion of the size:

from scipy.sparse import csr_matrix

M = df.index.__len__()
N = genres.__len__()

user_mapper = dict(zip(np.unique(df.index), list(range(M))))
genres_mapper = dict(zip(genres, list(range(N))))

user_inv_mapper = {user_mapper[i]:i for i in user_mapper.keys()}
genres_inv_mapper = {genres_mapper[i]:i for i in genres_mapper.keys()}

user_index = []
genre_index = []
for user in df.index:
    print(user)
    print(df.loc[user,'genres'])
    for genre in df.loc[user,'genres'].split('|'):
        genre_index.append(genres_mapper[genre])
        user_index.append(user_mapper[user])

X = csr_matrix((np.ones(genre_index.__len__()),
                (user_index,genre_index)), shape=(M,N))

That looks like:

print(X)
  (0, 0)    1.0
  (0, 2)    1.0
  (0, 5)    1.0
  (1, 1)    1.0
  (1, 5)    1.0
  (1, 6)    1.0
  (2, 0)    1.0
  (2, 2)    1.0
  (2, 4)    1.0
  (3, 3)    1.0
  (3, 5)    1.0

With the above you can see the process with a smaller data set.