ALS algorithm in Dask optimization

445 Views Asked by At

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:

    Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)), 
                            da.dot(Users, X))[0].T.compute()
    Items = np.where(Items < 0, 0, Items)

    Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)), 
                            da.dot(Items.T, X.T))[0].compute()
    Users = np.where(Users < 0, 0, Users)

But I don't think this works correctly, because MSE is not decreasing.

Example input:

n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items

Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:

1  0  0  0  5  2        1 0 0 0    0.8  1.3     1.1  0.2  4.1  1.6
0  0  0  0  4  0        0 0 1 1    3.9  4.3     3.5  2.7  4.3  0.5
0  3  0  0  4  0        0 0 0 0    2.9  1.5
0  3  0  0  0  0        0 0 0 0    0.2  4.7
                        1 1 1 0    0.9  1.1
                        1 0 0 0    4.8  3.0

EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.

X_train = da.nan_to_num(X_train)

Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.

Any help would be highly appreciated. <3

1

There are 1 best solutions below

0
On

One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.

Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:

data_set = da.array([[1, 2], [3, 4]])

masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]

masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]

masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]

In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:

masked_X = da.ma.masked_where(X != X, X)

Now you can easily perform dot product like:

da.ma.getdata(da.dot(Users,masked_X))