Load NPZ sparse matrix in R

804 Views Asked by At

How can I read a sparse matrix that I have saved with Python as a *.npz file in R? I already came across two answers* on Stackoverflow but neither seems to do the job in my case.

The data set was created with Python from a Pandas data frame via:

scipy.sparse.save_npz(
     "data.npz",
     scipy.sparse.csr_matrix(DataFrame.values)
     )

It seems like the first steps for importing the data set in R are as follows.

library(reticulate)
np = import("numpy")
npz1 <- np$load("data.npz")

However, this does not yield a data frame yet.

*1 Load sparce NumPy matrix into R

*2 Reading .npz files from R

1

There are 1 best solutions below

0
NoIdea On

I cannot access your dataset, so I can only speak from experience. When I try loading a sparse CSR matrix with numpy, it does not work ; the class of the object is numpy.lib.npyio.NpzFile, which I can't use in R.

The way I found to import the matrix into an R object, as has been said in a post you've linked, is to use scipy.sparse.

library(reticulate)
scipy_sparse = import("scipy.sparse")
csr_matrix = scipy_sparse$load_npz("path_to_your_file")

csr_matrix, which was a scipy.sparse.csr_matrix object in Python (Compressed Sparse Row matrix), is automatically converted into a dgRMatrix from the R package Matrix. Note that if you had used scipy.sparse.csc_matrix in Python, you would get a dgCMatrix (Compressed Sparse Column matrix). The actual function doing the hardwork converting the Python object into something R can use is py_to_r.scipy.sparse.csr.csr_matrix, from the reticulate package.

If you want to convert the dgRMatrix into a data frame, you can simply use

df <- as.data.frame(as.matrix(csr_matrix))

although this might not be the best thing to do memory-wise if your dataset is big.

I hope this helped!