cheapest way to create pandas.DataFrame or pandas.SparseDataFrame

321 Views Asked by At

Suppose we have a huge and sparse matrix, what would be the cheapest way to fill it into a pandas.DataFrame? More specifically, the huge matrix comes from a big dataset that involves lots of dummy variable and the dense version matrix occupies 150Gb+ memory, which is apparently not durable.

I am trying to penetrate the memory management of python.pandas, as a green hand of pandas. The current dilemma is described as following:

  • Using dense source matrix and calling pd.DataFrame will not cause memory copy. The dense matrix will eat up most space though.
  • If using scipy.csr_matrix, pd.DataFrame doesn't accept it as argument of constructor. Taking a step back, if we resort to pd.SparseDataFrame, how can I avoid memory copy?
  • Here presents one great approach to convert a scipy.csr_matrix to a pd.SparseDataFrame. But the for-loop is so inefficient and it causes memory copy.

Further, I am trying to initialize a sparseDataFrame into block of memory and assign the value row-by-row, which ends up with:

a = np.random.rand(4,5)
b = pd.DataFrame(a)
c = sparse.csr_matrix(a)
d = pd.SparseDataFrame(index=b.index, columns=b.columns)
elem = pd.SparseSeries(c[2].toarray().ravel())
d.loc[[2]] = [ elem ]  # Got a NotImplementedError.
elem = pd.Series(c[2].toarray().ravel())
b.loc[[2]] = [ elem ]  # Yes.

I feel scripting language is decent, undoubtable. But I just need a pointer at this time, maybe.

Any help is appreciated in advance!

0

There are 0 best solutions below