Suppose we have a huge and sparse matrix, what would be the cheapest way to fill it into a pandas.DataFrame
? More specifically, the huge matrix comes from a big dataset that involves lots of dummy variable and the dense version matrix occupies 150Gb+ memory, which is apparently not durable.
I am trying to penetrate the memory management of python.pandas, as a green hand of pandas. The current dilemma is described as following:
- Using dense source matrix and calling
pd.DataFrame
will not cause memory copy. The dense matrix will eat up most space though. - If using
scipy.csr_matrix
,pd.DataFrame
doesn't accept it as argument of constructor. Taking a step back, if we resort topd.SparseDataFrame
, how can I avoid memory copy? - Here presents one great approach to convert a
scipy.csr_matrix
to apd.SparseDataFrame
. But the for-loop is so inefficient and it causes memory copy.
Further, I am trying to initialize a sparseDataFrame
into block of memory and assign the value row-by-row, which ends up with:
a = np.random.rand(4,5)
b = pd.DataFrame(a)
c = sparse.csr_matrix(a)
d = pd.SparseDataFrame(index=b.index, columns=b.columns)
elem = pd.SparseSeries(c[2].toarray().ravel())
d.loc[[2]] = [ elem ] # Got a NotImplementedError.
elem = pd.Series(c[2].toarray().ravel())
b.loc[[2]] = [ elem ] # Yes.
I feel scripting language is decent, undoubtable. But I just need a pointer at this time, maybe.
Any help is appreciated in advance!