I am loading excel files into pandas dataframes to do work with their data.
The script works on my machine with 16GB RAM but not on the machine where it is supposed to run with 8GB of RAM.
With the function df.info(memory_usage=’deep’)
I have checked the size of the dataframes coming out of pd.read_excel. The largest dataframe is 4MB.
My conclusion from this is that there must be something happening within the read_excel
function that causes the memory overflow.
This post suggests that it is possible that pandas is reading what Excel considers the used range which might be far beyond the range actually holding data. Since the files are in active use by people it feels foolish to just remove blank cells from Excel and call it a day. Users can always add the blank cells back. And it feels just as foolish to limit the read_excel function to a max amount of rows/columns because that might be surpassed in the future.
Is there a method to reduce memory usage when reading Excel files with used ranges that go far beyond the range where the data actually is?
Below is my code:
df = pd.read_excel(
file_location
, sheet_name = sheet_name
, header = None
, skiprows = number_of_rows_to_skip
, usecols = used_cols
, names = col_names
# For some reason if I don't set this the dataframe gets unreasonably large.
, nrows = 1000000
)
used cols
is a string containing the column characters like "A, C, F".col_names
is a list of names I want to give to each column.nrows
is strange. One example: If I don't use it my 88KB dataframe is growing to 136MB. However, Excel has just over 1 million rows, right? So why is there this huge difference between parsing all rows or just 1 million?