Why do I get an error when I remove Chunksize?

28 Views Asked by At

I'm encountering an error while running my Python code and need assistance in resolving it. Below are the details of the :

import pandas as pd

df_list = []
file_path = 'houses.txt'

for chunk in pd.read_csv(file_path, chunksize=1000000, names=['Size()sqft', 'No of bedrooms', 'No of floors', 'Age of home', 'Price(1000s dollar)']):
    df_list.append(chunk)

df = pd.concat(df_list)

print(df_list)

Output :

0        952.0             2.0           1.0         65.0                271.5
1       1244.0             3.0           1.0         64.0                300.0
2       1947.0             3.0           2.0         17.0                509.8
3       1725.0             3.0           2.0         42.0                394.0
4       1959.0             3.0           2.0         15.0                540.0
..         ...             ...           ...          ...                  ...
95      1224.0             2.0           2.0         12.0                329.0
96      1432.0             2.0           1.0         43.0                388.0
97      1660.0             3.0           2.0         19.0                390.0
98      1212.0             3.0           1.0         20.0                356.0
99      1050.0             2.0           1.0         65.0                257.8

[100 rows x 5 columns]]

After removing 'chunksize'. I get this error:

TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

Kindly explain what's the issue

1

There are 1 best solutions below

0
sytech On

chunksize implies iterable and is what changes the return type of read_csv to be a TextFileReader object that you're iterating over.

chunksize : int, optional
Number of lines to read from the file per chunk. Passing a value will cause the function to return a TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

When chunksize is not specified, it returns a DataFrame instead.

So, if you remove chunksize, iterable is no longer implied, and the object that you end up iterating over in your for loop will be no longer be a TextFileReader object, and the object types in your df_list will also be different as a consequence, ultimately causing the error when you call pd.concat on that list.

Chunking the file is useful in some cases where responsiveness or memory are concerns or when the full df is not needed, but if you don't need to chunk the file and want the full dataframe anyhow, you can skip the iteration and subsequent concatenation and just read the whole file into a dataframe in one step:

df = pd.read_csv(file_path, names=['Size()sqft', 'No of bedrooms', 'No of floors', 'Age of home', 'Price(1000s dollar)'])