Handle missing values in pandas using dtype to read files

859 Views Asked by At

I'm reading a bunch of CSV files using dtype to specify the type of data of each column:

dict_tpye = {"columns_1":"int","column_2":"str"}
pd.read_csv(path,dtype=dict_tpye)

the problem I'm facing with at doing this that columns with non-float values have missing rows, which rise and error. How can I handle this?

I'll like to use a default value for such a cases, like a 0 for numeric values and an empty string for names.

2

There are 2 best solutions below

0
On BEST ANSWER

Consider the converters argument which uses a dictionary, mapping results of a user-defined function to imported columns. Below user-defined methods uses the built-in isdigit() that returns True if all characters in string are a digit and False if at least one is not; and isalpha() as the string counterpart. Adjust as needed especially with strings as you may allow numbers in its content:

import pandas as pd

cleanFloat = lambda x: float(x if x.isdigit() else 0) 
cleanString = lambda x: str(x if x.isalpha() else '')

dict_convert = {1:cleanFloat, 2:cleanString,}
dict_type = {"columns_1":"int","column_2":"str"}

df = pd.read_csv('Input.csv', converters=dict_convert, dtype=dict_type)
0
On

One way to fill missing w/ a placeholder is to perform the fill after you've read in the data to a DataFrame. Like so

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd

# csv data with missing data in each of the 2 columns
csv_data = """number,colour
3,blue
12,
2,
2,red
,yellow
6,yellow
14,purple
4,green
18,green
11,orange"""

df = pd.read_csv(pd.io.parsers.StringIO(csv_data))

df.number = df.number.fillna(-999)    # fill missing numbers w/ -999
df.colour = df.colour.fillna('UNK')   # fill missing categorical w/ UNK 

print df

# In [1]: run test.py
#    number  colour
# 0     3.0    blue
# 1    12.0     UNK
# 2     2.0     UNK
# 3     2.0     red
# 4  -999.0  yellow
# 5     6.0  yellow
# 6    14.0  purple
# 7     4.0   green
# 8    18.0   green
# 9    11.0  orange