Issues replacing text in Pandas DataFrame where apostrophes appear

663 Views Asked by At

I'm using a Pandas DataFrame, which I have read in from Excel I and want to find and replace contractions in text (e.g. don't -> do not). The code I'm using works when replacing text which doesn't contain apostrophes, but it doesn't work on the words which include them.

I have defined a dictionary to specify what replacements to make. I provided a sample from it below, together with the code to execute the replacement.

contractions_dict = { 
'ain\'t': 'is not', 'aren\'t': 'are not', 'can\'t': 'can not', '\'cause': "because",
'coz': "because", 'cos': "because", 'could\'ve': "could have", 'couldn\'t': "could not",
'didn\'t': "did not", 'doesn\'t': "does not", 'don\'t': 'do not',
'no contractions': 'TEST'
}

regex_dict = {r"(\b){}(\b)".format(k):r"\1{}\2".format(v) for k,v in contractions_dict.items()}
regex_dict


data = {'Text_with_contractions': ['Text with no contractions', "Text with contractions doesn't work", 'More text']}
df = pd.DataFrame(data)

df['Text_with_no_contractions'] = df['Text_with_contractions'].replace(regex_dict, regex=True)
df['Text_with_contractions'].iloc[1]

The strange thing is, the above code works when I test it on a dataframe I've created manually, but it doesn't work on the dataframe I've read in from Excel. Any ideas why?

This is the manually created dataframe it works on:

data = {'Text_with_contractions': ['Text with no contractions', "Text with contractions doesn't work", 'More text']}
df = pd.DataFrame(data)

This is the code I've used to read in the dataframe it doesn't work on:

df = pd.read_excel(path + "output.xlsx", encoding = "UTF-8")

I've tried using escape characters before the apostrophe (as above). I've tried double quotes and a single quote for the apostrophe

I would be very grateful if someone can help identify why it doesn't work with the Excel-read data and suggest a solution.

1

There are 1 best solutions below

0
On

ok, so I found what was wrong. The dictionary included the character ' as an apostrophe, but the dataframe included the character ’

All working now