Pandas read_html issue with &nbsp

344 Views Asked by At

I'm using pandas read_html to read an html file and I'm running into an issue with nonbreaking spaces. I have data in a column of resulting data frame that should contains a string like "ABCDEF G" (three spaces between F and G). Instead I'm getting "ABCDEF G" (one space between F and G). When I inspect the html file it shows "ABCDEF   G" so for some reason these three nonbreaking spaces are being changed to one space only. All single nonbreaking spaces in the html are working fine. Is there a way to get around this so it retains the three spaces between F and G?

1

There are 1 best solutions below

1
On

It's not elegant but for now I'm doing

 with open(htmllink, 'r') as r: 
        data = r.read().replace('   ', '___')

Then coming back and replacing the underscores with three spaces. Still looking for a better way to do this but it should work for now.