encoding issue in python scriaping

377 Views Asked by At
<div class="features clearfix">
<span> <img src="/App_Theme/css/img/ico_area.png" width="36" height="36" class="imgvertical">
                78,00 a 207,00 m²             
</span>
<span><img src="/App_Theme/css/img/ico_bed.png" class="imgvertical"></i>  

                            Desde&nbsp;
                            2
            </span> 
<span><img src="/App_Theme/css/img/ico_bath.png" width="36" height="36" class="imgvertical">

                    Desde&nbsp;
                    2        
</span> 
<span><img src="/App_Theme/css/img/ico_garaje.png" width="36" class="imgvertical" height="36">  
                Sin especificar  
</span> 
</div>

Trying to scrap data inside the above tag, however the output string print only unreadable characters, not the correct data

My code

import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.fincaraiz.com.co/oceana-52/barranquilla/proyecto-nuevo-det-1041165.aspx')
soup = BeautifulSoup(page.content, 'lxml')
box_2 = soup.find('div' ,'features clearfix')
box_2_1  = box_2.findAll('span')
box2 = []
for row2 in box_2_1:
    box2.append(row2.text)
print (box2)

But it prints output like below

[' \r\n 78,00 a 207,00 m²\r\n \r\n ', ' \r\n \r\n Desde\xa0\r\n 2\r\n \r\n \r\n ', '\r\n \r\n Desde\xa0\r\n 2\r\n\r\n \r\n ', '\r\n \r\n Sin especificar\r\n \r\n ']

The expected output here is:

78,00 a 207,00 m² Desde 2 Desde 2 Sin especificar

I've already tried utf-8 encoding along with the code, but it still giving the same output. How could I avoid the unicode errors?

1

There are 1 best solutions below

2
On

What you are observing is not a unicode problem. The text you extracted does in fact contain newlines ('\r\n') and nonbreaking spaces where the HTML entity &nbsp; is converted to '\xa0'.

If you need to remove those characters, or perhaps replace them with spaces, you could modify your code like this:

for row2 in box_2_1:
    text = row2.text
    text = text.replace('\r\n', ' ')
    text = text.replace('\xa0', ' ')
    box2.append(text)
print(box2)

Note that this will still differ from the expected output that you provided above. Your code creates a list in box2, so when you print that list you will see square brackets and commas separating the list elements. If you don't want that, you can join the list into a string, with elements separated by spaces, like this:

print(' '.join(box2))