<div class="features clearfix">
<span> <img src="/App_Theme/css/img/ico_area.png" width="36" height="36" class="imgvertical">
78,00 a 207,00 m²
</span>
<span><img src="/App_Theme/css/img/ico_bed.png" class="imgvertical"></i>
Desde
2
</span>
<span><img src="/App_Theme/css/img/ico_bath.png" width="36" height="36" class="imgvertical">
Desde
2
</span>
<span><img src="/App_Theme/css/img/ico_garaje.png" width="36" class="imgvertical" height="36">
Sin especificar
</span>
</div>
Trying to scrap data inside the above tag, however the output string print only unreadable characters, not the correct data
My code
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.fincaraiz.com.co/oceana-52/barranquilla/proyecto-nuevo-det-1041165.aspx')
soup = BeautifulSoup(page.content, 'lxml')
box_2 = soup.find('div' ,'features clearfix')
box_2_1 = box_2.findAll('span')
box2 = []
for row2 in box_2_1:
box2.append(row2.text)
print (box2)
But it prints output like below
[' \r\n 78,00 a 207,00 m²\r\n \r\n ', ' \r\n \r\n Desde\xa0\r\n 2\r\n \r\n \r\n ', '\r\n \r\n Desde\xa0\r\n 2\r\n\r\n \r\n ', '\r\n \r\n Sin especificar\r\n \r\n ']
The expected output here is:
78,00 a 207,00 m² Desde 2 Desde 2 Sin especificar
I've already tried utf-8 encoding along with the code, but it still giving the same output. How could I avoid the unicode errors?
What you are observing is not a unicode problem. The text you extracted does in fact contain newlines (
'\r\n'
) and nonbreaking spaces where the HTML entity
is converted to'\xa0'
.If you need to remove those characters, or perhaps replace them with spaces, you could modify your code like this:
Note that this will still differ from the expected output that you provided above. Your code creates a list in
box2
, so when you print that list you will see square brackets and commas separating the list elements. If you don't want that, you can join the list into a string, with elements separated by spaces, like this: