Python 3 and Requests-Html: Trying to scrape a website - not getting the "real" html code back

49 Views Asked by At

I'm trying to scrape a website, but I'm not getting the correct, analyzable code back.

I am using python 3.12 and the requests HTML module to scrape the websites. For some of them it works without problems, but for "https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html" it doesn't work, although I use the render function of Requests-HTML to execute javascript code on the website. From analyzing the website, I know that the information I am looking for is contained in a tag with the attribute data-label = "artist". But in the HTML contained by the scraping and rendering there is not a single tag...

I don't know what to do, can someone help me and point me in the right direction?

from requests_html import HTML, HTMLSession


charts = {'ODC50': {
            'name': 'ODC50',
            'anz': 50,
            'url': 'https://www.mix1.de/charts/dance50.htm',
            'entry': 'div.charts-main-block',
            'date': '#mix1_content div.mybox_content'
        },
        'DDPHot50': {
            'name': 'DDP Hot50',
            'anz': 50,
            'url': 'https://www.deutsche-dj-playlist.de/hot-50/dance',
            'entry': 'div.list div.entry',
            'date': 'div.header div.title'
        },
        'Ostseewelle': {
            'name': 'Ostseewelle',
            'anz': 20,
            'url': 'https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html',
            'entry': 'section',
            'date': 'h3.text-center.titel1'
        }
}

choice = 'Ostseewelle'


chart_site = charts.get(choice).get('url')
session = HTMLSession()
r = session.get(chart_site)
r.html.render(sleep=2, keep_page=True, scrolldown=5, timeout=30)

print(r.status_code)

html = r.html

#print(html.html)

tds = html.xpath('//td[@data-label="Künstler"]')
print(f'Gefundene Einträge: {len(tds)}')


print('Programm beendet')

I don't get the correct HTML code back to parse, the expected code is missing.

1

There are 1 best solutions below

2
Andrej Kesely On BEST ANSWER

The chart data on the page you see is loaded from external URL. To get the info about artists you can use next example:

import requests
from bs4 import BeautifulSoup

url = "https://enricoostendorf.de/top20/top20eo.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for k in soup.select('[data-label="Künstler"]'):
    l1, l2 = k.get_text(strip=True, separator="|||").split("|||")
    print(l1)
    print(l2)
    print("-" * 80)

Prints:

...

--------------------------------------------------------------------------------
Loi
"Am I Enough"
--------------------------------------------------------------------------------
Nico Santos & Fast Boy
"Where You Are"
--------------------------------------------------------------------------------
Ofenbach
"Overdrive" (feat. Norma Jean Martine)
--------------------------------------------------------------------------------
Robin Schulz, Rita Ora, Tiago PZK
"I'll Be There"
--------------------------------------------------------------------------------
Tate McRae
"greedy"
--------------------------------------------------------------------------------
Dua Lipa
"Houdini"
--------------------------------------------------------------------------------