why part of content disappear magically

Question

why part of content disappear magically

32 Views Asked by POPIF At 08 February 2024 at 20:52

I try to scraping the site https://www.pik.ru/search/vangarden/storehouse, and I got html from web-site succesfully and write it in the file, but when I try to get html ater it lots of information lost.

Examples: what I've got from page (screen 1 that's not all)

what I've got when I try to operate it (screen 2)

Pls, help what's I do wrong (Thank you!) My code

import requests
from bs4 import BeautifulSoup
import undetected_chromedriver
import time
import os

url = 'https://www.pik.ru/search/storehouse'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}

proxies = {
    'https': 'http://146.247.105.71:4827'
}


def download_pages_objects(url):
    if os.path.isfile(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt') == True:
        os.remove(
            r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt')

    list_links = []
    req = requests.get(url, headers=headers, proxies=proxies)
    soup = BeautifulSoup(req.text, "lxml")

    for i in soup.find_all("a", class_="styles__ProjectCard-uyo9w7-0 friPgx"):
        list_links.append('https://www.pik.ru'+i.get('href')+'\n')

    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'a') as file:
        for link in list_links:
            file.write(link)


def get_list_objects_links(url):
    download_pages_objects(url)

    list_of_links = []
    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'r') as file:
        for item in file:
            list_of_links.append(item)

    return list_of_links


list_links = get_list_objects_links(url)

count = 0
for link in list_links:
    req = requests.get(link.replace('\n', ''),
                       headers=headers, proxies=proxies)

    with open('1.html', 'w') as file:
        file.write(req.text)

    soup = BeautifulSoup(req.text, 'lxml')
    print(req.text, '\n\n\n')
    print(soup.find_all('div', ''), '\n\n\n')

    with open('1.html', 'r') as file:
        scr = file.read()

    print(scr, '\n\n\n')
    soup = BeautifulSoup(scr, 'lxml')
    print(soup)

    count += 1
    if count == 1:
        break

I'd tried to operate it without write into the files, also change lxml, xml, html.parser -- it does not help (or I do something wrong)

Original Q&A

There are 1 best solutions below

**a small orange** · Accepted Answer · 2024-02-08T21:13:37.317000

As noted by JonSG, what you see in the browser is the result of the browser's engine executing JavaScript and modifying the page dynamically. Python's BeautifulSoup fetches the page's contents as-is.

What you're looking for is a web driver, for example, Selenium: Download web page content using Selenium Webdriver and HtmlUnit

why part of content disappear magically

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PARSING

Related Questions in WEB-SCRAPING

Related Questions in HTML-PARSING

Trending Questions

Popular # Hahtags

Popular Questions