why part of content disappear magically

32 Views Asked by At

I try to scraping the site https://www.pik.ru/search/vangarden/storehouse, and I got html from web-site succesfully and write it in the file, but when I try to get html ater it lots of information lost.

Examples: what I've got from page (screen 1 that's not all)enter image description here

what I've got when I try to operate it (screen 2)enter image description here

Pls, help what's I do wrong (Thank you!) My code

import requests
from bs4 import BeautifulSoup
import undetected_chromedriver
import time
import os

url = 'https://www.pik.ru/search/storehouse'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}

proxies = {
    'https': 'http://146.247.105.71:4827'
}


def download_pages_objects(url):
    if os.path.isfile(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt') == True:
        os.remove(
            r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt')

    list_links = []
    req = requests.get(url, headers=headers, proxies=proxies)
    soup = BeautifulSoup(req.text, "lxml")

    for i in soup.find_all("a", class_="styles__ProjectCard-uyo9w7-0 friPgx"):
        list_links.append('https://www.pik.ru'+i.get('href')+'\n')

    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'a') as file:
        for link in list_links:
            file.write(link)


def get_list_objects_links(url):
    download_pages_objects(url)

    list_of_links = []
    with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'r') as file:
        for item in file:
            list_of_links.append(item)

    return list_of_links


list_links = get_list_objects_links(url)

count = 0
for link in list_links:
    req = requests.get(link.replace('\n', ''),
                       headers=headers, proxies=proxies)

    with open('1.html', 'w') as file:
        file.write(req.text)

    soup = BeautifulSoup(req.text, 'lxml')
    print(req.text, '\n\n\n')
    print(soup.find_all('div', ''), '\n\n\n')

    with open('1.html', 'r') as file:
        scr = file.read()

    print(scr, '\n\n\n')
    soup = BeautifulSoup(scr, 'lxml')
    print(soup)

    count += 1
    if count == 1:
        break

I'd tried to operate it without write into the files, also change lxml, xml, html.parser -- it does not help (or I do something wrong)

1

There are 1 best solutions below

1
a small orange On BEST ANSWER

As noted by JonSG, what you see in the browser is the result of the browser's engine executing JavaScript and modifying the page dynamically. Python's BeautifulSoup fetches the page's contents as-is.

What you're looking for is a web driver, for example, Selenium: Download web page content using Selenium Webdriver and HtmlUnit