I try to scraping the site https://www.pik.ru/search/vangarden/storehouse, and I got html from web-site succesfully and write it in the file, but when I try to get html ater it lots of information lost.
Examples:
what I've got from page (screen 1 that's not all)
what I've got when I try to operate it (screen 2)
Pls, help what's I do wrong (Thank you!) My code
import requests
from bs4 import BeautifulSoup
import undetected_chromedriver
import time
import os
url = 'https://www.pik.ru/search/storehouse'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}
proxies = {
'https': 'http://146.247.105.71:4827'
}
def download_pages_objects(url):
if os.path.isfile(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt') == True:
os.remove(
r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt')
list_links = []
req = requests.get(url, headers=headers, proxies=proxies)
soup = BeautifulSoup(req.text, "lxml")
for i in soup.find_all("a", class_="styles__ProjectCard-uyo9w7-0 friPgx"):
list_links.append('https://www.pik.ru'+i.get('href')+'\n')
with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'a') as file:
for link in list_links:
file.write(link)
def get_list_objects_links(url):
download_pages_objects(url)
list_of_links = []
with open(r'C:\Users\kraz1\OneDrive\Рабочий стол\Антон\python\парсинг\кладовочная\pik_links.txt', 'r') as file:
for item in file:
list_of_links.append(item)
return list_of_links
list_links = get_list_objects_links(url)
count = 0
for link in list_links:
req = requests.get(link.replace('\n', ''),
headers=headers, proxies=proxies)
with open('1.html', 'w') as file:
file.write(req.text)
soup = BeautifulSoup(req.text, 'lxml')
print(req.text, '\n\n\n')
print(soup.find_all('div', ''), '\n\n\n')
with open('1.html', 'r') as file:
scr = file.read()
print(scr, '\n\n\n')
soup = BeautifulSoup(scr, 'lxml')
print(soup)
count += 1
if count == 1:
break
I'd tried to operate it without write into the files, also change lxml, xml, html.parser -- it does not help (or I do something wrong)
As noted by JonSG, what you see in the browser is the result of the browser's engine executing JavaScript and modifying the page dynamically. Python's BeautifulSoup fetches the page's contents as-is.
What you're looking for is a web driver, for example, Selenium: Download web page content using Selenium Webdriver and HtmlUnit