I'm attempting to scrape data from a website but I'm encountering issues with multiple pages. Somehow, my iterations always result in the error message 'All arrays must be of the same length'. Can somebody help me identify where I went wrong? Below is the code I'm using:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def replaced(text):
return text.replace('\n\n\n\n\n','')
total_page = 3
current_page = 1
judul_list = []
harga_list = []
distance = []
transmit = []
location = []
sp = []
rec_seller = []
while current_page <= total_page:
url = f""
req = requests.get
headers = {"User-Agent": }
page_request = requests.get(url, headers=headers)
soup = BeautifulSoup(page_request.content, "html.parser")
containers = soup.find_all('div', {'class' : 'grid'})
container = containers[0]
judul = container.findAll('h2', {'class' : 'listing__title epsilon flush'})
judul_list += [replaced(i.text) for i in judul]
harga = container.findAll('div', {'class' : 'listing__price delta weight--bold'})
harga_list += [replaced(j.text) for j in harga]
specs = container.findAll('div', {'class' : 'listing__specs soft-quarter--ends soft-half--sides milli'})
specs_list = [replaced(k.text) for k in specs]
distance += [k.split('|')[1].strip() for k in specs_list]
transmit += [k.split('|')[2].strip() for k in specs_list]
location += [k.split('|')[3].strip() for k in specs_list]
sp += [k.split('|')[4].strip() for k in specs_list]
rec_seller += [k.split('|')[5].strip() for k in specs_list]
current_page += 1
tahun = [a.split()[0].strip('|') for a in judul_list]
merek = [a.split()[1].strip('|') for a in judul_list]
series = [a.split()[2].strip('|') for a in judul_list]
# Create DataFrame
data = {
'Tahun': tahun,
'Merek': merek,
'Series': series,
'Harga': harga_list,
'Distance': distance,
'Transmit': transmit,
'Location': location,
'SP': sp,
'Rec_Seller': rec_seller
}
In newer code avoid old syntax
findAll()instead usefind_all()orselect()withcss selectors- For more take a minute to check docsInstead of using a multitude of different lists, whose equality in length cannot be guaranteed, just try a
listofdictionaries- this also has the charming advantage that missing values are simply ignored during the transformation into adataframe.To do this, also change your strategy of selecting the elements, focus on the containers and iterate these to extract the respective contents.
Furthermore, in case of doubt, the generation of the specs could also be more generic and your
defto replace the line breaks could is not necessarry simply useget_text()with parameterstrip=TrueExample