I can't seem to produce a dataframe from a webpage table

36 Views Asked by At

Not sure where the problem is, but code is not giving the dataframe retrieved from the webpage. This is my first extract project and I can't seem to identify the problem.

This is the code:

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime 

url = 'https://en.wikipedia.org/wiki/List_of_largest_banks#By_market_capitalization'
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = '/home/project/Largest_banks_data.csv'
log_file = '/home/project/code_log.txt'  
table_attribs = {'Bank name': 'Name', 'Market Cap (US$ Billion)': 'MC_USD_Billion'}

###  Task 2 - Extract process

def extract(url, table_attribs):
# Loading the webpage for scraping
html_page = requests.get(url).text

# Parse the HTML content of the webpage
data = BeautifulSoup(html_page, 'html.parser')

# Find the table with specified attributes
# Find the main table containing the relevant data
main_table = data.find('table', class_='wikitable sortable')

# Find the desired `tbody` elements within the main table
table_bodies = main_table.find_all('tbody', attrs=table_attribs)

# Extract data from each `tbody` element
extracted_data = []
for table_body in table_bodies:
    rows = table_body.find_all('tr')
    for row in rows:
        extracted_data.append([cell.text for cell in row.find_all('td')])

# Use pandas to create a DataFrame from the extracted data
df = pd.DataFrame(extracted_data, columns=list(table_attribs.values()))

return df

# Calling the extract function
df = extract(url, table_attribs)

if df is not None:
# Print the result DataFrame
    print(df)
else:
    print("Extraction failed.")
1

There are 1 best solutions below

0
On

You could just read the page directly into pandas:

tables = pd.read_html(html_page)

This will load 3 dataframes, corresponding to the 3 tables on the page. You can then then print (or whatever) each table separately; for example

tables[0] 

will print out the first table ("By market capitalization").