Spraping data from a table is slow but uncertain why

71 Views Asked by At

I am scraping data from a table using selenium with Java but is slow and I am not sure why. Is there a reason why and how can I speed it up? The other thing I noticed is that it seems to slow down more as it progresses. I noticed this by observing the print statements to the console.

Here is my code:

 package mypackage;

import java.io.IOException;
import java.time.Duration;
import java.util.List;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import com.seleniumpractice.utilities.XLUtils;

import io.github.bonigarcia.wdm.WebDriverManager;

public class CovidWebTable {
    static WebDriver driver;
    static XLUtils xl;
    static List<WebElement> header;
    static List<WebElement> rows;

    public static void main(String[] args) throws IOException {
        WebDriverManager.chromedriver().setup();
        driver = new ChromeDriver();
        driver.get("https://www.worldometers.info/coronavirus");
        driver.manage().window().maximize();
        driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(10));
        
        WebElement table = driver.findElement(By.xpath("//table[@id='main_table_countries_today']"));
        rows = table.findElements(By.xpath(".//tr[@role='row']"));
        System.out.println("Total rows: "+rows.size());
        
        xl = new XLUtils(".\\datafiles\\covid.xls");
        //xl.setCellData(null, rows, rows, null);
        
        //Add header
        header = table.findElements(By.xpath(".//thead//th"));
        System.out.println("Header cols: "+ header.size());
        
        for(int col=1; col<header.size()-1; col++) {
            xl.setCellData("Covid Data", 0, col-1, header.get(col).getText());
        }
        
        int xlRow = 1;
                
        for(int r=1; r<rows.size(); r++) {
            
            String a = rows.get(r).getText();
            
            if(rows.get(r).getText().equals("")) {
                System.out.println("Skipped row: "+r);
                continue;
            }
            System.out.println("Writing row "+r);   
            
            for(int c=1; c<header.size(); c++) {
                //String data = rows.get(r).findElement(By.xpath(".//td["+(c+1)+"]")).getText();
                xl.setCellData("Covid Data", xlRow, c-1, rows.get(r).findElement(By.xpath(".//td["+(c+1)+"]")).getText());
                
            }
            xlRow++;
            
        }
        System.out.println("Complete.");
        
        driver.close();
    
        }

}

The code that contains the code for writing to excel:

package com.internetBanking.utilities;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DataFormatter;

public class XLUtils {
    public static FileInputStream fi;
    public static FileOutputStream fo;
    public static HSSFWorkbook wb;
    public static HSSFSheet ws;
    public static HSSFRow row;
    public static HSSFCell cell;
    
    public static int getRowCount(String xlfile, String xlsheet) throws IOException {
        fi = new FileInputStream(xlfile);
        wb = new HSSFWorkbook(fi);
        ws = wb.getSheet(xlsheet);
        int rowcount = ws.getLastRowNum();
        wb.close();
        fi.close();
        return rowcount;
    }
    
    public static int getCellCount(String xlFile, String xlSheet, int rowNum) throws IOException {
        fi = new FileInputStream(xlFile);
        wb = new HSSFWorkbook(fi);
        ws = wb.getSheet(xlSheet);
        row = ws.getRow(rowNum);
        int cellCount = row.getLastCellNum();
        wb.close();
        fi.close();
        return cellCount;
        
    }
    
    public static String getCellData(String xlFile, String xlSheet, int rowNum, int colNum) throws IOException {
        fi = new FileInputStream(xlFile);
        wb = new HSSFWorkbook(fi);
        ws = wb.getSheet(xlSheet);
        row = ws.getRow(rowNum);
        cell = row.getCell(colNum);
        String data;
        try {
            String cellData = new DataFormatter().formatCellValue(cell);
            return cellData;
        }
        catch(Exception e) {
            data = "";
        }
        wb.close();
        fi.close();
        return data;
        
    }
    
    public static void setCellData(String xlFile, String xlSheet, int rowNum, int colNum, String data) throws IOException{
        fi = new FileInputStream(xlFile);
        wb = new HSSFWorkbook(fi);
        ws = wb.getSheet(xlSheet);
        row = ws.getRow(rowNum);
        Cell cell = row.createCell(colNum);
        cell.setCellValue(data);
        //cell = row.getCell(colNum);
        //cell.setCellValue(data);
        fo = new FileOutputStream(xlFile);
        wb.write(fo);
        wb.close();
        fi.close();
        fo.close();
    }
    
    

}

1

There are 1 best solutions below

2
Youans On

Ok getting to understand the code will give you a hint of what you should do, the code is iterating 231 read data from the table and write it to an excel file (daaa!) Ok but when writing into the excel file you write row by row (so!) You then iterate cell by cell on each row on each cell you call setCellData(...) in that XLUtils and here is when it starts to slow down!

In that setCellData each time you call it, it read a file from the disk, opens it, appends data then close the file and since you call it by cell, you end up calling it ~231 (rows) x 15 (col) = 3465

Imagine the time consumed (opening file/writing data/ closing file) 3465 times

So what to do, You need to create a list of list List<ArrayList> rows = new ArrayList()

This is a list of list ^ where each record in the list is another list

and on each row you read from the table (even headers), you create a list of cells and then you add this list to the rows list!

and eventually, you add some utility function in that XLUtils that accept List<ArrayList> that will Open the file once, Iterate on that rows list write their values, and close the file.

Imagine it as if you are trying to move from house to house, and you were using your personal small pickup truck car to move stuff. It won't take all the house stuff so you would go back and forth for a while. On the other hand, you could have used a furniture truck that picks up all furniture and loads it at once.