Parsing large number of HTML files with asyncio aiofiles and parsing them in pandas DataFrame

915 Views Asked by At

I have around 40 000 HTML files on disk and function that parses HTML with Beautiful Soup and returns dictionary for each HTML. During reading/parsing I'm appending all dictionaries to list and creating pandas DataFrame in the end.

It all works ok in synchronous mode but it takes a long time to run so I want to run in with aiofiles

Currently my code looks like this:

# Function for fetching all ad info from single page
async def getFullAdSoup(soup):
     ...
     adFullFInfo = {} # dictionary parsed from Beautifoul soup object
    return await adFullFInfo


async def main():
    adExtendedDF = pd.DataFrame()
    adExtendednfo = {}
    htmls = glob.glob("HTML_directory" + "/*.html") # Get all HTML files from directory

    htmlTasks = [] # Holds list of returned dictionaries
    for html in natsorted(htmls):
        async with aiofiles.open(html, mode='r', encoding='UTF-8', errors='strict', buffering=1) as f:
            contents = await f.read()
            htmlTasks.append(getFullAdSoup(BeautifulSoup(contents, features="lxml")))
        htmlDicts = await asyncio.gather(*htmlTasks)
    adExtendedDF = pd.DataFrame(data=htmlDicts, ignore_index=True)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Error I'm getting is:

File "C:/Users/.../test.py", line 208, in getFullAdSoup return await adFullFInfo TypeError: object dict can't be used in 'await' expression

I'm finding similar question here but I'm unable to make it work. I don't know how to transform my parsing function to asynchronous mode and how to iterate over files calling that function.

1

There are 1 best solutions below

1
On

Your error happens because you await a dict, I'm guessing you misunderstood, you don't need to await in the return statement for it to be async. I would refactor it like this

# Function for fetching all ad info from single page
async def getFullAdSoup(soup):
     ...
     adFullFInfo = {} # dictionary parsed from Beautifoul soup object
    return adFullFInfo #*****1****


async def main():
    adExtendedDF = pd.DataFrame()
    adExtendednfo = {}
    htmls = glob.glob("HTML_directory" + "/*.html") # Get all HTML files from directory

    htmlTasks = [] # Holds list of returned dictionaries
    for html in natsorted(htmls):
        async with aiofiles.open(html, mode='r', encoding='UTF-8', errors='strict', buffering=1) as f:
            contents = await f.read()
            htmlTasks.append(asyncio.create_task( #****2****
                getFullAdSoup(BeautifulSoup(contents, features="lxml"))))
        await asyncio.sleep(0) #****3****
    htmlDicts = await asyncio.gather(*htmlTasks) #****4****
    adExtendedDF = pd.DataFrame(data=htmlDicts, ignore_index=True)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

4 changes:

  1. No need to await the dict
  2. Use asyncio.create_task to schedule the task to run ASAP
  3. sleep(0) to release the event loop and let the task start running
  4. Move the gather method outside of the loop, so you can gather all tasks at once instead of one at a time.

2 and 3 are optional, but I find that it makes a lot of speed difference depending on what you are doing