I have around 40 000 HTML files on disk and function that parses HTML with Beautiful Soup and returns dictionary for each HTML. During reading/parsing I'm appending all dictionaries to list and creating pandas DataFrame in the end.
It all works ok in synchronous mode but it takes a long time to run so I want to run in with aiofiles
Currently my code looks like this:
# Function for fetching all ad info from single page
async def getFullAdSoup(soup):
...
adFullFInfo = {} # dictionary parsed from Beautifoul soup object
return await adFullFInfo
async def main():
adExtendedDF = pd.DataFrame()
adExtendednfo = {}
htmls = glob.glob("HTML_directory" + "/*.html") # Get all HTML files from directory
htmlTasks = [] # Holds list of returned dictionaries
for html in natsorted(htmls):
async with aiofiles.open(html, mode='r', encoding='UTF-8', errors='strict', buffering=1) as f:
contents = await f.read()
htmlTasks.append(getFullAdSoup(BeautifulSoup(contents, features="lxml")))
htmlDicts = await asyncio.gather(*htmlTasks)
adExtendedDF = pd.DataFrame(data=htmlDicts, ignore_index=True)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Error I'm getting is:
File "C:/Users/.../test.py", line 208, in getFullAdSoup return await adFullFInfo TypeError: object dict can't be used in 'await' expression
I'm finding similar question here but I'm unable to make it work. I don't know how to transform my parsing function to asynchronous mode and how to iterate over files calling that function.
Your error happens because you await a dict, I'm guessing you misunderstood, you don't need to await in the return statement for it to be async. I would refactor it like this
4 changes:
2 and 3 are optional, but I find that it makes a lot of speed difference depending on what you are doing