SgmlReader infinite loop on large document?

39 Views Asked by At

I've got this project to scrape data off of the SEC Edgar site. Part of the task is to get the meat of the whole filing, and I was just testing some of that today.

I ran into this somewhat large filing (https://www.sec.gov/Archives/edgar/data/355437/000119312520189547/0001193125-20-189547.txt) that's about 110 meg.

I was breaking up the package to the constituent <DOCUMENT> nodes and processing them differently, based on the FILENAME node value. For the types that were html/xml based, I just used

SgmlReader.ReadInnerXml();

to grab the innards, but on this large filing, it appears to go into this infinite loop. It ran for 15 minutes before I broke in with the debugger, and it was hung on that call.

Has anyone ever run into that before?

I'm using SqmlReader 1.8.16.

I saw a very old comment on a changelog page saying that there was such a bug with improperly terminated html comments but that was listed as fixed a good number of releases ago.

Thanks

0

There are 0 best solutions below