As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal techniques to capture this information using JSoup.
<br> 9/14/2021 8:31 AM 2114 <A HREF="/pub/time.series/ap/ap.area">ap.area</A><br> 4/14/2005 2:53 PM 987 <A HREF="/pub/time.series/ap/ap.contacts">ap.contacts</A><br>
There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.
But if we examine your concrete example, we observe the following:
TextNodes inside thepreelement,aelement) has a direct siblingbrthat precedes it. Well, except for the root directory: https://download.bls.gov/. You have to treat that case separately.This constitutes enough information for efficient queries:
You can further split up the
metaDataRowto extract timestamps like so: