How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?

143 Views Asked by discord At 05 October 2021 at 16:27

As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal techniques to capture this information using JSoup.

<br> 9/14/2021  8:31 AM         2114 <A HREF="/pub/time.series/ap/ap.area">ap.area</A><br> 4/14/2005  2:53 PM          987 <A HREF="/pub/time.series/ap/ap.contacts">ap.contacts</A><br>

Original Q&A

There are 1 best solutions below

Janez Kuhar On 06 October 2021 at 10:16 BEST ANSWER

There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.

But if we examine your concrete example, we observe the following:

your file/folder metadata are stored as TextNodes inside the pre element,
every relevant file/folder link (a element) has a direct sibling br that precedes it. Well, except for the root directory: https://download.bls.gov/. You have to treat that case separately.

This constitutes enough information for efficient queries:

Document doc = Jsoup.connect("https://download.bls.gov/pub/time.series/").get();
Elements links = doc.select("pre br + a");
List<TextNode> metaData = doc.select("pre").textNodes();
for (int i = 0; i < links.size(); i++) {
    String metaDataRow = metaData.get(i).toString();
    System.out.println(metaDataRow  + " | " + links.get(i));
}

You can further split up the metaDataRow to extract timestamps like so:

DateTimeFormatter formatter = DateTimeFormatter.ofPattern("M/d/yyyy pph:m a", Locale.ENGLISH);
// ...
String[] metaColumns = metaDataRow.split("        ");
LocalDate lastUpdated = LocalDate.parse(metaColumns[0].strip(), formatter);

How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in JSOUP

Related Questions in DIRECTORY-LISTING

Trending Questions

Popular # Hahtags

Popular Questions