Unable to parse element attribute with XOM

175 Views Asked by At

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.

<rss version="2.0">
  <channel>
    <item>
      <title>Decision Paralysis</title>
      <link>https://xkcd.com/1801/</link>
      <description>
        <img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
      </description>
      <pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
      <guid>https://xkcd.com/1801/</guid>
    </item>
  </channel>
</rss>

Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?

import nu.xom.*;

public class RSSParser {
    public static void main() {
        try {
            Builder parser = new Builder();
            Document doc = parser.build ( "https://xkcd.com/rss.xml" );
            Element rootElement = doc.getRootElement();
            Element channelElement = rootElement.getFirstChildElement("channel");
            Elements itemList = channelElement.getChildElements("item");

            // Iterate through itemList
            for (int i = 0; i < itemList.size(); i++) {
                Element item = itemList.get(i);
                Element descElement = item.getFirstChildElement("description");
                Element imgElement = descElement.getFirstChildElement("img");
                // Crashes with NullPointerException
                String imgSrc = imgElement.getAttributeValue("src");
            }
        }
        catch (Exception error) {
            error.printStackTrace();
            System.exit(1);
        }
    }
}
2

There are 2 best solutions below

1
Elliotte Rusty Harold On

There is no img element in the item. Try

  if (imgElement != null) {
    String imgSrc = imgElement.getAttributeValue("src");
  }

What the item contains is this:

<description>&lt;img    
    src="http://imgs.xkcd.com/comics/us_state_names.png" 
    title="Technically DC isn't a state, but no one is too 
    pedantic about it because they don't want to disturb the snakes
    ." 
     alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." /&gt;  
</description>

That's not an img elment. It's plain text.

0
Stevoisiak On

I managed to come up with a somewhat hacky solution using regex and pattern matching.

// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
    Element item = itemList.get(i);
    String descString = item.getFirstChildElement("description").getValue();

    // Parse image URL (hacky)
    String imgSrc = "";
    Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
    Matcher matcher = pattern.matcher(descString);
    if (matcher.find()) {
        imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
    }
}