Scrape only HTML+Microdata with Nokogiri

536 Views Asked by At

Problem

I need to scrape HTML pages and extract only the HTML that has Microdata in it.

Example

<div itemscope itemtype="http://schema.org/Movie">
  <span>SOMETHING ELSE</span>
  <script>SOMETHING</script>
  <h1 itemprop="name"&g;Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
  <img url="DOESNT MATTER" />
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

<div>SOMETHING ELSE</div>

<div itemscope itemtype="http://schema.org/Product">
  <span itemprop="brand">ACME</span>
  <span itemprop="name">Executive Anvil</span>

  <span itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    <span>SOMETHING ELSE</span>
    <meta itemprop="priceCurrency" content="USD" />
    $<span itemprop="price">119.99</span>
  </span>
</div>

Goal: only the HTML with Microdata

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>
<div itemscope itemtype="http://schema.org/Product">
  <span itemprop="brand">ACME</span>
  <span itemprop="name">Executive Anvil</span>
  <span itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    <meta itemprop="priceCurrency" content="USD" />
    $<span itemprop="price">119.99</span>
  </span>
</div>

Attempt

I tried to use:

doc.css("*[itemtype]").each do |container|
  puts container.to_html
end

But that doesn't work because it iterates each itemtype and outputs all their childen itemtype and after that it iterates again their children so it duplicate things, i.e., MOVIE+PERSON > PERSON > PRODUCT+OFFER > OFFER.

0

There are 0 best solutions below