Scala string pattern matching for html tagged contents extraction

728 Views Asked by At

Given a HTML page fetched from

val html = io.Source.fromURL("http://example.org/aPage.html").mkString()

how to extract the contents wrapped within a given tag ? To illustrate this consider for instance this HTML fragment and tag <textarea>,

val html = "<p>Marginalia</p> 
            <textarea rows="3" cols="10">Contents of interest"</textarea 
            <p>More marginalia</p>"

how to obtain "Contents of interest" ?

1

There are 1 best solutions below

0
On BEST ANSWER

There are two easy ways to do this:

Scala XML

Add the Scala XML dependency to your project:

libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "1.0.3"

Now you can parse your HTML code and select all textarea tags.

import scala.xml.XML
val htmlXml = XML.loadString(html)
val textareaContents = (htmlXml \\ "textarea").text

If your HTML is valid and want to do multiple XPath queries then this may be the better way. Also check out this blogpost for more info on what \\ means or how to use the Scala-XML library.

Regexp

Another simple way to do this is to define a regular expression and find the matches:

val regex = "<textarea.*>(.+)</textarea>".r
regex.findAllIn(html).map {
  match => // process match
}