Scala string pattern matching for html tagged contents extraction

740 Views Asked by elm At 11 June 2025 at 06:30

Given a HTML page fetched from

val html = io.Source.fromURL("http://example.org/aPage.html").mkString()

how to extract the contents wrapped within a given tag ? To illustrate this consider for instance this HTML fragment and tag <textarea>,

val html = "<p>Marginalia</p> 
            <textarea rows="3" cols="10">Contents of interest"</textarea 
            <p>More marginalia</p>"

how to obtain "Contents of interest" ?

Original Q&A

There are 1 best solutions below

Akos Krivachy On 12 June 2015 at 01:21 BEST ANSWER

There are two easy ways to do this:

Scala XML

Add the Scala XML dependency to your project:

libraryDependencies += "org.scala-lang.modules" %% "scala-xml" % "1.0.3"

Now you can parse your HTML code and select all textarea tags.

import scala.xml.XML
val htmlXml = XML.loadString(html)
val textareaContents = (htmlXml \\ "textarea").text

If your HTML is valid and want to do multiple XPath queries then this may be the better way. Also check out this blogpost for more info on what \\ means or how to use the Scala-XML library.

Regexp

Another simple way to do this is to define a regular expression and find the matches:

val regex = "<textarea.*>(.+)</textarea>".r
regex.findAllIn(html).map {
  match => // process match
}

Scala string pattern matching for html tagged contents extraction

There are 1 best solutions below

Scala XML

Regexp

Related Questions in STRING

Related Questions in SCALA

Related Questions in PATTERN-MATCHING

Trending Questions

Popular # Hahtags

Popular Questions