I'm trying to scrap website, but the problem is it requires Request Verification Token in the request.
For this I try to get Request Verification Token from home page of the website - then try to scrap website with that Request Verification Token in the request. I'm using Scala Version 2.13.1, sbt version 1.2.8 and libraries ("org.jsoup" % "jsoup" % "1.15.3", "net.ruippeixotog" %% "scala-scraper" % "3.0.0").
With Akka HTTP Client
implicit val system: ActorSystem = ActorSystem()
implicit val executionContext = system.dispatcher
// Define the URL you want to scrape
val url = "https://solutions.virginia.gov/Notary/Search/Search"
// Make a GET request to the URL
val requestForToken = HttpRequest(HttpMethods.GET, url)
val responseFutureForToken = Http().singleRequest(requestForToken)
// Extract the RequestVerificationToken from the response headers
val tokenFuture = responseFutureForToken.map { response =>
val tokenHeaders = response.headers.toList
val setCookieRequestVerificationToken = tokenHeaders(6).toString().split("\\s+|;|=").toList
setCookieRequestVerificationToken(2)
}
// Wait for the token value and print it
val requestVerificationToken = Await.result(tokenFuture, 5.seconds)
println(requestVerificationToken)
val postBody = s"__RequestVerificationToken=$requestVerificationToken&Query.FirstName=ab&Query.LastName=ab&Query.NotaryId="
val request = HttpRequest(HttpMethods.POST, url, Nil, HttpEntity(ContentTypes.`application/x-www-form-urlencoded`, postBody))
val responseFuture = Http(system).singleRequest(request)
responseFuture
.onComplete {
case Success(res) => logger.info("Result is: {} ", res)
Unmarshal(res.entity.toStrict(180 seconds)).value.map { result =>
val htmlStr = result.data.utf8String
val browser = JsoupBrowser()
val doc = browser.parseString(htmlStr)
println(doc)
}
case Failure(e) =>
println(ServerMessage.EXCEPTION)
}
Output It is giving:
Request Verification Code: GcgyA-aWo0c0FhTCQUwxd4ne14KPvPbiU6hBNPmbFJHACjnZOmIINyqa2EPwXq_82JbGMttj21V0NIw5yQnSDo4_SjAK-oOlcdCoyDGucSM1
18:04:31.873 521 [default-akka.actor.default-dispatcher-8] LogFactory$ INFO - Result is: HttpResponse(302 Found,List(Cache-Control: private, Location: /Notary/Error/500?aspxerrorpath=/Notary/Search/Search, Server: Microsoft-IIS/10.0, X-AspNetMvc-Version: 5.2, X-Powered-By: ASP.NET, X-Frame-Options: SAMEORIGIN, Date: Fri, 26 May 2023 13:04:30 GMT),HttpEntity.Strict(text/html; charset=UTF-8,170 bytes total),HttpProtocol(HTTP/1.1))
JsoupDocument(<html>
<head>
<title>Object moved</title>
</head>
<body>
<h2>Object moved to <a href="/Notary/Error/500?aspxerrorpath=/Notary/Search/Search">here</a>.</h2>
</body>
</html>)
Without Akka HTTP Client
implicit val system: ActorSystem = ActorSystem()
implicit val executionContext = system.dispatcher
// Define the URL you want to scrape
val url = "https://solutions.virginia.gov/Notary/Search/Search"
// Make a GET request to the URL
val requestForToken = HttpRequest(HttpMethods.GET, url)
val responseFutureForToken = Http().singleRequest(requestForToken)
// Extract the RequestVerificationToken from the response headers
val tokenFuture = responseFutureForToken.map { response =>
val tokenHeaders = response.headers.toList
val setCookieRequestVerificationToken = tokenHeaders(6).toString().split("\\s+|;|=").toList
setCookieRequestVerificationToken(2)
}
// Wait for the token value and print it
val requestVerificationToken = Await.result(tokenFuture, 5.seconds)
println("Request Verification Code: " + requestVerificationToken)
val formData = Map(
"__RequestVerificationToken" -> requestVerificationToken,
"FirstName" -> "ab",
"LastName" -> "ab",
"NotaryId" -> "",
)
val browser = JsoupBrowser()
val doc = browser.post(url, formData)
println(doc)
Output
Request Verification Code: x68MuhunwLyyHNxZjjVJzqf-VAIAiCNtyfNDlBabjyIA7rO1XtURuEfYhiryMOqKnZGfR-oPP4DmzU0Ju3Ed3ULnhtBnLP9GE8tD-MxLNKU1
JsoupDocument(<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Notary, Kay Coles James, Commonwealth, Secretary of the Commonwealth, Glenn Youngkin, governor, virginia, VA">
<meta name="robots" content="index,follow">
<meta name="author" content="[email protected]">
<title>500 - Error</title>
Where I am doing wrong?
Any help is appreciated.