I tried to create function to scrape and tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of <h3>
and <table>
elements, when I try to use select function to extract only table or h3 tags from resulting sequence,
I get (), or if I try to map those tags I get (nil nil nil ...).
Could you please help me to resolve this issue, or explain me what am I doing wrong?
Here is the code:
(ns Test2
(:require [net.cgrand.enlive-html :as html])
(:require [clojure.string :as string]))
(defn get-page
"Gets the html page from passed url"
[url]
(html/html-resource (java.net.URL. url)))
(defn h3+table
"returns sequence of <h3> and <table> tags"
[url]
(html/select (get-page url)
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3]
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]}
))
(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
This line gives me headache :
(html/select (h3+table url) [:table])
Could you please tell me what am I doing wrong?
Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?
As @Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying
(html/select raw-html selectors)
on the raw html. It seems like you try to applyhtml/select
multiple times, but this doesn't work.html/select
parses html into a clojure datastructure, so you can't apply it on that datastructure again.I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:
(The code is ugly here, you can also checkout this gist)
A few words on what's going on:
content->string
takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like<br/>
) that you want to ignore.The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.
The
tag-type
function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.The multi-method
parse-node
dispatches on the class of the node and if the node is a map additionally on thetag-type
.Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for
::String
is actually not even used, but I left it in for safety.The
h3+table
function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works as intended.Happy scraping!