could someone explain me how to scrape content from <td>
tags where the <th>
has content value (actually in this case I need content of <b>
tag for matching operation) "Row1 title", but without scraping <th>
tag (or any of its content) in process? Here is my test HTML:
<table class="table_class">
<tbody>
<tr>
<th>
<b>
Row1 title
</b>
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
<tr>
<th>
Row2 title
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
</tbody>
</table>
Data which I want to extract should come from these tags:
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
I have managed to create function which returns entire content of the table, but I would like to exclude the <th>
node from result, and to return only data from <td>
nodes, which content I can use for further parsing. Can anyone help me with this?
With enlive something like this
should give you a sequence of all the
td
nodes, something of the form{:tag :td :attrs {...} :content (...)}
. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))
In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the
tag-type
function like this(change the return value of all nodes to
::IgnoreNode
except for:td
then it just gives you a sequence of the content of the:td
s which is probably close to what you want. Let me know if you need more help.EDIT (in reply to comments below) I don't think selecting nodes based on their
:content
is possible with enlive alone - but you can certainly do so with Clojure.for example you could do something like
could work. (you might have to tweak the
(:content line)
form a little..