I have been experimenting with Jericho HTML Parser and Selenium IDE for the purpose of extracting text from a specific location inside HTML across multiple pages.
I have not found a simple example of how to do this and I don't know java.
I would like to find in a folder all HTML pages in the 1st table, 4th row, 1st div any string of text:
</table>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
<tr class="abc"><td class="xyz"><div align="center">The Text I want</div></td></tr>
</table>
And print the selected text to a txt file in a list like this:
The Text I want
Another Text I want
All the source files are stored locally and may contain bad HTML, so figured Jericho might be best for this purpose. However I'm happy to learn any method to achieve the desired result.
Well in the end I went with beautifulsoup and used a python script with something like this: