Extract text from specific HTML location across multiple pages

310 Views Asked by At

I have been experimenting with Jericho HTML Parser and Selenium IDE for the purpose of extracting text from a specific location inside HTML across multiple pages.

I have not found a simple example of how to do this and I don't know java.

I would like to find in a folder all HTML pages in the 1st table, 4th row, 1st div any string of text:

</table>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>
 <tr class="abc"><td class="xyz"><div align="center">The Text I don't want</div></td></tr>    
 <tr class="abc"><td class="xyz"><div align="center">The Text I want</div></td></tr>
</table>

And print the selected text to a txt file in a list like this:

    The Text I want
    Another Text I want

All the source files are stored locally and may contain bad HTML, so figured Jericho might be best for this purpose. However I'm happy to learn any method to achieve the desired result.

1

There are 1 best solutions below

0
On

Well in the end I went with beautifulsoup and used a python script with something like this:

# open source html file
with open(html_pathname, 'r') as html_file:
# using BeautifulSoup module search html tag's tree
soup = BeautifulSoup(html_file)
# find according your criteria "1st table, 6th tr, 1st td, 1st div"
trs = soup.html.body.table.tr.findNextSiblings('tr')[4].td.div
# write found text to result txt
print ' - writing to result txt'
result_file.write(''.join(trs.contents) + '\n')
print ' - ok!'