Within a file, I would like to use grep or maybe use the package qdapRegex's rm_between function to extract a whole section of html code containing a keyword, lets say "discount rate" for this example. Specifically, I want results that look like this code snippet:
<P>This is a paragraph containing the words discount rate including other things.</P>
and
<TABLE width="400">
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>Discount Rate</td>
<td>10.0%</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</TABLE>
- The trick here is it must find discount rate first and then pull out the rest.
- It is always going to be between
<P> and </P>or<TABLE and </TABLE>and no other html tags.
A good sample .txt file for this can be found here:
https://www.sec.gov/Archives/edgar/data/66740/0000897101-04-000425.txt
You can consider the file as html and explore it as if you were scraping it with
rvest:For the
<table>tags, you would not remove the first match: