I would like to do some text analytics on text from following web page: https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html
I don't know how to convert this HTML to tidy text object (every row in text is every row in dataframe).
For example, just applying html_text()
function doesn't help:
url <- "https://narodne-novine.nn.hr/clanci/sluzbeni/full/2007_07_79_2491.html"
p <- rvest::read_html(url, encoding = "UTF-8") %>%
rvest::html_text()
p
since I don't have separated rows.
That site has some very well-structured HTML with the headers and the body text of the section each given their own
align
attributes. We can use that to extract your text by section:You'll need to double check that the above didn't miss anything. Even if it did it should be straightforward to expand upon the answer.
You can get individual lines broken out using the above as well:
The
tidytext
package has examples of how to perform further cleanup transformations to facilitate text mining.