Function Corpus in Quanteda doesn't work because of a kwic objects

72 Views Asked by At

First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file.

When I want to create a matrix-document with Corpus function from Quanteda package, I obtain this: Error: corpus() only works on character, corpus, Corpus, data.frame, kwic objects.

In some URLs there is a kwic object (maybe a video, adverts...) that doesn't allow me to work just with text, and I think it's because when inspecting HTML div class = body, automatically picks these kwic objects.

I leave here my code to read it:

url_marca <- read.table("mbappe.txt",stringsAsFactors = F)$V1   
get_marca_text <- function(url){url %>%     
read_html() %>%         
html_nodes("div.ue-c-article__body") %>%    
html_text() %>%         
str_replace_all("[\r\n]" , "")} 

text_marca_mbappe <- sapply(url_marca,get_marca_text)

Does anyone know if is it because of a mistake in html_notes when inspecting the URL or is it something different?

0

There are 0 best solutions below