In my first post, I would like to share my pet project. I am in the process of making a machine learning algorithm that can assign buy/sell/hold positions to securities. The first step of this project is to build the dataframe that contains the securities' basic information as well as relevant predictive indicators. I am using rvest to webscrape data from two different websites that give stock information. Below is my code:
#load all variables of interest
for(i in 1:nrow(stockdata)){
#price
url <- paste0('https://www.nasdaq.com/symbol/',tolower(stockdata[,1][i]),
sep="")
html <- read_html(url)
#Select the text I want
Price <- html_nodes(html,'#qwidget_lastsale')
stockdata$Price[i] <- html_text(Price)
#price change percentage
url <- paste0('https://finviz.com/quote.ashx?t=',stockdata[,1][i], sep="")
html <- read_html(url)
#Select the text I want
change <- html_nodes(html,'.table-dark-row:nth-child(12) .snapshot-td2:nth-
child(12) b')
stockdata$PriceChange[i] <- html_text(change)
}
I have truncated the code, but the above works in pulling data. Unfortunately, the process is horrifically slow. I have many more variables to pull, and each one slows it down more and more. My knowledge of vectorization is decent for speeding up the process, but not sure how to apply it. Any tips on making this process faster in its execution or some knowledge on general speedier iteration tips would be greatly appreciated.