I have a data scraper in ruby that retrieves article data.
Another dev on my team needs my scraper to spin up a webServer he can make a request to so that he may import the data on a Node Application he's built.
Being a junior, I do not understand the following :
a) Is there a proper convention in Rails that tells me where to place my scraper.rb file
b) Once that file is properly placed, how would i get the server to accept connections with the scrapedData
c)What (functionally) is the relationship between the ports, sockets, and routing
I understand this may be a "rookieQuestion" but I honestly dont know.
Can someone please BREAK THIS DOWN.
I have already:
i) Setup a server.rb file and have it linking to localhost:2000 but Im not sure how to create a proper route or connection that allows someone to use Postman for a valid route and connect to my data.
require 'socket'
require 'mechanize'
require 'awesome_print'
port = ENV.fetch("PORT",2000).to_i
server = TCPServer.new(port)
puts "Listening on port #{port}..."
puts "Current Time : #{Time.now}"
loop do
client = server.accept
client.puts "= Running Web Server ="
general_sites = [
"https://www.lovebscott.com/",
"https://bleacherreport.com/",
"https://balleralert.com/",
"https://peopleofcolorintech.com/",
"https://afrotech.com/",
"https://bossip.com/",
"https://www.itsonsitetv.com/",
"https://theshaderoom.com/",
"https://shadowandact.com/",
"https://hollywoodunlocked.com/",
"https://www.essence.com/",
"http://karencivil.com/",
"https://www.revolt.tv/"
]
holder=[]
agent = Mechanize.new
general_sites.each do |site|
page=agent.get(site);
newRet = page.search('a')
newRet.each do |e|
data = e.attr('href').to_s
if(data.length > 50)
holder.push(data)
end
end
pp holder.length.to_s + " [ posts total] ==> Now Scraping --> " + site
end
client.write(holder)
client.close
end
In Rails you don't spin up a web server manually, as it's done for you using rackup, unicorn, puma or any other compatible application server.
Rails itself is never "talking" to the HTTP clients directly, it is just a specific application that exposes a rack-compatible API (basically have an object that responds to
call(hash)and returns[integer, hash, enumerable_of_strings]); the app server will get the data from unix/tcp sockets and call your application.If you want to expose your scraper to an external consumer (provided it's fast enough), you can create a controller with a method that accepts some data, runs the scraper, and finally renders back the scraping results in some structured way. Then in the router you connect some URL to your controller method.
and then with a simple
POST yourserver/scrape/me?site=www.example.comyou will get back your data.