I am actually involed in a pet project that involves scraping data from websites. The application I am writing is entirely in java. And this application runs for several hours scraping data from web pages.
Because of this, it happened that many times my IP has been blocked in several websites. That is the reason I am trying to access the websites through Tor networks.
I have used the code from this Stackoverlow link for running the Tor service from Orchid.
So after running the Tor service, I am using phantomjs to scrape websites.. So
I am running phantomjs as phantomjs --proxy-type=socks5 --proxy=127.0.0.1:9150 script.js
(Since tor service running on 9150 port. PhantomJS v2.1)
script.js contains
var page = new WebPage()
var fs = require('fs');
page.open("WEBSITE_ADDRESS", function() {
page.evaluate(function() {
});
});
page.onLoadFinished = function() {
fs.write('FILE_LOCATION', page.content, 'w');
phantom.exit();
};
Now here is the problem. When I try to run phantomjs, it quickly returns back returning an empty HTML file. But when I do the same using Tor.exe (i.e starting the tor.exe file and then trying to run PhantomJS using the same command specified above), It works perfectly. it works for both https and http. But while running Orchid, both https and http are not working.
One more thing is that, when i try to connect to some website from java class (using HttpURLConnection class) where the method for starting tor service is invoked, I am able to access both http and https websites (Tor service is working since I am getting a new ip address each time I visit whatismyip.com)
The reason i am not going with java based web scraping libraries like jsoup is that the websites i am trying to scrape data from use javascript excessively. Due to this, i always end up getting an imcomplete page which is not the case with phantomJs. And also i don't want to continue using tor.exe for running the tor service as it makes the project heavy. And also i cannot control tor.exe completely from java
Please help me with this.
After so much struggle on setting up PhantomJS to work with SOCKS(Tor), I finally decided to give on that exe. It is a known issue with Phantomjs
Now instead I am using JBrowserDriver for web scraping. That works a charm with Orchid Tor service. Now that everything in java, I am able to control everything .
And one more point, JBrowserDriver employs multiple threads while downloading a page, and seemingly faster than PhantomJS.
Thanks all for your efforts.