I am new to Java and HtmlUnit and am trying to scrape news updates from a page that loads these updates through AJAX calls. Whatever I seem to be doing the updates are not getting loaded. What am I missing?
I tried several methods of waiting for the JS scripts to finish but to no avail. Clicking the button to load more news or firing their events also didn't seem to help.
I've been working under the assumption that I don't need to reassign my page
instance after the JS scripts have finished. Is that right?
I've also been reading that HtmlUnit's JS Engine doesn't work too well with some websites. Is this the case here or am I simply missing something?
Thanks for your help!
Here's my code:
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
import java.util.List;
import org.junit.Assert;
public class ProblemDemo {
public static void main(String[] args) throws IOException, InterruptedException {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setTimeout(10000);
webClient.setJavaScriptTimeout(10000);
webClient.getOptions().setJavaScriptEnabled(true);
// Login procedure
HtmlPage page = webClient.getPage("https://login.xing.com/login");
final HtmlForm form = (HtmlForm) page.getElementById("login-form");
final HtmlInput userID = form.getInputByName("login_form[username]");
final HtmlInput password = form.getInputByName("login_form[password]");
final HtmlButton submit = form.getButtonByName("button");
final HtmlInput remember = form.getInputByName("login_form[perm]");
userID.setValueAttribute("user");
password.setValueAttribute("pass");
remember.setChecked(true);
page = submit.click();
Assert.assertEquals("Start | XING", page.getTitleText());
//Navigate to page to be scraped
page = webClient.getPage(
"https://www.xing.com/companies/deutschepostag/updates");
webClient.waitForBackgroundJavaScript(10*1000);
System.out.println(page.getUrl().toString());
System.out.println(page.asXml());
//Print number of employees (works, not dynamic)
HtmlElement result = page.getFirstByXPath("//div[@id='profile-nav-tabs']"
+ "/ul/li[@id='employees-tab']/a");
System.out.println("Employees: " + result.getTextContent());
//Print news (doesn't work)
String news;
List<HtmlElement> results = (List<HtmlElement>) page.getByXPath("//div"
+ "[@id='company-updates']/ul[@id='news-feed']/li/div"
+ "[@class='activity-content']");
System.out.println("News found: " + results.size());
for(HtmlElement item : results){
news = "";
System.out.println(" NEW ITEM");
System.out.println(item.getTextContent());
}
}
}
Also, is the following warning relevant? Since HtmlUnit generates tons of JS warnings I'm not really sure which ones are important and which ones aren't.
WARNING: Obsolete content type encountered: 'text/javascript'.
Setting
setThrowExceptionOnScriptError
tofalse
prevents your from seeing errors.EDIT: Latest snapshot contains a fix for
performance.navigation.redirectCount
Please try it and revert