I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.
After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.
The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.
I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?
edit: Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem) code:
public static Info crawlUrl(String urlPath) throws IOException{
Info info = new Info();
//
URL url = new URL(urlPath);
URLConnection uc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
uc.getInputStream(), "UTF-8"));
String inputLine;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains("<title>")) System.out.println(inputLine);
}
in.close();
//
return info;
}
this code goes trough a page and prints the main title on console.
Try to look at the request headers used by your crawler, mine is containing
Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
so I get the title in French.EDIT :
I checked with ModifyHeaders add-on on Google Chrome and the value
en-US
is getting me the English title for the movie =)