set the Accept-Language header for crawler in java

2.3k Views Asked by At

I'd like to find the correct method to set the Accept-Language header for my crawler? I read other related answers like Getting imdb movie titles in a specific language and How to set Accept-Language header on request from applet but they didn't work for me (I get this error: "the method is undefined for type connection" Here is part of code:

String baseUrl = "http://www.imdb.com/search/title?at=0&count=250";

org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");

Please help me, I am really new to java.

Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

In JSoup, you use the header method to set request headers. So the last line of your code will become this. I've just added line breaks for readability.

org.jsoup.Connection con = Jsoup
     .connect(baseUrl)
     .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
     .header("Accept-Language", /* Put your language here */);

For example, to accept English, you'd write "en" in place of that last comment.

0
On

The Accept-Language request-header field is similar to Accept, but restricts the set of natural languages that are preferred as a response to the request. Language Tags here

   Accept-Language = "Accept-Language" ":"
                     1#( language-range [ ";" "q" "=" qvalue ] )
   language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )

Each language-range MAY be given an associated quality value which represents an estimate of the user's preference for the languages specified by that range. The quality value defaults to "q=1". For example,

   Accept-Language: da, en-gb;q=0.8, en;q=0.7

would mean: "I prefer Danish, but will accept British English and other types of English." A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field.

  Note: This use of a prefix matching rule does not imply that
  language tags are assigned to languages in such a way that it is
  always true that if a user understands a language with a certain
  tag, then this user will also understand all languages with tags
  for which this tag is a prefix. The prefix rule simply allows the
  use of prefix tags if this is the case.

The language quality factor assigned to a language-tag by the Accept-Language field is the quality value of the longest language- range in the field that matches the language-tag. If no language- range in the field matches the tag, the language quality factor assigned is 0. If no Accept-Language header is present in the request, the server

SHOULD assume that all languages are equally acceptable. If an Accept-Language header is present, then all languages which are assigned a quality factor greater than 0 are acceptable.

It might be contrary to the privacy expectations of the user to send an Accept-Language header with the complete linguistic preferences of the user in every request.

As intelligibility is highly dependent on the individual user, it is recommended that client applications make the choice of linguistic preference available to the user. If the choice is not made available, then the Accept-Language header field MUST NOT be given in the request.

Note: When making the choice of linguistic preference available to the user, we remind implementors of the fact that users are not familiar with the details of language matching as described above, and should provide appropriate guidance. As an example, users might assume that on selecting "en-gb", they will be served any kind of English document if British English is not available. A user agent might suggest in such a case to add "en" to get the best matching behavior.

Example:

connection.setRequestProperty("Accept-Language","<!-- Depends on Language you want -->");

Hope that helps!

Sources: