How to use unicode with enlive for web-scraping

699 Views Asked by pooya72 At 17 May 2012 at 17:46

I'm trying to scrape a few sites that require unicode support. For example, I'm trying to get the title of this book, but it returns jumbled characters:

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" 
      java.net.URL. enlive/html-resource
 (enlive/select [:h1#page-title]) first :content)

And trying to scrape an Arabic site returns with ?????? all over the place.

(enlive/html-resource (java.net.URL. "http://www.aljazeera.net/portal"))

I'm not sure how I'm supposed to activate unicode support.

Original Q&A

There are 2 best solutions below

Andrew On 17 May 2012 at 19:13 BEST ANSWER

Enlive does have unicode support because it uses Java strings. I ran your first example on my computer and got this result:

(Evliyā Çelebi's Book of Travels)

Perhaps the font that you are using doesn't have glyphs for the pointcodes that you are trying to show?

pooya72 On 20 May 2012 at 11:00

Christophe Grand, the author of enlive, was kind of enough to reply on the Enlive email group. His suggestion was quite informative. I have copied the email below:

Hello,

Enlive is not (and does not include) a full-featured HTTP agent. When you pass a java.net.URL to a html-resource it call .getContent on it, get an InputStream an then assume UTF-8. However if you know the actual encoding you can do :

(-> "http://www.brill.nl/publications/evliya-celebis-book-travels" java.net.URL.
  .getContent (java.io.InputStreamReader. "ENCODING GOES HERE")
enlive/html-resource
 (en/select [:h1#page-title]) first :content)

Or use an agent library which will detect the correct encoding and pass the resulting Reader to html-resource.

hth,

Christophe

How to use unicode with enlive for web-scraping

There are 2 best solutions below

Related Questions in CLOJURE

Related Questions in ENLIVE

Trending Questions

Popular # Hahtags

Popular Questions