I have a web crawler that is run on different websites (Chinese in this case).
Now when I retrieve the data and display it on my website, the Chinese characters all end up as garbage. Well I read about character encoding, And I found out that UTF-8 is generally the best encoding.
Now the problem is when I use UTF-8 - The data crawled from WEBSITE-1 are shown correctly but not for WEBSITE-2.
For WEBSITE-2, the character encoding gb18030 is working correctly.
My question is, is there a way to know the character encoding for a website so that I can build a generic solution ? I mean I can render a page on my local website knowing what character encoding to use. In this way I can code in the backend, and not really worry on the front end what encoding is required to open a page.
Right now I have two pages, 1 for UTF-8 chinese characters, and one for GB18030 chinese characters.
Use the html meta tag "Content-Type" for html < 5 or the meta tag "char-set" for html 5
W3schools charset