I am trying to extract £ signs from an XML feed (full version is here > http://mjsiphone.com/scotjobsnet/)
A mini version is here > http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml
I have no control over the source of the feed or the source server or how the feeds are formatted, the headers they use etc or so on.
I just have to build a .NET 4.5 console application that will extract the data and save it on our own sites database.
Also I have to strip all HTML (from job descriptions) and remove any HTML Encoded characters and replace them with their real values.
Therefore I need to save real £ signs in nvarchar datatypes in an MS SQL 2008 DB not £20,000 or £20,000 etc.
When viewing the source of the feed it has UTF-8 at the top of it.
However when viewing the feed in a browsers source I don't see any mention of UTF-8 as a Request/Response header and in the Request Headers (Chrome) I only see:
Accept-Language:en-GB,en-US;q=0.8,en;q=0.6
When I copy and paste the characters from the browser or the console into SQL and check them they return 163 which is the correct ASCII character encoding e.g £
If you view the feed in a browser the pound signs show up fine.
When I just pipe out the content to a Windows Command console they show up fine as £ signs.
However when I try to save them to the DB or pipe the console debug out to a file in EditPlus (with character encoding set to UTF8 or ASCII) I just get squares in front of the numbers instead of the signs e.g in CMD
[path to .exe] > [path to debug.log file]
Either the console cannot pipe across the content to the editor correctly or I need to use the right encoding or pass along more headers or extract the XML differently.
Here is an example of the code I am using for testing this with, using just one field that uses £ signs in it and then breaking.
static void Main(string[] args)
{
Console.WriteLine("START");
XmlDocument xDoc = new XmlDocument();
string feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
WebClient webClient = new WebClient();
// need to pass a user-agent > 10 Chars to prevent blocking by OUR servers 403
webClient.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");
// piping out to console with this line below shows a £ but to a UTF-8 or ASCII file it's gibberish
webClient.Headers.Add("Content-Type", "application/xml; charset=utf-8");
// I tried this but still the console works but piping to an editor in UTF-8 or ASCII shows squares
webClient.Headers.Add("Accept-Language", "utf-8,en-GB,en-US;q=0.8,en;q=0.6");
// download as text - is this the problem? Should I be using a different method
string feedText = webClient.DownloadString(feedURL);
// load into XML object
xDoc.LoadXml(feedText);
if (xDoc != null)
{
XmlElement root = xDoc.DocumentElement;
XmlNodeList xNodelst = root.SelectNodes("job");
foreach (XmlNode node in xNodelst)
{
string salary = node.SelectSingleNode("candidateSalary").InnerText;
// piped to cmd console the £ signs show but to a UTF-8 file they are just squares
// I've tried adding the Encoding.UTF8 or Encoding.ASCII still no joy
// Console.WriteLine("candidateSalary = " + salary,Encoding.UTF8);
Console.WriteLine("candidateSalary = " + salary);
break;
}
}
Console.WriteLine("FINISH");
}
Any help will be much appreciated.
I am sure it's just a header I need to pass or maybe an issue with outputting the XML content to the Editor.
As I said before just viewing the output in a Windows console the £ show up fine.
Thanks
I expect the output of this command is not UTF-8:
There are two transcoding operations here:
The correct way to detect the XML document encoding is described in the XML specification. XmlDocument will do this for you.
The console encoding can be set to UTF-8 or you could serialize encoded bytes to STDOUT directly.
More on cmd.exe and Unicode here.