I have a problem with getting data with Polish diacritics from an Invoke webrequest or Invoke-Restmethod. WHen I retrieve the data I am getting strange characters instead of the correct Polish diacritics. for example: Plec : MÄżczyzna
When I try the same web request in postman I get the correct diacritics: "Plec": "Mężczyzna",
when I copy the PowerShell script that is created via Postman I do not get the correct diacritics. I have added this into the body:
`$body = [System.Text.Encoding]::UTF8.GetBytes($body)`
And also changed the headers to:
`$headers = @{
"Content-Type"="application/json; charset=utf-8";
"OData-MaxVersion"="4.0";
"OData-Version"="4.0";
};`
This is the request:
`$response = Invoke-RestMethod 'https://<URL>/api/MethodInvoker/InvokeServiceMethod' -Method 'POST' - Headers $headers -Body $body
`
I tried to use Postman, several different encodings, changed headers, etc.
tl;dr
This normally indicates the server is encoding the content into a response byte stream in one format (e.g.
utf8) but the client is decoding the byte stream using a different format (e.g.iso-8859-1). As a result, the content decoded by the client doesn't match the original content encoded by the server.This snippet shows the effect in action:
Unfortunately it's not 100% guaranteed to be able to reverse the process - mis-decoding some inputs is lossy so you can't always recover the original content by reversing the decoding and encoding steps, but if you write the response to disk PowerShell will just stream the raw response bytes into a file, and you can read it back using the server's encoding format to recover the original content:
More Details
The root problem seems to be caused by different interpretations of what the default encoding should be for some content types - for example:
Some systems (including Windows PowerShell) appear to use an older heuristic that assumes content is encoded using
iso-8859-1unless acharsetoptional parameter is specified on the content type - see RFC2616: Hypertext Transfer Protocol -- HTTP/1.1For example if Windows PowerShell receives a response with this header:
it will treat it like:
whereas if the response contains this header:
Windows PowerShell will use
utf8to decode it instead.This interpretation was superseded in RFC7321: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content where it says:
and since the spec for RFC8259: The JavaScript Object Notation (JSON) Data Interchange Format says:
that's what some clients do, so for those systems this:
is treated like
and they use
utf8even if nocharsetis specified.You could fix the original issue by getting the owner of the website / api to add the
charset=utf-8optional parameter onto thecontent-typeheader which would improve interoperability with some clients, but it's not strictly necessary according to the various specs, and may not be straightforward to get applied if the site is owned by a third party.And based on the above, the reason the
Content-Type: application/jsonresponse header works in Postman is probably that it uses the newer interpretation of the specs and assumesutf8encoding forapplication/json, whereas Windows PowerShell is using the older interpretation ofiso-8859-1encoding.For reference, this GitHub issue was the key to understanding all of this behaviour.
Finally...
...if you want a script to help debug these sorts of issues in future I wrote one a while ago in this answer - https://stackoverflow.com/a/67182420/3156906. It takes the original text and the mangled text and tries to work out what pair of mismatched encoding / decoding were used mangle the text. When I ran it with your text it gave me this: