PHP - Replace JSON with the correct Unicode symbol

2.3k Views Asked by At

I have some JSON, that when decoded, I print out the result. Before the JSON is decoded, I use stripslashes() to remove extra slashes. The JSON contains website links, such as https://www.w3schools.com/php/default.asp and descriptions like Hello World, I have u00249999999 dollars

When I print out the JSON, I would like it to print out Hello World, I have $9999999 dollars, but it prints out Hello World, I have u00249999999 dollars.

I assume that the u0024 is not getting parsed because it has no backslash, though the thing is that the website links' forward slashes aren't removed through strip slashes, which is good - I think that the backslashes for the Unicode symbols are removed with stripslashes();

How do I get the PHP to automatically detect and parse the Unicode dollar sign? I would also like to apply this rule to every single Unicode symbol.

3

There are 3 best solutions below

1
On BEST ANSWER

According to the PHP documentation on stripslashes (), it

un-quotes a quoted string.

Which means, that it basically removes all backslashes, which are used for escaping characters (or Unicode sequences). When removing those, you basically have no chance to be completely sure that any sequence as "u0024" was meant to be a Unicode entity, your user could just have entered that.

Besides that, you will get some trouble when using stripslashes () on a JSON value that contains escaped quotes. Consider this example:

{
  "key": "\"value\""
}

This will become invalid when using stripslashes () because it will then look like this:

{
  "key": ""value""
}

Which is not parseable as it isn't a valid JSON object. When you don't use stripslashes (), all escape sequences will be converted by the JSON parser and before outputting the (decoded) JSON object to the client, PHP will automatically decode (or "convert") the Unicode sequences your data may contain.

Conclusion: I'd suggest not to use stripslashes () when dealing with JSON entities as it may break things (as seen in the previous example, but also in your problem).

1
On

Your assumption is correct: u0024 is not getting parsed because it has no backslash. You can use regex to add backslash back after the conversion.

It looks like you have UTF-8 encoded strings internally, PHP outputs them properly, but your browser fails to auto-detect the encoding (it decides for ISO 8859-1 or some other encoding).

The best way is to tell the browser that UTF-8 is being used by sending the corresponding HTTP header:

header("content-type: text/html; charset=UTF-8"); 

Then, you can leave the rest of your code as-is and don't have to html-encode entities or create other mess.

If you want, you can additionally declare the encoding in the generated HTML by using the <meta> tag:

<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML <=4.01
<meta charset="UTF-8">

for HTML5 HTTP header has priority over the <meta> tag, but the latter may be useful if the HTML is saved to HD and then read locally.

5
On

The main question you have to understand, is why do you need to strip slashes? And, if it is really necessary to strip slashes, how to manage the encoding? Probably it is a good idea to convert unicode symbols before to strip slashes, not after, using html_entity_decode .

Anyway, you can try fix the problem with this workaround:

$string = "Hello World, I have u00249999999 dollars";
$string = preg_replace( "/u([0-9A-F]{0,4})/", "&#x$1;", $string ); // recover "u" + 4 alnums
$string = html_entity_decode( $string, ENT_COMPAT, 'UTF-8' ); // convert to utf-8