right single quote not decoded correctly

112 Views Asked by At

Consider the following text:

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"

Notice that there is a regular single quote in "fox's" and a right single quote in "it’s"

So my purpose is to get the original text representation of those encoded characters in sample_text, but not able to do so completely.

I did the following:

>>> sample_text.encode().decode('unicode-escape')
"The fox's color was "brown" and itâ\x80\x99s speed was quick"

Now my question is, is there any way I could get the original right single quote after decoding that sample_text . With my code's output, you can see that it's giving me itâ\x80\x99s instead. I want it to be: it’s

Edit: As suggested in the comments, I'm adding the output of print(sample_text)

print(sample_text)
output: The fox's color was \u201Cbrown\u201D and it’s speed was quick

Edit: I'm using python 3.8.10 and Ubuntu

3

There are 3 best solutions below

0
kzi On

According to your post and your edits this should work for you:

>>> text_part_1 = "The fox's color was "
>>> text_part_2 = " and it’s speed was quick"
>>> color = "\u201Cbrown\u201D"
>>> color = color.encode().decode('unicode-escape')
>>> print(f'{text_part_1}{color}{text_part_2}')

To avoid confusion, I have to add that this is not working for me, but it's giving me this:

>>> print(f'{text_part_1}{color}{text_part_2}')
The fox's color was âbrownâ and it’s speed was quick

(I'm using python 3.10.6 in Ubuntu 22.04.2 in WSL2 right now)

But since the color was output correctly in your code sample

>>> sample_text.encode().decode('unicode-escape')
"The fox's color was "brown" and itâ\x80\x99s speed was quick"

it should work for you.

0
JosefZ On

Read about unicode-escape in Python Specific Encodings (my emphasizing):

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decode from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

Hence, .encode().decode('unicode_escape') causes a mojibake case as follows:

'it’s'.encode()                            # b'it\xe2\x80\x99s'
'it’s'.encode().decode('unicode_escape')   #  'itâ\x80\x99s'
'it’s'.encode().decode('latin-1')          #  'itâ\x80\x99s'
'it’s'.encode().decode('unicode_escape') == 'it’s'.encode().decode('latin-1')
 #                                         # True

Solution in the following code; :

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)    # regular python text
sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)    # raw python text
print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))

Linux:

~$ python3
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
>>> print(sample_text)
The fox's color was “brown” and it’s speed was quick
>>> sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
>>> print(sample_text)

The fox's color was \u201Cbrown\u201D and it’s speed was quick

>>> print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
The fox's color was “brown” and it’s speed was quick
>>>

Windows:

Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
   ...: print(sample_text)
   ...: sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
   ...: print(sample_text)
   ...: print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
   ...:
The fox's color was “brown” and it’s speed was quick
The fox's color was \u201Cbrown\u201D and it’s speed was quick
The fox's color was “brown” and it’s speed was quick
In [2]:
2
Andj On

If i understand your question correctly, there are two parts to it:

  1. a concern about the presence of C-style Unicode escapes in your string, and
  2. How to handle the apostrophe like character in "it’s".

Your question indicates that you are using Python 3.8.10 and Ubuntu, so your ecosystem will be using Unicode (UTF-8), so there shouldn't be a need to use encode/decode pairs if your string is "The fox's color was \u201Cbrown\u201D and it’s speed was quick".

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)
# The fox's color was “brown” and it’s speed was quick

I'm using macOS (and thus musl libc) rather than Ubuntu (and glibc) but the behaviour should be the same.

For Python, the escaped character is the same as the actual character, so:

import unicodedata as ud
print('\u201C' == '“')
# True
print(ud.name("\u201C"))
# LEFT DOUBLE QUOTATION MARK
print(ud.name('“'))
# LEFT DOUBLE QUOTATION MARK

If you avoid the encode/decode pairs then it should resolve your second problem.

Although your string has other issues. Looking at words in your string:

fox's uses U+0027 (APOSTROPHE), “brown” uses U+201C (LEFT DOUBLE QUOTATION MARK) and U+201D (RIGHT DOUBLE QUOTATION MARK), and it’s uses U+2019 (RIGHT SINGLE QUOTATION MARK)

You are using U+0027 and U+2019 for the same purpose. It would be useful to cleanup the string. Since you are using smart quotes elsewhere:

sample_text = sample_text.replace('\u0027', '\u2019')
print(sample_text)
# The fox’s color was “brown” and it’s speed was quick

You discuss the need to get the original text representation of your string. Your string may be the original, as it is. The fact that you are using smart double quotes, would imply that your apostrophe/right single quotes should probably be right single quotes to match the smart double quotes. What the original string is, would be a combination of what keystrokes were used, and what editing controls were used to create the original string. But that takes you down a complex rabbit hole.

It would be a cleaner approach to think in terms of normalising your string, i.e. choosing a preferred Unicode character for apostrophe like characters. That is the approach I took above, using str.replace() to normalise the string using smart quotes consistently in the string. Obviously your could normalise away from smart quotes to the Basic Latin (ASCII) quotes:

sample_text = sample_text.replace('\u2019', '\u0027').replace('\u201C', '"').replace('\u201D', '"')
print(sample_text)
# The fox's color was "brown" and it's speed was quick