How to convert percent-encoded url to string with non-ascii chars?

934 Views Asked by At

This should be an easy one I hope. I have a url:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg

that is saved into a json file with this code:

paintings = get_all_paintings(marc_chagall)
with open('chagall.json', 'w') as fb:
    x = json.dump(paintings, fb)

In the file, the URL has become:

u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

I am able to get the original, usable, percent-encoded URL with this code:

p = u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
p = urllib.quote(p.encode('utf8'), safe='/:')
print repr(p) 
> 'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol%C3%A9on.jpg'

Now comes the tricky part. I want to get this string:

http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

with the non-ascii character in napoléon intact. This is for naming purposes in the storage bucket, not for anything else. How can I produce this string?

2

There are 2 best solutions below

3
On BEST ANSWER

Just print the unicode value:

>>> print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'
http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napoléon.jpg

Don't confuse the python representation of the Unicode value (which is deliberately using escapes for non-ASCII characters for ease of debugging and introspection) with the actual value.

Printing encodes the value to the codec used by your console or terminal, provided Python was able to detect it. My terminal is set to UTF-8, so Python encoded the U+00E9 unicode code point to C3 A9 bytes and my terminal then interpreted that as UTF-8 and displayed the é.

This all just means that you already have the right value, but were thrown by the debugging output.

2
On

You already have it:

print u'http://uploads4.wikiart.org/images/marc-chagall/kopeikin-and-napol\xe9on.jpg'

The value of p already is already that string, it's only displayed differently.