Decode HTML Entity on Python

90 Views Asked by At

I have a file that contain some lines like this:

StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

Respect to this lines, i have some files on disk, but saved on decoded form:

StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4

I need get file name from first file list and correct file name from second file and change file name to second name. For this goal, i need decode html entity from file name, so i do somthing like this:

import os
from html.parser import HTMLParser

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(HTMLParser().unescape(fileName))

but this action doesn't have any effect on run, some run's result is:

meysampg@freedom:~/Downloads/Practical Machine Learning$ python3 changeName.py
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4

StatsLearning_Lect1_2b_111213_v2_%5BLvaTokhYnDw%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4a_110613_%5BWjyuiK5taS8%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4b_110613_%5BUvxHOkYQl8g%5D_%5Btag22%5D.mp4

StatsLearning_Lect3_4c_110613_%5BVusKAosxxyk%5D_%5Btag22%5D.mp4

How i can fix this?

2

There are 2 best solutions below

0
On BEST ANSWER

This is actually "percent encoding", not HTML encoding, see this question:

How to percent-encode URL parameters in Python?

Basically you want to use urllib.parse.unquote instead:

from urllib.parse import unquote
unquote('StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4')

Out[192]: 'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'
0
On

I guess you should use urllib.parse instead of html.parser

>>> f="StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4"
>>> import urllib.parse as parse
>>> f
'StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4'
>>> parse.unquote(f)
'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

So your script should look like:

import os
import urllib.parse as parse

fpListDwn = open('listDwn', 'r')

for lineNumberOnList, fileName in enumerate(fpListDwn):
    print(parse.unquote(fileName))