Python 3 HTML parser

1.2k Views Asked by At

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:

curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'

All I have in python3 so far is:

import urllib.request

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')

for lines in f.readlines():
    print(lines)

f.close()

Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!

3

There are 3 best solutions below

0
On BEST ANSWER

This is based off of larsmans's answer, above.

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
    if b'align="center">' in line:
        print(next(f).decode().rstrip())
f.close()

Explanation:

for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.

if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)

next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.

3
On

I would probably use regular expressions to get the ip itself:

import re
import urllib

f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]

which will print the first string of the format: 1-3digits, period, 1-3digits,...

I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details). By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).

Hope that helps

4
On
# no need for .readlines here
for ln in f:
    if 'align="center">' in ln:
        print(ln)

But be sure to read the Python tutorial.