Text parsing XML File with Python

331 Views Asked by At

So I have been able to query and receive an HTTP RSS webpage, convert it to a .txt file, and query the elements within the XML with minidom.

What I am tying to do next is create a selective list of links that meet my requirements.

Here is an example XML file that has a similar architecture to my file:

<xml>
    <Document name = "example_file.txt">
        <entry id = "1">
            <link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
        </entry>
        <entry id = "2">
            <link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
        </entry>
        <entry id = "3">
            <link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
        </entry>
        </entry>
        <entry id = "4">
            <link href="http://wwww.examplesite.com/files/test_image_1.png"/>
        </entry>
        <entry id = "5">
            <link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
        </entry>
        <entry id = "6">
            <link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
        </entry>
        <entry id = "7">
            <link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
        </entry>
        </entry>
        <entry id = "8">
            <link href="http://wwww.examplesite.com/files/test_image_2.png"/>
        </entry>
    </Document>
</xml>

With minidom, I can get it down to a list of just links, but I think I can skip this step if I can create a list based off of text-searching parameters. I do not want all links, I only want these links:

http://wwww.examplesite.com/files/test_image_1.jpg
http://wwww.examplesite.com/files/test_image_2.jpg

Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name.

My end goal is to have python download these files one at a time. Would a list be best for this?

To make this even more complicated, I am limited to the stock library with Python 2.6. I won't be able to implement any great 3rd party APIs.

3

There are 3 best solutions below

6
On BEST ANSWER

Using lxml and cssselect this is easy:

from pprint import pprint


import cssselect  # noqa
from lxml.html import fromstring


doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
pprint(links)

Output:

['http://wwww.examplesite.com/files/test_image_1_Big.jpg',
 'http://wwww.examplesite.com/files/test_image_1.jpg',
 'http://wwww.examplesite.com/files/test_image_1_Small.jpg',
 'http://wwww.examplesite.com/files/test_image_1.png',
 'http://wwww.examplesite.com/files/test_image_2_Big.jpg',
 'http://wwww.examplesite.com/files/test_image_2.jpg',
 'http://wwww.examplesite.com/files/test_image_2_Small.jpg',
 'http://wwww.examplesite.com/files/test_image_2.png']

If you only want two of the links (which two?):

links = links[:2]

This is called Slicing in Python.

Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name. Any help would be great

You can filter your list like this:

doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
predicate = lambda l: not any([s in l for s in ("png", "Big", "Small")])
links = [l for l in links if predicate(l)]
pprint(links)

This will give you:

['http://wwww.examplesite.com/files/test_image_1.jpg',
 'http://wwww.examplesite.com/files/test_image_2.jpg']
0
On
import re
from xml.dom import minidom

_xml = '''<?xml version="1.0" encoding="utf-8"?>
<xml >
    <Document name="example_file.txt">
        <entry id="1">
            <link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
        </entry>
        <entry id="2">
            <link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
        </entry>
        <entry id="3">
            <link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
        </entry>
        <entry id="4">
            <link href="http://wwww.examplesite.com/files/test_image_1.png"/>
        </entry>
        <entry id="5">
            <link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
        </entry>
        <entry id="6">
            <link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
        </entry>
        <entry id="7">
            <link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
        </entry>
        <entry id="8">
            <link href="http://wwww.examplesite.com/files/test_image_2.png"/>
        </entry>
    </Document>
</xml>
'''

doc = minidom.parseString(_xml)  # minidom.parse(your-file-path) gets same resul
entries = doc.getElementsByTagName('entry')
link_ref = (
    entry.getElementsByTagName('link').item(0).getAttribute('href')
    for entry in entries
)
plain_jpg = re.compile(r'.*\.jpg$')  # regex you needs
result = (link for link in link_ref if plain_jpg.match(link))
print list(result)

This code gets result of [u'http://wwww.examplesite.com/files/test_image_1_Big.jpg', u'http://wwww.examplesite.com/files/test_image_1.jpg', u'http://wwww.examplesite.com/files/test_image_1_Small.jpg', u'http://wwww.examplesite.com/files/test_image_2_Big.jpg', u'http://wwww.examplesite.com/files/test_image_2.jpg', u'http://wwww.examplesite.com/files/test_image_2_Small.jpg'].

But we may use xml.etree.ElementTree better. etree is faster and low memory and smarter interfaces.

etree was bundled in standard library.

0
On
from feedparse import parse
data=parse("foo.html")
for elem in data['entries']:
    if 'link' in elem.keys():
        print(elem['link'])

The Library "feedparse" returns dictionaries by parsing the XML content