using lxml to find the literal text of url links

7k Views Asked by At

(Python 3.4.2) First off, I'm pretty new to python--more than a beginner but less than an intermediate user.

I'm trying to display the literal text of url's in a page by using lxml. I think I've ALMOST got it, but I'm missing something. I can get the actual url links, but not their titles.

Example--from this,

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" aria-describedby="description-id-588180" data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u" href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">Zombie on Omegle!</a>

I want to get this:

'Zombie on Omegle!'

(I'll make that html tag a little more readable for you guys)

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2"
   dir="ltr" aria-describedby="description-id-588180"
   data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u"
   href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">
       Zombie on Omegle!
</a>

I'm trying to do this from a YouTube page, and one of the problems is that YouTube doesn't specify a tag or an attribute for the titles of its links, if that makes sense.

Here's what I've tried:

import lxml.html
from lxml import etree
import urllib

url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()

href_list = tree.xpath('//a/@href')
#Perfect. List of all url's under the 'href' attribute.
href_res = [lxml.etree.tostring(href) for href in href_list]
#^TypeError: Type 'lxml.etree._ElementUnicodeResult' cannot be serialized.

#So I tried extracting the 'a' tag without the attribute 'href'.
a_list = tree.xpath('//a')
a_res = [lxml.etree.tostring(clas) for clas in a_list]
#^This works.

links_fail = lxml.html.find_rel_links(doc,'href')
#^I named it 'links_fail because it doesn't work: the list is empty on output.
#   But the 'links_success' list below works.
urls = doc.xpath('//a/@href')
links_success = [link for link in urls if link.startswith('/watch')]
links_success
#^Out: ['/watch?v=K_yEaIBByFo&list=UUrtZO4nmCBN4C9ySmi013oA', ...]
#Awesome! List of all url's that begin with 'watch?v=..."
#Now only if I could get the titles of the links...

contents = [text.text_content() for text in urls if text.startswith('/watch')]
#^Empty list.

#I thought this paragraph below wouldn't work,
#   but I decided to try it anyway.
texts_fail = doc.xpath('//a/[@href="watch"]')
#^XPathEvalError: Invalid expression
#^Oops, I made a syntax error there. I forgot a '/' before 'watch'.
#    But after correcting it (below), the output is the same.
texts_fail = doc.xpath('//a/[@href="/watch"]')
#^XPathEvalError: Invalid expression
texts_false = doc.xpath('//a/@href="watch"')
texts_false
#^Out: False
#^Typo again. But again, the output is still the same.
texts_false = doc.xpath('//a/@href="/watch"')
texts_false
#^Out: False

target_tag = ''.join(('//a/@class=',
                        '"yt-uix-sessionlink yt-uix-tile-link  spf-link  ',
                        'yt-ui-ellipsis yt-ui-ellipsis-2"'))
texts_html = doc.xpath(target_tag)
#^Out: True
#But YouTube doesn't make attributes for link titles.
texts_tree = tree.xpath(target_tag)
#^Out: True

#I also tried this below, which I found in another stackoverflow question.
#It fails. The error is below.
doc_abs = doc.make_links_absolute(url)
#^Creates empty list, which is why the rest of this paragraph fails.
text = []
text_content = []
notText = []
hasText = []
for each in doc_abs.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)
#AttributeError                            Traceback (most recent call last)
#<ipython-input-215-38c68f560efe> in <module>()
#----> 1 for each in doc_abs.iter():
#      2     if each.text:
#      3         text.append(each.text)
#      4         hasText.append(each)   # list of elements that has text each.text is true
#      5     text_content.append(each.text_content()) #the text for all elements
#
#AttributeError: 'NoneType' object has no attribute 'iter'

I'm out of ideas. Anyone want to help this python padawan? :P

-----EDIT-----

I'm a step further, thanks to theSmallNothing. This command gets the text elements:

doc.xpath('//a/text()')

Unfortunately, that command returns a lot of whitespace and newlines ('\n') as values. I'll probably post another question later for that issue. If I do, I'll put a link to that question here in case anyone else with the same question ends up here.

How to use lxml to pair 'url links' with the 'names' of the links (eg. {name: link})

1

There are 1 best solutions below

7
On BEST ANSWER

For your example you want to use the text selector in your xpath query:

doc.xpath('//a/text()')

which returns the text element of all the a elements it can find.

To get the href and text of all the a elements, which I think your trying to do, you can first extract all the a elements, then iterate and extract the href and text individually.

watch_els = []

els = doc.xpath('//a')
for el in els:
    text = el.xpath("//text()")
    href = el.xpath("//@href")
    #check text and href arrays are not empty...
    if len(href) <= 0 or len(text) <= 0:
        #empty text/href, skip.
        continue

    text = text[0]
    href = href[0]
    if "/watch?" in href:
        #do something with a youtube video link...
        watch_els.append((text, href))