How to use lxml to pair 'url links' with the 'names' of the links (eg. {name: link})

908 Views Asked by At

For some background info, I posted a question leading up to this one: using lxml to find the literal text of url links

Also, I'm somewhat new to python--more than a beginner but not 100% comfortable with it yet.

I'm trying to use lxml in order match up each link on a page with its name (the blue hyperlink text that appears in a web browser). I'm doing this for a YouTube page, and one problem is that YouTube doesn't make html attributes for link titles.

I've almost got it, but I'm missing something. It might be as something as simple as a syntax change. :/

Problem: when I fetch the literal text for the <a> attributes in the page (into a python list), it returns a TON of values, most of which are filled only with whitespace.

I'll explain what I'd LIKE to do at the end of this post. But first, I'll post my code:

import lxml.html
from lxml import etree
import re

url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()

#Get all links.
urls = doc.xpath('//a/@href')
len(urls)
#^Out: 109

#Get link names (try, at least).
texts = doc.xpath('//a/text()')
len(texts)
#^Out: 263
#That's a little more than 109.
#A lot of the values in this list are whitespace or '\n'.

#Make copies of the list 'texts' from above,
#    and try to filter out the whitespace.
texts_test = []
#^This list will strip the values filled of whitespace,
#^    but the values themselves won't be deleted, just empty.
texts_test2 = []
#^This list will only hold the values of list 'texts'
#^    that contain something other than whitespace (\S and not \s)
texts_test3 = []
#^This list will only hold the values of list 'texts'
#^    that contain something other than newlines
for t in texts:
    texts_test.append(t.strip())
    #^List of stripped 
    if re.findall('\S', t):
        texts_test2.append(t)
    if not re.findall('\n', t):
        texts_test3.append(t)

#Now filter out the values in list 'urls'.
urls_test = []
#^This list will only contains the values of list 'urls'
#^    that begin with 'watch'.
#^    In other words, only the urls of YouTube videos.
urls = doc.xpath('//a/@href')
for u in urls:
    if u.startswith('https://www.youtube.com/watch'):
        urls_test.append(u)

len(texts)       #List holds all literal text under html tag <a>.
#263
len(texts_test)  #Copy of list above with 'junk' values emptied but not deleted.
#263
len(texts_test2) #List holds values with something other than whitespace.
#44
len(texts_test3) #List holds values with something other than '\n'.
#43
len(urls)        #List holds all url links.
#109
len(urls_test)   #List holds only links of YouTube videos.
#60

For the lists that are close in value (texts_test3 and urls_test), I checked their values. They mostly contain what I want. I also checked to see if urls_test just had some extra values at the beginning or the end, but unfortunately that's not the case. In other words, the differences are spread throughout the list. For example, urls_test[5-15] does not match up with any ten consecutive values of texts_test3.

I'm currently fetching the text of ALL the tags with this command:

texts = doc.xpath('//a/text()')

What I'd LIKE to do is fetch the text for ONLY the <a> tags that contain href attributes. So something like this:

texts = doc.xpath('//a/@href/text()')

But this command outputs nothing. I've also tried this:

texts = doc.xpath('//a/[@href]/text()')

But I get this error:

XPathEvalError: Invalid expression

I'm out of ideas. Anyone else have some?

1

There are 1 best solutions below

0
On BEST ANSWER

XPath requires the predicate (the specific quality you want in a tag such as a specific attribute) to come straight after the tag name:

//title[@lang]  Selects all the title elements that have an attribute named lang

As taken from W3Schools.

In your case its the extra forward slash that's your problem:

texts = doc.xpath('//a/[@href]/text()')

to

texts = doc.xpath('//a[@href]/text()')