I am working with Parsel. Unfortunately, I am not able to parse <a> tag, which is child of another <a> tag (I know, that <a> inside <a> isn't HTML standard). How can I handle this situation via Parsel ? I have already solved this problem using Beautiful Soup + html.parser as a backend (Beatufiul Soup + lxml does not work as well).
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<a href="#">
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</a>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...
If I put <a> inside <div> everything works fine. There is an example below:
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div>
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</div>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...
The
lxml.htmlparser thatParseluses "fixes" the HTML code and puts the inner<a>outside. Try to specifytype="xml"when instantiating theSelector:Prints: