Structural pattern matching of lxml HtmlElement attributes

74 Views Asked by At

I want to use PEP 634 – Structural Pattern Matching to match an HtmlElement that has a particular attribute. The attributes are accessible through an .attrib attribute that returns an instance of the _Attrib class, and IIUC it has all methods for it to be a collections.abc.Mapping.

The PEP says this:

For a mapping pattern to succeed the subject must be a mapping, where being a mapping is defined as its class being one of the following:

  • a class that inherits from collections.abc.Mapping
  • a Python class that has been registered as a collections.abc.Mapping
  • ...

Here's what I'm trying to do, but it doesn't print the href:

from collections.abc import Mapping
from lxml.html import HtmlElement, fromstring
el = fromstring('<a href="https://stackoverflow.com/">StackOverflow</a>')
Mapping.register(type(el.attrib))  # lxml.etree._Attrib
assert(isinstance(el.attrib, Mapping))  # It's True even before registering _Attrib.

match el:
    case HtmlElement(tag='a', attrib={'href': href}):
        print(href)

This matches and prints attrib:

match el:
    case HtmlElement(tag='a', attrib=Mapping() as attrib):
        print(attrib)

This does not match, as expected:

match el:
    case HtmlElement(tag='a', attrib=list() as attrib):
        print(attrib)

I also tried this and it works:

class Upperer:
    def __getitem__(self, key): return key.upper()
    def __len__(self): return 1
    def get(self, key, default): return self[key]
Mapping.register(Upperer)  # It doesn't work without this line.
match Upperer():
    case {'href': href}:
        print(href)  # Prints "HREF"

I understand using XPath/CSS selectors would be easier, but at this point I just want to know what is the problem with the _Attrib class and my code.

Also, I don't want to unpack the element and convert the _Attrib instance to dict as follows:

match el.tag, dict(el.attrib):
    case 'a', {'href': href}:
        print(href)

or use guards:

match el:
    case HtmlElement(tag='a', attrib=attrs) if 'href' in attrs:
        print(attrs['href'])

It works but it doesn't look right. I'd like to find a solution so the original case HtmlElement(tag='a', attrib={'href': href}) works. Or something that's very close to it.

Python version I'm using is 3.11.4.

1

There are 1 best solutions below

0
Sepu Ling On

There seems to be a problem with Python's use of match case to compare two objects for equality, since pattern matching with match case statements is typically used to match different values, not to compare objects for equality. In Python, the == operator is often used to compare objects for equality. If you want to compare two objects for equality, you should use == instead of match case.

Write a class for comparison:

class Element:
    def __init__(self,data : HtmlElement= None,**kwargs):
        if not kwargs:
            temp={'tag': data.tag, 'text': data.text,'tail': data.tail,'attrib': data.attrib}
            self.__dict__ = {i:temp[i] for i in temp if temp[i]}
            return
        self.__dict__ =kwargs
    def __eq__(self, other):
        return self.__dict__ == other.__dict__

Use if statement to determine if they are equal

if Element(el) == Element(tag='a',text= 'StackOverflow', attrib={'href': 'https://stackoverflow.com/'}):
    print('equal')