Parsing an ONIX xml with conditions using Python lxml

258 Views Asked by At

I am trying to extract some information from an ONIX XML format file using Python lxml parser.

Among other things, the part I am interested in in the document looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<ProductSupply>
       <SupplyDetail>
          <Supplier>
             <SupplierRole>03</SupplierRole>
             <SupplierName>EGEN</SupplierName>
          </Supplier>
          <ProductAvailability>40</ProductAvailability>
          <Price>
             <PriceType>01</PriceType>
             <PriceAmount>0.00</PriceAmount>
             <Tax>
                <TaxType>01</TaxType>
                <TaxRateCode>Z</TaxRateCode>
                <TaxRatePercent>0</TaxRatePercent>
                <TaxableAmount>0.00</TaxableAmount>
                <TaxAmount>0.00</TaxAmount>
             </Tax>
             <CurrencyCode>NOK</CurrencyCode>
          </Price>
          <Price>
             <PriceType>02</PriceType>
             <PriceQualifier>05</PriceQualifier>
             <PriceAmount>0.00</PriceAmount>
             <Tax>
                <TaxType>01</TaxType>
                <TaxRateCode>Z</TaxRateCode>
                <TaxRatePercent>0</TaxRatePercent>
                <TaxableAmount>0.00</TaxableAmount>
                <TaxAmount>0.00</TaxAmount>
             </Tax>
             <CurrencyCode>NOK</CurrencyCode>
          </Price>
       </SupplyDetail>
    </ProductSupply>

I need to pick up the price amount with the following conditions:

PriceType='02' and CurrencyCode='NOK' and PriceQualifier='05'

I tried:

price = p.find(
"ProductSupply/SupplyDetail[Supplier/SupplierRole='03']/Price[PriceType='02' \
and CurrencyCode='NOK' and PriceQualifier='05']/PriceAmount").text

For some reason my XPath with and operators does not work and get the following error:

File "<string>", line unknown
    SyntaxError: invalid predicate

Any idea how to approach it? Any assistance is highly appreciated!

1

There are 1 best solutions below

1
On BEST ANSWER

TL;DR: Use xpath() because boolean operators like and are not supported by find*() methods.


As Daniel suggested, you should use lxml's parser method xpath() for your (rather complex) XPath expression.

XPath

Your XPath expression contains node tests and predicates which use the boolean operator and (XPath 1.0):

ProductSupply/SupplyDetail[Supplier/SupplierRole='03']/Price[PriceType='02' \
and CurrencyCode='NOK' and PriceQualifier='05']/PriceAmount

Tip: Test it online (see Xpather demo). This asserts that it finds a single element <PriceAmount>0.00</PriceAmount> as expected.

Using find() methods

According to Python docs you can use following find methods which accept a match expression (e.g. XPath) as argument:

  1. find
  2. findAll

Issue: limited XPath syntax support for find()

Although their supported XPath syntax is limited!

This limitation includes logical operators like your and. Karl Thornton explains this on his page XML parsing: Python ~ XPath ~ logical AND | Shiori.

On the other side a note on lxml documentation prefers them:

The .find*() methods are usually faster than the full-blown XPath support. They also support incremental tree processing through the .iterfind() method, whereas XPath always collects all results before returning them. They are therefore recommended over XPath for both speed and memory reasons, whenever there is no need for highly selective XPath queries.

(emphasis mine)

Using lxml's xpath()

So lets start with the safer and richer xpath() function (before premature optimization). For example:

# the node predicates to apply within XPath
sd_predicate = "[Supplier/SupplierRole='03']"
p_predicate = "[PriceType='02' and CurrencyCode='NOK' and PriceQualifier='05']"

pa_xpath = f"ProductSupply/SupplyDetail{sd_predicate}/Price{p_predicate}/PriceAmount"  # building XPath including predicates with f-string
print("Using XPath:", pa_xpath) # remove after debugging

root = tree.getroot()
price_amount = root.xpath(pa_xpath)
print("XPath evaluated to:", price_amount) # remove after debugging

See also: