How do I represent an optional component in a grammar with pyparser?

67 Views Asked by At

I am developing a parser that extracts the dose and name from expressions of medication dosages. For example, pulling "10 mg" and "aspirin" from "10mg of aspirin" and "10 mg aspirin".

My attempt in pyparsing.

import pyparsing as pp

doseWord = pp.Word(pp.alphas)
doseNum = pp.Word(pp.nums)
unit = pp.Word(pp.alphas)
preposition = pp.Word(pp.alphas)
chemical = pp.Word(pp.printables)

dosage_parser = doseNum + unit + pp.Optional(preposition) + chemical

print(dosage_parser.parseString('10mg of aspirin')) # ['10','mg','of','aspirin']
print(dosage_parser.parseString('10mg aspirin')) # Error, expected W(0123...) found end of text. 
#These two lines should output the same thing. 

What I've tried

  1. wrapping preposition in pp.Optional - Not working
  2. replacing preposition with pp.Combine(pp.Optional(pp.preposition) pp.Empty()) - Not working
  3. replacing preposition with pp.oneOrMore([pp.preposition,pp.Empty()]) - hangs indefinitely as somewhat expected
  4. wrapping preposition in pp.ZeroOrMore - Not working.

5.(pp.Empty() | preposition) - parses incorrectly ['10', 'mg', 'of']

1

There are 1 best solutions below

4
PaulMcG On

When writing a parser, it is easy to forget that we humans are good at looking at context, but computers (and especially parsers) need things to be spelled out more clearly.

(I'm going to write the rest of this assuming that you are just parsing the strings as given, and that you are not going to pull them out of longer texts like "Take 10 mg of asprin twice a day.")

So in "10 mg of asprin", what tells you that "of" is not the chemical? To me, I would say that "of" is not chemical because it is not the last word in the string. Our human eyes are looking ahead, past the word "of" to see that when it is present, it is followed by another word, which we see as the chemical.

In contrast, pyparsing parsers are generally left-to-right. They don't do any lookahead unless you include that in the parser definition. Your expressions are pretty unconstrained, for instance, your definition of preposition will match any word of alphabetic letters, not just prepositions. So it will match 'aspirin' as well as 'of'. To have pyparsing do the necessary looking ahead, you could write either:

# not the last word in the string
preposition + ~pp.StringEnd()

or

# followed by a chemical
preposition + FollowedBy(chemical)

Since this expression is not desired in the final results, we can suppress it by wrapping in Suppress(), or just call .suppress().

Here is what I have for your parser:

dosage_parser = (
        doseNum
        + unit
        + pp.Optional(preposition + ~pp.StringEnd()).suppress()
        + chemical
)

If you add these lines afterward, pyparsing will generate a parser railroad diagram that might help you visualize how the parser is working:

pp.autoname_elements()
dosage_parser.create_diagram("rx_parser.html")

railroad diagram of dosage_parser

You'll also find it easier to run your parser against multiple test strings using run_tests:

dosage_parser.run_tests("""\
    10 mg aspirin
    10 mg of aspirin
""")

prints each test string, followed by the parsed results

10 mg aspirin
['10', 'mg', 'aspirin']

10 mg of aspirin
['10', 'mg', 'aspirin']

An added tip - it is easier to extract the values from the parsed results if you add results names in your parser. If you define your parser this way:

dosage_parser = (
        doseNum("quantity")
        + unit("unit")
        + pp.Optional(preposition + ~pp.StringEnd()).suppress()
        + chemical("medication")
)

will show up in your test output as:

10 mg aspirin
['10', 'mg', 'aspirin']
- medication: 'aspirin'
- quantity: '10'
- unit: 'mg'

10 mg of aspirin
['10', 'mg', 'aspirin']
- medication: 'aspirin'
- quantity: '10'
- unit: 'mg'

And you can access the fields like attributes on an object:

result = dosage_parser.parse_string("10 mg of aspirin")
print(result.qty, result.unit, result.medication)

As you continue to work on this parser, you may find that preposition will need to be much more narrowly defined, maybe even as specific as CaselessKeyword("of") to match the word "of" (or use pp.one_of to list the propositions you expect to see. If you do this, you will probably be able to remove the lookahead, since you would be specifically looking for "of", not just any alphabetic word that could misread the medication for a preposition. Be sure to define this using keyword semantics, so that you don't accidentally match the leading "of" of "ofloxacin", for example.