I am developing a parser that extracts the dose and name from expressions of medication dosages. For example, pulling "10 mg" and "aspirin" from "10mg of aspirin" and "10 mg aspirin".
My attempt in pyparsing.
import pyparsing as pp
doseWord = pp.Word(pp.alphas)
doseNum = pp.Word(pp.nums)
unit = pp.Word(pp.alphas)
preposition = pp.Word(pp.alphas)
chemical = pp.Word(pp.printables)
dosage_parser = doseNum + unit + pp.Optional(preposition) + chemical
print(dosage_parser.parseString('10mg of aspirin')) # ['10','mg','of','aspirin']
print(dosage_parser.parseString('10mg aspirin')) # Error, expected W(0123...) found end of text.
#These two lines should output the same thing.
What I've tried
- wrapping
prepositioninpp.Optional- Not working - replacing
prepositionwithpp.Combine(pp.Optional(pp.preposition) pp.Empty())- Not working - replacing
prepositionwithpp.oneOrMore([pp.preposition,pp.Empty()])- hangs indefinitely as somewhat expected - wrapping
prepositioninpp.ZeroOrMore- Not working.
5.(pp.Empty() | preposition) - parses incorrectly ['10', 'mg', 'of']
When writing a parser, it is easy to forget that we humans are good at looking at context, but computers (and especially parsers) need things to be spelled out more clearly.
(I'm going to write the rest of this assuming that you are just parsing the strings as given, and that you are not going to pull them out of longer texts like "Take 10 mg of asprin twice a day.")
So in "10 mg of asprin", what tells you that "of" is not the chemical? To me, I would say that "of" is not
chemicalbecause it is not the last word in the string. Our human eyes are looking ahead, past the word "of" to see that when it is present, it is followed by another word, which we see as the chemical.In contrast, pyparsing parsers are generally left-to-right. They don't do any lookahead unless you include that in the parser definition. Your expressions are pretty unconstrained, for instance, your definition of
prepositionwill match any word of alphabetic letters, not just prepositions. So it will match 'aspirin' as well as 'of'. To have pyparsing do the necessary looking ahead, you could write either:or
Since this expression is not desired in the final results, we can suppress it by wrapping in
Suppress(), or just call.suppress().Here is what I have for your parser:
If you add these lines afterward, pyparsing will generate a parser railroad diagram that might help you visualize how the parser is working:
You'll also find it easier to run your parser against multiple test strings using
run_tests:prints each test string, followed by the parsed results
An added tip - it is easier to extract the values from the parsed results if you add results names in your parser. If you define your parser this way:
will show up in your test output as:
And you can access the fields like attributes on an object:
As you continue to work on this parser, you may find that
prepositionwill need to be much more narrowly defined, maybe even as specific asCaselessKeyword("of")to match the word "of" (or usepp.one_ofto list the propositions you expect to see. If you do this, you will probably be able to remove the lookahead, since you would be specifically looking for "of", not just any alphabetic word that could misread the medication for a preposition. Be sure to define this using keyword semantics, so that you don't accidentally match the leading "of" of "ofloxacin", for example.