I have an Excel file with thousands of rows containing information about the fulfilment of a contract. The data is loaded into the system using a template. But sometimes the template is filled out incorrectly and the row with the fulfilment information is filled out with errors.
The correct template looks like this: Object_№_(number)_Serviced_(dd.mm.YYYY)_Fulfilment_of_obligations_under_agr._№_90/11/122_dated_20.10.2010,_VAT_exempt.
Now I need to check all the thousands of contract descriptions and write in an additional column what error is made and at what place (or after which word) it is contained.
For example,
| original information | text_verification |
|---|---|
| Object № 1001 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt. | OK |
| Object № 10023__Serviced 30.11.2023 Fulfilment of obligations under agr. №90/11/122 dated 20.10.2010, VAT exempt. | double space after 10023, no space after "№" sign |
| Object № 100221 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt | No full stop at the end |
So far, I've only been able to determine if the template is correctly populated to output information or not. Using regular expressions.
template = 'Object № \d+ Serviced (0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[0-2])\.(20\d\d) Fulfilment of obligations under agr. № 90\/11\/122 dated 20.10.2010, VAT exempt.'
new_col = []
for index, row in df.iterrows():
if re.match(template, row['original information']):
new_col.append('OK')
else:
new_col.append('not OK')
df = df.assign(text_verification=new_col)
| original information | text_verification |
|---|---|
| Object № 1001 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt. | OK |
| Object № 10023__Serviced 30.11.2023 Fulfilment of obligations under agr. №90/11/122 dated 20.10.2010, VAT exempt. | not OK |
| Object № 100221 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt | not OK |
Could you please tell me how to more correctly and concisely determine the type of error in the text, as well as its location?
Somewhat elaborating on my own comment: you could write yourself a parser and see those entries as a domain specific language (dsa). I very much like the parsimonious library:
This would yield
Obviously, you'd need to check for the most common errors beforehand.