I am looking for various (better) ways of parsing structured text data in PHP and getting that data into a PHP object graph. I have seen a lot of different parsers in PHP for a variety of text-based file formats but pretty much all of them seem to be some brittle chain of regular expressions. There must be a better way!
In this specific case I am looking to parse MT940 files (bank account transactions). But I have run into the same problem with other file formats as well. Invariably I end up with a big chain of regexes that becomes complex to maintain, especially when different formats need to be supported. MT940 has this problem as well. MT940 isn't a strictly defined format and pretty much all banks use a slightly different dialect.
So, how do you design parsers that are more robust and extendable to deal with different dialects?
Here's an example MT940 statement, taken from this question:
{1:F01AHHBCH110XXX0000000000}{2:I940X N2}{3:{108:XBS/091502}}{4:
:20:XBS/091202/0001
:25:5887/507004-50
:28C:140/1
:60F:C0914CHF7789,
:61:0912021202D36,80NTRFNONREF//0887-1202-29-941
04392579-0 LUTHY + xxx, ZUR
:86:6034?60LUTHY + xxxx, ZUR vom 01.12.09 um 16:28 Karten-Nr. 2232
2579-0
:62F:C091202CHF52,2
:64:C091302CHF52,2
-}