I'm trying to use the npm SGML library here to parse OFX data. OFX v1-1.6 are based on SGML, and later version on XML.
My plan thus far is to use SGML to convert all OFX files into fully normalised XML (proper end tags etc) and the use the xml2json library to convert the xml into json objects i can use in javascript.
This is what i have so far, but this throws an error "content must start with document element when document type isn't specified".
var entitymanager = new sgml.NoopEntitymanager();
var errorhandler = new sgml.Errorhandler();
var resolver = new sgml.Resolver();
var parser = new sgml.Parser();
const readableStream = new Stream.Readable();
let outputhandler = new sgml.Outputhandler(readableStream, entitymanager);
outputhandler.output_format = "xml";
parser.documentHandler = outputhandler;
parser.dtdHandler = outputhandler;
parser.errorHandler = errorhandler;
parser.lexicalHandler = outputhandler;
parser.entityResolver = resolver ;
let recordmanager = new sgml.PlatformStringRecordmanager(errorhandler, parser);
let fileData = await fs.readFile("MY FILE PATH");
recordmanager.set_input(fileData);
parser.recordManager = recordmanager;
recordmanager.start_records(); // throws.
Following along in the debugger i can see it starts to read the file and gets a few tags in before the error is thrown.
Here are some examples of the files i wish to process https://github.com/actualbudget/actual/tree/master/packages/loot-core/src/mocks/files
Thanks!
You didn't include the particular data file to parse so I just picked
data.ofxfrom https://github.com/actualbudget/actual/blob/master/packages/loot-core/src/mocks/files/data.ofx.Looking at that file reveals the following two issues:
A document type declaration (DOCTYPE) is missing; without markup declarations, an SGML parser can't figure out eg. missing end-element tags (which are seemingly used a lot with OFX, or more specifically on nearly every non-container element). Apparently, an official OFX DTD can be downloaded from https://financialdataexchange.org/common/Uploaded%20files/OFX%20files/OFX1.6.zip so download and unzip that file which will place
ofx160.dtdin your directory. Now thatofx160.dtdfile isn't actually a DTD as commonly understood since it contains a<!DOCTYPE ... [line itself at the begin and a DOCTYPE-closing]>line at the end when a DTD is supposed to only contain the markup declarations contained within these two lines. So that file is intended as a fragment to manually prepend in front of an OFX data file, which we're going to do here.ofx160.dtdalso seems to contain (in lines 1342 and 2032, resp.) invalid SGML comments with excess space characters in the comment close marker-->, and also contains Windows CR/LF sequences in unexpected places. Long story short, I've prepended the data file with a version ofofx160.dtdwith all comments (everything between<!--and-->) removed and all Windows CR/LF sequences converted to just plain linefeed (LF) characters. Unfortunately, due to space restrictions I can't post the complete working DTD here.Moreover, the data file starts with a couple "file header" lines such as
OFXHEADER:100, etc.; these aren't normal SGML (though there's a way to make those parse using the SGML SHORTREF feature, but for that to work, the DTD needs extra declarations), so we're going to remove those lines.So this is how
data.ofxshould look like at this point:If you invoke
sgmlproc(the command-line utility coming with thesgmlnpm package), it will (rightly) complain about the data not matching the schemabut if you remove the lines containing
<INTU.BID>and<INTU.USERID>or adapt the content model declaration for theSONRSelement accordingly to acceptINTU.BIDandINTU.USERID, thensgmlprocwill be able to successfully parsedata.ofx, and the equivalent setup of an SGML parsing pipeline via JavaScript sketched in your question should be as well.You'll have to manage removing the extra "header" lines, prepending the (compact) DTD, and removing/accepting those
IMTU....elements programmatically, though.