I am trying to write an XML scanner in C++. I would ideally like to use the regex library as it would be much easier.
However, I'm a little stumped as to how to do it. So, first I need to create the regular expressions for each token in the language. I could use a map to store pairs of these regexes in addition to the name of the token.
Next, I would open an input file and want to use an iterator to iterate through the strings in my file and match them to a regex. However, in XML, you don't have spacing to separate strings.
So my question is will this method even work? Also, how exactly will the regex library fit my needs? Is regex_match enough to fit my needs in a foolproof way so that my scanner isn't tricked?
I'm just trying to create a skeleton of the process in my head so that I can start working on this. I wanted some input from others to see if I'm thinking about the problem correctly.
I'd appreciate any thoughts on this. Thanks so much!
Lexical analysis usually proceeds by sequentially matching tokens, where each token corresponds to the longest possible match from a set of possible regular expressions. Since each match is anchored where the previous token ended, no searching is performed.
Here, I use the word "token" slightly loosely; whitespace and comments are also matched as tokens, but in most programming languages they are simply ignored after being recognised. A conformant XML tokenizer would need to recognize them as tokens, though, so the usage would be precise for your problem domain.
Rather than immersing yourself in a sea of annoying details, you might want to learn about (f)lex, which efficiently implements this algorithm given a collection of regular expressions. It also takes care of buffer handling and some other details which let you concentrate on understanding the nature of the lexical analysis process.