Using regex library to create lexical analyzer in C++?

1.1k Views Asked by At

I am trying to write an XML scanner in C++. I would ideally like to use the regex library as it would be much easier.

However, I'm a little stumped as to how to do it. So, first I need to create the regular expressions for each token in the language. I could use a map to store pairs of these regexes in addition to the name of the token.

Next, I would open an input file and want to use an iterator to iterate through the strings in my file and match them to a regex. However, in XML, you don't have spacing to separate strings.

So my question is will this method even work? Also, how exactly will the regex library fit my needs? Is regex_match enough to fit my needs in a foolproof way so that my scanner isn't tricked?

I'm just trying to create a skeleton of the process in my head so that I can start working on this. I wanted some input from others to see if I'm thinking about the problem correctly.

I'd appreciate any thoughts on this. Thanks so much!

2

There are 2 best solutions below

0
rici On

Lexical analysis usually proceeds by sequentially matching tokens, where each token corresponds to the longest possible match from a set of possible regular expressions. Since each match is anchored where the previous token ended, no searching is performed.

Here, I use the word "token" slightly loosely; whitespace and comments are also matched as tokens, but in most programming languages they are simply ignored after being recognised. A conformant XML tokenizer would need to recognize them as tokens, though, so the usage would be precise for your problem domain.

Rather than immersing yourself in a sea of annoying details, you might want to learn about (f)lex, which efficiently implements this algorithm given a collection of regular expressions. It also takes care of buffer handling and some other details which let you concentrate on understanding the nature of the lexical analysis process.

0
Dr. Alex RE On

There is a tool for this, called RE/flex that generates scanners:

https://sourceforge.net/projects/re-flex

The generated scanners use regex engines such as Boost.Regex. Boost.Regex is used via an API to handle different types of input, so there is some additional C++ code. Not the bare-bones Boost.Regex API calls that you may be looking for.

The examples included with RE/flex includes an XML scanner in C++ that may help you to get started. RE/flex also supports UTF-8 encoding which you will need to properly scan XML.