Using regex library to create lexical analyzer in C++?

1.1k Views Asked by Jane Doe At 12 October 2016 at 02:15

I am trying to write an XML scanner in C++. I would ideally like to use the regex library as it would be much easier.

However, I'm a little stumped as to how to do it. So, first I need to create the regular expressions for each token in the language. I could use a map to store pairs of these regexes in addition to the name of the token.

Next, I would open an input file and want to use an iterator to iterate through the strings in my file and match them to a regex. However, in XML, you don't have spacing to separate strings.

So my question is will this method even work? Also, how exactly will the regex library fit my needs? Is regex_match enough to fit my needs in a foolproof way so that my scanner isn't tricked?

I'm just trying to create a skeleton of the process in my head so that I can start working on this. I wanted some input from others to see if I'm thinking about the problem correctly.

I'd appreciate any thoughts on this. Thanks so much!

Original Q&A

There are 2 best solutions below

rici On 12 October 2016 at 16:57

Lexical analysis usually proceeds by sequentially matching tokens, where each token corresponds to the longest possible match from a set of possible regular expressions. Since each match is anchored where the previous token ended, no searching is performed.

Here, I use the word "token" slightly loosely; whitespace and comments are also matched as tokens, but in most programming languages they are simply ignored after being recognised. A conformant XML tokenizer would need to recognize them as tokens, though, so the usage would be precise for your problem domain.

Rather than immersing yourself in a sea of annoying details, you might want to learn about (f)lex, which efficiently implements this algorithm given a collection of regular expressions. It also takes care of buffer handling and some other details which let you concentrate on understanding the nature of the lexical analysis process.

Dr. Alex RE On 28 November 2016 at 03:00

There is a tool for this, called RE/flex that generates scanners:

https://sourceforge.net/projects/re-flex

The generated scanners use regex engines such as Boost.Regex. Boost.Regex is used via an API to handle different types of input, so there is some additional C++ code. Not the bare-bones Boost.Regex API calls that you may be looking for.

The examples included with RE/flex includes an XML scanner in C++ that may help you to get started. RE/flex also supports UTF-8 encoding which you will need to properly scan XML.

Using regex library to create lexical analyzer in C++?

There are 2 best solutions below

Related Questions in C++

Related Questions in REGEX

Related Questions in LEXICAL-ANALYSIS

Related Questions in SKELETON-CODE

Trending Questions

Popular # Hahtags

Popular Questions