I am attempting to write a simplistic C89 --> x86_64 compiler, based on this C89 standard draft, in C89, for learning's sake. So far, I am implementing translation phase 1. My understanding is that this consists of
- Reading the code into a string.
- Replacing the trigraph sequences.
I have tried to implement this with a program (please forgive any style errors I have made):
char *trigraph_replacement(char *code)
{
char *temp = calloc(1, strlen(code));
char *temp_1 = temp;
char *code_1 = code;
for (; *code_1; code_1++)
{
if (strncmp(code_1, "??", 2) == 0)
{
code_1 += 2;
switch (*code_1)
{
case '<':
*(temp_1) = '{';
break;
case '>':
*(temp_1) = '}';
break;
case '(':
*(temp_1) = '[';
break;
case ')':
*(temp_1) = ']';
break;
case '=':
*(temp_1) = '#';
break;
case '/':
*(temp_1) = '\\';
break;
case '\'':
*(temp_1) = '^';
break;
case '!':
*(temp_1) = '!';
break;
case '-':
*(temp_1) = '~';
break;
default:
break;
}
}
else
{
*temp_1 = *code_1;
}
temp_1++;
}
free(code);
return temp;
}
Now, intuitively, it seems that this should do what is is supposed to do, replace all the trigraphs. However, the gcc docs say that "Trigraphs are not popular and many compilers implement them incorrectly". It goes on to state that "portable code should not rely on trigraphs being either converted or ignored"
As a result, I am wondering
- Is my implementation sufficient or did I err in some way?
- Are trigraphs worth implementing in the first place, or are they not used, even in the most ancient legacy programs?
Leaving aside a number of technical issues and inefficiencies, which I'll get to later, the answer to your question:
is, "Yes, you did".
1. The bug in this code
Here's a highly abbreviated view of the loop in
trigraph_replacement
(you might hate my brace formatting; I do it this way so as to reduce vertical space wastage):If the
strncmp
returns a non-zero value, there's no problem; the effect is:If the
strncmp
returns 0 and the first (or any other specified)case
matches, again no problem:But what if the
strncmp
returns 0 but the next character was, say, a space:The
default
case involves the following sequence:So the code skips over the
??
and the following space, and fails to store anything into the corresponding position in the output buffer. The result will be:(The
Ø
represents an actual NUL character; sincecode
was allocated withcalloc
, the unset character position will be 0. That's not likely to play well with the next phase of the translation; it might prematurely terminate the source code, or produce an invalid source character error, or just get dropped on the floor, but none of those are going to be what the user intended.)Perhaps worse, though, is the following example (§5.2.1.1para3 in the standard):
The trigraph sequence in that example starts at the second
?
, which your code will never notice (and might not notice even if you fix the default case, depending on how you fix it). The intended translation isprintf("Eh?\n");
, after the trigraph replaces??/
with a single backslash, which clearly must happen before the backslash escape is interpreted.That is indeed one of the classic implementation errors which the GCC docs were alluding to.
2. Other implementation difficulties
Another common error, which your code won't suffer from if you perform the translation phases in order, is to fail to handle trigraphs in certain contexts. This typically happens in lexers which attempt to handle trigraphs (and line splicing) in parallel with the lexical decomposition, which I think was mentioned in a comment somewhere.
The argument, which I find reasonable, is that since neither trigraphs nor splices are very common at all [Note 1], even a very slow implementation in the lexer which only happens if necessary is probably a lot faster than wasting cycles on a translation pass which is practically never needed.
But some of the pathological cases, which no sensible programmer would ever actually employ, are very easy to get wrong. Consider the following:
In case it wasn't obvious, that's a comment because trigraph replacement is in phase 1; line splicing is in phase2; and comment recognition is in phase 3. After phase 1, the input is
and after phase 2, it's
That's easy to do if phase 1 and phase 2 are independent of the lexer. But the token regex which recognises
/??/<newline>*
as a comment starter and*??/<newline>/
as a comment terminator, along with all the other wacky variants, is not very pleasant to write [Note 2].Finally, I know that it's tempting to slurp the entire input file into memory in order to parse it as a single contiguous string, but it's not really a very nice pattern. The very common antipattern of "seek to the end of the file and call
tell
to see how long the file is" is both an invitation to buffer overflow (the file could be longer when you actually read it, because it's still being written to) and a way of making it impossible to use a Unix pipe as an input source. But even if you read the file more carefully, you'll end up using a lot more memory than necessary, and you won't be able to start processing the source until it's all available. You might not care, and these days I'd probably accept that argument, but it's something to think about.On the other hand, if you read the file buffer by buffer, you need to cope with the annoying case where the start of the trigraph is at the end of the buffer but the last character hasn't yet been read. Fortunately, both trigraphs and splices are short, so you can solve this issue with a small setaside buffer or even a very small state machine. But it's extra code which needs to be written and debugged.
3. Not exactly a code review
As written, the prototype
requires the argument to
trigraph_replacement
to be a mutable dynamically-allocated buffer whose ownership is transferred to the function, which ends up byfree
ing it and returning a new dynamically-allocated buffer.That kind of ownership transfer is a recipe for disaster (particularly when it is undocumented). Unless you have a really good reason, functions which take string arguments should either leave them alone, or they should modify them in place. In-place modification is an efficiency hack; it's not always possible --indeed, it is often not possible-- but in this particular case, it is possible so you might want to consider it. Since the actual use of trigraphs is probably limited to the test cases you write yourself, there's really not much point burdening production code with a pointless of double allocation and copy of the entire input.
In-place modification is possible because all trigraph replacements replace three bytes with a single byte [Note 3]. Line-splicing is another process which always makes the output shorter, and with a bit of care you can do both of them at the same time. So if you document the fact that you are planning on modifying (but not freeing!) the input, it's completely reasonable to write the function as something like this:
That loop is careful to only advance the input pointer over the
??
when it has verified that the??
starts a trigraph. I made no attempt to special case the various possibilities in which you could copy and advance one or two characters, on the grounds that they don't happen often enough to justify the additional code complexity.4. What if in-place modification isn't acceptable?
Not everyone like in-place modification (including me, sometimes), so it's worth thinking about an alternative. Whatever alternative you choose, it's important to think about (and document) the dynamic storage allocation strategy.
For example, it would be tempting to simply return the argument unmodified in the common case that no trigraphs were encountered. But that makes life difficult for the caller because they won't know whether the output string is an additional dynamically-allocated memory region, or the same one. So there is a good argument for unconditionally copying the input to a newly allocated output. (It's possible that the caller could take advantage of the fact that the copy is unconditional, for example by eliminating an unnecessary copy of the input buffer.)
However, the only mechanism C provides for computing the length of a copy is
strlen
, which requires a complete scan of the buffer (recalling that the buffer is the entire program, which could be quite large). It's actually highly likely that the caller knows how long the input it, since they must have acquired it from somewhere, and most input functions tell you how much data was read [Note 4]. So it's entirely reasonable to require that the caller tell you how long the input is, resulting in a prototype likeIf for whatever reason, the caller doesn't know how long
code
is, they can callstrlen
, but there's no need to penalize calls which do know. Regardless, it's vital that enough space be allocated for the entire string including the terminating NUL character. So you will want to usemalloc(codelen + 1)
orcalloc(1, codelen + 1)
. (If you're careful, you don't really needcalloc
.)Notes
The vast majority of line splices are surrounded by whitespace, and those can be trivially handled with a small modification to the whitespace regex pattern. The same goes for splices in string literals. Much more problematic are the splices which are in the middle of a token, particularly complicated tokens like pp-numbers, or the horrific case described below of a splice composed from a trigraph.
It's not impossible, and I know some readers will take it as a challenge. That's cool. But are you sure that you got all the cases covered? I expect to see a lot of test cases :-)
The characters into which the trigraphs are converted are all in the basic character set, which must be encoded into a single byte. See §5.2.1.2.
fgets
should not be used to read multiline input, since it involves an additional scan for newline characters. Aside from the fact that it doesn't tell you how many characters it read, leaving the necessity to do yet another scan for the null character.