Why are string literals parsed for trigraph sequences in Gnu gcc/g++?

527 Views Asked by At

Consider this innocuous C++ program:

#include <iostream>
int main() {
  std::cout << "(Is this a trigraph??)" << std::endl;
  return 0;
}

When I compile it using g++ version 5.4.0, I get the following diagnostic:

me@my-laptop:~/code/C++$ g++ -c test_trigraph.cpp
test_trigraph.cpp:4:36: warning: trigraph ??) ignored, use -trigraphs to enable [-Wtrigraphs]
   std::cout << "(Is this a trigraph??)" << std::endl;
                                     ^

The program runs, and its output is as expected:

(Is this a trigraph??)

Why are string literals parsed for trigraphs at all?

Do other compilers do this, too?

2

There are 2 best solutions below

1
On BEST ANSWER

Trigraphs were handled in translation phase 1 (they are removed in C++17, however). String literal related processing happens in subsequent phases. As the C++14 standard specifies (n4140) [lex.phases]/1.1:

The precedence among the syntax rules of translation is specified by the following phases.

  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences ([lex.trigraph]) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set ([lex.charset]) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

This happened first, because as you were told in comments, the characters that trigraphs stood for needed to be printable as well.

1
On

This behavious is inherited from C compilers and the old time when we used serial terminals where only 7 bits were used (the 8th being a parity bit). To allow non English languages with special characters (for example the accented àéèêîïôù in French or ñ in Spanish) the ISO/IEC 646 code pages used some ASCII (7bits) code to represent them. In particular, the codes 0x23, 0x24 (#$ in ASCII) 0x40 (@), 0x5B to 0x5E([\]^), 0x60 (`) and 0x7B to 0x7E ({|}~) could be replaced by national variants1.

As they have special meaning in C, they could be replaced in source code with trigraphs using only the invariant part of the ISO 646.

For compatibility reasons, this has been kept up to the C++14, when only dinosaurs still remember of the (not so good) days of ISO646 and 7 bits only code pages.


1 For example, the French variant used: 0x23 £, 0x40 à 0x5B-0x5D °ç§, 0x60 µ, 0x7B-0x7E éùè¨