Internationalize C program

1.5k Views Asked by At

I have a C program written for some embedded device in English. So there are codes like:

SomeMethod("Please select menu");
OtherMethod("Choice 1");

Say I want to support other languages, but I don't know how much memory I have with this device. I don't want to store strings in other memory areas where I might have less space and crash the program. So I want to store strings in the same memory area and take the same space. So I thought of this:

SomeMethod(SELECT_MENU);
OtherMethod(CHOICE_1);

And a separate header file:

English.h

#define SELECT_MENU "Please select menu"
#define CHOICE_1 "Choice 1"

For other languages:

French.h

#define SELECT_MENU "Text in french"
#define CHOICE_1 "same here"

Now depending which language I want I would include that header file only.

Does this satisfy the requirement that if I select English version my internationalized programs' strings will be stored on same memory region and take same memory as my previous one? (I know French might take more - but that is other issue related that French letters take more bytes).

I thought since I will use defines strings will be placed at same place in memory they were before.

5

There are 5 best solutions below

1
On

The way you are doing that, if you compile the program as English, then French words will not be stored in the English version of the program.

The compiler will not even see the French words. The French words will not be in the final executable.

In some cases, the compiler may see some data, but it chooses to ignore that data if the data is not being used in the program.

For example, consider this function:

void foo() {
    cout << "qwerty\n";
}

If you define this function, but you don't use it in the program, then the function foo and the string "qwerty" will not find their way in the final executable.

Using macro doesn't make any difference. For example, foo1 and foo2 are identical.

#define SOME_TEXT "qwerty\n"
void foo2() {
    cout << SOME_TEXT;
}

The data is stored in heap, heap limit is usually very large. There won't be shortage of memory unless SOME_TEXT is bigger than stack limit (usually about 100 kb) and this data is being copied in stack.

So basically you don't have anything to worry about except the final size of the program.

10
On

At least on Linux and many other POSIX systems, you should be interested by gettext(3) (and by the positioning arguments in printf(3), e.g. %3$d instead of %d in the control format string).

Then you'll code

 printf(gettext("here x is %d and y is %d"), x, y);

and that is common enough to have the habit to

#define _(X) gettext(X)

and code later

printf(_("here x is %d and y is %d"), x, y);

You'll also want to process message catalogs with msgfmt(1)

You'll find several documents on internationalization (i18n) and localization, e.g. Debian Introduction to i18n. Read also locale(7). And you probably should always use UTF-8 today.

The advantage of such message catalogs (all this is by default already available on Linux systems!) is that the internationalization happens at runtime. There is no reason to restrict it to happen at compile time. Message catalogs can (and often are) translated by other people that the developers. You'll have directories in your file system (e.g. in some cheap flash memory, like some SD chip) containing these.

Notice that internationalization & localization is a difficult subject (read more documentation to understand how difficult it can be, once you want to handle non-European languages), and the Linux infrastructure has designed it quite well (probably better, and more efficient, than what you are suggesting with your macros). And Qt and Gtk have also extensive support for internationalization (based upon gettext etc...).

5
On

Let me get this straight: You want to know that if preprocessor-defined variables (in your case, related to i18n) were swapped out before compile, that they would (a) take the same amount of memory (between the macro and non-macro version) and (b) be stored in the same program segment?

The short answer is (a) yes and (b) yes-ish.

For the first part, this is easy. Preprocessor-defined constants are whole-text replaced with their #define'd values by the preprocessor before being passed into the compiler. So, to the compiler,

#define SELECT_MENU "Please select menu"
// ...
SomeMethod(SELECT_MENU);

is read in as

SomeMethod("Please select menu");

and therefore will be identical for all intents and purposes except for how it appears to the programmer.

For the second part, this is a bit more complex. If you have constant string literals in a C program, they will be allocated either into the program's data segment or (if declared as the initial contents of a self-allocating char array) built dynamically within the program's code segment and stored either on the stack or the heap, if I'm not mistaken (as discussed in the answers to this question). This is dependent on how the preprocessor-defined constant is used in the program.

Considering what I said in the first part, if you have char buffer[] = MY_CONSTANT;, it is likely be stored as a heap-space allocator and initializer where it is used in the program, and will increase the code segment (and possibly the BSS). If you have someFunction(MY_CONSTANT);, or char* c_str = MY_CONSTANT;, then it will likely be stored in the data segment, and you will receive a pointer to that area at runtime. There are many ways this may manifest in your actual program; having the variables #define'd does not reliably determine how they will be stored in your compiled program, although if they are used in certain ways only, then you can be reasonably certain where it will be stored.

EDIT Modified first half of answer to accurately address what is being asked, thanks to @esm's comment.

9
On

The pre-processor use here is simple substitution: there is no difference in the executable code between

SomeMethod("Please select menu");

and

#define SELECT_MENU "Please select menu"
...
SomeMethod(SELECT_MENU);

But the memory usage is unlikely to be exactly the same for each language.

In practice, messages are often more complicated than a simple translation. For example in the message

Input #4 is dangerous

Would you have

#define DANGER "Input #%d is dangerous"
...
printf(DANGER, inpnum); 

Or would you do

#define DANGER "Dangerous input #"
...
printf(DANGER); 
printf("%d", inpnum); 

I use these examples to show that you must consider language versions from the outset, not as an easy post-fix.

Since you mention "a device" and are concerned with memory usage, I guess you are working with embedded. My own preferred method is to provide language modules containing an array of words or phrases, with #define to reference the array element to use to piece together a message. That could also be done with enum.

For example (would actually include the English language source file separately

#include <stdio.h>

char *message[] = { "Input", 
                    "is dangerous" };

#define M_INPUT     0
#define M_DANGER    1

int main()
{
    int input = 4;
    printf ("%s #%d %s\n", message[M_INPUT], input, message[M_DANGER]);
    return 0;
}

Program output:

Input #4 is dangerous
2
On

To answer the question of will this take the same amount of memory and will strings be placed in the same section of the program for the English non-macro version when using English macro version the answer is yes.

The C preprocessor (CPP) will replace all instances of the macro with the correct language string for the given language and after the CPP run it will be as if the macros were never there. The strings will still be placed in the read only data section of the binary, assuming that is supported, just as if you didn't use macros.

So to summarize the English version with macros and the English version without macros are the same as far as the C compiler is considered, see link