Explaining this passage in "About size_t and ptrdiff_t"

751 Views Asked by At

In this blog entry by Andrey Karpov entitled, "About size_t and ptrdiff_t" he concludes with

As the reader can see, using ptrdiff_t and size_t types gives some advantages for 64-bit programs. However, it is not an all-out solution for replacement of all unsigned types with size_t ones. Firstly, it does not guarantee correct operation of a program on a 64-bit system. Secondly, it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on. You should not forget that after this replacement, the memory size needed for the program will greatly increase as well. Increase of the necessary memory size will slow down the application's work, for the cache will store fewer objects being dealt with.

I don't understand these claims, and I don't see them addressed in the article,

"it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on."

How is that likely, how can there be no error before the migration and the type-migration result in an error? It's not clear when the types (size_t and ptrdiff_t) seem to be more more restrictive than what they're replacing.

You should not forget that after this replacement, the memory size needed for the program will greatly increase as well.

I'm unclear of how or why the memory size needed would "greatly" increase, or increase at all? I understand though that if it did Andrey's conclusions follow.

3

There are 3 best solutions below

4
On

The article contains very dubious claims.

First of all, size_t is the type returned by sizeof. uintptr_t is an integer type that can store any pointer to void.

The article claims that size_t and uintptr_t are synonymous. They're not. On for example segmented MSDOS with large memory models the maximum number of elements in an array would have fit in a size_t of 16 bits, but a pointer requires 32 bits. They're synonymous on our common Windows, Linux flat memory models now.

Even worse is the claim that you can store a pointer in ptrdiff_t, or that it would be synonymous with intptr_t:

The size of size_t and ptrdiff_t always coincide with the pointer's size. Because of this, it is these types which should be used as indexes for large arrays, for storage of pointers and, pointer arithmetic.

That's not true at all. ptrdiff_t is the type of the value of pointer subtraction, but pointer subtraction is defined only when both pointers point to the same object or just after it, not just anywhere in the memory.

On the other hand ptrdiff_t could be chosen to be larger than size_t - this is because if you have an array of size larger than MAX_SIZE / 2 elements, subtracting a pointer to the first element from the pointer to the last element or just beyond will have undefined behaviour if ptrdiff_t is of the same width as size_t. Inded, the standard does say that size_t can be only 16 bits wide, but ptrdiff_t must be at least 17](http://port70.net/~nsz/c/c11/n1570.html#7.20.3).

On Linux ptrdiff_t and size_t are of same size - and it is possible to allocate an object on 32-bit Linux that is larger than PTRDIFF_MAX elements. And as it was pointed out in the comments that standard doesn't require ptrdiff_t to be even of the same rank as size_t, though such an implementation would be pure evil.

If one is to follow the advice and use size_t and ptrdiff_t to store pointers, one certainly cannot go right.


As for the claim that

You should not forget that after this replacement, the memory size needed for the program will greatly increase as well.

I'd contest that claim - the memory requirement increase would be rather modest compared to the already-present increased consumption from general 64-bit alignment, alignment of the stack and the 64-bit pointers that are inherent in moving to 64-bit environment.

As for the claim that

"it is most likely that due to this replacement, new errors will appear, data format compatibility will be violated, and so on."

That certainly is true, but most probably if you're coding such buggy code, you'd accidentally "fix" old errors in the process, like the signed/unsigned int example:

int A = -2;
unsigned B = 1;
int array[5] = { 1, 2, 3, 4, 5 };
int *ptr = array + 3;
ptr = ptr + (A + B); //Error
printf("%i\n", *ptr);

where the both original and the new code will have undefined behaviour (accessing array elements out of bounds), but the new code would appear to be "correct" on 64-bit platforms too.

3
On

Well any change will potentially introduce errors. Specifically, I can imagine changing sizes could break where less rigour with regard to types have been applied (e.g. assuming ints or longs being the same as pointers where they are not). Any binary structure written to a file would not be readable directly, and any RPC may well fail, depending on protocols.

Memory requirements will obviously increase as the size of most in-memory objects will increase. Most data will be aligned on 64 bit boundaries, meaning more "holes". Stack usage will increase, potentially resulting in more frequent cache misses.

While all generalisations may be true or false, the only way to find out is to do some proper analysis on the system at hand.

6
On

As a general proposition, using size_t and ptrdiff_t is vastly preferred over using, say, plain unsigned int and int. size_t and ptrdiff_t are pretty much the only way of writing a robust and widely portable program.

However: there is no such thing as a free lunch. Properly using size_t takes some work, too -- it's just that, if you know what you're doing, it takes less work than trying to achieve the same result without using size_t.

Also, size_t has the problem that you can't print it using %d or %u. Ideally you want to use %zu, but, tragically, not all implementations have supported it.

If you have a large and badly written program that doesn't use size_t, it's probably full of bugs. Some of those bugs will have been masked or worked around. If you try to change it to use size_t, a certain number of the program's workarounds will fail, perhaps uncovering once-hidden bugs. Eventually you'll work those out and achieve the more-robust and more-reliable and more-portable program you desire, but the process will be a rocky one. I suspect that's what the author means by "it is most likely that due to this replacement, new errors will appear".

Changing a program over to use size_t is sort of like trying to add const in all the right places. You make the changes you think you need to make, and recompile, and you get a bunch of errors and warnings, and you fix those and recompile, and you get a bunch more errors and warnings, etc. It's at least a nuisance, and sometimes a ton of work. But it's generally the only way to go if you want to make the code more robust and portable.

A big part of the problem is keeping the compiler happy. It's going to warn about a bunch of stuff, and you'll generally want to fix everything it complains about, even though some of what it complains about is ticky-tack and unlikely to cause a problem. But it's perilous to say, "Yeah, I can ignore this particular warning", so in the end, as I said, you'll generally want to fix everything.

The author's most eye-catching claim is

memory size needed for the program will greatly increase as well.

I suspect this is an exaggeration -- in most cases I doubt that memory will "greatly" increase -- but it's likely to increase at least a little bit. The issue is that on a 64-bit system, size_t and ptrdiff_t are likely to be 64-bit types. If for whatever reason you have large arrays of these, or large arrays of structures containing these, and if you had been using some 32-bit type (perhaps plain int or unsigned int) before, yes, you're going to see a memory increase.

And then you're going to want to ask, Do I really need to be able to describe 64-bit sizes? 64-bit programming gives you two things: (a) the ability to address more than 4Gb of memory, and (b) the ability to have a single object greater than 4Gb. If you want to have a total data usage greater than 4Gb, but you don't ever need to have a single object bigger than 4Gb, and if you never want to read more than 4Gb of data at a time from a file (using a single read or fread call, that is), you don't really need 64-bit size variables everywhere.

So to avoid bloat, you might make an informed choice to use, say, unsigned int (or even unsigned short) instead of size_t in some places. As a trivial example, if you had

size_t x = sizeof(int);
printf("%zu\n", x);

you could change this to

unsigned int x = sizeof(int);
printf("%u\n", x);

without any loss in portability, because I can quite confidently guarantee your code is never going to find itself running on a machine with 34359738368-bit ints (or at least, not in our lifetimes :-) ).

But this last example, trivial as it is, also illustrates the other issues that tend to intrude. The similar code

unsigned int x = sizeof(y);
printf("%u\n", x);

is not so obviously safe, because whatever y is, there's a chance it could be so big that its size doesn't fit in an unsigned int. So if you or your compiler really care about type correctness, there may be warnings about possible data loss when assigning size_t to unsigned int. And to shut off those warnings, you may need explicit casts, as in

unsigned int x = (unsigned int)sizeof(int);

And this cast is, arguably, perfectly appropriate. The compiler is operating under the assumption that any object might be really big, that any attempt to jam a size_t into an unsigned int might lose data. The cast says you've thought about this case: you're saying, "Yes, I know that, but in this case, I know it won't overflow, so please don't warn me about this one any more, but please do warn me about any others, that might not be so safe."

P.S. I'm being downvoted, so in case I've given the wrong impression, let me make clear that (as I said in my opening paragraph) size_t and ptrdiff_t are vastly preferred. In general there's every reason to use them, no good reason not to use them. (Come to that, Karpov wasn't saying not to use them, either -- merely highlighting some of the issues that might come up along the way.)