Memory Hygiene in C and C++: Safe Programming with Risky Data

Eugenia Loli 2004-02-09 General Development 24 Comments

Memory management is scary. It should be: A lot can go wrong—often very wrong. But a moderately experienced C or C++ programmer can learn and understand memory hazards completely. Once you have that knowledge, you should feel only confidence, not fear. Read the article here.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

24 Comments

2004-02-09 5:29 am

Anonymous
In the article, the author mentions the below as incorrect … what am I missing because it looks fine to me?

A final model for memory corruption is mistyped dereferencing. This is the category of the buffer overflows that so often yield exploits. Here’s an example:

————————————————

char *bad_implementation_of_strdup(char *string)

{

char *ptr;

/* Oh no! Do you see the missing “+ 1”? */

ptr = (char *) malloc(strlen(string));

strcpy(ptr, string);

return ptr;

}

Think of ptr here as the address of a LENGTH-long character array, while ptr has the distinct type of a (LENGTH + 1) array.

———————————–

Thanks in advance
2004-02-09 5:31 am

Anonymous
This should be no news for an average C programmer.
2004-02-09 5:37 am

Anonymous
“This should be no news for an average C programmer.”

You’d be surprised with what some people are allowed to get away with in school. They (sinfully) didn’t even teach us malloc() and demalloc(), choosing to start memory management in C++ (with new and delete). Now, mind you, anyone with half a brain can learn malloc and demalloc in an hour, but I’ve seen a lot of idiots out there.

-Erwos
2004-02-09 6:32 am

Anonymous
Most of those suggesting that languages which operate within runtimes are somehow superior to languages which compile to native code and don’t include features like automatic garbage collection are prone to a number of problems such as memory leaks and buffer overflows don’t have hands on experience with today’s modern (and in the case of valgrind, free!) memory debuggers.

Again, here we find yet another use for the unit testing stage. Memory leaks and accessing of unallocated memory or memory outside a stack buffer is a programatic error. The next step in unit testing after a particular set of tests succeeds is to rerun those same tests with a memory debugger. A given module of code succeeding to execute unit tests does not indicate that that piece of code is correct for the given range of inputs… it’s only after it’s been memory debugged that you can truly say it’s correct for the given range.

In reality, profile guided optimization places languages which compile to native code well beyond the reaches of any optimizing JIT powered runtime. As long as C++ programmers avoid overuse of virtual classes, especially within loops, and optimize properly with templates, and utilize profile guided optimization, there is no way that any runtime-driven language can even theoretically come close to a native code language.

The goal of any run-time language is clear… to decomplicate the development process by adding additional program complexity. Is this a worthwhile tradeoff? I think when compared to craftsmanlike native code applications, such as qmail, you will find craftsman quality native code applications to be equally as aproblematic from a stability and security standpoint as programs which execute in a runtime environment, but the native code solution will be substantially less complex, and thus arguably more elegant.
2004-02-09 8:53 am

Anonymous
string is a C string, which is terminated by a . strlen however gives you the number of characters in the string but not the number of bytes you need to store it. You have to add the terminating to the result of strlen to get the number of bytes the string needs.

ptr points to a chunk of memory which is one byte smaller than the amount of memory strcpy uses to copy the string.
2004-02-09 8:53 am

Anonymous
Assume

char *string = “A”;

in effect string has {‘A’, ‘’} or two bytes allocated.

but strlen of string will return size 1 because it will take into account only the A.

A sample strlen code

int strlen (char *str)

{

int len = 0;

for (char *ptr=str;

*ptr;

ptr++, len++) ;

return len;

}

So as u can see len will increment only as long as ptr increments which will happen as long as *ptr!= NULL. there fore len is allows the bytes allocated minus one byte allocated for the NULL character.

So ptr in this case will be allocated only one byte.

Then in the strcpy command a similar function occurs to copy the char

a sample strcpy

void strcpy (char *dest, char *src)

{

for (char *tmpsrc =src, *tmpdest= dest;

*tmpsrc;

*tmpdest++ = *tmpsrc++);

}

As u can see it will only check for the size of the src. so inevitably 2 bytes are copied into dest whereas only one byte is actually allocated leading to a buffer overrun.

So the key is allocate one byte more than strlen returns in this case.
2004-02-09 8:58 am

Anonymous
“n the article, the author mentions the below as incorrect … what am I missing because it looks fine to me? ”

Because a string in C is an array terminated with a character.

When you declare an array of characters (string are character array in C):

char a[] = “hello”, hello has 6 characters, and not 5.

a[0] = ‘h’, etc…and a[5] = ‘’

Always.

Ref: C FAQ

http://www.eskimo.com/~scs/C-faq/q6.2.html
2004-02-09 9:02 am

Anonymous
You can see the differences claimed in former posts by comiling and running the following code

#include <string.h>

#include <stdio.h>

int main(void)

{

char a[] = “hello”;

int i;

printf(“Value returned by sizeof(a): %d
“,sizeof(a));

printf(“Value returned by strlen(a): %d
“,strlen(a));

return 0;

}
2004-02-09 9:27 am

Anonymous
Isn’t your sizeof (a) the size of the pointer, 4 bytes on 32bits platforms ?
2004-02-09 9:42 am

Anonymous
For me, history of programming languages goes towards higher levels of abstraction ( Assembly -> Fortran -> C -> Java -> … )

If the computer can do something automatically, you shouldn’t have to do it by yourself, reducing code complexity and potential errors.

For example, if automatic garbage collection is doeable for your environment ( plenty of RAM and horsepower ), it is desirable as it removes you some burden and, at the same time allows more flexible data management undoeable without it. ( as an example, Lisp mandates GC )

The runtime environment is also a tradeoff between hardware supported features and software compatibility.

Current stock computers are built around the C language programming model ( stacks, HW datatypes similar to the native C constructs … ), any significantly differently language need a big runtime environment overhead.

( Once existed Lisp machines with HW assisted read barrier GC and tagged datatypes … )
2004-02-09 9:44 am

Anonymous
a is an array and not a pointer. Sometimes an array is implicitly converted to a pointer, but they are not the same.
2004-02-09 10:05 am

Anonymous
Christoph, you’re mistaken. Given:

char a[];

a[] is an array. a is a char*.

int main(int argc, char** argv)

is perfectly legal.

The problem with not having GC is that you have to *understand* the language prior to using it – a prerequisite that has fallen into disgrace, hence Java, VB and others like them have been invented so that you can start programming before you grasped the concepts…
2004-02-09 10:27 am

Anonymous
a is not a char *;

Try this:

#include <stdio.h>

int main()

{

char a[] = “Hallo World”;

char * b = “Hallo World”;

printf(“%d
“, sizeof(a)); /* 1 */

printf(“%d
“, sizeof(b)); /* 2 */

a[0] = ‘c’; /* 3 */

b[0] = ‘c’; /* 4 */

a = b; /* 5 */

b = a; /* 6 */

}

1. Gives you 12.

2. Gives you 4 or 8 dependant on your architecture

3. Is fine and alters a to “callo World”

4. Should give a Segmentation fault because b points to read only memory

5. Illegal and g++ gives you: error: incompatible types in assignment.

You wrote a is char *. How comes this error?

6. Legal. a is converted to char * and now b points to the first byte of a.
2004-02-09 11:21 am

Anonymous
I stand corrected regarding the sizeof(char[]).

But that doesn’t change anything about either the strlen() + 1 or my statement regarding gc. If you need garbage collection, do it. (Plainly possible with C++.) But I don’t see gc as being the solution to the problem of people not being aware what they’re doing, to the contrary.
2004-02-09 11:37 am

Anonymous
Thank you for explaining that stuff. I was wrong !

char arrays “[]” are const char * :

Constant pointers to variable characters or variable pointer to constant characters ?

It reminds me of the difference in C++ between initializing constructors and assignment :

– Object A=”A”; ( Initializing constructor, equivalent to Object A(“A”); )

and :

– Object A; A=”A”; ( Default constructor, then assignment )

[ Am I alone to think it somewhat complex ? ]
2004-02-09 11:53 am

Anonymous
> Constant pointers to variable characters or variable

> pointer to constant characters ?

If ever in doubt, there’s a two-step path to enlightenment. 😉

1) const type_t is equivalent to type_t const;

2) After doing 1), each “const” refers to the type standing in front of it.

const char * <=> char const *

…means, pointer to constant char.

char * const

…means, constant pointer to char.

char const * const

…means, constant pointer to constant char.

And because a reference is always const, there is type_t const &, but no type_t & const. 😉

> Am I alone to think it somewhat complex ?

C++ is a language that can be (ab)used in many different ways – procedural, OO, generic. Such flexibility brings complexity. It all depends on what you intend to do.
2004-02-09 11:56 am

Anonymous
Regarding the “shifting around” of const type_t to type_t const… it is a bit unfortunate that the former is the more popular way to write it, because the latter is the more correct one. I think there was some place using templates where you are even forced to use the latter.
2004-02-09 1:19 pm

Anonymous
By now, I ‘ll try to use your latter syntax.

( Maybe the bad habit come from the difference between C syntax and English syntax )

Regards.
2004-02-09 7:21 pm

Anonymous
This is exactly why I think tutorial writers should stop using strings and char *’s in their examples. They break most of the C conventions and are used infrequently, yet tutorial writers keep on thinking its a good idea to use them to demonstrate methods of writing typical code.
2004-02-09 7:40 pm

Anonymous
“In reality, profile guided optimization places languages which compile to native code well beyond the reaches of any optimizing JIT powered runtime. As long as C++ programmers avoid overuse of virtual classes, especially within loops, and optimize properly with templates, and utilize profile guided optimization, there is no way that any runtime-driven language can even theoretically come close to a native code language. ”

There’s no theory to it, Bascule. I usually agree with most of what you have to say, but this time no.

Profile guided optimization is only an approximation of true dynamic optimization. The training data set that is used to generate the “profile” that is fed into a profile guided optimization pass must be somehow be representative of all execution paths: i.e. an “average.” But the whole point of a true dynamic optimization that the application being optimized is being optimized for this particular run, regardless of any other training data.

Secondly, even profile guided optimization is more conservative than true dynamic optimization since the compiler cannot assume that the training profile is an exhaustive representation of true dynamic behavior (e.g. it cannot eliminate code that has not been executed in the training set). This leads to conservativeness in the optimizations done by a profile guided optimizer since the optimizer can never be certain that the program will not exhibit behavior that was not present in the training set. The result is the profile guided optimizations are limited to things like better block scheduling, trace scheduling, guarded inlining, and the like. Data restructuring, devirtualization, deep inlinings, dead code removal, etc, cannot be done safely.

However, in a true dynamic optimization setting, All such optimizations can be employed and “backed out” when the conditions that made them possible no longer hold in the running program. An example is devirtualization and inling in Java virtual machines. Java has fully dynamic class loading. For a virtual method invocation, the VM can devirtualize it if only one possible receiver method exists within the loaded class hierarchy. For a program that behaves significantly differently from run to run and uses different portions of the hierarchy based on execution, this can lead to a multitude of behaviors that each have significant opportunity for optimization, but in their average yield no useful optimization for a static, profile-guided compiler.

A good example of this one of my current projects, with about 620 classes, where fewer than 100 of these classes may be loaded on any particular run, yet they have such tight cohesion, related functionality, and the number of combinations is so immense that factoring them into multiple tools is not feasible.

To me, I find it surprising that people still think that static compilers can ever approach dynamic compilers, when you amortize or factor out the runtime cost of compilation itself.
2004-02-09 9:03 pm

Anonymous
I suppose the real question in regards to static versus dynamic optimization is whether or not the added computational overheaded of dynamic optimization can be reclaimed by optimizing the inner portions of loops in a more efficient manner than a static compiler.

Certainly the one striking example would be invoking a virtual method from within a loop. In this case the dynamic inlining of virtual methods would allow a run-time language to win out in the long run. But there are many ways to eliminate the use of virtual methods in C++ programs, most notably through the use of templates, and most C++ programmers are taught from the first day they are introduced to virtual methods that they are slow due to the cost of a vptr table lookup.

In the end it’s simply too much overhead for a JIT to provide the degree of optimization available in native code. While a JIT may be able to optimize inner loops, C++ templates allow you to eliminate loops altogether. By looping within the template itself, and expressing trig functions as templates which generate a Maclauren series, it’s possible to express a transformation as complicated as the FFT as a single expression. Redundancy can then be optimized away at compile-time.

Certainly there’s nothing preventing languges which execute in runtimes from implementing features like templates, but the advantages of templates are lost as the JIT has considerably more code to optimize if templates were used. Meanwhile most C++ compilers will optimize the Maclauren series expansion to a single machine instruction.

So was I wrong to say that statically compiled languages exceed the performance of those which execute in runtimes? I suppose so, but it’s *highly* unlikely that we’ll see runtime implementations of algorithms exceeding the performance of properly implemented C++ versions. Just for reference, implementing the FFT as a template provides over 4 times the performance of a Java version implemented using trig tables rather than the StrictMath classes. A StrictMath implementation was over 100 times as slow as a C++ version using Maclauren series expansions.

However, keep in mind that a runtime will still be unable to optimize for special cases that the template based implementation does such as optimizing c = a * cos(0) + b * sin(0) : c = a.

So, if you wish to quibble about the realm of theoretical possibility that’s fine, but in all practical terms C++ can provide the highest degree of efficiency available from any language today, especially when PGO is utilized.
2004-02-10 7:34 am

Anonymous
I feel that this, again, is an issue of “I don’t know” or “I don’t care”.

I, too, think that carefully crafted code can beat any JIT hands down.

The thing is, just like garbage collection allows your average trial-and-error coder to write code that doesn’t leak memory, JITs allows the same type of cretin to write code that performs anything but abysmal.

Is it bad to enable more people into writing software? Not necessary. But if you look at the “The Future of Computing, pt. 1” article a few articles up from this one, ask yourself: How much of those problems are caused because there are too many people in the industry who don’t know exactly what they’re doing after all?

I feel that the software industry has grown far beyond healthy sizes. It’s far too easy to hire a dozen programmers and roll your own – dozens of reimplementations, little reuse, little demand for reuse, and the same mistakes made over and over again. If programming manpower would be scarce, I feel that the quality of software would be indefinitely higher, and we could focus on the next problem instead of reimplementing the old ones.
2004-02-10 10:51 am

Anonymous
You are assuming, of course, that only cretins use JITs. There’s nothing stopping real programmers from writing well crafted code for the JIT.

You can’t stop morons from programming. While languages like VB and Java make it easier for the average person to program, they are useful tools for the ‘real’ programmers, those who know what they are doing.
2004-02-10 10:56 am

Anonymous
Well, I for one prefer to know exactly *when* my destructors are being executed… that enables me to do meaningful things in them. 😉