MFC Programmer's SourceBook : Thinking in C++

A small library usually starts out as a collection of functions, but those of you who have used third-party C libraries know that there’s usually more to it than that because there’s more to life than behavior, actions and functions. There are also characteristics (blue, pounds, texture, luminance), which are represented by data. And when you start to deal with a set of characteristics in C, it is very convenient to clump them together into a struct, especially if you want to represent more than one similar thing in your problem space. Then you can make a variable of this struct for each thing.

Thus, most C libraries have a set of structs and a set of functions that act on those structs. As an example of what such a system looks like, consider a programming tool that acts like an array, but whose size can be established at run-time, when it is created. I’ll call it a CStash. Although it’s written in C++, it has the style of what you’d write in C:

The tag name for the struct is generally used in case you need to reference the struct inside itself. For example, when creating a linked list, you need a pointer to the next struct. But almost universally in a C library you’ll see the typedef as shown above, on every struct in the library. This is done so you can treat the struct as if it were a new type and define variables of that struct like this:

Note that the function declarations use the Standard C style of function prototyping, which is much safer and clearer than the “old” C style. You aren’t just introducing a function name; you’re also telling the compiler what the argument list and return value look like.

The storage pointer is an unsigned char* . This is the smallest piece of storage a C compiler supports, although on some machines it can be the same size as the largest. It’s implementation dependent. You might think that because the CStash is designed to hold any type of variable, a void* would be more appropriate here. However, the purpose is not to treat this storage as a block of some unknown type, but rather as a block of contiguous bytes.

The source code for the implementation file (which you may not get if you buy a library commercially – you might get only a compiled OBJ or LIB or DLL, etc.) looks like this:

initialize( ) performs the necessary setup for struct CStash by setting the internal variables to appropriate values. Initially, the storage pointer is set to zero, and the size indicator is also zero – no initial storage is allocated.

The add( ) function inserts an element into the CStash at the next available location. First, it checks to see if there is any available space left. If not, it expands the storage using the inflate( ) function, described later.

Because the compiler doesn’t know the specific type of the variable being stored (all the function gets is a void*), you can’t just do an assignment, which would certainly be the convenient thing. Instead, you must use the Standard C library function memcpy( ) to copy the variable byte-by-byte. The first argument is the destination address where memcpy( ) is to start copying bytes. It is produced by the expression:

This indexes from the beginning of the block of storage to the next available piece. This number, which is simply a count of the number of pieces used plus one, must be multiplied by the number of bytes occupied by each piece to produce the offset in bytes. This doesn’t produce the address, but instead the byte at the address. To produce the address, you must use the address-of operator &.

The second and third arguments to memcpy( ) are the starting address of the variable to be copied and the number of bytes to copy, respectively. The next counter is incremented, and the index of the value stored is returned, so the programmer can use it later in a call to fetch( ) to select that element.

fetch( ) checks to see that the index isn’t out of bounds and then returns the address of the desired variable, calculated the same way as it was in add( ).

count( ) may look a bit strange at first to a seasoned C programmer. It seems like a lot of trouble to go through to do something that would probably be a lot easier to do by hand. If you have a struct CStash called intStash, for example, it would seem much more straightforward to find out how many elements it has by saying intStash.next instead of making a function call (which has overhead) like count(&intStash). However, if you wanted to change the internal representation of CStash and thus the way the count was calculated, the function call interface allows the necessary flexibility. But alas, most programmers won’t bother to find out about your “better” design for the library. They’ll look at the struct and grab the next value directly, and possibly even change next without your permission. If only there were some way for the library designer to have better control over things like this! (Yes, that’s foreshadowing.)

Dynamic storage allocation

You never know the maximum amount of storage you might need for a CStash, so the memory pointed to by storage is allocated from the heap. The heap is a big block of memory used for allocating smaller pieces at run-time. You use the heap when you don’t know the size of the memory you’ll need while you’re writing a program. That is, only at run-time will you find out that you need space to hold 200 airplane variables instead of 20. Dynamic-memory allocation functions are part of the Standard C library and include malloc( ), calloc( ), realloc( ), and free( ).

The inflate( ) function uses realloc( ) to get a bigger chunk of space for the CStash. realloc( ) takes as its first argument the address of the storage that’s already been allocated and that you want to resize. (If this argument is zero – which is the case just after initialize( ) has been called – it allocates a new chunk of memory.) The second argument is the new size that you want the chunk to be. If the size is smaller, there’s no chance the block will need to be copied, so the heap manager is simply told that the extra space is free. If the size is larger, as in inflate( ),there may not be enough contiguous space, so a new chunk might be allocated and the memory copied. The assert( ) checks to make sure that the operation was successful. ( malloc( ), calloc( ) and realloc( ) all return zero if the heap is exhausted.)

Note that the C heap manager is fairly primitive. It gives you chunks of memory and takes them back when you free( ) them. There’s no facility for heap compaction , which compresses the heap to provide bigger free chunks. If a program allocates and frees heap storage for a while, you can end up with a heap that has lots of memory free, just not anything big enough to allocate the size of chunk you’re looking for at the moment. However, a heap compactor moves memory chunks around, so your pointers won’t retain their proper values. Some operating environments have heap compaction built in, but they require you to use special memory handles (which can be temporarily converted to pointers, after locking the memory so the heap compactor can’t move it) instead of pointers.

assert( ) is a preprocessor macro in <cassert>. assert( ) takes a single argument, which can be any expression that evaluates to true or false. The macro says, “I assert this to be true, and if it’s not, the program will exit after printing an error message.” When you are no longer debugging, you can define a flag so asserts are ignored. In the meantime, it is a very clear and portable way to test for errors. Unfortunately, it’s a bit abrupt in its handling of error situations: “Sorry, mission control. Our C program failed an assertion and bailed out. We’ll have to land the shuttle on manual.” In Chapter 16, you’ll see how C++ provides a better solution to critical errors with exception handling.

When you create a variable on the stack at compile-time, the storage for that variable is automatically created and freed by the compiler. It knows exactly how much storage it needs, and it knows the lifetime of the variables because of scoping. With dynamic memory allocation, however, the compiler doesn’t know how much storage you’re going to need, and it doesn’t know the lifetime of that storage. It doesn’t get cleaned up automatically. Therefore, you’re responsible for releasing the storage using free( ), which tells the heap manager that storage can be used by the next call to malloc( ), calloc( ) or realloc( ). The logical place for this to happen in the library is in the cleanup( ) function because that is where all the closing-up housekeeping is done.

To test the library, two CStashes are created. The first holds ints and the second holds arrays of 80 chars. (You could almost think of this as a new data type. But that happens later.)

At the beginning of main( ), the variables are defined, including the two CStash structures. Of course, you must remember to initialize these later in the block. One of the problems with libraries is that you must carefully convey to the user the importance of the initialization and cleanup functions. If these functions aren’t called, there will be a lot of trouble. Unfortunately, the user doesn’t always wonder if initialization and cleanup are mandatory. They know what they want to accomplish, and they’re not as concerned about you jumping up and down saying, “Hey, wait, you have to do this first!” Some users have even been known to initialize the elements of the structure themselves. There’s certainly no mechanism to prevent it (more foreshadowing).

The intStash is filled up with integers, and the stringStash is filled with character arrays. These character arrays are produced by opening the source code file, Libtest.c, and reading the lines from it into the stringStash. Notice something interesting here: The Standard C library functions for opening and reading files use the same techniques as in the Stash library! fopen( ) returns a pointer to a FILE struct, which it creates on the heap, and this pointer is passed to any function that refers to that file ( fgets( ), in this case). One of the things fclose( ) does is release the FILE struct back to the heap. Once you start noticing this pattern of a C library consisting of structs and associated functions, you see it everywhere!

After the two Stashes are loaded, you can print them out. The intStash is printed using a for loop, which uses count( ) to establish its limit. The stringStash is printed with a while, which breaks out when fetch( ) returns zero to indicate it is out of bounds.

There are a number of other things you should understand before we look at the problems in creating a C library. (You may already know these because you’re a C programmer.) First, although header files are used here because it’s good practice, they aren’t essential. It’s possible in C to call a function that you haven’t declared. A good compiler will warn you that you probably ought to declare a function first, but it isn’t enforced. This is a dangerous practice, because the compiler can assume that a function that you call with an int argument has an argument list containing int, and it will treat it accordingly – a very difficult bug to find.

Note that the Lib.h header file must be included in any file that refers to CStash because the compiler can’t even guess at what that structure looks like. It can guess at functions, even though it probably shouldn’t, but that’s part of the history of C.

Each separate C file is a translation unit. That is, the compiler is run separately on each translation unit, and when it is running it is aware of only that unit. Thus, any information you provide by including header files is quite important because it provides the compiler’s understanding of the rest of your program. Declarations in header files are particularly important, because everywhere the header is included, the compiler will know exactly what to do. If, for example, you have a declaration in a header file that says void func(float); , the compiler knows that if you call it with an integer argument, it should promote the int to a float. Without the declaration, the compiler would simply assume that a function func(int) existed, and it wouldn’t do the promotion.

For each translation unit, the compiler creates an object file, with an extension of .o or .obj or something similar. These object files, along with the necessary start-up code, must be collected by the linker into the executable program. During linking, all the external references must be resolved. For example, in Libtest.c, functions like initialize( ) and fetch( ) are declared (that is, the compiler is told what they look like) and used, but not defined. They are defined elsewhere, in Lib.c. Thus, the calls in Libtest.c are external references. The linker must, when it puts all the object files together, take the unresolved external references and find the addresses they actually refer to. Those addresses are put in to replace the external references.

It’s important to realize that in C, the references are simply function names, generally with an underscore in front of them. So all the linker has to do is match up the function name where it is called and the function body in the object file, and it’s done. If you accidentally made a call that the compiler interpreted as func(int) and there’s a function body for func(float) in some other object file, the linker will see _func in one place and _func in another, and it will think everything’s OK. The func( ) at the calling location will push an int onto the stack, and the func( ) function body will expect a float to be on the stack. If the function only reads the value and doesn’t write to it, it won’t blow up the stack. In fact, the float value it reads off the stack might even make some kind of sense. That’s worse because it’s harder to find the bug.

A tiny C-like library

Dynamic storage allocation