This web page publishes the C++ source code for a reference counted String container. This String container is derived from a reference counted array template. This array template implements copy-on-write semantics. The array template is supported by a few classes and templates which are useful for creating reference counted objects (see, SubString.C for an example of a reference counted class which is not an array).
The closely related SubString class is also published here. The SubString class supports String section operations. Like the String class, the SubString class is a reference counted object.
This source base also includes a class called DoubleVec which demonstrates how a specific array class can be derived from the array template.
Regression tests are included for the String, SubString and DoubleVec classes.
If computer applications are divided into broad categories, one of the largest categories involves text processing. This includes document creation and processing, document retrieval and software translation (e.g., compiling C++ into machine code). Character streams are frequently divided into strings. This makes string containers some of the most frequently used classes in object oriented languages like C++ and Java.
The String container published here allows character strings to be treated as "first class" objects that support many of the basic operators (for example, +, +=).
// String construction String a("the quick brown fox"); String b = " jumps over the lazy"; // String concatenation String c = a + b; // c = "the quick brown fox jumps over the lazy" // c = "the quick brown fox jumps over the lazy golden retriever" c += " golden retriever"; c(16, 3) = "coy"; // replace "fox" with "coy" c(19, 0) = "ote"; // insert "ote" at character index 19
This string container uses reference counted strings and "copy-on-write" for efficiency. In the code below, for example, the assignment of "a" to "b" increments the reference count of "a" and points "b" at the shared data. The assignment of 'z' to b[3] results in a unique copy being made for "b", since b has been modified.
String a, b; a = "1234"; b = a; b[3] = 'z';
Click on the icon for a definition of "open source"
This source code may be used without restriction, in any way you choose, without a fee, as long as you maintain the following quote in the source code:
This software was written and is copyrighted by Ian Kaplan, Bear Products International, www.bearcave.com, 2003
The definition of open source used here means that you can use this software in proprietary software. Using my String container does not mean that your software must be open source.
The source code is provided "as is" without any warranty as to its usefulness or functionality. The source code comes with a fairly extensive regression test suite (e.g., StringTest.C, SubStrTest.C and DblVecTest.C). I recommend running compiling and running these tests. I also recommend running these tests with a memory verification tool like like Purify or Valgrind to help assure that there are no memory errors.
The presence of these regression tests in no way changes the "use at your own risk" caveat above.
After witnessing the Red Hat initial public stock offering (IPO), which made Red Hat CEO Bob Young and the other Red Hat executives vastly wealthy I rethought some of my ideas about publishing open source software. Those who got rich from the Red Hat IPO did nothing to create Linux and the GNU software that made Linux possible. They simply marketed GNU/Linux, taking advantage of the work of others, while giving them crumbs in return.
After the Red Hat IPO I started publishing my software with more restrictive use conditions. However, I have not applied these conidtions to the String container and related software published here. There are several reasons for this:
There are lots of string container classes around, including the string class in the Standard Template Library. So this software is not that unique.
Open source software works when it is software exchanged between engineers, not between engineers and end users. By modifying and improving open source software engineers "give back" to the open source community. Marketeers are not going to become rich off the efforts of software that is exchanged between engineers.
The String container class is only useful to software engineers (it is not end user software, like GNU/Linux). I hope that any use the String container finds will result in improvements and bug fixes. Since I use this the String container in my own software, this will pay back my effort in developing the String container.
The String container makes some software easier to create, but it will always be a small part of the software that uses it. So the software author must expend their effort, rather than simply reselling my work.
Version 3.0, September 2003
This is the third version of the String container class published here. The String and SubString classes were largely rewritten for this version. They now use a common set of classes and templates to support reference counted objects. The String class is derived from a char instance of a reference counted array (RCArray).
The source and associated test code for the String container can be downloaded in "tar" format here.
The tar program is part of UNIX and Linux. It should also be available on Apple's UNIX derived operating systems (e.g., OS X). If you do not have a version of the "tar" program for Windows NT/200/XP/WhatEverMicrosoftCallsIt you can download a copy here.
To unpack the "tar" archive, copy the tar file to the directory where you the string sub-directory to be created. The file name should end in ".tar" (e.g., string.tar). Change the name if necessary (e.g., from string_tar to string.tar). Use the command
tar xvf string.tar
to unpack the tar file. You should have the following files:
string/DblVecTest.C string/DoubleVec.h string/doxygenTemplate string/GrowableArray.h string/MainPage string/RCArray.h string/RCBase.h string/RCObject.h string/RCPtr.h string/String.C string/String.h string/StringTest.C string/SubString.C string/SubString.h string/SubStrTest.C
For notes on compiling this source using the Microsoft C++ compiler, see the comment in String.C
The String container class is extensively commented using Doxygen formatted comments. The Doxygen generated documentation for the String container can be found here. I have also documented the class "by hand" below.
I have been using the Rogue Wave Tools.h++ classes and templates for a number of years. I wanted to have a class like Rogue Wave's RWCString for the software that I develop for Bear Products International. But I did not want to pay Rogue Wave's license fees (which have become expensive). Nor do I want to force Bear Products International licensees to pay Rogue Wave fees.
Many implementations of the C++ Standard Template Library (STL) do not provide reference counted string objects like RWCString, so STL was not an alternative (however, I'm told the the Dinkumware STL library does provide a reference counted string class).
So I decided to write my one String container, influenced in part by the publicly available Rogue Wave documentation. I expected that it would take me no more than a few evenings or a couple of weekends. As happens so often with software engineers, I underestimated the effort. In fact, to say that I underestimated the effort is a gross underestimation. The String container is now on its third version, over a span of over two years. My original estimate of a few days work has been vastly exceeded.
This software has been time consuming to develop because the details of the String container have been more complicated that I expected. Library code like the String container is also more time consuming to develop because it serves as the foundation for other software. Library code must be as reliable as possible and I have spent a fair amount of time developing regression tests.
Version 3.0 is close to a complete rewrite of the String class and of the supporting SubString class. The String class is now derived from an instance of a generic reference counted array (RCArray). This reference counted array in turn uses a common set of classes and templates that support reference counted classes.
A reference counted object should only make copies when a shared object is modified. Much to my surprise, version 2.0 of the String container made copies on read. At the time I did not fully understand how C++ handled the operator [] function. This issue is discussed in a note I posted to comp.lang.c++.
To avoid copy-on-read the reference counted array template (RCArray) returns a proxy class as a result of the operator [] function. This proxy is based on Stroustrup's Cref class (see 11.12 of The C++ Programming Language, Third Edition).
One of the best choices I made in developing this software was to include a fairly extensive set of regression tests along with this code. The C++ code that implements the reference counted String container unavoidably uses some C++ features that, at least historically, have been inconsistently implemented by various C++ compilers. The regression tests allow some degree of verification relative to the local compiler. While I have done my best to assure the correctness of the source code, you should run the regression test and verify the software using a memory usage tool like Valgrind or Purify.
The String container code (and related code, like DoubleVec) has been compiled with the following compilers on the following platforms:
I have also attempted to check for memory leaks using Valgrind on Linux (I don't currently have access to Purify on Solaris as I have had in the past).
In terms of lines of code, the regression tests are larger than the source that implements the String, SubString, RCArray and other supporting classes and templates. In developing these regression tests I have tried to be fairly methodical. Developing these tests forced me to think about how difficult it is to assure absolute correctness, even in a relatively small software base. I have no doublt that these regression tests are incomplete.
After publishing Version 3.0 of the String container in September 2003, I have not heard that anyone is using it. Perhaps the reason is that this is yet another String container. If you are using the String container, I'd appreciate a brief e-mail (you don't have to say any more than "I'm using it"). Just click on the icon below:
Extensive Doxygen formatted documentation is provided in the source code. The documentation below is intended to highlight a few features of the String container.
The character data in the String container is stored along with a null character terminator ('\0'). In most cases String operations will preserve the null terminator. However, it can be overwritten using the [] operator (something that is not recommended).
When one String object is assigned to another, only a reference is copied and the reference count for the shared data is incremented. A unique copy of a shared string is only made when the string is changed. Copy-on-write takes place as a result of the following:
The String assignment operator can be used to assign a C-string, a String object, or a SubString to a String object. For example:
String a; String b; String c; a = "this is string data"; // assign a C-string to a String b = a; // assign String a to String b, incrementing the reference count. c = b(8, 11); // assign a SubString, "string data", to c
The SubString assignment operator can be used to assign a C-string, String object or SubString object to a SubString. For example:
String a("abcdefgh"); String b("wxyz"); a(4, 4) = "1234"; // change "abcdefgh" to "abcd1234" a(0,4) = b; // change "abcd1234" to "wxyz1234" a(0,4) = a(4,4); // change "wxyz1234" to "12341234"
The += concatenates a C-string, String object or character to the String object on the right hand side. For example:
String weather("'Twas brillig, "); String nextLine("Did gyre and gimble in the wabe"); weather += "and the slithy toves"; // concatenate a C-string weather += '\n'; // concatenate a character weather += nextLine; // concatenate a String object
The + operator concatenates a String object, a C-string, a String or a character to create a new String. The String operand in a + expression may be either the right or left operand. For example:
String a("All mimsy were the borogoves"); String b; String c("\nBeware the Jabberwock, my son"); String d; String poem; b = a + "\nAnd the mome raths outgrabe"; // concatenate String and a C-string d = b + c; // concatenate a String and a String poem = d + '!'; // concatenate a String and a character // note that the String may also be the right operand poem = "Lewis Carroll" + d; // concatenate a C-string and a String poem = '\n' + a; // concatenate a character and a String
The index operator allows a single character in a String to be indexed. The first index is 0. The index operation may be used on either the right or left hand size of an assignment. For example, after the for loop below is executed, both String objects will contain a unique copy of "xyz".
String a("123"), b("xyz"); for (size_t i = 0; i < a.strlen(); i++) a[i] = b[i];
Note that there is always a null character beyond the end of the character data. This null can be overwritten, which may cause problems if the String is assigned to a const char * pointer. For example:
a[ a.strlen() ] = 'X'; // overwrite the null terminator
In general overwriting the null terminator is not a good idea.
If an index beyond the end of the string (e.g., characters plus the null terminator) will throw an exception.
The substring operator is given two arguments: the starting index, and the length of the SubString. It may be used on either the right or left hand side of the assignment. For example:
String a("Gore is the school Principle"); String b("Bush is a vacuous frat boy"); String c("Gore wins!"); c(0, 4) = b(0, 4); b(10, 7) = "fatuous"; if (b(10,7) == "idiotic") cout << "must be a Bush";
OK, you've probably figured out that I'm a bit unhappy about the way Bush "won" the election through the graces of five conservative Supreme Court Justices, while still losing to Gore by over half a million votes.
Note that SubStrings supports general insertion. The inserted Strings do not have to be the same size as the SubString. For example:
String a = "inserthere"; a(6, 0) = " key "; // a now contains "insert key here" String b = "123 456 678"; b(4, 3) = "abcdefgh"; // b now contains "123 abcdefgh 678" String c = "this is a really long winded explaination"; c(10, 18) = "terse"; // c now contains "this is a terse explaination"
The const char * operator allows a String to be assigned to a const char * pointer. For example:
String a("This is it!"); const char *pCStr; pCStr = a; printf("%s\n", pCStr ); // this will print "This is it!"
The const char * operator will allow String expressions to be assigned to a const char * pointer. However, this is not a good idea since the pointer will point to memory that has been deallocated (see C++: the good and the bad, above). For example, the code below will result in a reference to deallocated memory:
String a("better red"); String b(" than ned"); const char *pCStr; pCStr = a + b; // pCStr points to deallocated memory
Code like this could produce strange errors, especially in a multi-threaded environment.
To use the an instance of the String container with in C++ stream output you will need to cast the String to const char *. For example:
String a("PDQ "); String b("Bach"); cout << (const char *)a << endl; cout << (const char *)(a + b) << endl;
The String container supports all of the relational operators ( ==, !=, <=, >=, <, >). Either the right or left hand operands may be a String object or a C-string. For example:
String a, b; [...] if (a <= b) if (b > "1234") if ("5678" != a) [...]
Resize will set the strlen() of the string to new_size. If new_size is larger than the current String, it will be padded with spaces. If new_size is smaller, it will truncate the String data. The resize() function preserves the null terminator, unless the string is resized to zero, in which case the String object will become empty.
The getRefCnt() function returns the reference count for an instance of the String container. If the String is empty, the reference count will be one (however, the String will not contain any characters). The reference count is incremented each time the String object is assigned to another String object. The reference count is decremented each time a String reference copy is modified. For example:
String original("The one, the only"); String a, b, c; // Before assignment, getRefCnt() == 1 a = original; // reference count is now 2 b = a; // reference count is now 3 c = b; // reference count is now 4 a += " (NOT!)"; // a is now unique, reference count is now 3
In general the user of the String container will not find much use for the getRefCnt() function. It is useful mainly for verifying the correct function of the String container (e.g., it is used extensively in the StringTest.C, SubStrTest.C and DblVecTest.C regression tests).
The strlen() function returns the number of characters contained in the String object. This is one less than the internal length, which includes the null terminator.
Writing string classes has great educational value...
Bjarne Stroustrup (in The C++ Programming Language, third Edition,
Chapter 20).
Writing the String container was indeed an educational experience. The String container is a good example of why I both love and hate C++. The String container allows character strings to be treated like "first class" objects in C++. Several of the basic operators are defined for the String type. The support provided by C++ for powerful (and complex) features like templates and operator overloading allowed the String container to be developed. In contrast, the String container that is part of the Java programming language requires support from the Java compiler (e.g., the Java String container is part of the language, not simply a library object).
The price exacted for the power and flexibility of the C++ language is complexity. I have tried to comment the String container and related code heavily, but it remains complex and difficult to understand. Reference counted objects like String, SubString or classes instantiated from the RCArray template share a common data object. The management of the shared data object is provided by a "smart pointer" (based on a smart pointer described in Scott Meyers book More Effective C++). To understand this code requires an indepth understanding of how C++ uses temporaries, constructors and destructors. Rather than concentrating on the algorithm being implemented, the developer must think hard about the C++ language and its implementation. In this case, the language simply gets in the way. The need for such a complicated container would be eliminated by a language/compiler that supported garbage collected memory (as Java does).
Some examples are probably worth while, both to illuminate the String container and to illustrate some of the complexities in its implementation.
Reference counted strings allow data to be shared between two String containers. When one String object is assigned to another, the reference count will be incremented and both objects will share the same data. The container supports "copy-on-write", so a change to one String object will not effect other objects that reference the shared data. Each time a class destructor is called the reference count will be decremented. This allows String objects to recover their own memory and reduces the effort that the programmer must put into memory management.
In general the String container just works and the programmer does not have to think about what the compiler does. For example, in the code below, the compiler generates a hidden temporary for the result of the String concatenation.
String a("1234"), b("5678"); String c = a + b;
The hidden temporary is shown below:
String a("1234"), b("5678"); tmp = a + b; String c = tmp;
The lifetime of the temporary spans the assignment statement. That is, the temporary is constructed before the assignment statement and its destructor is called after the assignment statement, but before the following statement.
The lifetime of the compiler generated temporaries cannot be entirely ignored. The String container implements a const char * operator, which allows String objects to be assigned to a const char * variable. The const char * variable points into the internal memory of the String. In some cases this memory may have been deallocated. For example, in the code below the overloaded + operator is used to concatenate two strings ("1234" and "5678"). The resulting String temporary (containing "12345678") is assigned to a const char * pointer.
String a("1234"), b("5678"); const char *pChar; pChar = a + b;
The problem here is that the destructor for the String temporary will be called after the assignment. When the reference count is decremented to zero the object will be deallocated. To paraphrase Stroustrup:
A temporary object of class String is created to hold a + b. Next, a pointer to a C-style string is extracted from the object (via the overloaded operator const char *()). Then - at the end of the expression - the temporary object is deleted. Now, where was the C-style string allocated? Probably as part of the temporary object holding a + b, and that storage is not guaranteed to exist after the temporary is destroyed. Consequently, pChar points to deallocated storage. If pChar is referenced it might work as expected, but that would be sheer luck.
The complexity of overloaded operators, their associated compiler created temporaries, constructors and destructors are only a fraction of the C++ language. The complexity of programming in C++ is compounded by the need to explicitly manage memory. Assuring that there are no memory errors is difficult simply by inspecting the code. In most cases, a tool like Valgrind or Purify is needed to gain some confidence that there are no memory errors.
The C++ programming language holds out the promise of allowing object oriented software to be created. But along with this promise is the considerable complexity of the language. One of the reasons that Java has become popular is that it preserves many of the core features of C++ in a smaller, easier to understand language. Java also supports garbage collection avoiding the burden of explicitly managed memory.
Sadly Java has its own problems. The language has not been released to an impartial standards organization and Sun has insisted that Java must be an interpreted language. As a result, Java is slow. Perhaps Microsoft's C# will provide a better answer. Or, perhaps, as a friend of mine says: life is hard (but compared to what) - Shimon Peres. We'll just have to live with problems that could have been avoided.
Many of these references discuss, in one way or another, the complexity of C++.
This is my web page that links to Ian Joyner's excellent essay C++??: A Critique of C++
This satire is supposedly an interview with Stroustrup who admits that C++ was his attempt to raise the salaries for programmers by introducting a really complex language. Like all satire, this one obviously has a kernel of truth (complexity, not the salary issue).
The C++ Style Sweet Spot A Conversation with Bjarne Stroustrup, Part I by Bill Venners October 13, 2003
This is a real interview with Stroustrup. I find it ironic that Stroustrup complains that most people do not use C++ properly. Perhaps this has something to do with the complexity of the language?
Generalizing Overloading for C++2000 by Bjarne Stroustrup (PDF)
This paper actually proposes generalizing C++ operator overloading so that it would be possible to overload whitespace. Stroustrup writes:
Here, I describe the more innovative and powerful mechanism for overloading whitespace. Consider x*y. In programming languages (e.g. Fortran, Pascal, and C++), this is the conventional notation for multiplying two values. However, mathematicians and physicists traditionally do not use the operator *. Instead they use simple juxtaposition to indicate multiplication. That is, for variables x and y of suitable types,
x ymeans multiply x by y.
This is simply achieved by overloading the space operator for double-precision floating-point values:
double operator (double d1, double d2) { return d1*d2; }Or - more explicitly - equivalently
double operator ' '(double d1, double d2) { return d1*d2; }
I used to joke that if the C++ Programming Language standards committee met in the Soviet Union during the Stalin period he would have had them all shot for the good of the engineering community. Stalin was a monster and I don't advocate his methods. However, I do think that it is time the standards committee retired unless they can concentrate on making C++ more usable, rather than more obscure. Stroustrup's proposal, outlined above, shows that he does not understand what a complex language he has a hand in creating.
Ian Kaplan, January, 2001
Revised: February 2004