Searching
in strings
The
find
family of
string
member functions allows you to locate a character or group of characters within
a given string. Here are the members of the
find
family and their general usage:
string
find member function
|
|
|
Searches
a string for a specified character or group of characters and returns the
starting position of the first occurrence found or
npos
(this member datum holds the current actual length of the string which is being
searched) if no match is found.
|
|
Searches
a target string and returns the position of the first match of
any
character in a specified group. If no match is found, it returns
npos.
|
|
Searches
a target string and returns the position of the last match of
any
character in a specified group. If no match is found, it returns
npos.
|
|
Searches
a target string and returns the position of the first element that
doesn’t
match of
any
character in a specified group. If no such element is found, it returns
npos.
|
|
Searches
a target string and returns the position of the element with the largest
subscript that
doesn’t
match of
any
character in a specified group. If no such element is found, it returns
npos.
|
|
Searches
a string from end to beginning for a specified character or group of characters
and returns the starting position of the match if one is found. If no match is
found, it returns
npos.
|
String
searching member functions and their general uses
The
simplest use of
find( )
searches
for one or more characters in a
string.
This overloaded version of
find( )
takes
a parameter that specifies the character(s) for which to search, and optionally
one that tells it where in the string to begin searching for the occurrence of
a substring. (The default position at which to begin searching is 0.) By
setting the call to
find
inside
a loop, you can easily move through a string, repeating a search in order to
find all of the occurrences of a given character or group of characters within
the string.
Notice
that we define the string object
sieveChars
using a constructor idiom which sets the initial size of the character array
and writes the value ‘P’ to each of its member.
//: C17:Sieve.cpp
#include <string>
#include <iostream>
using namespace std;
int main() {
// Create a 50 char string and set each
// element to 'P' for Prime
string sieveChars(50, 'P');
// By definition neither 0 nor 1 is prime.
// Change these elements to "N" for Not Prime
sieveChars.replace(0, 2, "NN");
// Walk through the array:
for(int i = 2;
i <= (sieveChars.size() / 2) - 1; i++)
// Find all the factors:
for(int factor = 2;
factor * i < sieveChars.size();factor++)
sieveChars[factor * i] = 'N';
cout << "Prime:" << endl;
// Return the index of the first 'P' element:
int i = sieveChars.find('P');
// While not at the end of the string:
while(i != sieveChars.npos) {
// If the element is P, the index is a prime
cout << i << " ";
// Move past the last prime
i++;
// Find the next prime
i = sieveChars.find('P', i);
}
cout << "\n Not prime:" << endl;
// Find the first element value not equal P:
i = sieveChars.find_first_not_of('P');
while(i != sieveChars.npos) {
cout << i << " ";
i++;
i = sieveChars.find_first_not_of('P', i);
}
} ///:~
The
output from
Sieve.cpp
looks like this:
Prime:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47
Not prime:
0 1 4 6 8 9 10 12 14 15 16 18 20 21 22
24 25 26 27 28 30 32 33 34 35 36 38 39
40 42 44 45 46 48 49
find( )
allows you to walk forward through a
string,
detecting multiple occurrences of a character or group of characters, while
find_first_not_of( )
allows you to test for the absence of a character or group.
The
find
member is also useful for detecting the occurrence of a sequence of characters
in a
string:
//: C17:Find.cpp
// Find a group of characters in a string
#include <string>
#include <iostream>
using namespace std;
int main() {
string chooseOne("Eenie, meenie, miney, mo");
int i = chooseOne.find("een");
while(i != string::npos) {
cout << i << endl;
i++;
i = chooseOne.find("een", i);
}
} ///:~
Find.cpp
produces a single line of output :
This
tells us that the first ‘e’ of the search group “een”
was found in the word “meenie,” and is the eighth element in the
string. Notice that
find
passed over the “Een” group of characters in the word
“Eenie”. The
find
member function performs a
case
sensitive
search.
There
are no functions in the
string
class
to change the case of a string, but these functions can be easily created using
the Standard C library functions
toupper( )
and
tolower( ),
which change the case of one character at a time. A few small changes will make
Find.cpp
perform a case insensitive search:
//: C17:NewFind.cpp
#include <string>
#include <iostream>
#include <cstdio>
using namespace std;
// Make an uppercase copy of s:
string upperCase(string& s) {
char* buf = new char[s.length()];
s.copy(buf, s.length());
for(int i = 0; i < s.length(); i++)
buf[i] = toupper(buf[i]);
string r(buf, s.length());
delete buf;
return r;
}
// Make a lowercase copy of s:
string lowerCase(string& s) {
char* buf = new char[s.length()];
s.copy(buf, s.length());
for(int i = 0; i < s.length(); i++)
buf[i] = tolower(buf[i]);
string r(buf, s.length());
delete buf;
return r;
}
int main() {
string chooseOne("Eenie, meenie, miney, mo");
cout << chooseOne << endl;
cout << upperCase(chooseOne) << endl;
cout << lowerCase(chooseOne) << endl;
// Case sensitive search
int i = chooseOne.find("een");
while(i != string::npos) {
cout << i << endl;
i++;
i = chooseOne.find("een", i);
}
// Search lowercase:
string lcase = lowerCase(chooseOne);
cout << lcase << endl;
i = lcase.find("een");
while(i != lcase.npos) {
cout << i << endl;
i++;
i = lcase.find("een", i);
}
// Search uppercase:
string ucase = upperCase(chooseOne);
cout << ucase << endl;
i = ucase.find("EEN");
while(i != ucase.npos) {
cout << i << endl;
i++;
i = ucase.find("EEN", i);
}
} ///:~
Both
the
upperCase( )
and
lowerCase( )
functions follow the same form: they allocate storage to hold the data in the
argument
string,
copy the data and change the case. Then they create a new
string
with the new data, release the buffer and return the result
string.
The
c_str( )
function cannot be used to produce a pointer to directly manipulate the data in
the
string
because
c_str( )
returns a pointer to
const.
That is, you’re not allowed to manipulate
string
data with a pointer, only with member functions. If you need to use the more
primitive
char
array manipulation, you should use the technique shown above.
The
output looks like this:
Eenie, meenie, miney, mo
eenie, meenie, miney, mo
EENIE, MEENIE, MINEY, MO
8
eenie, meenie, miney, mo
0
8
EENIE, MEENIE, MINEY, MO
0
8
The
case insensitive searches found both occurrences on the “een” group.
NewFind.cpp
isn’t the best solution to the case sensitivity problem, so we’ll
revisit it when we examine
string
comparisons.
Finding
in reverse
Sometimes
it’s necessary to search through a
string
from end to beginning, if you need to find the data in “last in / first
out “ order. The string member function
rfind( )
handles this job.
//: C17:Rparse.cpp
// Reverse the order of words in a string
#include <string>
#include <iostream>
#include <vector>
using namespace std;
int main() {
// The ';' characters will be delimiters
string s("now.;sense;make;to;going;is;This");
cout << s << endl;
// To store the words:
vector<string> strings;
// The last element of the string:
int last = s.size();
// The beginning of the current word:
int current = s.rfind(';');
// Walk backward through the string:
while(current != string::npos){
// Push each word into the vector.
// Current is incremented before copying to
// avoid copying the delimiter.
strings.push_back(
s.substr(++current,last - current));
// Back over the delimiter we just found,
// and set last to the end of the next word
current -= 2;
last = current;
// Find the next delimiter
current = s.rfind(';', current);
}
// Pick up the first word - it's not
// preceded by a delimiter
strings.push_back(s.substr(0, last - current));
// Print them in the new order:
for(int j = 0; j < strings.size(); j++)
cout << strings[j] << " ";
} ///:~
Here’s
how the output from
Rparse.cpp
looks:
now.;sense;make;to;going;is;This
This is going to make sense now.
rfind( )
backs through the string looking for tokens, reporting the array index of
matching characters or
string::npos
if it is unsuccessful.
Finding
first/last of a set
The
find_first_of( )
and
find_last_of( )
member functions can be conveniently put to work to create a little utility
that will strip whitespace characters off of both ends of a string. Notice it
doesn’t touch the original string, but instead returns a new string:
//: C17:trim.h
#ifndef TRIM_H
#define TRIM_H
#include <string>
// General tool to strip spaces from both ends:
inline std::string trim(const std::string& s) {
if(s.length() == 0)
return s;
int b = s.find_first_not_of(" \t");
int e = s.find_last_not_of(" \t");
if(b == -1) // No non-spaces
return "";
return std::string(s, b, e - b + 1);
}
#endif // TRIM_H ///:~
The
first test checks for an empty
string;
in that case no tests are made and a copy is returned. Notice that once the end
points are found, the
string
constructor is used to build a new
string
from the old one, giving the starting count and the length. This form also
utilizes the “return value optimization” (see the index for more
details).
Testing
such a general-purpose tool needs to be thorough:
//: C17:TrimTest.cpp
#include <iostream>
#include "trim.h"
using namespace std;
string s[] = {
" \t abcdefghijklmnop \t ",
"abcdefghijklmnop \t ",
" \t abcdefghijklmnop",
"a", "ab", "abc", "a b c",
" \t a b c \t ", " \t a \t b \t c \t ",
"", // Must also test the empty string
};
void test(string s) {
cout << "[" << trim(s) << "]" << endl;
}
int main() {
for(int i = 0; i < sizeof s / sizeof *s; i++)
test(s[i]);
} ///:~
In
the array of
string
s,
you can see that the character arrays are automatically converted to
string
objects. This array provides cases to check the removal of spaces and tabs from
both ends, as well as ensuring that spaces and tabs do not get removed from the
middle of a
string.
Removing
characters from strings
My
word processor/page layout program (Microsoft Word) will save a document in
HTML, but it doesn’t recognize that the code listings in this book should
be tagged with the HTML “preformatted” tag (<PRE>), and it
puts paragraph marks (<P> and </P>) around every listing line. This
means that all the indentation in the code listings is lost. In addition, Word
saves HTML with reduced font sizes for body text, which makes it hard to read.
To
convert the book to HTML form
[43],
then, the original output must be reprocessed, watching for the tags that mark
the start and end of code listings, inserting the <PRE> and </PRE>
tags at the appropriate places, removing all the <P> and </P> tags
within the listings, and adjusting the font sizes. Removal is accomplished with
the
erase( )
member function, but you must correctly determine the starting and ending
points of the substring you wish to erase. Here’s the program that
reprocesses the generated HTML file:
//: C17:ReprocessHTML.cpp
// Take Word's html output and fix up
// the code listings and html tags
#include <iostream>
#include <fstream>
#include <string>
#include "../require.h"
using namespace std;
// Produce a new string which is the original
// string with the html paragraph break marks
// stripped off:
string stripPBreaks(string s) {
int br;
while((br = s.find("<P>")) != string::npos)
s.erase(br, strlen("<P>"));
while((br = s.find("</P>")) != string::npos)
s.erase(br, strlen("</P>"));
return s;
}
// After the beginning of a code listing is
// detected, this function cleans up the listing
// until the end marker is found. The first line
// of the listing is passed in by the caller,
// which detects the start marker in the line.
void fixupCodeListing(istream& in,
ostream& out, string& line, int tag) {
out << line.substr(0, tag)
<< "<PRE>" // Means "preformatted" in html
<< stripPBreaks(line.substr(tag)) << endl;
string s;
while(getline(in, s)) {
int endtag = s.find("/""/""/"":~");
if(endtag != string::npos) {
endtag += strlen("/""/""/"":~");
string before = s.substr(0, endtag);
string after = s.substr(endtag);
out << stripPBreaks(before) << "</PRE>"
<< after << endl;
return;
}
out << stripPBreaks(s) << endl;
}
}
string removals[] = {
"<FONT SIZE=2>",
"<FONT SIZE=1>",
"<FONT FACE=\"Times\" SIZE=1>",
"<FONT FACE=\"Times\" SIZE=2>",
"<FONT FACE=\"Courier\" SIZE=1>",
"SIZE=1", // Eliminate all other '1' & '2' size
"SIZE=2",
};
const int rmsz =
sizeof(removals)/sizeof(*removals);
int main(int argc, char* argv[]) {
requireArgs(argc, 2);
ifstream in(argv[1]);
assure(in, argv[1]);
ofstream out(argv[2]);
string line;
while(getline(in, line)) {
// The "Body" tag only appears once:
if(line.find("<BODY") != string::npos) {
out << "<BODY BGCOLOR=\"#FFFFFF\" "
"TEXT=\"#000000\">" << endl;
continue; // Get next line
}
// Eliminate each of the removals strings:
for(int i = 0; i < rmsz; i++) {
int find = line.find(removals[i]);
if(find != string::npos)
line.erase(find, removals[i].size());
}
int tag1 = line.find("/""/"":");
int tag2 = line.find("/""*"":");
if(tag1 != string::npos)
fixupCodeListing(in, out, line, tag1);
else if(tag2 != string::npos)
fixupCodeListing(in, out, line, tag2);
else
out << line << endl;
}
} ///:~
Notice
the lines that detect the start and end listing tags by indicating them with
each character in quotes. These tags are treated in a special way by the logic
in the
Extractcode.cpp
tool
for extracting code listings. To present the code for the tool in the text of
the book, the tag sequence itself must not occur in the listing. This was
accomplished by taking advantage of a C++ preprocessor feature that causes text
strings delimited by adjacent pairs of double quotes to be merged into a single
string during the preprocessor pass of the build.
int
tag1 = line.find("/""/"":");
The
effect of the sequence of
char
arrays is to produce the starting tag for code listings.
Stripping
HTML tags
Sometimes
it’s useful to take an HTML file and strip its tags so you have something
approximating the text that would be displayed in the Web browser, only as an
ASCII text file. The
string
class once again comes in handy. The following has some variation on the theme
of the previous example:
//: C17:HTMLStripper.cpp
// Filter to remove html tags and markers
#include <fstream>
#include <iostream>
#include <string>
#include "../require.h"
using namespace std;
string replaceAll(string s, string f, string r) {
unsigned int found = s.find(f);
while(found != string::npos) {
s.replace(found, f.length(), r);
found = s.find(f);
}
return s;
}
string stripHTMLTags(string s) {
while(true) {
unsigned int left = s.find('<');
unsigned int right = s.find('>');
if(left==string::npos || right==string::npos)
break;
s = s.erase(left, right - left + 1);
}
s = replaceAll(s, "<", "<");
s = replaceAll(s, ">", ">");
s = replaceAll(s, "&", "&");
s = replaceAll(s, " ", " ");
// Etc...
return s;
}
int main(int argc, char* argv[]) {
requireArgs(argc, 1,
"usage: HTMLStripper InputFile");
ifstream in(argv[1]);
assure(in, argv[1]);
const int sz = 4096;
char buf[sz];
while(in.getline(buf, sz)) {
string s(buf);
cout << stripHTMLTags(s) << endl;
}
} ///:~
The
string
class can replace one string with another but there’s no facility for
replacing all the strings of one type with another, so the
replaceAll( )
function does this for you, inside a
while
loop that keeps finding the next instance of the find string
f.
That function is used inside
stripHTMLTags
after it uses
erase( )
to remove everything that appears inside angle braces (‘
<‘
and ‘
>‘).
Note that I probably haven’t gotten all the necessary replacement values,
but you can see what to do (you might even put all the find-replace pairs in a
table...). In
main( )
the arguments are checked, and the file is read and converted. It is sent to
standard output so you must redirect it with ‘
>‘
if you want to write it to a file.
Comparing
strings
Comparing
strings is inherently different than comparing numbers. Numbers have constant,
universally meaningful values. To evaluate the relationship between the
magnitude of two strings, you must make a
lexical
comparison
.
Lexical comparison means that when you test a character to see if it is
“greater than” or “less than” another character, you
are actually comparing the numeric representation of those characters as
specified in the collating sequence of the character set being used. Most
often, this will be the ASCII collating sequence, which assigns the printable
characters for the English language numbers in the range from 32 to 127
decimal. In the ASCII collating sequence, the first “character” in
the list is the space, followed by several common punctuation marks, and then
uppercase and lowercase letters. With respect to the alphabet, this means that
the letters nearer the front have lower ASCII values than those nearer the end.
With these details in mind, it becomes easier to remember that when a lexical
comparison that reports s1 is “greater than” s2, it simply means
that when the two were compared, the first differing character in s1 came later
in the alphabet than the character in that same position in s2.
C++
provides several ways to compare strings, and each has their advantages. The
simplest to use are the non member overloaded operator functions
operator
==, operator != operator >, operator <, operator >=,
and
operator <=.
//: C17:CompStr.cpp
#include <string>
#include <iostream>
using namespace std;
int main() {
// Strings to compare
string s1("This ");
string s2("That ");
for(int i = 0; i< s1.size() &&
i < s2.size(); i++)
// See if the string elements are the same:
if(s1[i] == s2[i])
cout << s1[i] << " " << i << endl;
// Use the string inequality operators
if(s1 != s2) {
cout << "Strings aren't the same:" << " ";
if(s1 > s2)
cout << "s1 is > s2" << endl;
else
cout << "s2 is > s1" << endl;
}
} ///:~
Here’s
the output from
CompStr.cpp:
T 0
h 1
4
Strings aren’t the same: s1 is > s2
The
overloaded comparison operators are useful for comparing both full strings and
individual string elements.
Notice
in the code fragment below the flexibility of argument types on both the left
and right hand side of the comparison operators. The overloaded operator set
allows the direct comparison of string objects, quoted literals, and pointers
to C style strings.
// The lvalue is a quoted literal and
// the rvalue is a string
if("That " == s2)
cout << "A match" << endl;
// The lvalue is a string and the rvalue is a
// pointer to a c style null terminated string
if(s1 != s2.c_str())
cout << "No match" << endl;
You
won’t find the logical not (!) or the logical comparison operators
(&& and ||) among operators for string. (Neither will you find
overloaded versions of the bitwise C operators &, |, ^, or ~.) The
overloaded non member comparison operators for the string class are limited to
the subset which has clear, unambiguous application to single characters or
groups of characters.
The
compare( )
member function offers you a great deal more sophisticated and precise
comparison than the non member operator set, because it returns a lexical
comparison value, and provides for comparisons that consider subsets of the
string data. It provides overloaded versions that allow you to compare two
complete strings, part of either string to a complete string, and subsets of
two strings. This example compares complete strings:
//: C17:Compare.cpp
// Demonstrates compare(), swap()
#include <string>
#include <iostream>
using namespace std;
int main() {
string first("This");
string second("That");
// Which is lexically greater?
switch(first.compare(second)) {
case 0: // The same
cout << first << " and " << second <<
" are lexically equal" << endl;
break;
case -1: // Less than
first.swap(second);
// Fall through this case...
case 1: // Greater than
cout << first <<
" is lexically greater than " <<
second << endl;
}
} ///:~
The
output from
Compare.cpp
looks like this:
This
is lexically greater than That
To
compare a subset of the characters in one or both strings, you add arguments
that define where to start the comparison and how many characters to consider.
For example, we can use the overloaded version of
compare( ): s1.compare(s1artPos,
s1NumberChars, s2, s2StartPos, s2NumberChars);
If
we substitute the above version of
compare( )
in the previous program so that it only looks at the first two characters of
each string, the program becomes:
//: C17:Compare2.cpp
// Overloaded compare()
#include <string>
#include <iostream>
using namespace std;
int main() {
string first("This");
string second("That");
// Compare first two characters of each string:
switch(first.compare(0, 2, second, 0, 2)) {
case 0: // The same
cout << first << " and " << second <<
" are lexically equal" << endl;
break;
case -1: // Less than
first.swap(second);
// Fall through this case...
case 1: // Greater than
cout << first <<
" is lexically greater than " <<
second << endl;
}
} ///:~
This
and That are lexically equal
which
is true, for the first two characters of “This” and
“That.”
Indexing
with [ ] vs. at( )
In
the examples so far, we have used C style array indexing syntax to refer to an
individual character in a string. C++ strings provide an alternative to the
s[n]
notation:
the
at( )
member. These two idioms produce the same result in C++ if all goes well:
//: C17:StringIndexing.cpp
#include <string>
#include <iostream>
using namespace std;
int main(){
string s("1234");
cout << s[1] << " ";
cout << s.at(1) << endl;
} ///:~
The
output from this code looks like this:
However,
there is one important difference between
[
]
and
at( ).
When you try to reference an array element that is out of bounds,
at( )
will
do you the kindness of throwing an exception, while ordinary
[
]
subscripting
syntax will leave you to your own devices:
//: C17:BadStringIndexing.cpp
#include <string>
#include <iostream>
using namespace std;
int main(){
string s("1234");
// Run-time problem: goes beyond array bounds:
cout << s[5] << endl;
// Saves you by throwing an exception:
cout << s.at(5) << endl;
} ///:~
Using
at( )
in
place of
[
]
will
give you a chance to gracefully recover from references to array elements that
don’t exist.
at( )
throws an object of class
out_of_range.
By
catching this object in an exception handler, you can take appropriate remedial
actions such as recalculating the offending subscript or growing the array.
(You can read more about Exception Handling in Chapter XX)
Using
iterators
In
the example program
NewFind.cpp,
we used a lot of messy and rather tedious C
char
array handling code to change the case of the characters in a string and then
search for the occurrence of matches to a substring. Sometimes the “quick
and dirty” method is justifiable, but in general, you won’t want to
sacrifice the advantages of having your string data safely and securely
encapsulated in the C++ object where it lives.
Here
is a better, safer way to handle case insensitive comparison of two C++ string
objects. Because no data is copied out of the objects and into C style strings,
you don’t have to use pointers and you don’t have to risk
overwriting the bounds of an ordinary character array. In this example, we use
the string
iterator.
Iterators are themselves objects which move through a collection or container
of other objects, selecting them one at a time, but never providing direct
access to the implementation of the container. Iterators are
not
pointers, but they are useful for many of the same jobs.
//: C17:CmpIter.cpp
// Find a group of characters in a string
#include <string>
#include <iostream>
using namespace std;
// Case insensitive compare function:
int
stringCmpi(const string& s1, const string& s2) {
// Select the first element of each string:
string::const_iterator
p1 = s1.begin(), p2 = s2.begin();
// Don’t run past the end:
while(p1 != s1.end() && p2 != s2.end()) {
// Compare upper-cased chars:
if(toupper(*p1) != toupper(*p2))
// Report which was lexically greater:
return (toupper(*p1)<toupper(*p2))? -1 : 1;
p1++;
p2++;
}
// If they match up to the detected eos, say
// which was longer. Return 0 if the same.
return(s2.size() - s1.size());
}
int main() {
string s1("Mozart");
string s2("Modigliani");
cout << stringCmpi(s1, s2) << endl;
} ///:~
Notice
that the iterators
p1
and
p2
use the same syntax as C pointers – the ‘
*’
operator makes the
value
of
element at the location given by the iterators available to the
toupper( )
function.
toupper( )
doesn’t
actually change the content of the element in the string. In fact, it
can’t. This definition of
p1
tells us that we can only use the elements
p1
points to as constants.
string::const_iterator
p1 = s1.begin();
The
way
toupper( )
and the iterators are used in this example is called a
case
preserving
case
insensitive comparison. This means that the string didn’t have to be
copied or rewritten to accommodate case insensitive comparison. Both of the
strings retain their original data, unmodified.
Iterating
in reverse
Just
as the standard C pointer gives us the increment (++) and decrement (--)
operators to make pointer arithmetic a bit more convenient, C++ string
iterators come in two basic varieties. You’ve seen
end( )
and
begin( ),
which are the tools for moving forward through a string one element at a time.
The reverse iterators
rend( )
and
rbegin( )
allow you to step backwards through a string. Here’s how they work:
//: C17:RevStr.cpp
// Print a string in reverse
#include <string>
#include <iostream>
using namespace std;
int main() {
string s("987654321");
// Use this iterator to walk backwards:
string::reverse_iterator rev;
// "Incrementing" the reverse iterator moves
// it to successively lower string elements:
for(rev = s.rbegin(); rev != s.rend(); rev++)
cout << *rev << " ";
} ///:~
The
output from
RevStr.cpp
looks like this:
Reverse
iterators act like pointers to elements of the string’s character array,
except
that when you apply the increment operator to them, they move backward rather
than forward
.
rbegin( )
and
rend( )
supply string locations that are consistent with this behavior, to wit,
rbegin( )
locates the position just beyond the end of the string, and
rend( )
locates the beginning. Aside from this, the main thing to remember about
reverse iterators is that they
aren’t
type equivalent to ordinary iterators. For example, if a member function
parameter list includes an iterator as an argument, you can’t substitute
a reverse iterator to get the function to perform it’s job walking
backward through the string. Here’s an illustration:
// The compiler won’t accept this
string sBackwards(s.rbegin(), s.rend());
The
string constructor won’t accept reverse iterators in place of forward
iterators in its parameter list. This is also true of string members such as
copy( ),
insert( ),
and
assign( ).
Strings
and character traits
We
seem to have worked our way around the margins of case insensitive string
comparisons using C++ string objects, so maybe it’s time to ask the
obvious question: “Why isn’t case-insensitive comparison part of
the standard
string
class ?” The answer provides interesting background on the true nature of
C++ string objects.
Consider
what it means for a character to have “case.” Written Hebrew,
Farsi, and Kanji don’t use the concept of upper and lower case, so for
those languages this idea has no meaning at all. This the first impediment to
built-in C++ support for case-insensitive character search and comparison: the
idea of case sensitivity is not universal, and therefore not portable.
It
would seem that if there were a way of designating that some languages were
“all uppercase” or “all lowercase” we could design a
generalized solution. However, some languages which employ the concept of
“case”
also
change the meaning of particular characters with diacritical marks: the cedilla
in Spanish, the circumflex in French, and the umlaut in German. For this
reason, any case-sensitive collating scheme that attempts to be comprehensive
will be nightmarishly complex to use.
Although
we usually treat the C++
string
as a class, this is really not the case.
string
is a
typedef
of a more general constituent, the
basic_string< >
template. Observe how
string
is declared in the standard C++ header file:
typedef
basic_string<char> string;
To
really understand the nature of strings, it’s helpful to delve a bit
deeper and look at the template on which it is based. Here’s the
declaration of the
basic_string< >
template:
template<class charT,
class traits = char_traits<charT>,
class allocator = allocator<charT> >
class basic_string;
Earlier
in this book, templates were examined in a great deal of detail. The main thing
to notice about the two declarations above are that the
string
type is created when the
basic_string
template is instantiated with
char.
Inside
the
basic_string< >
template
declaration, the line
class
traits = char_traits<charT>,
tells
us that the behavior of the class made from the
basic_string< >
template
is specified by a class based on the template
char_traits< >.
Thus, the
basic_string< >
template provides for cases where you need string oriented classes that
manipulate types other than
char
(wide characters or unicode, for example). To do this, the
char_traits< >
template
controls the content and collating behaviors of a variety of character sets
using the character comparison functions
eq( )
(equal),
ne( )
(not equal), and
lt( )
(less than) upon which the
basic_string< >
string
comparison functions rely.
This
is why the string class doesn’t include case insensitive member
functions: That’s not in its job description. To change the way the
string class treats character comparison, you must supply a different of
char_traits< >
template, because that defines the behavior of the individual character
comparison member functions.
This
information can be used to make a new type of
string
class
that ignores case. First, we’ll define a new case insensitive
char_traits< >
template that inherits the existing one. Next, we’ll override only the
members we need to change in order to make character-by-character comparison
case insensitive. (In addition to the three lexical character comparison
members mentioned above, we’ll also have to supply new implementation of
find( )
and
compare( ).)
Finally, we’ll
typedef
a new class based on
basic_string,
but using the case insensitive
ichar_traits
template for its second argument.
//: C17:ichar_traits.h
// Creating your own character traits
#ifndef ICHAR_TRAITS_H
#define ICHAR_TRAITS_H
#include <string>
#include <cctype>
struct ichar_traits : std::char_traits<char> {
// We'll only change character by
// character comparison functions
static bool eq(char c1st, char c2nd) {
return
std::toupper(c1st) == std::toupper(c2nd);
}
static bool ne(char c1st, char c2nd) {
return
std::toupper(c1st) != std::toupper(c2nd);
}
static bool lt(char c1st, char c2nd) {
return
std::toupper(c1st) < std::toupper(c2nd);
}
static int compare(const char* str1,
const char* str2, size_t n) {
for(int i = 0; i < n; i++) {
if(std::tolower(*str1)>std::tolower(*str2))
return 1;
if(std::tolower(*str1)<std::tolower(*str2))
return -1;
if(*str1 == 0 || *str2 == 0)
return 0;
}
return 0;
}
static const char* find(const char* s1,
int n, char c) {
while(n-- > 0 &&
std::toupper(*s1) != std::toupper(c))
s1++;
return s1;
}
};
#endif // ICHAR_TRAITS_H ///:~
If
we
typedef
an
istring
class like this:
typedef basic_string<char, ichar_traits,
allocator<char> > istring;
Then
this
istring
will act like an ordinary
string
in every way, except that it will make all comparisons without respect to case.
Here’s an example:
//: C17:ICompare.cpp
#include <string>
#include <iostream>
#include "ichar_traits.h"
using namespace std;
typedef basic_string<char, ichar_traits,
allocator<char> > istring;
int main() {
// The same letters except for case:
istring first = "tHis";
istring second = "ThIS";
cout << first.compare(second) << endl;
} ///:~
The
output from the program is “0”, indicating that the strings compare
as equal. This is just a simple example – in order to make
istring
fully equivalent to
string,
we’d have to create the other functions necessary to support the new
istring
type.
[43]
I subsequently found better tools to accomplish this task, but the program is
still interesting.
Go to CodeGuru.com
Contact: webmaster@codeguru.com
© Copyright 1997-1999 CodeGuru