set
The
set
produces a container that will accept only one of each thing you place in it;
it also sorts the elements (sorting isn’t intrinsic to the conceptual
definition of a set, but the STL
set
stores its elements in a balanced binary tree to provide rapid lookups, thus
producing sorted results when you traverse it). The first two examples in this
chapter used
sets. Consider
the problem of creating an index for a book. You might like to start with all
the words in the book, but you only want one instance of each word and you want
them sorted. Of course, a
set
is perfect for this, and solves the problem effortlessly. However,
there’s also the problem of punctuation and any other non-alpha
characters, which must be stripped off to generate proper words. One solution
to this problem is to use the Standard C library function
strtok( ),
which produces tokens (in our case, words) given a set of delimiters to strip
out:
//: C20:WordList.cpp
// Display a list of words used in a document
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
#include "../require.h"
using namespace std;
const char* delimiters =
" \t;()\"<>:{}[]+-=&*#.,/\\~";
int main(int argc, char* argv[]) {
requireArgs(argc, 1);
ifstream in(argv[1]);
assure(in, argv[1]);
set<string> wordlist;
string line;
while(getline(in, line)) {
// Capture individual words:
char* s = // Cast probably won’t crash:
strtok((char*)line.c_str(), delimiters);
while(s) {
// Automatic type conversion:
wordlist.insert(s);
s = strtok(0, delimiters);
}
}
// Output results:
copy(wordlist.begin(), wordlist.end(),
ostream_iterator<string>(cout, "\n"));
} ///:~
strtok( )
takes
the starting address of a character buffer (the first argument) and looks for
delimiters (the second argument). It replaces the delimiter with a zero, and
returns the address of the beginning of the token. If you call it subsequent
times with a first argument of zero it will continue extracting tokens from the
rest of the string until it finds the end. In this case, the delimiters are
those that delimit the keywords and identifiers of C++, so it extracts these
keywords and identifiers. Each word is turned into a
string
and placed into the
wordlist
vector, which eventually contains the whole file, broken up into words.
You
don’t have to use a
set
just to get a sorted sequence. You can use the
sort( )
function (along with a multitude of other functions in the STL) on different
STL containers. However, it’s likely that
set
will be faster.
Eliminating
strtok( )
Some
programmers consider
strtok( )
to be the poorest design in the Standard C library because it uses a
static
buffer to hold its data between function calls. This means:
- You
can’t use
strtok( )
in
two places at the same time
- You
can’t use
strtok( )
in
a multithreaded program
- You
can’t use
strtok( )
in a library that might be used in a multithreaded program
- strtok( )
modifies the input sequence, which can produce unexpected side effects
- strtok( )
depends on reading in “lines”, which means you need a buffer big
enough for the longest line. This produces both wastefully-sized buffers, and
lines longer than the “longest” line. This can also introduce
security holes. (Notice that the buffer size problem was eliminated in
WordList.cpp
by using
string
input, but this required a cast so that
strtok( )
could modify the data in the string – a dangerous approach for
general-purpose programming).
For
all these reasons it seems like a good idea to find an alternative for
strtok( ).
The following example will use an
istreambuf_iterator
(introduced earlier) to move the characters from one place (which happens to be
an
istream)
to another (which happens to be a
string),
depending on whether the Standard C library function
isalpha( )
is true:
//: C20:WordList2.cpp
// Eliminating strtok() from Wordlist.cpp
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
#include <iterator>
#include "../require.h"
using namespace std;
int main(int argc, char* argv[]) {
using namespace std;
requireArgs(argc, 1);
ifstream in(argv[1]);
assure(in, argv[1]);
istreambuf_iterator<char> p(in), end;
set<string> wordlist;
while (p != end) {
string word;
insert_iterator<string>
ii(word, word.begin());
// Find the first alpha character:
while(!isalpha(*p) && p != end)
p++;
// Copy until the first non-alpha character:
while (isalpha(*p) && p != end)
*ii++ = *p++;
if (word.size() != 0)
wordlist.insert(word);
}
// Output results:
copy(wordlist.begin(), wordlist.end(),
ostream_iterator<string>(cout, "\n"));
} ///:~
This
example was suggested by Nathan Myers, who invented the
istreambuf_iterator
and its relatives. This iterator extracts information character-by-character
from a stream. Although the
istreambuf_iterator
template
argument might suggest to you that you could extract, for example,
ints
instead of
char,
that’s not the case. The argument must be of some character type –
a regular
char
or a wide character.
After
the file is open, an
istreambuf_iterator
called
p
is attached to the
istream
so characters can be extracted from it. The
set<string>
called
wordlist
will be used to hold the resulting words.
The
while
loop reads words until the end of the input stream is found. This is detected
using the default constructor for
istreambuf_iterator
which produces the past-the-end iterator object
end.
Thus, if you want to test to make sure you’re not at the end of the
stream, you simply say
p
!= end
. The
second type of iterator that’s used here is the
insert_iterator,
which creates an iterator that knows how to insert objects into a container.
Here, the “container” is the
string
called
word
which, for the purposes of
insert_iterator,
behaves like a container. The constructor for
insert_iterator
requires the container and an iterator indicating where it should start
inserting the characters. You could also use a
back_insert_iterator,
which requires that the container have a
push_back( )
(
string
does). There is also a
back_insert_iterator,
but that requires that the container have a
push_back( ),
which
string
does not.
After
the
while
loop sets everything up, it begins by looking for the first alpha character,
incrementing
start
until that character is found. Then it copies characters from one iterator to
the other, stopping when a non-alpha character is found. Each
word,
assuming it is non-empty, is added to
wordlist.
StreamTokenizer:
a
more flexible solution
The
above program parses its input into strings of words containing only alpha
characters, but that’s still a special case compared to the generality of
strtok( ).
What we’d like now is an actual replacement for
strtok( )
so we’re never tempted to use it.
WordList2.cpp
can be modified to create a class called
StreamTokenizer
that delivers a new token as a
string
whenever you call
next( ),
according to the delimiters you give it upon construction (very similar to
strtok( )):
//: C20:StreamTokenizer.h
// C++ Replacement for Standard C strtok()
#ifndef STREAMTOKENIZER_H
#define STREAMTOKENIZER_H
#include <string>
#include <iostream>
#include <iterator>
class StreamTokenizer {
typedef std::istreambuf_iterator<char> It;
It p, end;
std::string delimiters;
bool isDelimiter(char c) {
return
delimiters.find(c) != std::string::npos;
}
public:
StreamTokenizer(std::istream& is,
std::string delim = " \t\n;()\"<>:{}[]+-=&*#"
".,/\\~!0123456789") : p(is), end(It()),
delimiters(delim) {}
std::string next(); // Get next token
};
#endif STREAMTOKENIZER_H ///:~
The
default delimiters for the
StreamTokenizer
constructor extract words with only alpha characters, as before, but now you
can choose different delimiters to parse different tokens. The implementation of
next( )
looks similar to
Wordlist2.cpp:
//: C20:StreamTokenizer.cpp {O}
#include "StreamTokenizer.h"
using namespace std;
string StreamTokenizer::next() {
string result;
if(p != end) {
insert_iterator<string>
ii(result, result.begin());
while(isDelimiter(*p) && p != end)
p++;
while (!isDelimiter(*p) && p != end)
*ii++ = *p++;
}
return result;
} ///:~
The
first non-delimiter is found, then characters are copied until a delimiter is
found, and the resulting
string
is returned. Here’s a test:
//: C20:TokenizeTest.cpp
//{L} StreamTokenizer
// Test StreamTokenizer
#include <iostream>
#include <fstream>
#include <set>
#include "../require.h"
#include "StreamTokenizer.h"
using namespace std;
int main(int argc, char* argv[]) {
requireArgs(argc, 1);
ifstream in(argv[1]);
assure(in, argv[1]);
StreamTokenizer words(in);
set<string> wordlist;
string word;
while((word = words.next()).size() != 0)
wordlist.insert(word);
// Output results:
copy(wordlist.begin(), wordlist.end(),
ostream_iterator<string>(cout, "\n"));
} ///:~
Now
the tool is more reusable than before, but it’s still inflexible, because
it can only work with an
istream.
This isn’t as bad as it first seems, since a
string
can be turned into an
istream
via an
istringstream.
But in the next section we’ll come up with the most general, reusable
tokenizing tool, and this should give you a feeling of what
“reusable” really means, and the effort necessary to create truly
reusable code.
A
completely reusable tokenizer
Since
the STL containers and algorithms all revolve around iterators, the most
flexible solution will itself be an iterator. You could think of the
TokenIterator
as an iterator that wraps itself around any other iterator that can produce
characters. Because it is designed as an input iterator (the most primitive
type of iterator) it can be used with any STL algorithm. Not only is it a
useful tool in itself, the
TokenIterator
is also a good example of how you can design your own iterators.
[56] The
TokenIterator
is doubly flexible: first, you can choose the type of iterator that will
produce the
char
input. Second, instead of just saying what characters represent the delimiters,
TokenIterator
will use a predicate which is a function object whose
operator( )
takes a
char
and decides if it should be in the token or not. Although the two examples
given here have a static concept of what characters belong in a token, you
could easily design your own function object to change its state as the
characters are read, producing a more sophisticated parser.
The
following header file contains the two basic predicates
Isalpha
and
Delimiters,
along with the template for
TokenIterator:
//: C20:TokenIterator.h
#ifndef TOKENITERATOR_H
#define TOKENITERATOR_H
#include <string>
#include <iterator>
#include <algorithm>
#include <cctype>
struct Isalpha {
bool operator()(char c) {
return std::isalpha(c);
}
};
class Delimiters {
std::string exclude;
public:
Delimiters() {}
Delimiters(const std::string& excl)
: exclude(excl) {}
bool operator()(char c) {
return exclude.find(c) == std::string::npos;
}
};
template <class InputIter, class Pred = Isalpha>
class TokenIterator: public std::iterator<
std::input_iterator_tag,std::string,ptrdiff_t>{
InputIter first;
InputIter last;
std::string word;
Pred predicate;
public:
TokenIterator(InputIter begin, InputIter end,
Pred pred = Pred())
: first(begin), last(end), predicate(pred) {
++*this;
}
TokenIterator() {} // End sentinel
// Prefix increment:
TokenIterator& operator++() {
word.resize(0);
first = std::find_if(first, last, predicate);
while (first != last && predicate(*first))
word += *first++;
return *this;
}
// Postfix increment
class Proxy {
std::string word;
public:
Proxy(const std::string& w) : word(w) {}
std::string operator*() { return word; }
};
Proxy operator++(int) {
Proxy d(word);
++*this;
return d;
}
// Produce the actual value:
std::string operator*() const { return word; }
std::string* operator->() const {
return &(operator*());
}
// Compare iterators:
bool operator==(const TokenIterator&) {
return word.size() == 0 && first == last;
}
bool operator!=(const TokenIterator& rv) {
return !(*this == rv);
}
};
#endif // TOKENITERATOR_H ///:~
TokenIterator
is inherited from the
std::iterator
template. It might appear that there’s some kind of functionality that
comes with
std::iterator,
but it is purely a way of tagging an iterator so that a container that uses it
knows what it’s capable of. Here, you can see
input_iterator_tag
as a template argument – this tells anyone who asks that a
TokenIterator
only has the capabilities of an input iterator, and cannot be used with
algorithms requiring more sophisticated iterators. Apart from the tagging,
std::iterator
doesn’t do anything else, which means you must design all the other
functionality in yourself.
TokenIterator
may look a little strange at first, because the first constructor requires both
a “begin” and “end” iterator as arguments, along with
the predicate. Remember that this is a “wrapper” iterator that has
no idea of how to tell whether it’s at the end of its input source, so
the ending iterator is necessary in the first constructor. The reason for the
second (default) constructor is that the STL algorithms (and any algorithms you
write) need a
TokenIterator
sentinel
to be the past-the-end value. Since all the information necessary to see if the
TokenIterator
has reached the end of its input is collected in the first constructor, this
second constructor creates a
TokenIterator
that is merely used as a placeholder in algorithms.
The
core of the behavior happens in
operator++.
This erases the current value of
word
using
string::resize( ),
then finds the first character that satisfies the predicate (thus discovering
the beginning of the new token) using
find_if( )
(from the STL algorithms, discussed in the following chapter). The resulting
iterator is assigned to
first,
thus moving
first
forward to the beginning of the token. Then, as long as the end of the input is
not reached and the predicate is satisfied, characters are copied into the word
from the input. Finally the new token is returned.
The
postfix increment requires a proxy object to hold the value before the
increment, so it can be returned (see the operator overloading chapter for more
details of this). Producing the actual value is a straightforward
operator*.
The only other functions that must be defined for an output iterator are the
operator==
and
operator!=
to indicate whether the
TokenIterator
has reached the end of its input. You can see that the argument for
operator==
is
ignored – it only cares about whether it has reached its internal
last
iterator. Notice that
operator!=
is defined in terms of
operator==. A
good test of
TokenIterator
includes a number of different sources of input characters including a
streambuf_iterator,
a
char*,
and a
deque<char>::iterator.
Finally, the original
Wordlist.cpp
problem is solved:
//: C20:TokenIteratorTest.cpp
#include <fstream>
#include <iostream>
#include <vector>
#include <deque>
#include <set>
#include "TokenIterator.h"
#include "../require.h"
using namespace std;
int main() {
ifstream in("TokenIteratorTest.cpp");
assure(in, "TokenIteratorTest.cpp");
ostream_iterator<string> out(cout, "\n");
typedef istreambuf_iterator<char> IsbIt;
IsbIt begin(in), isbEnd;
Delimiters
delimiters(" \t\n~;()\"<>:{}[]+-=&*#.,/\\");
TokenIterator<IsbIt, Delimiters>
wordIter(begin, isbEnd, delimiters),
end;
vector<string> wordlist;
copy(wordIter, end, back_inserter(wordlist));
// Output results:
copy(wordlist.begin(), wordlist.end(), out);
out = "--------------------------------------";
// Use a char array as the source:
char* cp =
"typedef std::istreambuf_iterator<char> It";
TokenIterator<char*, Delimiters>
charIter(cp, cp + strlen(cp), delimiters),
end2;
vector<string> wordlist2;
copy(charIter, end2, back_inserter(wordlist2));
copy(wordlist2.begin(), wordlist2.end(), out);
out = "--------------------------------------";
// Use a deque<char> as the source:
ifstream in2("TokenIteratorTest.cpp");
deque<char> dc;
copy(IsbIt(in2), IsbIt(), back_inserter(dc));
TokenIterator<deque<char>::iterator,Delimiters>
dcIter(dc.begin(), dc.end(), delimiters),
end3;
vector<string> wordlist3;
copy(dcIter, end3, back_inserter(wordlist3));
copy(wordlist3.begin(), wordlist3.end(), out);
out = "--------------------------------------";
// Reproduce the Wordlist.cpp example:
ifstream in3("TokenIteratorTest.cpp");
TokenIterator<IsbIt, Delimiters>
wordIter2(IsbIt(in3), isbEnd, delimiters);
set<string> wordlist4;
while(wordIter2 != end)
wordlist4.insert(*wordIter2++);
copy(wordlist4.begin(), wordlist4.end(), out);
} ///:~
When
using an
istreambuf_iterator,
you create one to attach to the
istream
object, and one with the default constructor as the past-the-end marker. Both
of these are used to create the
TokenIterator
that will actually produce the tokens; the default constructor produces the faux
TokenIterator
past-the-end sentinel (this is just a placeholder, and as mentioned previously
is actually ignored). The
TokenIterator
produces
strings
that are inserted into a container which must, naturally, be a container of
string
– here a
vector<string>
is used in all cases except the last (you could also concatenate the results
onto a
string).
Other than that, a
TokenIterator
works like any other input iterator.
[56]
This is another example coached by Nathan Myers.
Go to CodeGuru.com
Contact: webmaster@codeguru.com
© Copyright 1997-1999 CodeGuru