GRETA:
The GRETA Regular Expression Template Archive

 

Copyright Eric Niebler, 2002

 

 

 

The purpose of this document is to describe how to use the GRETA Regular Expression Template Archive.  It describes the objects in the library, the methods defined on the objects, and the ways to use the objects and methods to perform regular expression pattern matching on strings in C++.  It does not describe regular expression syntax.  It is enough to say that the full Perl 5 syntax is supported.  If you are not familiar with Perl’s regular expression syntax, I recommend reading Chapter 2 of Programming Perl, 2nd Ed. (a.k.a. The Camel Book), one of the many fine books put out by O’Reilly publishers. 


GRETA:  The GRETA Regular Expression Template Archive. 1

Overview.. 3

A Word about Speed. 4

Notice to Users of Version 1.x. 5

The rpattern Object 5

rpattern::string_type. 6

rpattern::rpattern. 6

rpattern::match. 7

rpattern::substitute. 8

rpattern::count 9

rpattern::split 9

rpattern::set_substitution. 9

rpattern::cgroups. 10

match_results, subst_results and split_results. 10

match_results::cbackrefs. 10

match_results::backref. 10

match_results::rstart 10

match_results::rlength. 11

match_results::all_backrefs. 11

subst_results::backref_str 11

split_results::strings. 11

The Syntax Module. 12

register_intrinsic_charset 14

Customizing Your Search. 14

NOCASE.. 15

GLOBAL.. 15

MULTILINE.. 15

SINGLELINE.. 15

EXTENDED.. 15

RIGHTMOST. 15

NOBACKREFS. 15

ALLBACKREFS. 16

FIRSTBACKREFS. 16

NORMALIZE.. 16

Matching Modes. 16

MODE_FAST. 17

MODE_SAFE.. 17

MODE_MIXED.. 17

Known Issues and Perl Incompatibilities. 17

Embedded Code in a Regular Expression. 17

Pattern Modifier Scope. 17

Comment Blocks Before Quantifiers. 18

Variable Width Look-Behind Assertions. 18

Recursive Patterns. 18

Compile-Time Switches. 19

REGEX_WIDE_AND_NARROW... 19

REGEX_POSIX.. 19

REGEX_NO_PERL.. 19

REGEX_DEBUG.. 19

REGEX_DEBUG_HEAP. 19

REGEX_STACK_ALIGNMENT. 19

REGEX_FOLD_INSTANTIATIONS. 20

REGEX_TO_INSTANTIATE.. 20

Miscellaneous. 20

Static Const Patterns. 20

Thread-safety. 21

Stack Usage. 21

DBCS. 22

STL.. 22

VC7 and Managed Code. 22

Template Instantiation. 22

Contact Information. 23

Appendix 1: History. 24

Appendix 2: Implementation Details. 27

 

Overview

The regular expression template library contains objects and functions that make it possible to perform pattern matching and substitution on strings in C++.  They are:

To perform a search or replace operation, you will typically first initialize an rpattern object by giving it a string representing the pattern against which to match.  You will then call a method on the rpattern object (match() or substitute(), for instance), passing it a string to match against and a match_results objects to receive the results of the match.  If the match()/substitute() fails, the method returns false.  If it succeeds, it returns true, and the match_results object stores the resulting array of backreferences internally.  (Here, the term backreference has the same meaning as it does in Perl.  Backreferences provide extra information about what parts of the pattern matched which parts of the string.)  There are methods on the match_results object to make the backreference information available.  For example:

 

#include <iostream>

#include <string>

#include “regexpr2.h”

using namespace std;

using namespace regex;

 

int main() {

    match_results results;

    string str( “The book cost $12.34” );

    rpattern pat( “\\$(\\d+)(\\.(\\d\\d))?” ); 

    // Match a dollar sign followed by one or more digits,

    // optionally followed by a period and two more digits.

    // The double-escapes are necessary to satisfy the compiler.

 

    match_results::backref_type br = pat.match( str, results );

    if( br.matched ) {

        cout << “match success!” << endl;

        cout << “price: ” << br << endl;

    } else {

        cout << “match failed!” << endl;

    }

    return 0;

}

 

The above program would print out the following:

 

match success!

price: $12.34

 

The following sections discuss the rpattern object in detail and how to customize your searches to be faster and more efficient.

 

Note: all declarations in the header file (regexpr2.h) are contained in the regex namespace.  To use any of the objects, methods or enumerations described in this document, you must prepend all declarations with “regex::” or you must have the “using namespace regex;” directive somewhere within the enclosing scope of your declarations.  For simplicity, I’ve left off the “regex::” prefixes in the rest of my code snippets.

 

A Word about Speed

Different regex engines are good on different types of patterns.  That said, I have found my regex engine to be pretty quick.  For a benchmark, I matched the pattern “^([0-9]+)(\-| |$)(.*)$” against the string “100- this is a line of ftp response which contains a message string”.  GRETA is about 7 times faster than the regex library in boost (http://www.boost.org), and about 10 times faster than the regular expression classes in ATL7.  For this input, GRETA is even faster than Perl, although Perl is faster for some other patterns.  Most regex engines I have seen build up an NFA (non-deterministic finite state automaton) and execute it iteratively, often with a big, slow switch statement.  I have a different approach: patterns are compiled into a directed, possibly cyclic graph, and matching happens by traversing this graph recursively.  In addition, the code makes heavy use of templates to freeze the state of the flags into the compiled pattern so that they don’t need to be checked at match time.  The result is a pretty lean blob of code that can match your pattern quickly.

 

Even the best algorithms have their weaknesses, though.  Matching regular expressions with backreferences is an NP-complete problem.  There are patterns that will make any backtracking regex engine take exponential time to finish.  (These usually involve nested quantifiers.)  If you have a performance critical app, you would be smart to test your patterns for speed, or profile your app to make sure you are not spending too much time thrashing around in the regex code.  You’ve been warned!

 

Also, see the section VC7 and Managed Code for some advice for compiling GRETA under VC7.

 

Notice to Users of Version 1.x

Many things have changed since version 1.x of the Regular Expression Template Archive.  If you have code which uses version 1.x, you will not be able to use version 2 without making changes to your code.  Sorry!  There were a number of unsafe, unintuitive interface features of version 1 that I felt were worth fixing for version 2.  If you need version 1, I have a copy and I’d be happy to give it to you.

 

Most notably, the regexpr object has gone away.  It was a subclass of std::string, with match() and substitute() methods, and it stored the results of the match/substitute internally.  Subclassing std::string is dangerous because std::string doesn’t have a virtual destructor.  Also, matching is conceptually a const operation, and it seemed wrong that it should change internal state.

 

The match/count/substitute methods have moved to the rpattern object.  The state that used to be stored in the regexpr object is now put in a match_results/subst_results container, which is passed as an out parameter to the match/substitute methods.

 

Also, the CSTRINGS flag has gone away.  It is no longer necessary to optimize a pattern for use with C-style NULL-terminated strings.  When you pass a C-style string to the rpattern::match method, the same optimization is used automatically.  (In early 2.X versions of the library, there was a basic_rpattern_c object for performing this optimization, but it is no longer necessary and has been deprecated.)

 

Another minor change involves the register_intrinsic_charset() method.  It used to be a part of rpattern’s interface, but it has moved to the syntax module.

 

Despite the sweeping interface changes, the majority of the back-end code is unchanged.  You should expect patterns that worked in version 1.x to continue to work in version 2.

The rpattern Object

The rpattern object contains the regular expression pattern against which to match.  It also exposes the match(), substitute(), and count() methods you will use to perform regular expression matches.  When you instantiate an rpattern object, the pattern is “compiled” into a structure that speeds up pattern matching.  Once compiled, you may reuse the same pattern for multiple match operations.

 

Here is how rpattern is declared:

 

template< typename CI,

          typename SY = perl_syntax<std::iterator_traits<CI>::value_type> >

class basic_rpattern {

};

typedef basic_rpattern<std::basic_string<TCHAR>::const_iterator> rpattern;

typedef basic_rpattern<TCHAR const *> rpattern_c;

 

The rpattern class is a template on iterator type.  It is also a template on the syntax module.  By default, the Perl syntax module is used, but you are free to write your own syntax and specify it as a template parameter.  See the section on the Syntax Module.

 

The following sections describe the methods available on the rpattern object.

rpattern::string_type

rpattern::string_type is a typedef that is used in many of the following function prototypes.  It is defined as follows:

 

typedef CI const_iterator;

typedef std::iterator_traits<const_iterator>::value_type char_type;

typedef std::basic_string<char_type> string_type;

 

The typedef is a little complicated, but its effect is what you would expect.  If the result of dereferencing a const_iterator is a char, then string_type is the same as std::string.  If dereferencing a const_iterator results in a wchar_t, then string_type is the same as std::wstring.

rpattern::rpattern

There are two constructors for instantiating an rpattern object.  Here are their prototypes:

 

rpattern::rpattern(

const string_type & pat,

REGEX_FLAGS flags=NOFLAGS,

REGEX_MODE mode=MODE_DEFAULT ); // throw(bad_alloc,bad_regexpr);

 

rpattern::rpattern(

const string_type & pat,

const string_type & sub,

REGEX_FLAGS flags=NOFLAGS,

REGEX_MODE mode=MODE_DEFAULT ); // throw(bad_alloc,bad_regexpr);

 

Both methods require you to specify a string containing the pattern to match.  Both methods allow you to specify some flags that customize your search and the mode with which the pattern should be matched (see Matching Modes).  The first method does not require you to specify a substitution string; the second one does.  If you do not specify a substitution string, it is assumed to be the null string.

 

Notice the (commented out) exception specification, “throw(bad_alloc, bad_regexpr),” on the constructors.  This means that the constructor of an rpattern object can throw an exception of type bad_regexpr or bad_alloc.  This can happen when the specified pattern contains illegal regular expression syntax, such as unbalanced parentheses.  It can also occur when the substitution string refers to a backref that does not exist (for instance, if the substitution string contains “$6”, and there are not 6 groups in the pattern). The bad_regexpr exception inherits from the std::invalid_argument exception.  The bad_regexpr::what() method returns a pointer to a C-style string that contains a description of the problem.

rpattern::match

The match method is used to find patterns in strings.  There are a couple of different versions of the match method.  The first takes a std::basic_string as a parameter.  The second version takes two iterators, the beginning and end of the string to match.  Finally, there is a version of the match() method that takes a pointer to a NULL-terminated string.

 

template< typename CH, typename TR, typename AL >

rpattern::backref_type rpattern::match(

const std::basic_string<CH,TR,AL> & str,

match_results & results,

    size_type pos = 0,            // offset to the start of the substring to match

size_type len = npos ) const; // length of the substring to match

 

rpattern::backref_type rpattern::match(

const_iterator ibegin,        // start of the string to match

const_iterator iend,          // one past end of string to match

match_results & results ) const;

 

rpattern::backref_type rpattern::match(

const char_type * sz,

match_results_c & results ) const;

 

Notice that the match() method that takes a std::basic_string is a template on character, character traits and allocator.  This is necessary given that basic_rpattern is a template only on iterator, but it gives us a little more flexibility then we really want.  If you accidentally pass in the wrong string type (say, a std::wstring when the match method is really expecting a std::string) you will get a compile-time error informing you of your mistake.

 

The match() method returns an object of type rpattern::backref_type.   If the match fails, an “empty backref” will be returned.  An empty backref contains “false” in the matched member.  If the match succeeds, the backref contains information about where the pattern matched the string, and contains “true” in the matched member.  The following code demonstrates the usage of the match() method:

 

const char * sz = “Here is a string to match.”;

rpattern_c pat( “str.ng” ); // matches “string”, “strong”, etc.

match_results_c results;

rpattern_c::backref_type br = pat.match( sz, results );

if( br.matched ) {

    printf( “Backref: %.*s”, br.second – br.first, br.first );

} else {

    printf( “No match found.” );

}

 

In the above code, br.first is a pointer to the first character in the string that matched the pattern and br.second is a pointer to the last+1 character that matched the pattern.  Therefore, the length of the matched pattern is br.second – br.first.  (You’ll notice that this value is used as the precision of the string in the printf() format statement.)  The above code will generate the following output: “Backref: string”. 

 

Notice the use of rpattern_c and match_results_c in the above example.  These are typedefs for basic_rpattern<const TCHAR *> and basic_match_results<const TCHAR *>, respectively.

rpattern::substitute

The substitute() method finds patterns and replaces the found substrings with insertion strings you specify.  It takes a std::string object as a parameter, and a subst_results container.  It also takes two optional integer parameters, which you can use to specify the offset of the first character to search and the length of the search string.  Here is the prototype:

 

template< typename CH, typename TR, typename AL >

size_t rpattern::substitute( std::basic_string<CH,TR,AL> & str,

                             subst_results & results,

                             size_type pos = 0,

                             size_type len = npos );

 

The return type is an unsigned integer representing the number of substitutions made.  If the pattern is not found, no substitutions are made and the return value is 0.  The following example finds all instances of \n not preceded by \r and replaces them with \r\n:

 

size_t insert_crlf( string & str ) {

static const rpattern s_crlf( “(?<!\r)\n”, “\r\n”, GLOBAL );

subst_results results;

    return s_crlf.substitute( str, results );

}

 

Later sections will describe the significance of “static const” and “GLOBAL” in the above example.  Note the use of the look-behind assertion, (?<!\r), in the example.  It evaluates to true when the preceding character is not a return character, but it doesn’t include the preceding character in the match.

 

When doing a global substitution, the next match operation begins where the last substitution left off.  Thus, if the search string contains the word “foo”, pattern is “o”, and the substitution is “oo”, the resulting string will be, “foooo”.

rpattern::count

Use the count() method when you want to know the number of times a given pattern matches a string, but you don’t care about where the matches occur or what they are exactly.  The count() method has three variants, which are analogous to the three match() methods.  Here’s one if its prototypes:

 

size_t rpattern::count(

const_iterator ibegin,        // start of the string to match

const_iterator iend ) const;  // one past end of string to match

 

Like the substitute() method, the next match begins where the previous one ended.  Thus, if you want to know how many times the pattern “oo” shows up in the string “fooooo”, the answer is 2, even though this pattern could conceivably match the string in 4 unique places.  Overlapping matches are not counted.

rpattern::split

Use the split() method when you have a string that you want to split into substrings, using a regular expression as a delimiter.  Here’s one if its prototypes:

 

size_t rpattern::split(

const_iterator ibegin,    // start of the string to match

const_iterator iend,      // one past end of string to match

split_results & results,

int limit = 0 ) const;

 

The split() function takes an optional “int limit” parameter which defaults to 0.  If it is greater than 0, it is the upper limit on the number of fields to split the string into.  If it is 0, then there is no limit, except that trailing empty fields are dropped.  If it is less that 0, then there is no limit, and empty trailing fields are kept.  As in perl, an empty leading field is always dropped.

 

If your regular expression has capturing groups, these back-references become fields in the split_results.

 

The split_results struct behaves like an STL container of std:strings.

rpattern::set_substitution

Sometimes you may want to set the substitution string after the rpattern object has been instantiated.  For this, you would use the set_substitution() method. 

 

void rpattern::set_substitution( const string_type & sub );

// throw(bad_alloc,bad_regexpr);

 

The set_substitution() method can be called as many times as you like.  Each time it is called, it replaces the old substitution string with the new one.  Notice that the set_substitution() method can also throw a bad_regexpr exception for ill-formed substitution strings.

rpattern::cgroups

This method returns the count of groups in the pattern string.  For a successfully parsed pattern, this number is always at least one, because the entire pattern is considered group number 0.

 

size_t rpattern::cgroups() const;

match_results, subst_results and split_results

The results of a match(), substitute(), or split() operation are stored in the match_results, subst_results, and split_results containers, respectively.  After a successful match(), substitute() or split(), they contain useful information that can be queried for using the following methods.  All the match_results methods are also available on subst_results.

match_results::cbackrefs

This method returns the count of the internally saved backrefs.  It takes no parameters.  Here is the prototype:

 

size_t match_results::cbackrefs() const;

 

After a successful match() operation, cbackrefs() will always return at least one.  That is because the part of the string that matched the entire pattern is always saved in backref number 0 (like $& in Perl).

match_results::backref

This method returns the requested backref.  It takes one parameter, which is the integer representing the backref to return.  The prototype for this method looks like:

 

const backref_type & match_results::backref( size_t cbackref ) const;

 

As in Perl, backreferences are created as a side effect of grouping.  The substring that matched the entire pattern is saved in backref number 0; the substring that matched the pattern enclosed in the first set of parentheses is backref number 1, the second set of parentheses is backref number 2, and so on.  Sets of parentheses are numbered by counting opening parentheses from left to right. 

 

If cbackref is greater than the number of saved backrefs, match_results::backref() throws a std::out_of_range exception.

match_results::rstart

The rstart() method returns the offset to the first character in the specified backref.  Here is the prototype:

 

size_t match_results::rstart( size_t cbackref = 0 ) const;

 

After a global substitution operation, rstart() returns the offset to the point in the string that would be the start of the backref if the string had not been modified.  That’s because it is really an index into the backref string. See subst_results::backref_str below.

 

If cbackref is greater than the number of saved backrefs, match_results::rstart() throws a std::out_of_range exception. 

match_results::rlength

The rlength() method returns the length of the specified backref.  Here is the prototype:

 

size_t match_results::rlength( size_t cbackref = 0 ) const;

 

If cbackref is greater than the number of saved backrefs, match_results::rlength() throws a std::out_of_range exception.

match_results::all_backrefs

This method returns a const reference to the internal vector of backrefs.  The return value is const to prevent writing to this vector.   It is intended for read-only.

 

const match_results::backref_vector & match_results::all_backrefs() const;

 

subst_results::backref_str

This method is only defined on the subst_results container.  It returns a const reference to the string to which the backreferences point.  Here is the prototype:

 

const string_type & subst_results::backref_str() const;

 

As I mentioned earlier, backrefs are really just a pair of iterators into a string.  But a substitution modifies the string, and the backreferences need to refer to the string as it was before it was modified.  For this to work, a copy of the string must be made before it is changed.  The backreferences are pointers into the copy of the string.  The backref_str() method returns a const reference to the copy.

 

Note: this implies that doing substitutions on really big strings is costly, because the regex engine must first make a copy of the string before performing the substitution.  Often, you don’t need backreferences after a substitution, and a copy does not need to be made.  There is a flag to turn off this expensive feature for these cases.  This is discussed in the section entitled NOBACKREFS.

split_results::strings

Use the split_results::strings() method to retrieve a read/write STL container of the strings representing the results of a split() operation.

The Syntax Module

The regular expression syntax supported by default is the syntax of Perl 5.6.  The syntax is defined in a separate module and is incorporated into the rpattern class by means of a template parameter.  You can define your own syntax module if you like.  Your module must implement the following interface:

 

template< typename CH >

class my_syntax {

 

public:

 

    typedef CH char_type;

    typedef std::basic_string<CH>::iterator iterator;

    typedef std::basic_string<CH>::const_iterator const_iterator;

 

my_syntax( REGEX_FLAGS flags );

 

REGEX_FLAGS get_flags() const;

 

void set_flags( REGEX_FLAGS flags );

 

    TOKEN reg_token( iterator & icur,

                     const_iterator iend );

 

    TOKEN quant_token( iterator & icur,

                       const_iterator iend );

 

    TOKEN charset_token( iterator & icur,

                         const_iterator iend );

 

    TOKEN subst_token( iterator & icur,

                       const_iterator iend );

 

    TOKEN ext_token( iterator & icur,

                     const_iterator iend );

 

void register_intrinsic_charset(

    char_type ch,

    const std::basic_string<CH> & str ); // throw(bad_regexpr,std::bad_alloc);

};

 

The TOKEN type is an enumeration, defined as follows:

 

enum TOKEN {

    // Can be returned by any syntax method

    NO_TOKEN = 0,

 

    // Token that can be returned by the reg_token method

    BEGIN_GROUP,

    END_GROUP,

    ALTERNATION,

    BEGIN_LINE,

    END_LINE,

    BEGIN_CHARSET,

    MATCH_ANY,

    ESCAPE,

 

    // Tokens that can be returned by the quant_token method

    ONE_OR_MORE,

    ZERO_OR_MORE,

    ZERO_OR_ONE,

    ONE_OR_MORE_MIN,

    ZERO_OR_MORE_MIN,

    ZERO_OR_ONE_MIN,

    BEGIN_RANGE,

    RANGE_SEPARATOR,

    END_RANGE,

    END_RANGE_MIN,

 

    // These may also be returned by the reg_token method

    ESC_DIGIT,

    ESC_NOT_DIGIT,

    ESC_SPACE,

    ESC_NOT_SPACE,

    ESC_WORD,

    ESC_NOT_WORD,

    ESC_BEGIN_STRING,

    ESC_END_STRING,

    ESC_END_STRING_z,

    ESC_WORD_BOUNDARY,

    ESC_NOT_WORD_BOUNDARY,

    ESC_WORD_START,

    ESC_WORD_STOP,

    ESC_QUOTE_META_ON,

    ESC_QUOTE_META_OFF,

 

    // Tokens that can be returned by the subst_token method

    SUBST_BACKREF,

    SUBST_PREMATCH,

    SUBST_POSTMATCH,

    SUBST_MATCH,

    SUBST_ESCAPE,

    SUBST_QUOTE_META_ON,

    SUBST_UPPER_ON,

    SUBST_UPPER_NEXT,

    SUBST_LOWER_ON,

    SUBST_LOWER_NEXT,

    SUBST_ALL_OFF,

 

    // Tokens that can be returned by the charset_token method

    CHARSET_NEGATE,

    CHARSET_ESCAPE,

    CHARSET_RANGE,

    CHARSET_BACKSPACE,

    CHARSET_END,

    CHARSET_ALNUM,

    CHARSET_NOT_ALNUM,

    CHARSET_ALPHA,

    CHARSET_NOT_ALPHA,

    CHARSET_BLANK,

    CHARSET_NOT_BLANK,

    CHARSET_CNTRL,

    CHARSET_ NOT_CNTRL,

    CHARSET_DIGIT,

    CHARSET_ NOT_DIGIT,

    CHARSET_GRAPH,

    CHARSET_NOT_GRAPH,

    CHARSET_LOWER,

    CHARSET_NOT_LOWER,

    CHARSET_PRINT,

    CHARSET_NOT_PRINT,

    CHARSET_PUNCT,

    CHARSET_NOT_PUNCT,

    CHARSET_SPACE,

    CHARSET_NOT_SPACE,

    CHARSET_UPPER,

    CHARSET_NOT_UPPER,

    CHARSET_XDIGIT,

    CHARSET_NOT_XDIGIT,

 

    // Tokens that can be returned by the ext_token method

    EXT_NOBACKREF,

    EXT_POS_LOOKAHEAD,

    EXT_NEG_LOOKAHEAD,

    EXT_POS_LOOKBEHIND,

    EXT_NEG_LOOKBEHIND,

    EXT_INDEPENDENT,

    EXT_UNKNOWN

};

 

If any syntax method returns a token other than NO_TOKEN, the character pointer (icur) must be advanced past the token.  icur must not be advanced past the end of the string, iend.  If NO_TOKEN is returned, icur should not be modified, except to move past any characters that should be ignored.

 

Once you have defined your syntax module, you need to instantiate an rpattern template that uses it.  See the section on Template Instantiation for help.

register_intrinsic_charset

The intrinsic character sets are pretty handy.  For instance, “\s” matches the same thing as “[\t\f\r\n ]”, and “\w” is (roughly) a shorthand for “[a-zA-Z_0-9]”.  With the register_intrinsic_charset() method, you can create your own.  Here’s the prototype:

 

static void my_syntax::register_intrinsic_charset(

    char_type ch, const std::basic_string<char_type> & str );

    // throw(bad_regexpr,std::bad_alloc);

 

You would use it as follows:

 

my_syntax<char>::register_intrinsic_charset( ‘p’, “[aeiouAEIOU]” );

 

This maps “\p” to “[aeiouAEIOU]”.  In implementation, the character set is “compiled” when you call register_intrinsic_charset(), and when “\p” appears in any new patterns, this compiled form gets linked into your pattern.  This speeds up pattern compilation, and reduces memory usage, since the compiled character set is shared across all your patterns.  Note that with the above call to register_intrinsic_charset(), no mapping of “\P” to “[^aeiouAEIOU]” is created.  If you want that mapping, you need to create it yourself with another call to register_intrinsic_charset().

 

The case-sensitivity of a user-defined intrinsic character set is taken from the context in which it is used.  For instance, if you define “\y” to be “[a-z]”, then in a case-sensitive pattern, it matches only a-z, but in a case-insensitive pattern, it matches a-z and A-Z.

 

If you redefine an intrinsic character set, any compiled patterns that refer to that character set are now invalid.  You must reinitialize them.

 

Customizing Your Search

It is possible to customize your searches by specifying some flags in the rpattern constructor.  Here is an example:

 

// Find all instances of ‘foo’, replace with ‘bar’, ignore case

rpattern pat( “foo”, “bar”, NOCASE | GLOBAL );

 

NOCASE

Do a case-insensitive search.  Searches are by default case-sensitive.

GLOBAL

This flag causes the rpattern::substitute() method to replace all substrings that match the pattern.  By default, only the first substring that matches the pattern is replaced.  When used with the rpattern::match() method, it causes all substrings that match the pattern to be found.  See the section on NOBACKREFS, FIRSTBACKREFS and ALLBACKREFS to find out how to control what information gets saved in the backref vector.  This flag is ignored by the rpattern::count() method.

MULTILINE

By default, ‘^’ matches only the beginning of the string, and ‘$’ matches only the end of the string (or at the newline preceding the end of the string).  When you use the MULTILINE flag, ‘^’ will also match the beginning of a line (immediately after a newline character) and ‘$’ will also match before the end of the line (immediately before a carriage return or newline character).  The assertions ‘\A’ and ‘\Z’ are used to match only the beginning and end of the string respectively, regardless of whether the MULTILINE flag has been specified.

SINGLELINE

By default, ‘.’ matches any character besides the newline character.  When you specify SINGLELINE, ‘.’ will match the newline character, also.  This can be used together with the MULTILINE option.  The two only sound mutually exclusive.

EXTENDED

This is the equivalent of the /x pattern modifier in Perl.  When you specify EXTENDED, any white space in you pattern that is not escaped or in a character class is ignored.  In addition, any text following a # up to and including a newline character is considered a comment and is ignored.  The idea is to make patterns more readable.

RIGHTMOST

Find the rightmost, longest match.  The default is to find the leftmost, longest match.

NOBACKREFS

This flag is used to optimize substitution operations.  By default, backreferences are saved during substitution operations.  To make the backreferences available even after the substitution operation has changed the string, it is necessary to save a copy of the string before modifying it.  This can be an expensive operation, particularly if the string is very long.  Often, there is no need to know the backreferences after a substitution operation completes, in which case the copy operation was wasteful.  The NOBACKREFS flag is a hint to the regular expression engine that you will not be using the backreferences after the substitution operation completes.  This eliminates the need to make a copy of the string.

 

If your substitution string contains backreferences, such as ‘$2’, then a copy of the string needs to be made anyway, and the NOBACKREFS flag is ignored.

 

This flags is ignored by the match() and count() methods.

ALLBACKREFS

By default, when doing a global match or substitute operation, the backref vector only contains the backrefs from the last successful match.  Sometimes, it is helpful to know the backref information for all the matches.  If you specify ALLBACKREFS, then the backref information for each successive match will be appended to the backref vector, instead of replacing it.

FIRSTBACKREFS

This flag is like ALLBACKREFS, except that only information from the complete match (backref 0) is saved for each successive match.

NORMALIZE

If the NORMALIZE flag is specified, the regular expression parser performs the following normalizations:

 

The string “\\n” is turned into the newline character, ‘\n’.

The string “\\r” is turned into the return character, ‘\r’.

The string “\\t” is turned into the tab character, ‘\t’.

The string “\\f” is turned into the form-feed character, ‘\f’.

The string “\\v” is turned into the vertical-feed character, ‘\v’.

The string “\\a” is turned into the bell character, ‘\a’.

The string “\\\\” is turned into the backslash character, ‘\\’.

 

This can be especially useful if you are accepting the pattern and substitution strings from a user as input.

Matching Modes

By default, the pattern matching algorithm is a depth-first, recursive search.  It is fast and effective, but it has a limitation: the size of your process’s stack.  If you run out of stack space during a match operation, one of two things will happen.  If you are on a Microsoft platform, the exception is handled and execution continues as if the pattern failed to match (which may or may not be the right answer).  If you are not on a Microsoft platform, the exception cannot be handled, and your process will dump core and die.  It’s not very nice to have a process die when given legal input, so GRETA gives you the option to perform matches in a completely safe manner, using an iterative algorithm that is stack-conservative.  The different matching-modes are described below.  You can specify these modes to the rpattern constructor.

MODE_FAST

On Microsoft platforms, this is the default mode.  It uses the recursive algorithm.  As the name suggests, it is fast, but it can tear through stack space pretty quick for certain types of patterns.  The worst offender is a group quantifier.  This pattern eats lots of stack: “(.)+” whereas this one does not: “.+”.  You should use fast mode when your patterns do not quantify groups or use the (?R) extension (see Recursive Patterns), or when you are not accepting the string and the pattern from users.

MODE_SAFE

This mode uses the slower, safer iterative algorithm.  It will not overflow the stack.  Use this mode when you cannot make assumptions about the pattern and the string (eg., you are accepting them as input from users).  Also, you can use this mode when your stack space is small, or if you are in a DLL and you cannot make assumptions about the size of your process’s stack space. 

 

You should be aware that safe mode will occasionally need to make heap allocations, which could fail in low-memory situations and generate an exception.  You should wrap calls to match(), count() and substitute() in a try/catch block and handle any std::bad_alloc exceptions that may get thrown.

MODE_MIXED

This is the default mode on non-Microsoft platforms.  When operating in mixed mode, GRETA uses a simple heuristic to determine on a per-pattern basis which algorithm is the best to use.  Patterns that contain quantified groups, like “(.)+”, are matched using safe mode, as are patterns that use the (?R) extension.  All other patterns are matched using the fast mode.

Known Issues and Perl Incompatibilities

There are a few things my code does a bit differently than Perl.  These are things that could trip you up, and you should be aware of them.

Embedded Code in a Regular Expression

Perl lets you put Perl code directly into a regex with the (?{ code }) extension.  You can’t embed Perl code in a GRETA regular expression.  Duh!  But see Recursive Patterns for a useful hack that might get you what you’re looking for.

Pattern Modifier Scope

In the pattern “a(?i)b”, (?i) turns on case-insensitive pattern matching.  In both Perl and GRETA, pattern modifiers such as (?i) have the scope of the nearest enclosing group.  Outside of that group, they have no effect.  However, in GRETA, the scope of the pattern modifier extends only to the right, whereas in Perl it extends both to the right and left.  For example, in Perl, the above example would match the string “AB”, but in GRETA it would not because the “a” appears to the left of the (?i) modifier.

Comment Blocks Before Quantifiers

Perl lets you write patterns like /X(?#Match X 3 times){3}/, where the bit in (?#...) is a comment, and {3} is a quantifier that applies to X.  (I don’t know who would write such a pattern, but Perl programmers are a pretty sick bunch.)  In GRETA, comment groups logically separate quantifiers from the things quantified, so this will not match XXX like it does in Perl.

Variable Width Look-Behind Assertions

Perl doesn’t allow variable-width look-behind assertions, probably because it can be really expensive.  GRETA allows it, and lets you shoot yourself if you’re not careful.  For variable-width look-behind assertions, every evaluation of the look-behind is like a rightmost pattern match unto itself, complete with backtracking.  This can result in very severe exponential behavior, especially if you’re using your pattern to match very long strings.  I have done my best to protect you, though.  GRETA does static width analysis of your look-behind assertions.  This analysis is fairly robust; for example, it recognizes that the pattern “(\d{3}-){2}\d{4}” must match exactly 12 characters (a hyphen-separated telephone number, for instance).  If the static analysis reveals that the look-behind assertion must match between N and M characters, it only tries matching in those places where a match is possible.  In the previous example, the pattern matcher would back up exactly 12 characters and look for a phone number.  If it fails, it gives up without looking anyplace else.

 

By the way, you are allowed to create back-references in your positive look-behind assertions and use them later in your pattern and access then after a successful match.

Recursive Patterns

Perl lets you embed code directly into your patters with the (?{ code }) extension, which relies on the Perl interpreter to interpolate the code and execute it at pattern match time.  One very nice application of this is to write recursive patterns for matching things like balanced, nested parenthesis, XML tags, etc.  Obviously, I can’t call out to an interpreter in C++.  But I have included experimental support for recursive patterns with the (?R) extension.  For example, consider the following pattern (assume EXTENDED is used so white space is ignored):

\( ( (?>[^()]+) | (?R) )* \)

 

First, you match an opening paren.  Then you match a bunch of stuff that is not an opening or closing paren.  Then, if you can match a closing paren you are done.  Otherwise, you recurse by matching an opening paren again, followed by a bunch of stuff that is not a paren, etc, etc.  You finish when you find a matching paren for the first one you found.

 

This extension was inspired by PCRE (Perl-compatible regular expressions) at http://www.pcre.org

Compile-Time Switches

Here are some #define’s to give you control of some aspects of GRETA’s behavior.

REGEX_WIDE_AND_NARROW

Define REGEX_WIDE_AND_NARROW if you plan to match wide and narrow strings in the same executable.  This forces the instantiation of both the wide and narrow versions of the pattern matching routines.  Note that this will make your executable bigger.

REGEX_POSIX

Define REGEX_POSIX if you want to use the POSIX syntax module.  It causes the appropriate templates to get instantiated.

REGEX_NO_PERL

Define REGEX_NO_PERL if you are not planning on using the Perl syntax module.  It prevents the Perl templates from getting instantiated.  This saves some space.

REGEX_DEBUG

In debug mode, the GRETA performs sanity checks at run-time.  Among other things, it turns off the arena allocator and turns on run-time type checking in the type-unsafe parts of the code.  This makes the code run very slow, but it’s a good idea to try your patterns a few times with REGEX_DEBUG on (-DREGEX_DEBUG=1), just to make sure everything is kosher.  If you define any of: DEBUG, _DEBUG or DBG, then REGEX_DEBUG gets turned on automatically.  To override this, explicitly turn it off with –DREGEX_DEBUG=0.

REGEX_DEBUG_HEAP

When parsing a pattern, GRETA does a number of allocations.  To speed up the parsing phase, the allocations come from an arena.  This is a significant performance enhancement, but it can interfere with debug tools like AppVerifier and PageHeap which track allocations.  By turning REGEX_DEBUG_HEAP on (-DREGEX_DEBUG_HEAP=1), you can effectively turn off arena allocation.  By default, REGEX_DEBUG_HEAP is turned on when REGEX_DEBUG is on.

REGEX_STACK_ALIGNMENT

Anyone who has written a low-level memory routine and tried to make it portable has run into alignment problems.  I hit one such problem when writing GRETA’s stack class.  It constructs objects of unknown type in place from a block of pre-allocated memory, and it tries to be clever about alignment.  But it’s not perfect.  (Blame the C++ standardization committee or Bjarne Stroustrup, but don’t blame me.)  If you’re getting a compiler error in restack.h on a line that looks like this:

 

static_assert< (ALIGNMENT >= alignof<T>::value) > const align_test;

 

then you have hit an alignment problem.  Relax, it is easy to fix.  Just compile with: –DREGEX_STACK_ALIGNMENT=8.  If that doesn’t work, try an alignment of 16 or 32.

REGEX_FOLD_INSTANTIATIONS

On many implementations of the STL, string::iterator is not a typedef for char*. Rather, it is a wrapper class. As a result, the regex code gets instantiated twice, once for bare pointers (rpattern_c) and once for the wrapped pointers (rpattern). But if there is a conversion from the bare pointer to the wrapped pointer, then we only need to instantiate the template for the wrapped pointers, and the code will work for the bare pointers, too. This can be a significant space savings.  The REGEX_FOLD_INSTANTIONS macro controls this optimization. The default is "off" for backwards compatibility. To turn the optimization on, compile with: -DREGEX_FOLD_INSTANTIATIONS=1.  When instantiation-folding is enabled, rpattern_c and match_results_c are typedefs for rpattern and match_results, and attempts to use basic_rpattern<const TCHAR *> directly will result in linker errors.

REGEX_TO_INSTANTIATE

To allow separate compilation of the basic_rpattern template, I instantiate it on the iterator types that I think would be most useful. By default, I instantiate basic_rpattern on std::basic_string<TCHAR>::const_iterator and const TCHAR*.  Your requirements may be different, however.  If you would like to create additional instantiations, or if you don’t need one of the standard instantiations, you should use the REGEX_TO_INSTANTIATE macro.  Just set it equal to a comma-separated list of iterator types for which you would like to create instantiations.  For instance, if you need to match std::wstring and char*, then you could –DREGEX_TO_INSTANTIATE=std::wstring::const_iterator,char*.  With this macro definition, you will be able to use basic_rpattern<std::wstring:: const_iterator> and basic_rpattern<char*> in your code.  As a side-note, if you use a non-const iterator type with basic_rpattern (like char*), then the resulting backreferences will also be non-const.  This gives you the nice benefit of being able to modify the underlying string by writing through your backreferences.

Miscellaneous

Here are a couple of random things to be aware of.

Static Const Patterns

Make your patterns “static const” if you can.  That way, they are compiled once when the thread of execution passes over their declarations, not every time.  Consider the two following functions:

 

void good_function( string & str ) {

    static const rpattern static_pattern( “your pattern here” );

    static_pattern.substitute( str );

}

 

void bad_function( string & str ) {

    rpattern auto_pattern( “your pattern here” );

    auto_pattern.substitute( str );

}

 

In good_function(), the static_pattern object is initialized once the first time good_function() is called.  Thereafter, good_function() can be called as many times as you like, and you won’t incur the penalty of recompiling the pattern.  The static_pattern object is cleaned up only when the program terminates.  In bad_function(), the auto_pattern object is initialized every time bad_function() is called, and it is cleaned up every time bad_function() returns.  This is unnecessary overhead since the pattern you are looking for is not changing each time.  But read the Thread-safety section below for one small caveat.

Thread-safety

A const rpattern object is completely thread-safe.  For instance, if you declare an rpattern object to be “static const”, you can use it in multiple threads to perform simultaneous match() or substitute() operations. 

 

However, VC does no synchronization while constructing function-local, static objects.  (Sad, but true.)  It’s possible for two threads to try to initialize a static const pattern simultaneously.  But fear not! I’ve provided a work-around.  You can use the STATIC_RPATTERN macro to declare all your static const patterns.  If _MT is defined, then STATIC_RPATTERN will guard all your rpattern constructions with a critical section.

Stack Usage

By default, GRETA uses recursion in a depth-first search for matches.  If you are matching against a very long string (several thousands of characters long), then there is a chance you could overflow the stack.  If you use the VC compiler on a Win32 platform, then GRETA traps the stack overflow exception and resets the stack guard page so that any additional stack overflows that may occur can also be handled.  In that case, execution will continue as if the pattern had failed to match.

 

An ounce of prevention goes a long way, however.  There are some simple things you can do to prevent stack overflows.  The first is to not match against strings that are 1000’s of characters long.  J  Next, you can avoid constructs that require deep recursion.  Quantifying a group is one such construct.  So for instance, this pattern recurses deeply: “(.)+”, whereas this one does not: “.+”.  Also, you should compile GRETA with full optimization, making sure to turn on inline expansion (/Ob1 or /Ob2).  This drastically reduces GRETA’s stack usage.  Finally, you can use the /STACK switch to the linker to increase your stack reserve.  By default, the VC linker gives executables 1 Mb of stack reserve.  By contrast, the version of perl.exe that I have (from ActiveState) has a whopping 16 Mb of stack reserve.  For most situations, 1 Mb is plenty, but if you allow deeply recursive patterns to match long strings, then you might want to bump your stack reserve up to 2 or 4 Mb.  With 4 Mb of stack, you can match a deeply recursive pattern against a string of 35,000 characters without overflowing your stack.

 

Finally, if you are still concerned about stack overflows, you can use the “safe” mode to match patterns.  When in safe mode, pattern matching is done using an iterative algorithm, eliminating the possibility of stack overflows.  Safe mode is slower, however.  See Matching Modes for more information.

DBCS

The regular expression engine is not DBCS-aware.  You can use only fixed-width character sets (that is, ANSI and UNICODE).  Sorry.

STL

This regular expression package makes heavy use of STL (The Standard Template Library).  STL is available in VC5 and higher.  If you are not currently using STL in your projects (why not?!) you may have to make some modifications to your build settings to use this package.  If you are using the NT build environment, you should add the line “USE_STL=1” to your sources file.  If you are building in VC, you may need to explicitly link to one of the STL libraries (msvcprt.lib or libcp.lib, or their debug equivalents).

VC7 and Managed Code

If you are compiling GRETA under VC7, and you care about speed, you should not compile the code with managed extensions turned on.  I have found that the performance of the regex engine degrades badly with managed code.  You should compile GRETA as unmanaged code into a separate library, and then link the lib to your managed app.

Template Instantiation

I’ve needed to instantiate the basic_rpattern classes I thought would be most useful.  By default, only the classes necessary to perform Perl-style pattern matches on strings of type TCHAR are instantiated.  If you would like to do pattern matching on both char and wchar_t strings, then compile with REGEX_WIDE_AND_NARROW defined.  Note that this will make your code twice as big.  See the section REGEX_TO_INSTANTIATE to find out how to have finer control over the template instantiation. 

 

By default, the Perl syntax module is compiled, and the POSIX syntax is not.  If you want to use the POSIX syntax, then define REGEX_POSIX.  If you don’t plan on using Perl syntax, you can turn off Perl template instantiation by defining REGEX_NO_PERL.

 

If you have defined your own syntax module that you want to use with the rpattern class, then you need to do something fancier.  You can set the REGEX_TO_INSTANTIATE macro to null (-DREGEX_TO_INSTANTIATE=) and provide your own explicit instantiations at the bottom of regexpr2.cpp, using the template parameters of your choosing.  Or you can #include “regexpr2.cpp” and use implicit instantiation.  If you do this, you should also set REGEX_TO_INSTANTIATE to null.  But be aware of how many template instantiations you have, because each additional instantiation will greatly increase the size of your compiled code.

Contact Information

You should send bug reports and feature requests to me, Eric Niebler.  I’ll get to them as soon as I can.


Appendix 1: History

7/18/03

Folded VC6 code into main code-base for easier maintenance.
Minor code clean-up.

7/10/03

Fixed off-by-one in lookbehind assertions that could result in an out-of-range iterator access.
Use _isctype on VC6.
Miscellaneous code cleanup.

6/09/03

Patterns that begin with a string literal now use the Boyer-Moore algorithm to limit the search space, resulting in huge perf wins for some patterns.

6/03/03

GRETA now builds on VC6 again <sigh>.
Added 1-char look-ahead to groups and alternates, greatly speeding up some patterns.

5/10/03

Modified GRETA so it builds in the NT / CoReXT build environment (that is, with the VC6 STL headers).

5/5/03

Added split().
10% perf win across the board.
Fixed bug with VC8’s debug iterators.
Reduced memory footprint of charsets by 20%.
Made match_results/subst_results/split_results template on allocator.
Miscellaneous code clean-up.

4/29/03

Make GRETA work with debug iterators.
basic_rpattern::swap wasn’t swapping m_invisible_groups; now it does.

2/2/03

Use pseudo-vtables in hetero_stack to increase exception safety when pushing/popping objects with non-trivial destructors.
Use Instantiator to move more implementation out of the headers.

11/3/02

Use Instantiator idiom for separate compilation of templates.
Minor performance tweaks.
Miscellaneous code clean-up.

8/11/02

Remove implementation-dependent code that relied on STL containters accepting allocators with per-instance state.  (Hence, the REGEX_NO_ALLOCATOR macro is no longer needed.)
Use of singly-linked lists reduces memory footprint of compiled patterns.
Miscellaneous code clean-up, resulting in almost 200 fewer lines of code.

6/11/02

Move implementation details into internal namespace “detail".
Use REGEX_CHAR macro for more correct handling of character literals.
Miscellaneous code clean-up.

5/30/02

Fixed bad interaction between NORMALIZE and EXTENDED flags that was causing “\n” escape sequences to be ignored as white-space.
Fix linking problem when compiling with gcc 3.1 and –O3.
Included ostream output operator for backrefs in namespace regex.
Minor performance tweaks.

5/22/02

Fixed some compile problems.
Add REGEX_TO_INSTANTIATE macro, which controls exactly which instantiations of basic_rpattern you want to create.

5/17/02

rpattern can now be instantiated with non-const iterators.
Added typename in all places where the standard requires it.

5/16/02

rpattern::set_flags() method has gone away.
Bring back implicit conversion of backref to bool using “safe bool” idiom.
rpattern object now has copy constructor, assignment operator and non-throwing swap(), all of which are exception-safe.
init() and set_substitution() are now exception-safe.
hetero_stack now has much faster runtime type checking.

5/02/02

bad_regexpr now inherits from std::invalid_argument, as it should.
Fix alignof struct to not use offsetof().
Fix potential alignment fault in hetero_stack.

4/24/02

Fix NOBACKREFS flag, which has been broken since 3/15. Many thanks to Michael Nelte for the bug report.

4/22/02

Make heterogeneous stack exception safe and more generally useful.

4/10/02

Fix linker error caused by 4/8 change.
Compiles and runs on VC7.1.

4/8/02

Safer alignment handling in unsafe_stack.
Relaxed the iterator requirements to the match/count/substitute methods. Any iterator will now be accepted as long as it is convertible to the rpattern’s const_iterator type.
Implemented template instantiation folding to save space in the executable image (disabled by default for backward compatibility). See REGEX_FOLD_INSTANTIATIONS for details.

3/24/02

Fix potential crashing bug in unsafe_stack. Only safe mode is affected.

3/15/02

Retarget the code for VC7.
basic_rpattern_c and basic_match_results_c are no longer needed, and are marked deprecated.
The implicit conversion from a backref to a bool is deprecated.
subst_results now inherits directly from match_results.

3/6/02

Faster implementation of unsafe_stack resulting in an overall 10% perf improvement in “safe” mode.

1/28/02

Some fixes for gcc 3.0.3.

1/25/02

20% performance improvement in character set matching.

1/23/02

Fix VC7 build problem with the forward declaration of _resetstkoflw.  Thanks to Steve West for the suggested fix.

1/17/02

Various portability fixes. Compiles with gcc-2.96 with STLPort.

12/5/01

Compiles cleanly on VC6 with /W4 and on VC7 with /W4 and /Wp64.

11/29/01

GLOBAL match/substitute operations with patterns that match zero-length substrings are now handled as they are in Perl.
Performance work on safe mode.

11/26/01

Finally, a fix for the infamous Repeated Pattern Matching Zero-Length Substring bug.

11/16/01

Big perf improvement for quantified literals and character sets.
Add “safe mode,” a stack-conservative, iterative (slow) way to match patterns.

10/30/01

Statically initialize the Perl syntax look-up tables to prevent global rpattern objects from using the tables before they are initialized.
Do away with pointless exception specifications.

10/5/01

Use group extent information to make lookahead and lookbehind assertions and independent subexpressions more efficient in both time and (stack) space.

9/27/01

Fix a bug when matching a zero-width pattern (e.g. /^\s*$/) to an empty std::string.

9/20/01

Add nathann’s debug heap support.
Trap EXCEPTION_STACK_OVERFLOW and reset the guard page.

9/12/01

Fix AV when trying to match against an uninitialized std::string.
Fix compile problem on IA64.
Fix alignment fault on IA64 (thanks to nathann for debugging this).
Backreferences to nonexistent groups now generates an error.
Code clean up / small reduction in stack usage.

9/5/01

Add conditional subexpressions a-la Perl.
Fix potential memory leak when parsing invalid character sets with REGEX_NO_ALLOCATOR defined.

8/31/01

Fix backtracking bug in non-greedy group quantifiers.
Null character is allowed as lower bound of range in charset.
Major improvements in Perl compliance:
- Add (?#...) comment group.
- Add Perl /x modifier for non-significant white space in patterns.
- Add Perl escape sequences \e, \777, \xFF, \cC.
- backrefs in patterns can refer to enclosing groups
Add experimental (?R) extension for recursive patterns.

8/21/01

Fix bug in nested POSIX-style character sets.

8/20/01

Use small-object allocator to speed up pattern compilation.

8/15/01

Complete overhaul of the interface.
Remove regexpr class; add match_results/subst_results.
Intrinsic charsets (\w, \d, \s) respect locale settings.
CSTRINGS flag functionally replaced by rpattern_c object.
Custom intrinsic charsets are managed by syntax module.
Compiles and runs on Linux (STLport-4.0, gcc 2.96).

7/31/01

Big performance improvements.
Fix for STATIC_RPATTERN macro.

7/9/01

Add negative posix charsets like [:^alpha:].
Make posix charsets play nice with user-defined intrinsic charsets.

4/17/01

Remove vertical tab character (‘\v’) from \s intrinsic character set.

3/10/01

Explicitly zero Perl syntax look-up tables.

1/31/01

Fix assert when parsing certain invalid patterns.

1/25/01

Fix AV when reinitializing an rpattern.

1/5/01

Remove platform-dependent assumption that string iterators are char*.

10/13/00

Compiles and runs with VC7.
Add new Perl \z assertion.

8/14/00

Use pattern width analysis to optimize matching.

8/10/00

Add lookbehind assertions, which can be arbitrarily complicated (Perl only allows fixed-width lookbehinds).
Allow posix-style nested charsets, like [:alpha:] (Perl 5.6 negative posix charsets like [:^alpha:] not supported).

4/18/00

Include basic POSIX syntax module.
Don't allow uninitialized backrefs to match the empty string.
] doesn't close a character class if it is the first character in the class.

3/27/00

Fix leak when parsing invalid substitution strings.

3/22/00

Support quotemeta (\Q and \E) in pattern and substitution strings.
Support upper- and lower-case escape sequences (\L \l \U \u) in substitution strings.

2/9/00

regexpr is now template on character type, traits and allocator!
rpattern is also template on syntax module.
Fixed bug with ^ when doing global multiline match.
Fixed problem with $' when doing global substitution.
Made $ more Perl-like.

9/20/99

Fixed bug in Unicode charset matching.
Performance improvements.
Allow user-defined intrinsic character sets.

6/23/99

Expanded the global match and substitute functionality
Improved performance of character set matching
Open source'd it!

6/4/99

Just fixed a memory leak! Get the latest version.

4/1/99

Submitted to http://toolbox/

 


Appendix 2: Implementation Details

So, you want to know how GRETA works, huh?  Well, it is a hybrid of a standard state machine and the Interpreter design pattern, as described in Design Patterns, by Gamma, et al.  A state machine has states and transitions in a directed, possibly cyclic, graph.  It also has a clear flow from the initial state to the terminal state.  It is generally implemented with a p-code interpreter or with jump tables.  The Interpreter pattern, on the other hand, doesn't deal with states -- it deals with grammar rules.  Each rule in your language (e.g. regular expressions) gets its own polymorphic class, and you parse sentences in your language into trees of these polymorphic nodes according to your grammar rules.  Recognizing sentences in your language (e.g., matching a pattern) happens by recursively traversing this tree.

 

GRETA is somewhere in between these two.  It is basically a state machine, but the states are polymorphic nodes in a graph, and transitions are virtual function calls between nodes, much like Interpreter.  There is one start node, and (essentially) one end node.  In between, there are branches and loops.  The goal is to find a way to the end state using only legal transitions.  Each node either matches or it doesn't.  If it doesn't, it returns false.  If it does, it passes execution to the next node in the graph recursively.  If there are branches, it picks one, and if its first guess fails to match (returns false), it tries the next one.  The entire execution state during pattern matching is maintained in local variables on the stack.  Since there is no dynamic allocation, it is very fast.

 

This approach can actually be as fast/faster that a traditional p-code state machine.  With p-code, to move between states, you have an op-code and you execute a big switch on your op-code.  Big switches are inefficient -- they introduce branches and blow locality out of the water.  Compare that to a virtual function call.  A couple of indirections and you're executing the code you want to be executing, with no branches.  This is similar to the jump table school of state machines.  But jump tables are hard to write, hard to understand and hard to maintain.  With GRETA, the v-tables are the jump tables, and I let the compiler worry about writing them and maintaining them.

 

Finally, I use a lot of template tricks to choose branches at compile time and to turn virtual function calls into inline function calls.  (Try that with a jump table!)  This speeds things up a lot, at the expense of some code bloat.  I have an article about this trick in the January 2003 issue of the C/C++ User's Journal.  You can read online it here: http://www.cuj.com/articles/2003/0301/0301d/0301d.htm?topic=articles.