std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm
Regular expressions are used by many familiar Unix utilities, including egrep, grep, lex, and sed 1 , and are supported directly in many programming languages including awk 1 , perl 2 , C 3 , and ECMAScript 4 also known as JavaScript, and indeed are often used ubiquitously wherever text processing occurs. Until recently most regular expression libraries were implemented in C, no doubt helped by the fact that the POSIX standard provides a standardized regular expression API 1 . Well known C implementations include the GNU regex library 5 , the RX library 6 , Henry Spencers regex library 7 which forms part of the BSD distribution, and PCRE 8 for perl like regular expressions. Most early C implementations were simple wrappers around the C APIs 9 10 , although these add a measure of safety by using the Resource Acquisition is Initialization principle, as well an object oriented interface. However wrapper classes and C APIs alike suffer from a key defect when interfacing with the C standard library: they are limited to searching narrow character strings held in a const char. Native C implementations can overcome the limitations suffered by C libraries, and there are two main candidates for standardization: the Boost regex library 11 formerly regex, and GRETA 12 . Both these libraries support narrow and wide character regular expressions, and iterator-based searches. GRETA implements the regular expression algorithms as member functions of the regular expression class, while the Boost library make these separate non-member functions. GRETA is based on perl regular expressions, where as the Boost library uses POSIX regular expression matching rules, albeit with some perl compatible extensions. Target audience The are essentially three possible target audiences for a regular expression library: End Users: regular expression may be used by the end user of a program - typically this would be in a text editor for search and replace operations. These users would require that the expression is localized to use their native character set, and never exhibit the kind of exponential search times that some regular expression engines can produce under worst case conditions. Typically this means that the regular expression recognized by the engine has to be restricted to some kind of safe subset of the syntax recognized by programming tools such as perl. Programmers: this section includes all users of scripting languages like perl 2 , awk 1 , and ECMAScript 4 . In these cases the regular expressions are authored by the programmer, and these are converted to a state machine at run time. Typically these users require an expressive regular expression syntax, at the expense of worst case performance. Localization of the syntax is not normally required, although the state machine itself should usually honor the run time locale for example when matching an alphanumeric character. Programmers code generators: this section of users include those using tools like lex 1 , which convert one or more regular expressions into a C or C source file. Here the expressions are written by the programmer, and then converted to a state machine at compile time. There is no localization, either of the syntax or of the way in which the state machine behaves at runtime.
However, thought has been given to the other possible user types, and the design by no means rule these out completely. Impact on the Standard This is a pure library proposal, it does not add any new language features, or alter any existing standard library headers. Design Overview This proposal consists of a pair of classes, and a set of algorithms that operate on instances of those classes. Regular expressions are represented by instances of the basic_regex class template, matches against a regular expression are represented by instances of the match_results class template. The algorithms regex_match and regex_search take as input a regular expression and a character string, and fill in a match_results structure if a match is found. For search and replace operations the algorithm regex_replace performs formatted search and replace on a character string, producing a new string as output. Finally, for repeated find operations, there are two iterator types;
Character types : both narrow and wide character strings are supported. Representation of regular expression matches : a container of sub-expression matches, what matched each marked sub-expression in the regular expression is reported back to the user discussion . Regular expression algorithms : iterator based non-member functions for searching for a match, and for finding an exact match discussion . Search and replace : ECMAScript-like search and replace via a non-member function to transform one string to another discussion . Repeated searches : an iterator type to enumerate all of the regular expression matches within another iterator sequence discussion . Splitting a string : an iterator type that enumerates one or more strings for each regular expression match within another iterator sequence discussion . Summary of key limitations imposed by the design Regular expression syntax : whatever syntax is chosen will severely limit the choice of state machine used along with any complexity guarantees given. The choice of ECMAScript regular expressions implies that some regular expressions can only be matched by a backtracking Non-deterministic Finite Automata NFA, other algorithms may be used for some subsets of ECMAScript regular expressions discussion . Regular expression object type : the fact that the regular expression object is templated on the character type, and not on an iterator type, may make some implementation choices more difficult. In practice this choice seems to have no impact on performance discussion . Regular expression traits class : the traits class design will be a major constraining factor on the kinds of internal representations that a regular expression state machine may use.
Support for wide characters : much regular expression theoretical work is based on small finite character sets, and can therefore not be applied to wide character strings, however wide character string support is considered essential. This constraint may be mitigated to some degree by providing a specialization for basic_regex<char> and then overloading its associated non-member algorithms discussion . Sub-expression matching : User feedback suggests that support for matching marked sub-expressions is essential. However, this can have a negative impact on the kinds of algorithms that may be employed. This is mitigated to some degree in this proposal by providing overloaded forms of the key algorithms that do not report what matched, allowing alternative algorithms to be used in these cases for some expressions discussion . Iterator type : support for forward iterators would allow for multibyte character support via iterator adapters, however this would also preclude some perl-like features, and may have negative performance implications. This proposal requires support for bidirectional iterators, however this choice needs more investigation and may need to be changed at a later date discussion . Regular expression syntax and complexity The complexity of converting a character string to a state machine, and the complexity of searching for a match to that machine, are deeply dependent upon the syntax used to represent the regular expression. There are three ways in which the regular expression syntax can be specified: Leave the manner in which the string representation is interpreted as implementation defined. This makes it impossible for programmers to include portable regular expressions in their code. Define completely within the standard how the regular expression finite state machine is interpreted. This option incurs a lot of work, and possibly duplicates information that has already been standardized elsewhere.
Primary equivalence classes, for example a would match any character whose primary collation key is the same as the primary collation key for a, ie , , , , , , A, , , , , , or . ECMAScript ECMAScript 4 is based on the perl regular expression syntax 2 ,...
|