Berkeley CSUA MOTD:Entry 13276
Berkeley CSUA MOTD
 
WIKI | FAQ | Tech FAQ
http://csua.com/feed/
2024/11/27 [General] UID:1000 Activity:popular
11/27   

2004/4/20 [Computer/SW/Languages/Perl] UID:13276 Activity:nil
4/19    yay!  Next version of boost::regex will support more Perl-like (or
        Javascript-like) pattern matches:
        http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm
        \_ what?  boost?
           \_ http://www.boost.org
              It's a popular set of C++ template libraries.
              \_ Popular doesn't exactly describe it.  It's the short list for
                 inclusion into the next C++ standard.
Cache (8192 bytes)
std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm
Regular expressions are used by many familiar Unix utilities, including egrep, grep, lex, and sed 1 , and are supported directly in many programming languages including awk 1 , perl 2 , C 3 , and ECMAScript 4 also known as JavaScript, and indeed are often used ubiquitously wherever text processing occurs. Until recently most regular expression libraries were implemented in C, no doubt helped by the fact that the POSIX standard provides a standardized regular expression API 1 . Well known C implementations include the GNU regex library 5 , the RX library 6 , Henry Spencers regex library 7 which forms part of the BSD distribution, and PCRE 8 for perl like regular expressions. Most early C implementations were simple wrappers around the C APIs 9 10 , although these add a measure of safety by using the Resource Acquisition is Initialization principle, as well an object oriented interface. However wrapper classes and C APIs alike suffer from a key defect when interfacing with the C standard library: they are limited to searching narrow character strings held in a const char. Native C implementations can overcome the limitations suffered by C libraries, and there are two main candidates for standardization: the Boost regex library 11 formerly regex, and GRETA 12 . Both these libraries support narrow and wide character regular expressions, and iterator-based searches. GRETA implements the regular expression algorithms as member functions of the regular expression class, while the Boost library make these separate non-member functions. GRETA is based on perl regular expressions, where as the Boost library uses POSIX regular expression matching rules, albeit with some perl compatible extensions. Target audience The are essentially three possible target audiences for a regular expression library: End Users: regular expression may be used by the end user of a program - typically this would be in a text editor for search and replace operations. These users would require that the expression is localized to use their native character set, and never exhibit the kind of exponential search times that some regular expression engines can produce under worst case conditions. Typically this means that the regular expression recognized by the engine has to be restricted to some kind of safe subset of the syntax recognized by programming tools such as perl. Programmers: this section includes all users of scripting languages like perl 2 , awk 1 , and ECMAScript 4 . In these cases the regular expressions are authored by the programmer, and these are converted to a state machine at run time. Typically these users require an expressive regular expression syntax, at the expense of worst case performance. Localization of the syntax is not normally required, although the state machine itself should usually honor the run time locale for example when matching an alphanumeric character. Programmers code generators: this section of users include those using tools like lex 1 , which convert one or more regular expressions into a C or C source file. Here the expressions are written by the programmer, and then converted to a state machine at compile time. There is no localization, either of the syntax or of the way in which the state machine behaves at runtime. However, thought has been given to the other possible user types, and the design by no means rule these out completely. Impact on the Standard This is a pure library proposal, it does not add any new language features, or alter any existing standard library headers. Design Overview This proposal consists of a pair of classes, and a set of algorithms that operate on instances of those classes. Regular expressions are represented by instances of the basic_regex class template, matches against a regular expression are represented by instances of the match_results class template. The algorithms regex_match and regex_search take as input a regular expression and a character string, and fill in a match_results structure if a match is found. For search and replace operations the algorithm regex_replace performs formatted search and replace on a character string, producing a new string as output. Finally, for repeated find operations, there are two iterator types; Character types : both narrow and wide character strings are supported. Representation of regular expression matches : a container of sub-expression matches, what matched each marked sub-expression in the regular expression is reported back to the user discussion . Regular expression algorithms : iterator based non-member functions for searching for a match, and for finding an exact match discussion . Search and replace : ECMAScript-like search and replace via a non-member function to transform one string to another discussion . Repeated searches : an iterator type to enumerate all of the regular expression matches within another iterator sequence discussion . Splitting a string : an iterator type that enumerates one or more strings for each regular expression match within another iterator sequence discussion . Summary of key limitations imposed by the design Regular expression syntax : whatever syntax is chosen will severely limit the choice of state machine used along with any complexity guarantees given. The choice of ECMAScript regular expressions implies that some regular expressions can only be matched by a backtracking Non-deterministic Finite Automata NFA, other algorithms may be used for some subsets of ECMAScript regular expressions discussion . Regular expression object type : the fact that the regular expression object is templated on the character type, and not on an iterator type, may make some implementation choices more difficult. In practice this choice seems to have no impact on performance discussion . Regular expression traits class : the traits class design will be a major constraining factor on the kinds of internal representations that a regular expression state machine may use. Support for wide characters : much regular expression theoretical work is based on small finite character sets, and can therefore not be applied to wide character strings, however wide character string support is considered essential. This constraint may be mitigated to some degree by providing a specialization for basic_regex<char> and then overloading its associated non-member algorithms discussion . Sub-expression matching : User feedback suggests that support for matching marked sub-expressions is essential. However, this can have a negative impact on the kinds of algorithms that may be employed. This is mitigated to some degree in this proposal by providing overloaded forms of the key algorithms that do not report what matched, allowing alternative algorithms to be used in these cases for some expressions discussion . Iterator type : support for forward iterators would allow for multibyte character support via iterator adapters, however this would also preclude some perl-like features, and may have negative performance implications. This proposal requires support for bidirectional iterators, however this choice needs more investigation and may need to be changed at a later date discussion . Regular expression syntax and complexity The complexity of converting a character string to a state machine, and the complexity of searching for a match to that machine, are deeply dependent upon the syntax used to represent the regular expression. There are three ways in which the regular expression syntax can be specified: Leave the manner in which the string representation is interpreted as implementation defined. This makes it impossible for programmers to include portable regular expressions in their code. Define completely within the standard how the regular expression finite state machine is interpreted. This option incurs a lot of work, and possibly duplicates information that has already been standardized elsewhere. Primary equivalence classes, for example a would match any character whose primary collation key is the same as the primary collation key for a, ie , , , , , , A, , , , , , or . ECMAScript ECMAScript 4 is based on the perl regular expression syntax 2 ,...
Cache (6013 bytes)
www.boost.org -> www.boost.org/
The Boost web site provides free peer-reviewed portable C source libraries. The emphasis is on libraries which work well with the C Standard Library. The libraries are intended to be widely useful, and are in regular use by thousands of programmers across a broad spectrum of applications. A further goal is to establish existing practice and provide reference implementations so that Boost libraries are suitable for eventual standardization. Ten Boost libraries will be included in the C Standards Committee s upcoming C Standard Library Technical Report as a step toward becoming part of a future C Standard. Although Boost was begun by members of the C Standards Committee Library Working Group, participation has expanded to include thousands programmers from the C community at large. Participation If you are interested in participating in Boost, please join our main developers mailing list . Discussions are highly technical, and list members are encouraged to participate in formal reviews of proposed libraries. There is also a users mailing list , and several project specific lists . Both the main Boost developers list and the users list are also accessible as newsgroups . The new license offers better legal protection for both users and developers, and should speed users legal reviews of Boost libraries. The legal team was led by Diane Cabell , Director, Clinical Programs, Berkman Center for Internet & Society , Harvard Law School. Devin Smith , attorney, Nixon Peabody LLP , wrote the Boost License. Eva Chan, Harvard Law School, contributed analysis of issues and drafts of various legal documents. Please Note: Many of the Boost libraries are still using earlier licenses, though all conform to the Boost License Requirements . After this release we will begin an effort to move toward uniform use of the new license. Build and Installation New Getting Started procedures ease download and installation, from Rene Rivera and others. Improved support for libraries requiring separate compilation , from John Maddock and others. New Libraries enable_if : Selective inclusion of function template overloads, from Jaakko Jrvi, Jeremiah Willcock, and Andrew Lumsdaine. This is an important new technique which exploits the SFINAE substitution-failure-is-not-an-error principle. Variant Library : Safe, generic, stack-based discriminated union container, from Eric Friedman and Itay Maman. Updated Libraries Compose : This library has been deprecated and will be removed in a future release. Date Time Library: A whole host of bug fixes, new features, and documentation improvements. Filesystem Library : Several added functions, including improved checking for directory and file name portability. Iterator Library : Major version upgrade, with interface as proposed for the C library TR, including an improved iterator_adaptor design plus several new components, from David Abrahams, Jeremy Siek, and Thomas Witt. MultiArray : The multi_array class template now provides an element-preserving resize operation as well as default construction see the reference manual for more information. Python Library : Support for Python 23 and Intel C on Linux Container Indexing Suite added. Added support for custom users predicate returning both boolean result code and possibly error message Documentation structure rework and update For complete list of changes see Test Library release notes Miscellaneous Expanded testing and fixes for non-conforming compilers. August 19, 2003 - Version 1302 bugfix release Boost Consulting is now hosting Boost CVS mirrors - see our download page . Backported changes to the config system , to better handle new compiler releases. Tests are now run in the context of the users PATH environment settings msvc-stlport and intel-win32-stlport toolsets now build static libraries with multithreading enabled, to be compatible with the STLPort builds. Optional Library added - A discriminated-union wrapper for optional values, from Fernando Cacciola. Interval Library added - Extends the usual arithmetic functions to mathematical intervals, from Guillaume Melquiond, Herv Brnnimann and Sylvain Pion. MPL added - Template metaprogramming framework of compile-time algorithms, sequences and metafunction classes, from Aleksey Gurtovoy. Spirit Library added - An LL unlimited lookahead parser framework that represents parsers directly as EBNF grammars in inlined C source code, complete with semantic actions, ASTs and much more, from Joel de Guzman and team. Smart Pointers Library - cast functions are now spelled static_pointer_cast / dynamic_pointer_cast ; Date-Time Library - several fixes and small additions including an interface change to partial_date. Function Library - added support for assignment to zero to clear and comparison against zero to check if empty. Operators Library - now takes advantage of named return value optimization NRVO when available, from Daniel Frey. Regression Tests - Much expanded, plus a very nice summary page from Rene Rivera. Test Library - introduced following new facilities: Automatic registration of unit tests XML log format XML report format BOOST_CHECK_NO_THROW test tool BOOST_BITWISE_CHECK test tool For complete list of changes see Test Library release notes Many fixes and enhancements to other libraries. Dynamic Bitset added - A runtime sized version of the std::bitset class from Jeremy Siek and Chuck Allison. Format Library added - Type-safe printf-like format operations, from Samuel Krempp. Some old syntax and little-used features have been deprecated and will be removed shortly, and the syntax for the boost::function class template has been greatly improved on conforming compilers. Multi-array Library added - Multidimensional containers and adaptors for arrays of contiguous data, from Ron Garcia. Python Library - Version 2 is released, from Dave Abrahams and others. This is a major rewrite which works on many more compilers and platforms, with a completely new interface and lots of new features.