Berkeley CSUA MOTD:Entry 54012
Berkeley CSUA MOTD
 
WIKI | FAQ | Tech FAQ
http://csua.com/feed/
2025/04/04 [General] UID:1000 Activity:popular
4/4     

2011/1/19-2/19 [Computer/SW/Languages/Perl] UID:54012 Activity:nil
1/19    Perl god, please go to http://perldoc.perl.org/perlre.html
        Go to "Quantifiers" and the greedy operator, such as
        +?, *?, ??, {n,}?, {n,m}?
        So I understand the greedy operator that does matching
        based on having different choices (instead of the default
        maximal munch). What about "{n}?" ?  What are some
        examples of using "{n}?" ?
        \_ s{2,} will match "ssss" once.  s{2,}? will match it twice.
           \_ that is clear. s{n,} and s{n,m} will match between n to
              m number of s. Traditionally regex is maximal munch.
              However the "?" minimal munch operator such as s{n,m}?
              will perform minimal munch.
                My question is, if you have just s{n} and you match
              exactly n number of s, then what is the point of the
              minimal match, given that there is no choice?
              \_ It's just for consistency.  s{n} and s{n}? are identical.
                 \_ ^consistency^redundancy
                    ^consistency^stupidity
                    Perl programmers are like Republicans-- righteous,
                    unapologetic, and incapable of seeing the point
                    of view from other languages.
                    \_ How would you design it?  If s{2,3}? is legal but
                       s{2,2}? is an error, then s{$min,$max}? will fail
                       if $min and $max happen to be equal.  That seems
                       suboptimal.
        \_ "$_Wha???" lol
Cache (8192 bytes)
perldoc.perl.org/perlre.html
Language reference > perlre Please note: Many features of this site require JavaScript. You appear to have JavaScript disabled, or are running a non-JavaScript capable web browser. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. to match any character whatsoever, even a newline, which normally it would not match. match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. Using regular expressions in Perl in perlretut for further explanation of the g and c modifiers. These are usually written as "the /x modifier", even though the delimiter in question might not really be a slash. It tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. Taken together, these features go a long way towards making Perl's regular expressions more readable. Note that you have to be careful not to include the pattern delimiter in the comment--perl has no way of knowing you did not intend to close the pattern early. Character class By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. " character never matches a newline unless you use the /s modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't. Quantifiers The following standard quantifiers are recognized: 1 * Match 0 or more times 2 + Match 1 or more times 3 ? Match 1 or 0 times 4 {n} Match exactly n times 5 {n,} Match at least n times 6 {n,m} Match at least n but not more than m times (If a curly bracket occurs in any other context, it is treated as a regular character. n and m are limited to non-negative integral values less than a preset limit defined when perl is built. The actual limit can be seen in the error message generated by code such as this: 1 $_ **= $_ , / {$_} / for 2 .. Note that the meanings don't change, just the "greediness": 1 *? Match at least n but not more than m times, not greedily By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. Thus Perl provides the "possessive" quantifier form as well. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. You cannot include a literal $ or @ within a \Q sequence. An unescaped $ or @ interpolates the corresponding variable, while escaping will cause the literal string \$ to be matched. Character Classes and other Special Escapes In addition, Perl defines the following: 1 \w Match a "word" character (alphanumeric plus "_") 2 \W Match a non-"word" character 3 \s Match a whitespace character 4 \S Match a non-whitespace character 5 \d Match a digit character 6 \D Match a non-digit character 7 \pP Match P, named property. optionally be wrapped in curly brackets for safer parsing. When of the form \N{NAME} , it matches the character whose name is NAME ; and similarly when of the form \N{U+wide hex char}, it matches the character whose Unicode ordinal is wide hex char. Assertions Perl defines the following zero-width assertions: 1 \b Match a word boundary 2 \B Match except at a word boundary 3 \A Match only at beginning of string 4 \Z Match only at end of string, or before newline at the end 5 \z Match only at end of string 6 \G Match only at pos() (eg at the end-of-match position 7 of prior m//g) A word boundary (\b ) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W . To match the actual end of the string and not ignore an optional trailing newline, use \z . It is also useful when writing lex -like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of \G is not counted when determining the length of the match. Thus the following will not match forever: 1 $str = 'ABC'; It is worth noting that \G improperly used can result in an infinite loop. Take care when using patterns that include \G in an alternation. To refer to the current contents of a buffer later on, within the same pattern, use \1 for the first, \2 for the second, and so on. There is no limit to the number of captured substrings that you may use. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. If the bracketing group did not match, the associated backreference won't match either. The curly brackets are optional, however omitting them is less safe as the meaning of the pattern can be changed by text (such as digits) following it. When N is a positive integer the \g{N} notation is exactly equivalent to using normal backreferences. When N is a negative integer then it is a relative backreference referring to the previous N'th capturing group. When the bracket form is used and N is not an integer, it is treated as a reference to a named buffer. Thus \g{-1} refers to the last buffer, \g{-2} refers to the buffer before that. For example: 1 / 2 # buffer 1 3 ( # buffer 2 4 # buffer 3 5 \g{-1} # backref to buffer 3 6 \g{-3} # backref to buffer 1 7 ) 8 /x and would match the same as / ( \3 \1 )/x . You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed \g{name} backreference syntax. It's possible to refer to a named capture buffer by absolute and relative number as well. Outside the pattern, a named capture buffer is available via the %+ hash. When different buffers within the same pattern have the same name, $+{name} and \k<name> refer to the leftmost defined group. WARNING: Once Perl sees that you need one of $& , $ , or $' anywhere in the program, it has to provide them for every pattern match. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. So avoid $& , $' , and $ if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. The use of these variables incurs no global performance penalty, unlike their punctuation char equivalents, however at the trade-off that you have to tell perl when you want to use them. Backslashed metacharacters in Perl are alphanumeric, such as \b , \w , \n . Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters: 1 $pattern = s/(\W)/\\$1/g; Extended Patterns Perl also defines a consistent extension syntax for features not found in standard tools like awk and lex. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses. The character after the question mark indicates the extension. Some have been part of the core language for many years. Others are experimental and may change without warning or be completely removed. Check the documentation on an individ...