Learning Ruby 4, Regular Expressions

Regular expressions are objects of type Regexp.

create
a = Regexp.new(‘^s*[a-z]‘)   /^s*[a-z]/
b = /^s*[a-z]/               /^s*[a-z]/
c = %r{^s*[a-z]}             /^s*[a-z]/
options:
/i         case insensitive
/o         only interpolate #{} blocks once
/m         multiline mode – ‘.’ will match newline
/x         extended mode – whitespace is ignored
/[neus]    encoding: none, EUC, UTF-8, SJIS, respectively
e.g. b = /^s*[a-z]/i

match operators
 =~ (positive match)
 !~ (negative match)

name = "Fats Waller"
name =~ /a/   →1
name =~ /z/   → nil
/a/ =~ name   →1

return the character position at which the match occurred.
$& receives the part of the string that was matched by the pattern
$` receives the part of the string thatpreceded the match,
$’ receives the string after the match.
The match also sets the thread-global variables $~ and $1 through $9.
$~ is a MatchData object

To illustrate how matching works, define a method:
def show_regexp(a, re)
  if a =~ re
    "#{$`}<<#{$&}>>#{$’}"
  else
    "no match"
  end
end
show_regexp(‘very interesting’, /t/)   very in<<t>>eresting
show_regexp(‘Fats Waller’, /a/)        F<<a>>ts Waller
show_regexp(‘Fats Waller’, /ll/)       Fats Wa<<ll>>er
show_regexp(‘Fats Waller’, /z/)        no match

Patterns
all characters except ., |, (, ), [, ], {, }, +, , ^, $, *, and ? match themselves.
use ‘’ to match these characters.
regular expression may contain #{…} expression substitutions.

Anchors

By default, a regular expression will try to ?nd the ?rst match for the pattern in a string.

The patterns ^ and $ match the beginning and end of a line
A matches the beginning of a string,
z and Z match the end of a string. (Actually, Z matches the end of a string unless the string ends with a
, it which case it matches just before the
.)

show_regexp("this isn the time", /^the/)     this isn<<the>> time
show_regexp("this isn the time", /is$/)      this <<is>>n the time
show_regexp("this isn the time", /Athis/)   <<this>> is n the time
show_regexp("this isn the time", /Athe/)    no match

b and B match word boundaries and nonword boundaries
Word characters are letters, numbers, and underscores.
show_regexp("this isn the time", /bis/)   this <<is>>n the time
show_regexp("this isn the time", /Bis/)   th<<is>> isn the time

Character Classes
[aeiou] will match a vowel
[,.:;!?] matches punctuation

show_regexp(‘Price $12.’, /[aeiou]/)            Pr<<i>>ce $12.
show_regexp(‘Price $12.’, /[s]/)               Price<< >>$12.
show_regexp(‘Price $12.’, /[[:digit:]]/)        Price $<<1>>2.
show_regexp(‘Price $12.’, /[[:space:]]/)        Price<< >>$12.
show_regexp(‘Price $12.’, /[[:punct:]aeiou]/)   Pr<<i>>ce $12.

           POSIX Character Classes
            Alphanumeric[:alnum:]
            Uppercase or lowercase letter[:alpha:]
            Blank and tab[:blank:]
            Control characters (at least 0×00–0x1f, 0x7f)[:cntrl:]
            Digit[:digit:]
            Printable character excluding space[:graph:]
            Lowercase letter[:lower:]
            Any printable character (including space)[:print:]
            Printable character excluding space and alphanumeric[:punct:]
            Whitespace (same as s)[:space:]
            Uppercase letter[:upper:]
            Hex digit (0–9, a–f, A–F)[:xdigit:]

sequence c1 -c2 represents all the characters between c1 and c2
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[A-F]/)            see [<<D>>esign Patterns-page 123]
show_regexp(a, /[A-Fa-f]/)   s<<e>>e [Design Patterns-page 123]
show_regexp(a, /[0-9]/)      see [Design Patterns-page <<1>>23]                       
show_regexp(a, /[0-9][0-9]/)    see [Design Patterns-page <<12>>3]

If you want to include the literal characters ] and – within a character class, they must appear at the start.
Put a ^ immediately after the opening bracket to negate a character class
a = ‘see [Design Patterns-page 123]‘
show_regexp(a, /[]]/)        → see [Design Patterns-page 123<<]>>
show_regexp(a, /[-]/)        → see [Design Patterns<<->>page 123]
show_regexp(a, /[^a-z]/)     → see<< >>[Design Patterns-page 123]
show_regexp(a, /[^a-zs]/)   → see <<[>>Design Patterns-page 123]

                      Table 5.1. Character class abbreviations
Sequence As [ . . . ]        Meaning
         [0-9]               Digit character   d
         [^0-9]              Any character except a digit   D
         [strnf]        Whitespace character   s
         [^strnf]       Any character except whitespace   S
         [A-Za-z0-9_]        Word character   w
         [^A-Za-z0-9_]       Any character except a word character   W

show_regexp(‘It costs $12.’, /s/)   It<< >>costs $12.
show_regexp(‘It costs $12.’, /d/)   It costs $<<1>>2.

a period ( . ) appearing outside brackets represents any character except a newline
a = ‘It costs $12.’
show_regexp(a, /c.s/)   It <<cos>>ts $12.
show_regexp(a, /./)     <<I>>t costs $12.
show_regexp(a, /./)    It costs $12<<.>>

Repetition * ? {m,n}
       matches zero or more occurrences of r.   r*
       matches one or more occurrences of r.    r+
       matches zero or one occurrence of r.     r?
       matches at least “m” and at most “n” occurrences of r.   r{m,n}
       matches at least “m” occurrences of r.    r{m,}
       matches exactly “m” occurrences of r.     r{m}
       matches zero or more occurrences of previous regular expression(non greedy) *?
       matches one or more occurrences of previous regular expression(non greedy) +?

a = "The moon is made of cheese"
show_regexp(a, /w+/)             <<The>> moon is made of cheese  
show_regexp(a, /s.*s/)           The<< moon is made of >>cheese
show_regexp(a, /s.*?s/)          The<< moon >>is made of cheese
show_regexp(a, /[aeiou]{2,99}/)    The m<<oo>>n is made of cheese
show_regexp(a, /mo?o/)             The <<moo>>n is made of cheese

Alternation |
a = "red ball blue sky"
show_regexp(a, /d|e/)                  r<<e>>d ball blue sky
show_regexp(a, /al|lu/)                red b<<al>>l blue sky
show_regexp(a, /red ball|angry sky/)   <<red ball>> blue sky

Grouping ()
Everything within the group is treated as a single regular expression.

show_regexp(‘banana’, /an*/)         b<<an>>ana
show_regexp(‘banana’, /(an)*/)       <<>>banana
show_regexp(‘banana’, /(an)+/)       b<<anan>>a
a = ‘red ball blue sky’
show_regexp(a, /blue|red/)             <<red>> ball blue sky
show_regexp(a, /(blue|red) w+/)       <<red ball>> blue sky
show_regexp(a, /(red|blue) w+/)       <<red ball>> blue sky
show_regexp(a, /red|blue w+/)         <<red>> ball blue sky
show_regexp(a, /red (ball|angry) sky/)       no match
a = ‘the red angry sky’
show_regexp(a, /red (ball|angry) sky/)       the <<red angry sky>>

within the pattern, the sequence 1 refers to the match of the ?rst group, 2 the second group, and so on. Outside the pattern, the special variables $1, $2, and so on, serve the same purpose.
"12:50am" =~ /(dd):(dd)(..)/     0
"Hour is #$1, minute #$2"            "Hour is 12, minute 50"
"12:50am" =~ /((dd):(dd))(..)/   0
"Time is #$1"                        "Time is 12:50"
"Hour is #$2, minute #$3"            "Hour is 12, minute 50"
"AM/PM is #$4"                       "AM/PM is am"

look for various forms of repetition.
# match duplicated letter
show_regexp(‘He said "Hello"’, /(w)1/)   He said "He<<ll>>o"
# match duplicated substrings
show_regexp(‘Mississippi’, /(w+)1/)      M<<ississ>>ippi

match delimiters
show_regexp(‘He said "Hello"’, /(["']).*?1/)   He said <<"Hello">>
show_regexp("He said ‘Hello’", /(["']).*?1/)   He said <<’Hello’>>

Pattern-Based Substitution
String#sub performs one replacement
String#gsub replaces every occurrence of the match

a = "the quick brown fox"
a.sub(/[aeiou]/, ‘*’)      "th* quick brown fox"
a.gsub(/[aeiou]/, ‘*’)     "th* q**ck br*wn f*x"
a.sub(/sS+/, ”)         "the brown fox"
a.gsub(/sS+/, ”)        "the"

block
a = "the quick brown fox"
a.sub(/^./) {|match| match.upcase }         "The quick brown fox"
a.gsub(/[aeiou]/) {|vowel| vowel.upcase }   "thE qUIck brOwn fOx"

def mixed_case(name)
  name.gsub(/bw/) {|first| first.upcase }
end
mixed_case("fats waller")               "Fats Waller"
mixed_case("louis armstrong")           "Louis Armstrong"
mixed_case("strength in numbers")       "Strength In Numbers"

Backslash Sequences in the Substitution
"fred:smith".sub(/(w+):(w+)/, ‘2, 1′)   "smith, fred"
"nercpyitno".gsub(/(.)(.)/, ‘21′)         "encryption"
& (last match),
+ (lastmatched group),
` (string prior to match),
’ (string after match),
\ (a literal backslash)

str = ‘abc’                "abc"
str.gsub(/\/, ‘\\\\’)   "a\b\c"

or
str = ‘abc’            "abc"
str.gsub(/\/, ‘&&’)   "a\b\c"

or
str = ‘abc’               "abc"
str.gsub(/\/) { ‘\\’ }   "a\b\c"

example:
n modi?er(japanese)
    def unescapeHTML(string)
      str = string.dup
      str.gsub!(/&(.*?);/n) {
         match = $1.dup
         case match
         when /Aampz/ni           then ‘&’
         when /Aquotz/ni          then ‘"’
         when /Agtz/ni            then ‘>’
         when /Altz/ni            then ‘<’
         when /A#(d+)z/n         then Integer($1).chr
         when /A#x([0-9a-f]+)z/ni then $1.hex.chr
         end
      }
      str
    end
    puts unescapeHTML("1&lt;2 &amp;&amp; 4&gt;3")
    puts unescapeHTML("&quot;A&quot; = &#65; = &#x41;")
produces:
    1<2 && 4>3
    "A" = A = A

Object-Oriented Regular Expressions
re = /(d+):(d+)/     # match a time hh:mm
md = re.match("Time: 12:34am")
                        → MatchData
md.class
md[0]         # == $&   → "12:34"
md[1]         # == $1   → "12"
md[2]         # == $2   → "34"
md.pre_match # == $`    → "Time: "
md.post_match # == $’   → "am"

re = /(d+):(d+)/     # match a time hh:mm
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
md1[1, 2]   → ["12", "34"]
md2[1, 2]   → ["10", "30"]

re = /(d+):(d+)/
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
[ $1, $2 ]   # last successful match       ["10", "30"]
$~ = md1
[ $1, $2 ]   # previous successful match   ["12", "34"]


Regex Characters List
:
.          any character except newline
[ ]        any single character of set
[^ ]       any single character NOT of set
*          0 or more previous regular expression
*?         0 or more previous regular expression(non greedy)
+          1 or more previous regular expression
+?         1 or more previous regular expression(non greedy)
?          0 or 1 previous regular expression
|          alternation
( )        grouping regular expressions
^          beginning of a line or string
$          end of a line or string
{m,n}      at least m but most n previous regular expression
{m,n}?     at least m but most n previous regular expression(non greedy)
A         beginning of a string
b         backspace(0×08)(inside[]only)
b         word boundary(outside[]only)
B         non-word boundary
d         digit, same as[0-9]
D         non-digit
S         non-whitespace character
s         whitespace character[ tnrf]
W         non-word character
w         word character[0-9A-Za-z_]
z         end of a string
Z         end of a string, or before newline at the end
(?# )      comment
(?: )      grouping without backreferences
(?= )      zero-width positive look-ahead assertion
(?! )      zero-width negative look-ahead assertion
(?ix-ix)   turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.

Special Character Classes:
[:alnum:]   alpha-numeric characters
[:alpha:]   alphabetic characters
[:blank:]   whitespace – does not include tabs, carriage returns, etc
[:cntrl:]   control characters
[:digit:]   decimal digits
[:graph:]   graph characters
[:lower:]   lower case characters
[:print:]   printable characters
[:punct:]   punctuation characters
[:space:]   whitespace, including tabs, carriage returns, etc
[:upper:]   upper case characters
[:xdigit:]  hexadecimal digits

转载请注明: 转自船长日志, 本文链接地址: http://www.cslog.cn/Content/ruby_regexp/

此条目发表在 Ruby on Rails 分类目录。将固定链接加入收藏夹。

发表评论