Learning Ruby 4, Regular Expressions

Posted by Captain Zhan Fri, 13 Apr 2007 19:58:08 GMT


Regular expressions are objects of type Regexp.

create
a = Regexp.new('^\s*[a-z]')   /^\s*[a-z]/
b = /^\s*[a-z]/               /^\s*[a-z]/
c = %r{^\s*[a-z]}             /^\s*[a-z]/
options:
/i         case insensitive
/o         only interpolate #{} blocks once
/m         multiline mode - '.' will match newline
/x         extended mode - whitespace is ignored
/[neus]    encoding: none, EUC, UTF-8, SJIS, respectively
e.g. b = /^\s*[a-z]/i

match operators
 =~ (positive match)
 !~ (negative match)

name = "Fats Waller"
name =~ /a/   →1
name =~ /z/   → nil
/a/ =~ name   →1


return the character position at which the match occurred.
$& receives the part of the string that was matched by the pattern
$` receives the part of the string thatpreceded the match,
$' receives the string after the match.
The match also sets the thread-global variables $~ and $1 through $9.
$~ is a MatchData object

To illustrate how matching works, define a method:
def show_regexp(a, re)
  if a =~ re
    "#{$`}<<#{$&}>>#{$'}"
  else
    "no match"
  end
end
show_regexp('very interesting', /t/)   very in<<t>>eresting
show_regexp('Fats Waller', /a/)        F<<a>>ts Waller
show_regexp('Fats Waller', /ll/)       Fats Wa<<ll>>er
show_regexp('Fats Waller', /z/)        no match


Patterns
all characters except ., |, (, ), [, ], {, }, +, \, ^, $, *, and ? match themselves.
use '\' to match these characters.
regular expression may contain #{...} expression substitutions.



Anchors

By default, a regular expression will try to ?nd the ?rst match for the pattern in a string.

The patterns ^ and $ match the beginning and end of a line
\A matches the beginning of a string,
\z and \Z match the end of a string. (Actually, \Z matches the end of a string unless the string ends with a
, it which case it matches just before the
.)

show_regexp("this is\n the time", /^the/)     this is\n<<the>> time
show_regexp("this is\n the time", /is$/)      this <<is>>\n the time
show_regexp("this is\n the time", /\Athis/)   <<this>> is \n the time
show_regexp("this is\n the time", /\Athe/)    no match

\b and \B match word boundaries and nonword boundaries
Word characters are letters, numbers, and underscores.
show_regexp("this is\n the time", /\bis/)   this <<is>>\n the time
show_regexp("this is\n the time", /\Bis/)   th<<is>> is\n the time


Character Classes
[aeiou] will match a vowel
[,.:;!?] matches punctuation

show_regexp('Price $12.', /[aeiou]/)            Pr<<i>>ce $12.
show_regexp('Price $12.', /[\s]/)               Price<< >>$12.
show_regexp('Price $12.', /[[:digit:]]/)        Price $<<1>>2.
show_regexp('Price $12.', /[[:space:]]/)        Price<< >>$12.
show_regexp('Price $12.', /[[:punct:]aeiou]/)   Pr<<i>>ce $12.

           POSIX Character Classes
            Alphanumeric[:alnum:]
            Uppercase or lowercase letter[:alpha:]
            Blank and tab[:blank:]
            Control characters (at least 0x00–0x1f, 0x7f)[:cntrl:]
            Digit[:digit:]
            Printable character excluding space[:graph:]
            Lowercase letter[:lower:]
            Any printable character (including space)[:print:]
            Printable character excluding space and alphanumeric[:punct:]
            Whitespace (same as \s)[:space:]
            Uppercase letter[:upper:]
            Hex digit (0–9, a–f, A–F)[:xdigit:]


sequence c1 -c2 represents all the characters between c1 and c2
a = 'see [Design Patterns-page 123]'
show_regexp(a, /[A-F]/)            see [<<D>>esign Patterns-page 123]
show_regexp(a, /[A-Fa-f]/)   s<<e>>e [Design Patterns-page 123]
show_regexp(a, /[0-9]/)      see [Design Patterns-page <<1>>23]                       
show_regexp(a, /[0-9][0-9]/)    see [Design Patterns-page <<12>>3]


If you want to include the literal characters ] and - within a character class, they must appear at the start.
Put a ^ immediately after the opening bracket to negate a character class
a = 'see [Design Patterns-page 123]'
show_regexp(a, /[]]/)        → see [Design Patterns-page 123<<]>>
show_regexp(a, /[-]/)        → see [Design Patterns<<->>page 123]
show_regexp(a, /[^a-z]/)     → see<< >>[Design Patterns-page 123]
show_regexp(a, /[^a-z\s]/)   → see <<[>>Design Patterns-page 123]

                      Table 5.1. Character class abbreviations
Sequence As [ . . . ]        Meaning
         [0-9]               Digit character   \d
         [^0-9]              Any character except a digit   \D
         [\s\t\r\n\f]        Whitespace character   \s
         [^\s\t\r\n\f]       Any character except whitespace   \S
         [A-Za-z0-9_]        Word character   \w
         [^A-Za-z0-9_]       Any character except a word character   \W


show_regexp('It costs $12.', /\s/)   It<< >>costs $12.
show_regexp('It costs $12.', /\d/)   It costs $<<1>>2.

a period ( . ) appearing outside brackets represents any character except a newline
a = 'It costs $12.'
show_regexp(a, /c.s/)   It <<cos>>ts $12.
show_regexp(a, /./)     <<I>>t costs $12.
show_regexp(a, /\./)    It costs $12<<.>>


Repetition * ? {m,n}
       matches zero or more occurrences of r.   r*
       matches one or more occurrences of r.    r+
       matches zero or one occurrence of r.     r?
       matches at least “m” and at most “n” occurrences of r.   r{m,n}
       matches at least “m” occurrences of r.    r{m,}
       matches exactly “m” occurrences of r.     r{m}
       matches zero or more occurrences of previous regular expression(non greedy) *?
       matches one or more occurrences of previous regular expression(non greedy) +?

a = "The moon is made of cheese"
show_regexp(a, /\w+/)             <<The>> moon is made of cheese  
show_regexp(a, /\s.*\s/)           The<< moon is made of >>cheese
show_regexp(a, /\s.*?\s/)          The<< moon >>is made of cheese
show_regexp(a, /[aeiou]{2,99}/)    The m<<oo>>n is made of cheese
show_regexp(a, /mo?o/)             The <<moo>>n is made of cheese


Alternation |
a = "red ball blue sky"
show_regexp(a, /d|e/)                  r<<e>>d ball blue sky
show_regexp(a, /al|lu/)                red b<<al>>l blue sky
show_regexp(a, /red ball|angry sky/)   <<red ball>> blue sky


Grouping ()
Everything within the group is treated as a single regular expression.

show_regexp('banana', /an*/)         b<<an>>ana
show_regexp('banana', /(an)*/)       <<>>banana
show_regexp('banana', /(an)+/)       b<<anan>>a
a = 'red ball blue sky'
show_regexp(a, /blue|red/)             <<red>> ball blue sky
show_regexp(a, /(blue|red) \w+/)       <<red ball>> blue sky
show_regexp(a, /(red|blue) \w+/)       <<red ball>> blue sky
show_regexp(a, /red|blue \w+/)         <<red>> ball blue sky
show_regexp(a, /red (ball|angry) sky/)       no match
a = 'the red angry sky'
show_regexp(a, /red (ball|angry) sky/)       the <<red angry sky>>


within the pattern, the sequence \1 refers to the match of the ?rst group, \2 the second group, and so on. Outside the pattern, the special variables $1, $2, and so on, serve the same purpose.
"12:50am" =~ /(\d\d):(\d\d)(..)/     0
"Hour is #$1, minute #$2"            "Hour is 12, minute 50"
"12:50am" =~ /((\d\d):(\d\d))(..)/   0
"Time is #$1"                        "Time is 12:50"
"Hour is #$2, minute #$3"            "Hour is 12, minute 50"
"AM/PM is #$4"                       "AM/PM is am"

look for various forms of repetition.
# match duplicated letter
show_regexp('He said "Hello"', /(\w)\1/)   He said "He<<ll>>o"
# match duplicated substrings
show_regexp('Mississippi', /(\w+)\1/)      M<<ississ>>ippi

match delimiters
show_regexp('He said "Hello"', /(["']).*?\1/)   He said <<"Hello">>
show_regexp("He said 'Hello'", /(["']).*?\1/)   He said <<'Hello'>>


Pattern-Based Substitution
String#sub performs one replacement
String#gsub replaces every occurrence of the match

a = "the quick brown fox"
a.sub(/[aeiou]/, '*')      "th* quick brown fox"
a.gsub(/[aeiou]/, '*')     "th* q**ck br*wn f*x"
a.sub(/\s\S+/, '')         "the brown fox"
a.gsub(/\s\S+/, '')        "the"

block
a = "the quick brown fox"
a.sub(/^./) {|match| match.upcase }         "The quick brown fox"
a.gsub(/[aeiou]/) {|vowel| vowel.upcase }   "thE qUIck brOwn fOx"

def mixed_case(name)
  name.gsub(/\b\w/) {|first| first.upcase }
end
mixed_case("fats waller")               "Fats Waller"
mixed_case("louis armstrong")           "Louis Armstrong"
mixed_case("strength in numbers")       "Strength In Numbers"


Backslash Sequences in the Substitution
"fred:smith".sub(/(\w+):(\w+)/, '\2, \1')   "smith, fred"
"nercpyitno".gsub(/(.)(.)/, '\2\1')         "encryption"
\& (last match),
\+ (lastmatched group),
\` (string prior to match),
\' (string after match),
\\ (a literal backslash)

str = 'a\b\c'                "a\b\c"
str.gsub(/\\/, '\\\\\\\\')   "a\\b\\c"

or
str = 'a\b\c'            "a\b\c"
str.gsub(/\\/, '\&\&')   "a\\b\\c"

or
str = 'a\b\c'               "a\b\c"
str.gsub(/\\/) { '\\\\' }   "a\\b\\c"


example:
n modi?er(japanese)
    def unescapeHTML(string)
      str = string.dup
      str.gsub!(/&(.*?);/n) {
         match = $1.dup
         case match
         when /\Aamp\z/ni           then '&'
         when /\Aquot\z/ni          then '"'
         when /\Agt\z/ni            then '>'
         when /\Alt\z/ni            then '<'
         when /\A#(\d+)\z/n         then Integer($1).chr
         when /\A#x([0-9a-f]+)\z/ni then $1.hex.chr
         end
      }
      str
    end
    puts unescapeHTML("1&lt;2 &amp;&amp; 4&gt;3")
    puts unescapeHTML("&quot;A&quot; = &#65; = &#x41;")
produces:
    1<2 && 4>3
    "A" = A = A

Object-Oriented Regular Expressions
re = /(\d+):(\d+)/     # match a time hh:mm
md = re.match("Time: 12:34am")
                        → MatchData
md.class
md[0]         # == $&   → "12:34"
md[1]         # == $1   → "12"
md[2]         # == $2   → "34"
md.pre_match # == $`    → "Time: "
md.post_match # == $'   → "am"

re = /(\d+):(\d+)/     # match a time hh:mm
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
md1[1, 2]   → ["12", "34"]
md2[1, 2]   → ["10", "30"]

re = /(\d+):(\d+)/
md1 = re.match("Time: 12:34am")
md2 = re.match("Time: 10:30pm")
[ $1, $2 ]   # last successful match       ["10", "30"]
$~ = md1
[ $1, $2 ]   # previous successful match   ["12", "34"]

Regex Characters List
:
.          any character except newline
[ ]        any single character of set
[^ ]       any single character NOT of set
*          0 or more previous regular expression
*?         0 or more previous regular expression(non greedy)
+          1 or more previous regular expression
+?         1 or more previous regular expression(non greedy)
?          0 or 1 previous regular expression
|          alternation
( )        grouping regular expressions
^          beginning of a line or string
$          end of a line or string
{m,n}      at least m but most n previous regular expression
{m,n}?     at least m but most n previous regular expression(non greedy)
\A         beginning of a string
\b         backspace(0x08)(inside[]only)
\b         word boundary(outside[]only)
\B         non-word boundary
\d         digit, same as[0-9]
\D         non-digit
\S         non-whitespace character
\s         whitespace character[ \t\n\r\f]
\W         non-word character
\w         word character[0-9A-Za-z_]
\z         end of a string
\Z         end of a string, or before newline at the end
(?# )      comment
(?: )      grouping without backreferences
(?= )      zero-width positive look-ahead assertion
(?! )      zero-width negative look-ahead assertion
(?ix-ix)   turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.

Special Character Classes:
[:alnum:]   alpha-numeric characters
[:alpha:]   alphabetic characters
[:blank:]   whitespace - does not include tabs, carriage returns, etc
[:cntrl:]   control characters
[:digit:]   decimal digits
[:graph:]   graph characters
[:lower:]   lower case characters
[:print:]   printable characters
[:punct:]   punctuation characters
[:space:]   whitespace, including tabs, carriage returns, etc
[:upper:]   upper case characters
[:xdigit:]  hexadecimal digits

Comments

(leave url/email »)

   Preview comment