Ruby 1.8 has a rather buggy regexp engine. Some constructs using alternation followed by a Kleene star operator will not behave as expected, or as they do in Python or Perl. The bug is related to backtracking, and it is most easily exposed using the non-greedy version of the Kleene star. There are however expressions that involve the greedy version of Kleene star that are affected by the bug. The bug has been fixed in Ruby 1.9, but a release is not scheduled until October. In the mean time, this page provides a workaround. One of the simplest way to expose the bug is:
irb(main):036:0> /a(ba|.)*?a/.match('axba')
=> nil
irb(main):049:0> /a(.|ba)*?a/.match('axba')[0]
=> "axba"
In order for the bug to occur a few things need to happen:
irb(main):067:0> /aw(x|b|bw|bwa)*?wam/.match('awxbwawam')
=> nil
irb(main):068:0> /aw(x|bwa|bw|b)*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):069:0> /aw(x|bwa|bw|b)*?wam/.match('awxbwam')
=> nil
In the above case, for any permutation of the alternatives one can find an
input string that must be matched, but it is not. The real workaround is more
involved. One needs to "spell out" the backtracking. Common prefixes must be
factored out, and the remaining suffixes must be sorted in descending length
order. Of course, this must be done recursively. Thankfully, this is the same
rule used for ordering alternation parts even when not followed by a
star. Here is the solution for the above regexp:
irb(main):033:0> /aw(x|b(w(a|)|))*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):035:0> /aw(x|b(w(a|)|))*?wam/.match('awxbwam')[0]
=> "awxbwam"
Note however that the following will fail, due alternatives being listed in
the reverse order:
irb(main):046:0> /aw(x|b(|w(|a)))*?wam/.match('awxbwam')
=> nil
Which brings us to an example that fails with the greedy version of
star. Simply replace the non-greedy star above with a greedy one, and it will
still fail:
irb(main):046:0> /aw(x|b(|w(|a)))*wam/.match('awxbwam')
=> nil
irb(main):049:0> /aw(x|b(|w))*wam/.match('awxbwam')
=> nil
Finally, an easier way to remember the workaround correctly is to use the
question mark operator instead of empty alternatives:
irb(main):036:0> /aw(x|b(wa?)?)*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):037:0> /aw(x|b(wa?)?)*?wam/.match('awxbwam')
=> "awxbwam"