Ruby 1.8 regexp bug and workaround

Ruby 1.8 has a rather buggy regexp engine. Some constructs using alternation followed by a Kleene star operator will not behave as expected, or as they do in Python or Perl. The bug is related to backtracking, and it is most easily exposed using the non-greedy version of the Kleene star. There are however expressions that involve the greedy version of Kleene star that are affected by the bug. The bug has been fixed in Ruby 1.9, but a release is not scheduled until October. In the mean time, this page provides a workaround. One of the simplest way to expose the bug is:

irb(main):036:0>  /a(ba|.)*?a/.match('axba')
=> nil
irb(main):049:0>  /a(.|ba)*?a/.match('axba')[0]
=> "axba"
In order for the bug to occur a few things need to happen:
  1. the suffix of an alternative A1 (ba) needs to match a prefix of whatever comes after the alternation (the final a), and attempting to use A1 must cause the entire match to fail.
  2. Another alternative A2 (.) must exist, and must match some prefixes of A1.
  3. For a given string (axba), choosing A2 instead of A1 should normally cause the match to succeed (a..a).
Based on the above example, you may think the workaround is to simply reorder the alternatives, but that's not enough, as the following example shows:
irb(main):067:0> /aw(x|b|bw|bwa)*?wam/.match('awxbwawam')
=> nil
irb(main):068:0> /aw(x|bwa|bw|b)*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):069:0> /aw(x|bwa|bw|b)*?wam/.match('awxbwam')
=> nil
In the above case, for any permutation of the alternatives one can find an input string that must be matched, but it is not. The real workaround is more involved. One needs to "spell out" the backtracking. Common prefixes must be factored out, and the remaining suffixes must be sorted in descending length order. Of course, this must be done recursively. Thankfully, this is the same rule used for ordering alternation parts even when not followed by a star. Here is the solution for the above regexp:
irb(main):033:0> /aw(x|b(w(a|)|))*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):035:0> /aw(x|b(w(a|)|))*?wam/.match('awxbwam')[0]
=> "awxbwam"
Note however that the following will fail, due alternatives being listed in the reverse order:
irb(main):046:0> /aw(x|b(|w(|a)))*?wam/.match('awxbwam')
=> nil
Which brings us to an example that fails with the greedy version of star. Simply replace the non-greedy star above with a greedy one, and it will still fail:
irb(main):046:0> /aw(x|b(|w(|a)))*wam/.match('awxbwam')
=> nil
irb(main):049:0> /aw(x|b(|w))*wam/.match('awxbwam')
=> nil
Finally, an easier way to remember the workaround correctly is to use the question mark operator instead of empty alternatives:
irb(main):036:0> /aw(x|b(wa?)?)*?wam/.match('awxbwawam')[0]
=> "awxbwawam"
irb(main):037:0> /aw(x|b(wa?)?)*?wam/.match('awxbwam')
=> "awxbwam"

Credits

A Spring 2007 CMSC330 student (who shall remain anonymous for now) unknowingly brought the bug to my attention. Nikolas Coukouma has verified that the bug still exist in the current version of Ruby 1.8 (1.8.5p12), and that it does not occur in the new regexp engine that ships with Ruby 1.9.
Vasile Gaburici
Last modified: Wed Mar 7 17:37:34 EST 2007