Regex Complexity
Today I got a shocker.
I tried the not-too-complicated regular expression:
href=[‘"](?<link>[^?’">]*\??[^’" >]*)[^>]*(?<displayed>[^>]*)</a> |
I worked in the excellent RegexBuddy, and ran the above regex on a normal size HTML page (the regex aims to find all links in a page). The regex hung, and I got the following message:
The match attempt was aborted early because the regular expression is too complex. The regex engine you plan to use it with may not be able to handle it at all and crash. Look up "catastrophic backtracking" in the help file to learn how to avoid this situation. |
I looked up “catastrophic backtracking”, and got that regexes such as “(x+x+)+y” are evil. Sure – but my regex does not contain nested repetition operations!
I then tried this regex on a short page, and it worked. This was another surprise, as I always thought most regex implementations are compiled, and then run in O(n) (I never took the time to learn all the regex flavors, I just assumed what I learned in the university was the general rule).
It turns out that one of the algorithms to implement regex uses backtracking, so a regex might work on a short string but fail on a larger one. It appears even simple expressions such as “(a|aa)*b” take exponential time in this implementation.
I looked around a bit, but failed to find a good description of the internal implementation of .NET’s regular expression engine.
BTW, the work-around I used here is modify the regex. It’s not exactly what I aimed for, but it’s close enough:
href=[‘"](?<link>[^’">]*)[^>]*>(?<displayed>[^>]*)</a> |