Comments on: Regex Complexity

By: ripper234

ripper234 — Wed, 13 Aug 2008 11:31:00 +0000

This thread is going a bit off topic, but as lorg pointed out in a private conversation, his version of the regex works better for cases like "text". I still need to do a bit of work to find the optimal regex for this purpose.

By: ripper234

ripper234 — Wed, 13 Aug 2008 09:00:00 +0000

I’m not 100% sure, but I think it’s not exactly the same semantic as my regex. It could be “good enough”, just like my own modified version.

By: lorg

lorg — Wed, 13 Aug 2008 08:25:00 +0000

I did a little experiment.
I tried out your original regexp in Python (tweaked to conform to the Python re syntax), on http://qimmortal.blogspot.com/

Indeed, it got stuck.
Then, I rewrote your regexp like so:

r = 'href=[\'"](?P<link>.+?)[\'"](.*?)>(?P<displayed>.*?)</a>'

On the same input, my regexp runs fast, and re.finditer found all links without a problem.

By: ripper234

ripper234 — Tue, 12 Aug 2008 20:22:00 +0000

Thanks Eli for the referral to the interesting and rather long discussion. My viewpoint - PERL and other relevant languages should provide Thompson NFA, because this problem is not limited to so called pathological cases, but does come up (albeit rarely) in practice.

Of course it's a Task to write an efficient implementation, and has to be prioritized, tested, integrated... but in a perfect world, my regexes would not get stuck, and I would not have to learn intricacies on which are the pathological cases (if the class of pathological cases can be described so simply, then it should be simple to test for them and use NFA only for these cases).

By: Eli

Eli — Tue, 12 Aug 2008 18:06:00 +0000

Good stuff. Regex implementations are a fascinating topic, and I've spent a few pleasant hours exploring the subject (bottom) a few years ago.

The link tomer gabel posted is excellent and has its place, but it is important to see the whole picture, which is why it can be very educational to read the ensuing discussion on Perlmonks, which explains *why* Perl (and others) chose what they chose, and how to avoid such problems.

By: ripper234

ripper234 — Mon, 11 Aug 2008 12:25:00 +0000

It appears java & perl also suffer from this. It's beyond me why these languages do not implement the efficient NFA approach (the same O(n) algorithm that is taught in basic automata classes, and has been implemented in awk and grep years ago).

Backtracking should definitely not be used for regular expressions that do not contain back references.

From the link you posted:

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

By: Tomer Gabel

Tomer Gabel — Mon, 11 Aug 2008 12:02:00 +0000

There's a classic article on the effect of backtracking support on most modern regex implementations. I'm surprised you didn't know that.

By: ripper234

ripper234 — Sun, 10 Aug 2008 20:36:00 +0000

It’s not always what you want. But in any case, the regex I composed initially had non-greedy operators – it was still too slow (== never finished). I tried some tweaking to make it faster, including replacing all * with {0, 200}, but it didn’t do much for the speed.

By: lorg

lorg — Sun, 10 Aug 2008 19:32:00 +0000

Along with other approaches, I suggest constructing your regex with non-greedy operators. (In Python these are “+?” and “*?”).
I also found that this is what I actually need in most cases. For example,
“x.*?x” will match anything from the first x to the next x, without another x in the middle.