Comments on: Regex Complexity http://v1.ripper234.com/p/regex-complexity/ Stuff Ron Gross Finds Interesting Sun, 02 Aug 2015 11:03:35 +0000 hourly 1 https://wordpress.org/?v=4.5.3 By: ripper234 http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-389 Wed, 13 Aug 2008 11:31:00 +0000 http://localhost/p/regex-complexity/#comment-389 This thread is going a bit off topic, but as lorg pointed out in a private conversation, his version of the regex works better for cases like “text“. I still need to do a bit of work to find the optimal regex for this purpose.

]]>
By: ripper234 http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-397 Wed, 13 Aug 2008 09:00:00 +0000 http://localhost/p/regex-complexity/#comment-397 I’m not 100% sure, but I think it’s not exactly the same semantic as my regex. It could be “good enough”, just like my own modified version.

]]>
By: lorg http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-396 Wed, 13 Aug 2008 08:25:00 +0000 http://localhost/p/regex-complexity/#comment-396 I did a little experiment.
I tried out your original regexp in Python (tweaked to conform to the Python re syntax), on http://qimmortal.blogspot.com/

Indeed, it got stuck.
Then, I rewrote your regexp like so:

r = 'href=[\'"](?P<link>.+?)[\'"](.*?)>(?P<displayed>.*?)</a>'

On the same input, my regexp runs fast, and re.finditer found all links without a problem.

]]>
By: ripper234 http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-395 Tue, 12 Aug 2008 20:22:00 +0000 http://localhost/p/regex-complexity/#comment-395 Thanks Eli for the referral to the interesting and rather long discussion. My viewpoint – PERL and other relevant languages should provide Thompson NFA, because this problem is not limited to so called pathological cases, but does come up (albeit rarely) in practice.

Of course it’s a Task to write an efficient implementation, and has to be prioritized, tested, integrated… but in a perfect world, my regexes would not get stuck, and I would not have to learn intricacies on which are the pathological cases (if the class of pathological cases can be described so simply, then it should be simple to test for them and use NFA only for these cases).

]]>
By: Eli http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-394 Tue, 12 Aug 2008 18:06:00 +0000 http://localhost/p/regex-complexity/#comment-394 Good stuff. Regex implementations are a fascinating topic, and I’ve spent a few pleasant hours exploring the subject (bottom) a few years ago.

The link tomer gabel posted is excellent and has its place, but it is important to see the whole picture, which is why it can be very educational to read the ensuing discussion on Perlmonks, which explains *why* Perl (and others) chose what they chose, and how to avoid such problems.

]]>
By: ripper234 http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-393 Mon, 11 Aug 2008 12:25:00 +0000 http://localhost/p/regex-complexity/#comment-393 It appears java & perl also suffer from this. It's beyond me why these languages do not implement the efficient NFA approach (the same O(n) algorithm that is taught in basic automata classes, and has been implemented in awk and grep years ago).

Backtracking should definitely not be used for regular expressions that do not contain back references.

From the link you posted:

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

]]>
By: Tomer Gabel http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-392 Mon, 11 Aug 2008 12:02:00 +0000 http://localhost/p/regex-complexity/#comment-392 There’s a classic article on the effect of backtracking support on most modern regex implementations. I’m surprised you didn’t know that.

]]>
By: ripper234 http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-391 Sun, 10 Aug 2008 20:36:00 +0000 http://localhost/p/regex-complexity/#comment-391 It’s not always what you want. But in any case, the regex I composed initially had non-greedy operators – it was still too slow (== never finished). I tried some tweaking to make it faster, including replacing all * with {0, 200}, but it didn’t do much for the speed.

]]>
By: lorg http://v1.ripper234.com/p/regex-complexity/comment-page-1/#comment-390 Sun, 10 Aug 2008 19:32:00 +0000 http://localhost/p/regex-complexity/#comment-390 Along with other approaches, I suggest constructing your regex with non-greedy operators. (In Python these are “+?” and “*?”).
I also found that this is what I actually need in most cases. For example,
“x.*?x” will match anything from the first x to the next x, without another x in the middle.

]]>