Thread: BUG #6381: Incorrect greediness behavior in certain regular expressions
BUG #6381: Incorrect greediness behavior in certain regular expressions
From
code@phaedrusdeinus.org
Date:
The following bug has been logged on the website: Bug reference: 6381 Logged by: john melesky Email address: code@phaedrusdeinus.org PostgreSQL version: 9.1.1 Operating system: x86_64-pc-linux-gnu Description:=20=20=20=20=20=20=20=20 This simple regexp returns correctly (that is, (.*?) matches 'blahblah.com'): =3D# select regexp_matches('http://blahblah.com/asdf', 'http://(.*?)(/|%2f|$)'); regexp_matches=20=20 ------------------ {blahblah.com,/} This, more complex/complete version, matches greedily, which is incorrect: =3D# select regexp_matches('http://blahblah.com/asdf', 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f|$)'); regexp_matches=20=20=20=20=20=20=20=20=20 -------------------------------- {"",:,//,blahblah.com/asdf,""} (That is, (.*?) matches 'blahblah.com/asdf') The problem appears to be the inclusion of '$' in the final paren group. So, this works: select regexp_matches('http://blahblah.com/asdf', 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f)'); regexp_matches=20=20=20=20=20=20 -------------------------- {"",:,//,blahblah.com,/}
code@phaedrusdeinus.org writes: > This, more complex/complete version, matches greedily, which is incorrect: > =# select regexp_matches('http://blahblah.com/asdf', > 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f|$)'); > regexp_matches > -------------------------------- > {"",:,//,blahblah.com/asdf,""} I do not believe this is a bug; the RE code appears to me to be following its specification, as per the detailed rules in section 9.7.3.5: http://www.postgresql.org/docs/9.1/static/functions-matching.html#POSIX-MATCHING-RULES Specifically, the regex as a whole is considered greedy because the first subpart with a greediness attribute is the (s?) piece, and the ? quantifier is greedy. Therefore, the regex as a whole matches the longest possible string, which in this case will be up to the end of the input. It's true that the (.*?) subpart is non-greedy, but that only affects how much of the overall match length can get assigned to that subpart relative to other subparts, and in this case there is no flexibility to assign it more or less of the match once the total match length is determined. The reason your second case "works" as you expect is that there's no way for it to match anything beyond the last slash: > select regexp_matches('http://blahblah.com/asdf', > 'http(s?)(:|%3a)(//|%2f%2f)(.*?)(/|%2f)'); > regexp_matches > -------------------------- > {"",:,//,blahblah.com,/} There's only one possible match here, independently of whether the (.*?) portion is greedy or not. With this example, you could get the results you're after by using the non-greedy ?? operator for the first subpart: regression=# select regexp_matches('http://blahblah.com/asdf', 'http(s??)(:|%3a)(//|%2f%2f)(.*?)(/|%2f|$)'); regexp_matches -------------------------- {"",:,//,blahblah.com,/} although I cannot tell whether that generalizes to your real problem. regards, tom lane