Thread: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
christian_maechler@hotmail.com
Date:
The following bug has been logged on the website: Bug reference: 13538 Logged by: Chris Mächler Email address: christian_maechler@hotmail.com PostgreSQL version: 9.3.0 Operating system: ? Description: Here is an example to verify and reproduce the error (extract a number and the things before and after it with 3 groups): '(.*)([+-]?[0-9]*\.[0-9]+)(.*)' Using regexü_matches this will produce an undesirable result (only one digit in group 2), but everything behaves correctly, the third group matches until the end. '(.*?)([+-]?[0-9]*\.[0-9]+)(.*)' If we change the first group to non-greedy to fix this, then the bug appears: the third group becomes non-greedy too (it shouldn't!) and therefore it is always empty instead of matching until the end of the line. Also the first group is empty (should match from start!), it should find a match at start position, whether it is non-greedy or not and not look ahead if the non-greedy group can be reduced if starting to match at the next index. Both are wrong behaviors. (the workaround is anchoring, but the behavior of the regex is still wrong) link: http://sqlfiddle.com/#!15/f0f14/14
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
"David G. Johnston"
Date:
On Monday, August 3, 2015, <christian_maechler@hotmail.com> wrote: > The following bug has been logged on the website: > > Bug reference: 13538 > Logged by: Chris M=C3=A4chler > Email address: christian_maechler@hotmail.com <javascript:;> > PostgreSQL version: 9.3.0 > Operating system: ? > Description: > > Here is an example to verify and reproduce the error (extract a number an= d > the things before and after it with 3 groups): > > > '(.*)([+-]?[0-9]*\.[0-9]+)(.*)' > > Using regex=C3=BC_matches this will produce an undesirable result (only o= ne > digit > in group 2), but everything behaves correctly, the third group matches > until > the end. > > '(.*?)([+-]?[0-9]*\.[0-9]+)(.*)' > > If we change the first group to non-greedy to fix this, then the bug > appears: the third group becomes non-greedy too (it shouldn't!) and > therefore it is always empty instead of matching until the end of the lin= e. > Also the first group is empty (should match from start!), it should find = a > match at start position, whether it is non-greedy or not and not look ahe= ad > if the non-greedy group can be reduced if starting to match at the next > index. Both are wrong behaviors. > > (the workaround is anchoring, but the behavior of the regex is still wron= g) > > link: http://sqlfiddle.com/#!15/f0f14/14 > > > Reading the documentation this seems to be working as intended. http://www.postgresql.org/docs/9.3/static/functions-matching.html#POSIX-MAT= CHING-RULES On what are you basing your concept of correctness? Specifically, what language implementation do you consider "right"? The TCL implementation used by PostgreSQL has some differences compared to Java and Perl, the two I am most familiar with. David J.
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
Christian Mächler
Date:
<div dir="ltr">Sorry probably sent my mail to the wrong address before.<br /><br />Here some more detailed examples to showwhy the behavior of the 3rd group is clearly wrong also according to the specification:<br /><br />abc0.123def applyingthe first regex (with optional dot)--> abc0.12 , 3, def<br /><br />abc0.123def applying the 2nd regex (with optionaldot) --> , 3 ,<br /><br />Although the 3rd group wasn't touched it changed it's behavior from geedy to non-greedy.abc, 0.123, def would be correct.<br /><br />The behavior of the non-greedy group isn't correct either, becausea match should return the FIRST match found, greedyness is only about the FOLLOWING characters, but because it willalready find a match at start position it should return that and not keep looking to find another match with reducedgroup size.<br /><br />I was just trying to help, because I think this is a pretty big issue, although not using regexatm.<br /><br />Chris<br /></div>
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
Tom Lane
Date:
Christian Mächler <christian_maechler@hotmail.com> writes: > Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification: What specification are you reading? The POSIX standard for regular expressions doesn't mention non-greedy quantifiers at all. As David says, these examples appear to be following what's stated in http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES The Spencer regex engine we use has a notion of greediness or non-greediness of the entire regex, and further that that takes precedence for determining the overall match length over greediness of individual subexpressions. That behavior might be inconvenient for this particular use-case, but that doesn't make it a bug. regards, tom lane
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
Tom Lane
Date:
I wrote: > As David says, these examples appear to be following what's stated in > http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES > The Spencer regex engine we use has a notion of greediness or > non-greediness of the entire regex, and further that that takes precedence > for determining the overall match length over greediness of individual > subexpressions. That behavior might be inconvenient for this particular > use-case, but that doesn't make it a bug. BTW, perhaps it would be worth adding an example to that section that shows how to control this behavior. The trick is obvious once you've seen it, but not so much otherwise: you add something to the start of the regex that establishes the overall greediness you want, but can never actually match any characters. "\0*" or "\0*?" will work fine in Postgres use-cases since there can never be a NUL character in the data. regression=# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)'); regexp_matches ----------------- {abc0123,4,xyz} (1 row) regression=# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)'); regexp_matches ---------------- {abc,0,""} (1 row) regression=# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)'); regexp_matches ----------------- {abc,01234,xyz} (1 row) regards, tom lane
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
"David G. Johnston"
Date:
On Tue, Aug 4, 2015 at 8:39 AM, Christian M=C3=A4chler < christian_maechler@hotmail.com> wrote: > You say it is okay that a greedy group suddenly becomes non-greedy if > ANOTHER group is made non-greedy? > > I've chosen a simple example, but I'm pretty sure I could construct > several use-cases which can be solved easily if the regex behaves like in > java, javaScript, perl etc. but not with how it is done here. It's clear= ly > not a feature. Already simple things like ending a match with any amount = of > numbers will become difficult if non-greedy groups are present, e.g. > instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9]) makes > things easier... > > Seriously I didn't want to start a debate whether this is right or wrong, > because I honestly can't understand how anyone could defend the behavior > mentioned in the first sentence of this message. As I said, I just wanted > to point out that there is a bug to help improve, but if you prefer it li= ke > this it is fine with me, I just think then you probably haven't used rege= x > that much. > > =E2=80=8BI use RegEx quite a bit and while I will agree with the sentiment = it would be nearly impossible to replace the existing implementation. Fortunately, PostgreSQL is quite extensible and so if you need a more flexible RegEx implementation you can write a function in pl/perl or pl/v8 and use their implementations. About the only thing that would make sense would be to get the TCL implementation to accept the "possessive" modifier (+ in Java: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html) and let it override the "overall greediness" aspect of the matching region while leaving the unadorned case to use the overall aspect. There is probable a bit more consideration than the brief amount I've done here - though I think the point has been made. Right or wrong it is a design choice that has been made and in use for many years. It works well enough, and options/hacks exist, that finding someone who wants to dedicate resources to improving the situation is likely to be difficult. David J. =E2=80=8B
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
"David G. Johnston"
Date:
On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: > > As David says, these examples appear to be following what's stated in > > > http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-M= ATCHING-RULES > > The Spencer regex engine we use has a notion of greediness or > > non-greediness of the entire regex, and further that that takes > precedence > > for determining the overall match length over greediness of individual > > subexpressions. That behavior might be inconvenient for this particula= r > > use-case, but that doesn't make it a bug. > > BTW, perhaps it would be worth adding an example to that section that > shows how to control this behavior. The trick is obvious once you've see= n > it, but not so much otherwise: you add something to the start of the rege= x > that establishes the overall greediness you want, but can never actually > match any characters. "\0*" or "\0*?" will work fine in Postgres > use-cases since there can never be a NUL character in the data. > > regression=3D# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)'); > regexp_matches > ----------------- > {abc0123,4,xyz} > (1 row) > > regression=3D# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)'); > regexp_matches > ---------------- > {abc,0,""} > (1 row) > > regression=3D# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)'); > regexp_matches > ----------------- > {abc,01234,xyz} > (1 row) > > =E2=80=8B+1 David J.=E2=80=8B
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
Christian Mächler
Date:
You say it is okay that a greedy group suddenly becomes non-greedy if ANOTHER group is made non-greedy?
I've chosen a simple example, but I'm pretty sure I could construct several use-cases which can be solved easily if the regex behaves like in java, javaScript, perl etc. but not with how it is done here. It's clearly not a feature. Already simple things like ending a match with any amount of numbers will become difficult if non-greedy groups are present, e.g. instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9]) makes things easier...
Seriously I didn't want to start a debate whether this is right or wrong, because I honestly can't understand how anyone could defend the behavior mentioned in the first sentence of this message. As I said, I just wanted to point out that there is a bug to help improve, but if you prefer it like this it is fine with me, I just think then you probably haven't used regex that much.
Chris
I've chosen a simple example, but I'm pretty sure I could construct several use-cases which can be solved easily if the regex behaves like in java, javaScript, perl etc. but not with how it is done here. It's clearly not a feature. Already simple things like ending a match with any amount of numbers will become difficult if non-greedy groups are present, e.g. instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9]) makes things easier...
Seriously I didn't want to start a debate whether this is right or wrong, because I honestly can't understand how anyone could defend the behavior mentioned in the first sentence of this message. As I said, I just wanted to point out that there is a bug to help improve, but if you prefer it like this it is fine with me, I just think then you probably haven't used regex that much.
Chris
> From: tgl@sss.pgh.pa.us
> To: christian_maechler@hotmail.com
> CC: david.g.johnston@gmail.com; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
> Date: Tue, 4 Aug 2015 10:58:57 -0400
>
> Christian Mächler <christian_maechler@hotmail.com> writes:
> > Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:
>
> What specification are you reading? The POSIX standard for regular
> expressions doesn't mention non-greedy quantifiers at all.
>
> As David says, these examples appear to be following what's stated in
> http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
> The Spencer regex engine we use has a notion of greediness or
> non-greediness of the entire regex, and further that that takes precedence
> for determining the overall match length over greediness of individual
> subexpressions. That behavior might be inconvenient for this particular
> use-case, but that doesn't make it a bug.
>
> regards, tom lane
> To: christian_maechler@hotmail.com
> CC: david.g.johnston@gmail.com; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
> Date: Tue, 4 Aug 2015 10:58:57 -0400
>
> Christian Mächler <christian_maechler@hotmail.com> writes:
> > Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:
>
> What specification are you reading? The POSIX standard for regular
> expressions doesn't mention non-greedy quantifiers at all.
>
> As David says, these examples appear to be following what's stated in
> http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
> The Spencer regex engine we use has a notion of greediness or
> non-greediness of the entire regex, and further that that takes precedence
> for determining the overall match length over greediness of individual
> subexpressions. That behavior might be inconvenient for this particular
> use-case, but that doesn't make it a bug.
>
> regards, tom lane
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, perhaps it would be worth adding an example to that section that >> shows how to control this behavior. > +1 When I looked closer, I noticed that the docs already had a recommendation about how to force overall greediness or non-greediness, and it was cleaner than the \0* hack I'd come up with on the spur of the moment. So I extended that with an example. For-the-archives: it strikes me that if we ever did want to break backwards compatibility here in order to make it act a bit more like Perl's regexps, we could try making the concatenation rule be that the overall RE inherits the greediness of its last quantified atom rather than its first one. But I'm not sure how close an approximation that would produce to Perl's engine's behavior; there would probably still be some discrepancies. I doubt it's worth breaking backwards compatibility for, if we'd still get complaints from Perl users that our regex engine is broken because it's not bug-compatible with Perl's. regards, tom lane
Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
From
John R Pierce
Date:
On 8/4/2015 6:20 PM, Tom Lane wrote: > if we'd still get complaints from Perl users that our regex engine is > broken because it's not bug-compatible with Perl's. Some wag once said... "Some folks, when faced with a problem, go, 'I know, I'll use Regular Expressions'. Now they have TWO problems!" -- john r pierce, recycling bits in santa cruz