Thread: Why this regexp matches?!

Why this regexp matches?!

From
hubert depesz lubaczewski
Date:
select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
 regexp_replace
────────────────
 depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E '^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Why this regexp matches?!

From
Szymon Guz
Date:


On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com> wrote:
select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

what's worse:
$ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
 regexp_replace
────────────────
 depesz
(1 row)

I know that Pg regexps are limited, but even grep's regexps match this
correctly:

=$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E '^(.*)( \1)+$';
depesz depesz depesz

Best regards,

depesz


Hi,
some time ago I hit the same problem, however the solution was a little bit tricky. I didn't have time to investigate it, but this works:

postgres@postgres:5840=#  select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\\\1)+$', E'\\\\1' );
    regexp_replace     
-----------------------
 depesz depeszx depesz
(1 row)


regards
Szymon

Re: Why this regexp matches?!

From
hubert depesz lubaczewski
Date:
On Sat, Feb 04, 2012 at 09:54:34AM +0100, Szymon Guz wrote:
> On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com>wrote:
>
> > select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';
> >
> > what's worse:
> > $ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1'
> > );
> >  regexp_replace
> > ────────────────
> >  depesz
> > (1 row)
> >
> > I know that Pg regexps are limited, but even grep's regexps match this
> > correctly:
> >
> > =$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E
> > '^(.*)( \1)+$';
> > depesz depesz depesz
> >
> > Best regards,
> >
> > depesz
> >
> >
> Hi,
> some time ago I hit the same problem, however the solution was a little bit
> tricky. I didn't have time to investigate it, but this works:
>
> postgres@postgres:5840=#  select regexp_replace( 'depesz depeszx depesz',
> E'^(.*)( \\\\1)+$', E'\\\\1' );
>     regexp_replace
> -----------------------
>  depesz depeszx depesz
> (1 row)

not sure if I understand your point.

This regexp was meant to find repeated substrings.

Like this one does in perl:

/^(.*)( \1)+$/

We can see how it works with:
=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz
depeszdepesz' 
is repeat of [depesz]

=$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz
depeszxdepesz' 
is not repeated

reason why your regexp matches is also a mystery for me.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Why this regexp matches?!

From
David Johnston
Date:
On Feb 4, 2012, at 3:58, hubert depesz lubaczewski <depesz@depesz.com> wrote:

> On Sat, Feb 04, 2012 at 09:54:34AM +0100, Szymon Guz wrote:
>> On 4 February 2012 09:46, hubert depesz lubaczewski <depesz@depesz.com>wrote:
>>
>>> select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';
>>>
>>> what's worse:
>>> $ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1'
>>> );
>>> regexp_replace
>>> ────────────────
>>> depesz
>>> (1 row)
>>>
>>> I know that Pg regexps are limited, but even grep's regexps match this
>>> correctly:
>>>
>>> =$ printf 'depesz depesz depesz\ndepesz depeszx depesz\n' | grep -E
>>> '^(.*)( \1)+$';
>>> depesz depesz depesz
>>>
>>> Best regards,
>>>
>>> depesz
>>>
>>>
>> Hi,
>> some time ago I hit the same problem, however the solution was a little bit
>> tricky. I didn't have time to investigate it, but this works:
>>
>> postgres@postgres:5840=#  select regexp_replace( 'depesz depeszx depesz',
>> E'^(.*)( \\\\1)+$', E'\\\\1' );
>>    regexp_replace
>> -----------------------
>> depesz depeszx depesz
>> (1 row)
>
> not sure if I understand your point.
>
> This regexp was meant to find repeated substrings.
>
> Like this one does in perl:
>
> /^(.*)( \1)+$/
>
> We can see how it works with:
> =$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz
depeszdepesz' 
> is repeat of [depesz]
>
> =$ perl -e 'if ( shift =~ m/^(.*)( \1)+$/ ) { print "is repeat of [$1]\n" } else {print "is not repeated\n"}' 'depesz
depeszxdepesz' 
> is not repeated
>
> reason why your regexp matches is also a mystery for me.
>
> Best regards,
>
> depesz
>
>

Don't know the answer (if there is one other than 'it's a bug') but as a workaround you can split the string on
whitespacethen perform grouping and see if more than one record results... 

David J.

Re: Why this regexp matches?!

From
Alban Hertroys
Date:
On 4 Feb 2012, at 9:46, hubert depesz lubaczewski wrote:

> select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

Peculiar.

It's probably no use to you, but a version where the repetition is expanded (for that particular string) works:

select 'depesz depeszx depesz' ~ E'^(.*)( \\1)( \\1)$';

And this works too:

select 'depesz depeszx depesz' ~ E'^(depesz)( \\1)+$';

Apparently something odd is going on between the wildcard, the repetitive part and the back-reference. That could be
justus not seeing what's wrong with the expression or be an actual bug. 

> I know that Pg regexps are limited, but even grep's regexps match this

Limited? They're really not. According to the docs they are beyond POSIX compliant, even including several extensions
asthey appear in, among others, Perl. That said, the docs do mention a known limitation with braces and
forward-references- maybe this is related. 

Alban Hertroys

--
The scale of a problem often equals the size of an ego.



Re: Why this regexp matches?!

From
hubert depesz lubaczewski
Date:
On Sat, Feb 04, 2012 at 07:31:25PM +0100, Alban Hertroys wrote:
> > I know that Pg regexps are limited, but even grep's regexps match this
>
> Limited? They're really not. According to the docs they are beyond
> POSIX compliant, even including several extensions as they appear in,
> among others, Perl. That said, the docs do mention a known limitation
> with braces and forward-references - maybe this is related.

Limited - because (for example) Pg regexps, are the only regexp flavour
that I know that you can't have both greedy and non-greedy operators in
the same expression.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Why this regexp matches?!

From
Tom Lane
Date:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Sat, Feb 04, 2012 at 07:31:25PM +0100, Alban Hertroys wrote:
>> Limited? They're really not.

> Limited - because (for example) Pg regexps, are the only regexp flavour
> that I know that you can't have both greedy and non-greedy operators in
> the same expression.

Huh?  Sure you can.

The engine's rules for combining greedy and non-greedy behavior might be
a bit different from Perl's, but that doesn't make it "limited".  It
just means it has different idiosyncrasies from Perl's engine.  I do not
accept the proposition that Perl's regexps are perfect and everybody
else's are wrong to the extent that they act differently from Perl's.

As for the specific behavior at hand, it does look like a bug from here,
but I don't have time to poke at it right now.

            regards, tom lane

Re: Why this regexp matches?!

From
hubert depesz lubaczewski
Date:
On Sat, Feb 04, 2012 at 03:27:53PM -0500, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > that I know that you can't have both greedy and non-greedy operators in
> > the same expression.
>
> Huh?  Sure you can.

wrote about it year ago:

http://archives.postgresql.org/pgsql-general/2010-01/msg00067.php

Just tested, and it behaves the same way in 9.2devel.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Why this regexp matches?!

From
Jasen Betts
Date:
On 2012-02-04, hubert depesz lubaczewski <depesz@depesz.com> wrote:
> select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';
>
> what's worse:
> $ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
>  regexp_replace
> ────────────────
>  depesz
> (1 row)
>
> I know that Pg regexps are limited, but even grep's regexps match this
> correctly:

whose grep?

Postgres is BSD licence and that means they can't use the latest and
greatest GPL libraries.

--
⚂⚃ 100% natural

Re: Why this regexp matches?!

From
hubert depesz lubaczewski
Date:
On Mon, Feb 06, 2012 at 11:29:23AM +0000, Jasen Betts wrote:
> On 2012-02-04, hubert depesz lubaczewski <depesz@depesz.com> wrote:
> > select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';
> >
> > what's worse:
> > $ select regexp_replace( 'depesz depeszx depesz', E'^(.*)( \\1)+$', E'\\1' );
> >  regexp_replace
> > ────────────────
> >  depesz
> > (1 row)
> >
> > I know that Pg regexps are limited, but even grep's regexps match this
> > correctly:
>
> whose grep?
>
> Postgres is BSD licence and that means they can't use the latest and
> greatest GPL libraries.

yes, I did use gnu grep. but it's hardly "latest and greatest" - there
is nothing very special about this regexp, aside from the fact, that
according to pg docs (how I read them) - it shouldn't match, but it
does.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: Why this regexp matches?!

From
Tom Lane
Date:
Alban Hertroys <haramrae@gmail.com> writes:
> On 4 Feb 2012, at 9:46, hubert depesz lubaczewski wrote:
>> select 'depesz depeszx depesz' ~ E'^(.*)( \\1)+$';

> Apparently something odd is going on between the wildcard, the repetitive part and the back-reference. That could be
justus not seeing what's wrong with the expression or be an actual bug. 

FYI, I've made some progress on characterizing the cause of this bug,
as per comments at the upstream bug report:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
There are actually two distinct bugs involved, and I don't yet have a
patch for the case depesz illustrates.

            regards, tom lane