Thread: Uninterruptable regexp_replace in 9.3.1 ?

Uninterruptable regexp_replace in 9.3.1 ?

From
Sandro Santilli
Date:
The following snippet reveals that 9.3.1 has a bug 
in regexp_matches, which uninterruptably keeps CPU
spinning for minutes:

-----8<---------------------------------------------------

\timing
SET statement_timeout = 2;
-- this is only to show statement_timeout is effective here
SELECT count(*) from generate_series(1, 100000);
-- this one is uninterruptable!
SELECT regexp_matches($INPUT$







   /a   $b$   $c$d   ;
$INPUT$,
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*?\2)+)$REG$, 'g' );

-----8<---------------------------------------------------

The above has been tested to be harmless with PostgreSQL 9.1.11
in that the regexp_matches call is interrupted, but it is NOT
with PostgreSQL 9.3.1.

Is it a known bug ?

Please include my address in replies as I don't get notified
of list activity. Thanks.

--strk; 
()  ASCII ribbon campaign  --  Keep it simple !/\  http://strk.keybit.net/rants/ascii_mails.txt  



Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Sandro Santilli
Date:
I've just tested 9.3.3 and it is _also_ affected.
Should I report the regression somewhere else ?

--strk;

On Fri, Feb 21, 2014 at 10:17:59AM +0100, Sandro Santilli wrote:
> The following snippet reveals that 9.3.1 has a bug 
> in regexp_matches, which uninterruptably keeps CPU
> spinning for minutes:
> 
> -----8<---------------------------------------------------
> 
> \timing
> SET statement_timeout = 2;
> -- this is only to show statement_timeout is effective here
> SELECT count(*) from generate_series(1, 100000);
> -- this one is uninterruptable!
> SELECT regexp_matches($INPUT$
> 
> 
> 
> 
> 
> 
> 
> 
>     /a
>     $b$
>     $c$d
>     ;
> $INPUT$,
> $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*?\2)+)$REG$, 'g' );
> 
> -----8<---------------------------------------------------
> 
> The above has been tested to be harmless with PostgreSQL 9.1.11
> in that the regexp_matches call is interrupted, but it is NOT
> with PostgreSQL 9.3.1.
> 
> Is it a known bug ?
> 
> Please include my address in replies as I don't get notified
> of list activity. Thanks.
> 
> --strk; 
> 
>  ()  ASCII ribbon campaign  --  Keep it simple !
>  /\  http://strk.keybit.net/rants/ascii_mails.txt  



Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Craig Ringer
Date:
On 02/21/2014 05:17 PM, Sandro Santilli wrote:
> The following snippet reveals that 9.3.1 has a bug
> in regexp_matches, which uninterruptably keeps CPU
> spinning for minutes:

Huh. So it does. That's interesting.

(You should generally report things to pgsql-bugs@postgresql.org btw,
not -hackers)

Looks like it's busily looping within the regex.c code, never hitting a
CHECK_FOR_INTERRUPTS.

The real question IMO is why it's taking so long. It looks like
cfindloop(...) is being called multiple times, with each call taking a
couple of seconds.

A profile of the run is attached. I don't expect to have a chance to dig
into this right away, as I haven't touched the regexp code before and
would need to spend a bit of time studying it to achieve anything.
Hopefully the test, confirmation, and profile is useful.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Florian Pflug
Date:
On Feb21, 2014, at 16:46 , Craig Ringer <craig@2ndquadrant.com> wrote:
> The real question IMO is why it's taking so long. It looks like
> cfindloop(...) is being called multiple times, with each call taking a
> couple of seconds.

Yeah, I wondered about this too. I've shortened the example a bit - here
are a few observations

postgres=# select regexp_matches('             $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 33.026 ms

postgres=# select regexp_matches('              $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 60.594 ms

postgres=# select regexp_matches('               $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 114.410 ms

postgres=# select regexp_matches('                $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 227.467 ms

postgres=# select regexp_matches('                 $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 452.739 ms

postgres=# select regexp_matches('                  $a$b$c$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 943.098 ms

postgres=# select regexp_matches('             $a$b$c$d$e$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 200.795 ms

postgres=# select regexp_matches('             $a$b$c$d$e$f$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 298.264 ms

postgres=# select regexp_matches('             $a$b$c$d$e$f$g$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 444.219 ms

postgres=# select regexp_matches('             $a$b$c$d$e$f$g$h$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 696.137 ms

postgres=# select regexp_matches('             $a$b$c$d$e$f$g$h$i$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 974.502 ms

postgres=# select regexp_matches('             $a$b$c$d$e$f$g$h$i$j$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 1369.703 ms

postgres=# select regexp_matches('              $a$b$c$d$e$f$g$h$i$j$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');
Time: 2747.766 ms

In other words, the observes runtime is roughly 2^i * 1.5^j for inputs
consiting of i leading spaces (any character will probably do) followed
by j substring of the form $X$ (X is an arbitrary character).

best regards,
Florian Pflug




Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Craig Ringer
Date:
On 02/22/2014 12:04 AM, Florian Pflug wrote:
> On Feb21, 2014, at 16:46 , Craig Ringer <craig@2ndquadrant.com> wrote:
>> The real question IMO is why it's taking so long. It looks like
>> cfindloop(...) is being called multiple times, with each call taking a
>> couple of seconds.
> 
> Yeah, I wondered about this too. I've shortened the example a bit - here
> are a few observations
> 
> [snip]
>
> In other words, the observes runtime is roughly 2^i * 1.5^j for inputs
> consiting of i leading spaces (any character will probably do) followed
> by j substring of the form $X$ (X is an arbitrary character).

The biggest change in regexp support in the introduction of proper
unicode support, but that was before 9.1.

The problem report claims that the issue does not occur on 9.1, but yet:

git diff REL9_1_STABLE master  -- ./src/backend/utils/adt/regexp.c

is utterly trivial; a copyright date line change, and 1609797c which
just tweaks the includes. 9.0 has a much bigger diff.

So I'd like to confirm that this issue doesn't affect 9.1. I can
reproduce the issue againts 9.2. I don't have 9.1 or older lying around
to test against right this second.

Sandro, can you please provide the output of "SELECT version()" from a
PostgreSQL version that is not slow with this query?



(BTW, I'm highly amused by how the development style has changed around
here. From "git blame", this is from 1997:

"I haven't actually measured the speed improvement, but it `looks' a lot
quicker visually when watching regression test output."

Ha.)


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Florian Pflug
Date:
On Feb21, 2014, at 17:29 , Craig Ringer <craig@2ndQuadrant.com> wrote:
> The problem report claims that the issue does not occur on 9.1, but yet:
>
> git diff REL9_1_STABLE master  -- ./src/backend/utils/adt/regexp.c
>
> is utterly trivial; a copyright date line change, and 1609797c which
> just tweaks the includes. 9.0 has a much bigger diff.

On 9.1.12:

postgres=# select regexp_matches('              $a$b$c$d$e$f$g$h$i$j$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');              regexp_matches
---------------------------------------------{"              $a$b$c$d$e$f$g$h$i$j",NULL}
(1 row)
Time: 1.048 ms

On HEAD

postgres=# select regexp_matches('              $a$b$c$d$e$f$g$h$i$j$',
$REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g');    regexp_matches
-------------------------{"
",NULL}{a,NULL}{b,NULL}{c,NULL}{d,NULL}{e,NULL}{f,NULL}{g,NULL}{h,NULL}{i,NULL}{j,NULL}
(11 rows)
Time: 4787.239 ms

Aha! Since we go rid of regex_flavor pre-9.1, I don't have an immediate suspect...

best regards,
Florian Pflug




Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Tom Lane
Date:
Craig Ringer <craig@2ndquadrant.com> writes:
> So I'd like to confirm that this issue doesn't affect 9.1.

It doesn't.  I suspect it has something to do with 173e29aa5 or one
of the nearby commits in backend/regex/.
        regards, tom lane



Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Tom Lane
Date:
I wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> So I'd like to confirm that this issue doesn't affect 9.1.

> It doesn't.  I suspect it has something to do with 173e29aa5 or one
> of the nearby commits in backend/regex/.

Indeed, git bisect fingers that commit as introducing the problem.

What seems to be happening is that citerdissect() is trying some
combinatorially large number of ways to split the input string into
substrings that can satisfy the argument of the outer "+" iterator.
It keeps failing on the substring starting with the first '$', and
then vainly looking for other splits that dodge the problem.

I'm not entirely sure how come the previous coding didn't fall into
the same problem.  It's quite possible Henry Spencer is smarter than
I am, but there was certainly nothing there before that was obviously
avoiding this hazard.

Worthy of note is that I think pre-9.2 is actually giving the wrong
answer --- it's claiming the whole string matches the regex,
which it does not if I'm reading it right.  This may be related to
the problem that commit 173e29aa5 was trying to fix, ie failure to
enforce backref matches in some cases.  So one possible theory is
that by failing to notice that it *didn't* have a valid match,
the old code accidentally failed to go down the rabbit hole of trying
zillions of other ways to match.
        regards, tom lane



Re: Uninterruptable regexp_replace in 9.3.1 ?

From
Sandro Santilli
Date:
On Fri, Feb 21, 2014 at 08:17:36PM -0500, Tom Lane wrote:
> I wrote:
> > Craig Ringer <craig@2ndquadrant.com> writes:
> >> So I'd like to confirm that this issue doesn't affect 9.1.
> 
> > It doesn't.  I suspect it has something to do with 173e29aa5 or one
> > of the nearby commits in backend/regex/.
> 
> Indeed, git bisect fingers that commit as introducing the problem.

Is there anything I can do for this to not fade to oblivion ?
Will a mail to pgsql-bugs help ?

--strk;
()  ASCII ribbon campaign  --  Keep it simple !/\  http://strk.keybit.net/rants/ascii_mails.txt