Thread: uninterruptable regexp_replace in 9.2 and 9.3

uninterruptable regexp_replace in 9.2 and 9.3

From
Sandro Santilli
Date:
Starting with commit 173e29aa5 the regexp_replace function
became uninterruptable during operations, and takes a lot
of time and RAM to process some patterns.

Full story in this thread:
http://www.postgresql.org/message-id/646.1393031856@sss.pgh.pa.us

I'm hoping a mail to pgsql-bugs would assign this issue a ticket
number for me to follow. Let me know if there's a better way.
Thank you !

--strk;

 ()  ASCII ribbon campaign  --  Keep it simple !
 /\  http://strk.keybit.net/rants/ascii_mails.txt

Re: uninterruptable regexp_replace in 9.2 and 9.3

From
Pedro Gimeno
Date:
Sandro Santilli wrote, On 2014-02-28 12:28:
> Starting with commit 173e29aa5 the regexp_replace function
> became uninterruptable during operations, and takes a lot
> of time and RAM to process some patterns.
>
> Full story in this thread:
> http://www.postgresql.org/message-id/646.1393031856@sss.pgh.pa.us
>
> I'm hoping a mail to pgsql-bugs would assign this issue a ticket
> number for me to follow. Let me know if there's a better way.
> Thank you !

This may be relevant:
https://gist.github.com/johnbartholomew/8379265

I've added these lines:

printf 'Testing psql:\n'
time psql -c "SELECT regexp_matches('$pattern','$input');"

The results in my machine are (with PostgreSQL 9.1.9):

Pattern:
"a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa"
Input: "aaaaaaaaaaaaaaaaaaaaaaaaaaa"
Testing grep:
aaaaaaaaaaaaaaaaaaaaaaaaaaa

real    0m0.003s
user    0m0.000s
sys    0m0.000s
Testing perl:
match

real    0m6.103s
user    0m6.100s
sys    0m0.000s
Testing python:
<_sre.SRE_Match object at 0xb70e2250>

real    0m10.207s
user    0m10.141s
sys    0m0.060s
Testing psql:
        regexp_matches
-------------------------------
 {aaaaaaaaaaaaaaaaaaaaaaaaaaa}
(1 row)


real    0m0.039s
user    0m0.024s
sys    0m0.004s

Interestingly, as noted in the comments, PHP reports an error that the
backtracking limit was reached.

I'm adding this note in case it helps anyone get a bigger picture as to
what other implementations do about this problem.

Re: uninterruptable regexp_replace in 9.2 and 9.3

From
Tom Lane
Date:
Pedro Gimeno <pgsql-004@personal.formauri.es> writes:
> I'm adding this note in case it helps anyone get a bigger picture as to
> what other implementations do about this problem.

I spent some time looking at pcre to try to understand what it was doing
backtracking-wise, and eventually realized that it's not actually trying
very hard.  Consider this problem:

# select regexp_matches('123456789z', '(([0-9]+|9z)+)');
 regexp_matches
-----------------
 {123456789z,9z}
(1 row)

To obtain the longest possible match, it's necessary to decide that the
first iteration of the + operator matches '12345678', leaving '9z' to be
matched by the second iteration.  Postgres and Tcl get this right.
Perl and pcre, not so much: they report the match as '123456789'.

$ perl -e "if ('123456789z' =~ m/(([0-9]+|9z)+)/) {print \"\$1\\n\";}"
123456789

In fact, Perl doesn't even get this simplified case right, though no
backtracking is required:

$ perl -e "if ('123456789z' =~ m/(([0-9]|9z)+)/) {print \"\$1\\n\";}"
123456789

It seems to just fail to notice that on the 9th iteration, a longer match
is available from the second OR-alternative than the first.  (No doubt
this is documented behavior somewhere, but it sure flies in the face of
what I'd consider to be expected regex behavior.)

The performance problem we're looking at comes directly from the
backtracking that's done to ensure that we detect a match in case the
pattern has this sort of pathological behavior.  The test case doesn't
actually need any such backtracking, because there aren't multiple ways
to match any particular substring; but I'm not sure if there's any easy
way to recognize that.

            regards, tom lane

Re: uninterruptable regexp_replace in 9.2 and 9.3

From
Sandro Santilli
Date:
On Sun, Mar 02, 2014 at 02:07:29PM +0100, Pedro Gimeno wrote:
> Tom Lane wrote, On 2014-03-02 05:38:
>
> > The performance problem we're looking at comes directly from the
> > backtracking that's done to ensure that we detect a match in case the
> > pattern has this sort of pathological behavior.
>
> It is my understanding that the bug reported by the OP is not a
> performance problem, but PostgreSQL's failure to interrupt the
> processing if it takes too long, when statement_timeout is set.

Yes, this is my main concern.
Of course it'd be nice to get a faster response, even if ill, but
as long as statement_timeout is effective I'm happy.

--strk;

 ()  ASCII ribbon campaign  --  Keep it simple !
 /\  http://strk.keybit.net/rants/ascii_mails.txt

Re: uninterruptable regexp_replace in 9.2 and 9.3

From
Pedro Gimeno
Date:
Tom Lane wrote, On 2014-03-02 05:38:

> The performance problem we're looking at comes directly from the
> backtracking that's done to ensure that we detect a match in case the
> pattern has this sort of pathological behavior.

It is my understanding that the bug reported by the OP is not a
performance problem, but PostgreSQL's failure to interrupt the
processing if it takes too long, when statement_timeout is set.

If it's not possible to interrupt it, then maybe an approach similar to
PHP's backtracking limit could be implemented.

I'm not familiar at all with PostgreSQL's code, but I wonder if adding a
CHECK_FOR_INTERRUPTS every sensible number of backtracks could solve the
original bug.