Thread: Uninterruptable regexp_replace in 9.3.1 ?
The following snippet reveals that 9.3.1 has a bug in regexp_matches, which uninterruptably keeps CPU spinning for minutes: -----8<--------------------------------------------------- \timing SET statement_timeout = 2; -- this is only to show statement_timeout is effective here SELECT count(*) from generate_series(1, 100000); -- this one is uninterruptable! SELECT regexp_matches($INPUT$ /a $b$ $c$d ; $INPUT$, $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*?\2)+)$REG$, 'g' ); -----8<--------------------------------------------------- The above has been tested to be harmless with PostgreSQL 9.1.11 in that the regexp_matches call is interrupted, but it is NOT with PostgreSQL 9.3.1. Is it a known bug ? Please include my address in replies as I don't get notified of list activity. Thanks. --strk; () ASCII ribbon campaign -- Keep it simple !/\ http://strk.keybit.net/rants/ascii_mails.txt
I've just tested 9.3.3 and it is _also_ affected. Should I report the regression somewhere else ? --strk; On Fri, Feb 21, 2014 at 10:17:59AM +0100, Sandro Santilli wrote: > The following snippet reveals that 9.3.1 has a bug > in regexp_matches, which uninterruptably keeps CPU > spinning for minutes: > > -----8<--------------------------------------------------- > > \timing > SET statement_timeout = 2; > -- this is only to show statement_timeout is effective here > SELECT count(*) from generate_series(1, 100000); > -- this one is uninterruptable! > SELECT regexp_matches($INPUT$ > > > > > > > > > /a > $b$ > $c$d > ; > $INPUT$, > $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*?\2)+)$REG$, 'g' ); > > -----8<--------------------------------------------------- > > The above has been tested to be harmless with PostgreSQL 9.1.11 > in that the regexp_matches call is interrupted, but it is NOT > with PostgreSQL 9.3.1. > > Is it a known bug ? > > Please include my address in replies as I don't get notified > of list activity. Thanks. > > --strk; > > () ASCII ribbon campaign -- Keep it simple ! > /\ http://strk.keybit.net/rants/ascii_mails.txt
On 02/21/2014 05:17 PM, Sandro Santilli wrote: > The following snippet reveals that 9.3.1 has a bug > in regexp_matches, which uninterruptably keeps CPU > spinning for minutes: Huh. So it does. That's interesting. (You should generally report things to pgsql-bugs@postgresql.org btw, not -hackers) Looks like it's busily looping within the regex.c code, never hitting a CHECK_FOR_INTERRUPTS. The real question IMO is why it's taking so long. It looks like cfindloop(...) is being called multiple times, with each call taking a couple of seconds. A profile of the run is attached. I don't expect to have a chance to dig into this right away, as I haven't touched the regexp code before and would need to spend a bit of time studying it to achieve anything. Hopefully the test, confirmation, and profile is useful. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Feb21, 2014, at 16:46 , Craig Ringer <craig@2ndquadrant.com> wrote: > The real question IMO is why it's taking so long. It looks like > cfindloop(...) is being called multiple times, with each call taking a > couple of seconds. Yeah, I wondered about this too. I've shortened the example a bit - here are a few observations postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 33.026 ms postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 60.594 ms postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 114.410 ms postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 227.467 ms postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 452.739 ms postgres=# select regexp_matches(' $a$b$c$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 943.098 ms postgres=# select regexp_matches(' $a$b$c$d$e$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 200.795 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 298.264 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$g$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 444.219 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 696.137 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$i$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 974.502 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$i$j$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 1369.703 ms postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$i$j$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); Time: 2747.766 ms In other words, the observes runtime is roughly 2^i * 1.5^j for inputs consiting of i leading spaces (any character will probably do) followed by j substring of the form $X$ (X is an arbitrary character). best regards, Florian Pflug
On 02/22/2014 12:04 AM, Florian Pflug wrote: > On Feb21, 2014, at 16:46 , Craig Ringer <craig@2ndquadrant.com> wrote: >> The real question IMO is why it's taking so long. It looks like >> cfindloop(...) is being called multiple times, with each call taking a >> couple of seconds. > > Yeah, I wondered about this too. I've shortened the example a bit - here > are a few observations > > [snip] > > In other words, the observes runtime is roughly 2^i * 1.5^j for inputs > consiting of i leading spaces (any character will probably do) followed > by j substring of the form $X$ (X is an arbitrary character). The biggest change in regexp support in the introduction of proper unicode support, but that was before 9.1. The problem report claims that the issue does not occur on 9.1, but yet: git diff REL9_1_STABLE master -- ./src/backend/utils/adt/regexp.c is utterly trivial; a copyright date line change, and 1609797c which just tweaks the includes. 9.0 has a much bigger diff. So I'd like to confirm that this issue doesn't affect 9.1. I can reproduce the issue againts 9.2. I don't have 9.1 or older lying around to test against right this second. Sandro, can you please provide the output of "SELECT version()" from a PostgreSQL version that is not slow with this query? (BTW, I'm highly amused by how the development style has changed around here. From "git blame", this is from 1997: "I haven't actually measured the speed improvement, but it `looks' a lot quicker visually when watching regression test output." Ha.) -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Feb21, 2014, at 17:29 , Craig Ringer <craig@2ndQuadrant.com> wrote: > The problem report claims that the issue does not occur on 9.1, but yet: > > git diff REL9_1_STABLE master -- ./src/backend/utils/adt/regexp.c > > is utterly trivial; a copyright date line change, and 1609797c which > just tweaks the includes. 9.0 has a much bigger diff. On 9.1.12: postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$i$j$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); regexp_matches ---------------------------------------------{" $a$b$c$d$e$f$g$h$i$j",NULL} (1 row) Time: 1.048 ms On HEAD postgres=# select regexp_matches(' $a$b$c$d$e$f$g$h$i$j$', $REG$((?:[^'"$;]+|"[^"]*"|'(?:[^']*|'')*'|(\$[^$]*\$).*\2)+)$REG$, 'g'); regexp_matches -------------------------{" ",NULL}{a,NULL}{b,NULL}{c,NULL}{d,NULL}{e,NULL}{f,NULL}{g,NULL}{h,NULL}{i,NULL}{j,NULL} (11 rows) Time: 4787.239 ms Aha! Since we go rid of regex_flavor pre-9.1, I don't have an immediate suspect... best regards, Florian Pflug
Craig Ringer <craig@2ndquadrant.com> writes: > So I'd like to confirm that this issue doesn't affect 9.1. It doesn't. I suspect it has something to do with 173e29aa5 or one of the nearby commits in backend/regex/. regards, tom lane
I wrote: > Craig Ringer <craig@2ndquadrant.com> writes: >> So I'd like to confirm that this issue doesn't affect 9.1. > It doesn't. I suspect it has something to do with 173e29aa5 or one > of the nearby commits in backend/regex/. Indeed, git bisect fingers that commit as introducing the problem. What seems to be happening is that citerdissect() is trying some combinatorially large number of ways to split the input string into substrings that can satisfy the argument of the outer "+" iterator. It keeps failing on the substring starting with the first '$', and then vainly looking for other splits that dodge the problem. I'm not entirely sure how come the previous coding didn't fall into the same problem. It's quite possible Henry Spencer is smarter than I am, but there was certainly nothing there before that was obviously avoiding this hazard. Worthy of note is that I think pre-9.2 is actually giving the wrong answer --- it's claiming the whole string matches the regex, which it does not if I'm reading it right. This may be related to the problem that commit 173e29aa5 was trying to fix, ie failure to enforce backref matches in some cases. So one possible theory is that by failing to notice that it *didn't* have a valid match, the old code accidentally failed to go down the rabbit hole of trying zillions of other ways to match. regards, tom lane
On Fri, Feb 21, 2014 at 08:17:36PM -0500, Tom Lane wrote: > I wrote: > > Craig Ringer <craig@2ndquadrant.com> writes: > >> So I'd like to confirm that this issue doesn't affect 9.1. > > > It doesn't. I suspect it has something to do with 173e29aa5 or one > > of the nearby commits in backend/regex/. > > Indeed, git bisect fingers that commit as introducing the problem. Is there anything I can do for this to not fade to oblivion ? Will a mail to pgsql-bugs help ? --strk; () ASCII ribbon campaign -- Keep it simple !/\ http://strk.keybit.net/rants/ascii_mails.txt