Thread: Pathological regexp match
We came across a regexp that takes very much longer than expected. PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity ?column? ----------t (1 row) Time: 90525.148 ms The full query is below. The same match in perl takes less than 0.01 seconds on the same hardware. #!/bin/env perl use warnings; use strict; my $sample = 'ooo...'; # omitted for email brevity if ($sample =~ /Z(Q)[^Q]*A.*?(\1)/) { print 'matches'; } else { print 'does not match'; } This is a simplified version of a match that finally finished after 18 hours. Given the nearly 4 orders of magnitude difference between the Perl script and the Postgres version, is there something thatcould be improved in the Postgres regex engine? Cheers, Michael Glaesemann michael.glaesemann@myyearbook.com SELECT 'ooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooooooooooooooooooQoooooooooooooooooooooooooooooooooooQoooooooQooooooQooooZQoooooooooooAooooQoooooooQooooooooooooooooooooooQoooooooQooooooooooooooooooooooQoooooQoooooooooooAooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooooooooooooQoQooooooQoooooooQoooooooQoooooooQooooooooooooooooooooooooooooooooooooooooooZQoooooooooooooooooQooQoQoooooooQooooooooooooooooooooooooooooooooooooooooooQoooooQoooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQoQooooooooooooooooZQoooooooooooooooooQooQoQooooooooooooQoQoooooooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooQoooooooooooooooooooooooooooooooooooooooQoooooQooooooooooooooooooooooooooooooooooooooooooooooQooooooooooooooooZQooooooooooooooooooooooooooQooQoQooooooooooooQoQooooooooooooooooooooooooooooooZQoooooooooooooQooQoQooooooooooQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooQoooooooooooooooooooooooooQoooooooQooooooooooooooooooooooooooooooooooooooQooooooooQoQoooooooooooooooooooooooooooooZQooooooooooooQooQoQooooooooooooooooooooooooooooooooooooooooooooooooooooooooZQooooooooooooooooooooooooooooooQooQoQoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooQooooZQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooooooQooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooQooooooQooooooooooQoooooooQoooooooooooQoooooooooooooo' ~$r$Z(Q)[^Q]*A.*?(\1)$r$;
Hi Michael, Michael Glaesemann wrote: > We came across a regexp that takes very much longer than expected. > > PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit > > SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity The ? after .* is pointless. If you remove it, the query returns immediately. (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote: > Hi Michael, > > Michael Glaesemann wrote: >> We came across a regexp that takes very much longer than expected. >> >> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc >> (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit >> >> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email >> brevity > > The ? after .* is pointless. Interesting. I would expect that *? would be the non-greedy version of *, meaning match up to the first \1 (in this case the first Q following A), rather than as much as possible. For example, in Perl: $ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*(\1)/) { print \$&; } else { print 'NO'; }" && echo ZQoooAoooQooQooQ $ perl -e " if ('oooZQoooAoooQooQooQooo' =~ /Z(Q)[^Q]*A.*?(\1)/) { print \$&; } else { print 'NO'; }" && echo ZQoooAoooQ If I'm reading the docs right, Postgres does support non-greedy * as *?: <http://www.postgresql.org/docs/8.4/interactive/functions-matching.html#POSIX-QUANTIFIERS-TABLE > However, as you point out, Postgres doesn't appear to take this into account: postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) [^Q]*A.*(\2))$r$, $s$X$s$); regexp_replace ---------------- oooXooo (1 row) postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) [^Q]*A.*?(\2))$r$, $s$X$s$); regexp_replace ---------------- oooXooo (1 row) Michael Glaesemann michael.glaesemann@myyearbook.com
Michael Glaesemann wrote: > > On Jan 28, 2010, at 21:59 , Alvaro Herrera wrote: > > >Hi Michael, > > > >Michael Glaesemann wrote: > >>We came across a regexp that takes very much longer than expected. > >> > >>PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC > >>gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit > >> > >>SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email > >>brevity > > > >The ? after .* is pointless. > > Interesting. I would expect that *? would be the non-greedy version > of *, meaning match up to the first \1 (in this case the first Q > following A), rather than as much as possible. Huh, you are right, *? is the non-greedy version. I keep forgetting those. Note that they only work if you have regex_flavor set to advanced, though (which is the default). > However, as you point out, Postgres doesn't appear to take this into > account: > > postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) > [^Q]*A.*(\2))$r$, $s$X$s$); > regexp_replace > ---------------- > oooXooo > (1 row) > > postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) > [^Q]*A.*?(\2))$r$, $s$X$s$); > regexp_replace > ---------------- > oooXooo > (1 row) Hmm, that's strange ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Michael Glaesemann wrote: > However, as you point out, Postgres doesn't appear to take this into > account: > > postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) > [^Q]*A.*(\2))$r$, $s$X$s$); > regexp_replace > ---------------- > oooXooo > (1 row) > > postgres=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q) > [^Q]*A.*?(\2))$r$, $s$X$s$); > regexp_replace > ---------------- > oooXooo > (1 row) I think the reason for this is that the first * is greedy and thus the entire expression is considered greedy. The fact that you've made the second * non-greedy does not ungreedify the RE ... Note the docs say: The above rules associate greediness attributes not only withindividual quantified atoms, but with branches and entire REsthatcontain quantified atoms. What that means is that thematching is done in such a way that the branch, or whole RE,matchesthe longest or shortest possible substring as a whole. It's late here so I'm not sure if this is what you're looking for: alvherre=# select regexp_replace('oooZQoooAoooQooQooQooo', $r$(Z(Q)[^Q]*?A.*(\2))$r$, $s$X$s$);regexp_replace ----------------oooXooQooQooo (1 fila) (Obviously the non-greediness has moved somewhere else) :-( -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jan 28, 2010, at 23:21 , Alvaro Herrera wrote: > I think the reason for this is that the first * is greedy and thus the > entire expression is considered greedy. The fact that you've made the > second * non-greedy does not ungreedify the RE ... Note the docs say: > > The above rules associate greediness attributes not only with > individual quantified atoms, but with branches and entire REs > that contain quantified atoms. What that means is that the > matching is done in such a way that the branch, or whole RE, > matches the longest or shortest possible substring as a whole. Interesting. Thanks for pointing out this section of the docs. I wasn't aware of this twist. > It's late here so I'm not sure if this is what you're looking for: I'm not actually looking for a regexp that works: I was able to accomplish the task I had at hand with a different regexp. I'm just reporting the particular unexpected nastiness we ran into. :) Michael Glaesemann michael.glaesemann@myyearbook.com
2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>: > Hi Michael, > > Michael Glaesemann wrote: >> We came across a regexp that takes very much longer than expected. >> >> PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44), 64-bit >> >> SELECT 'ooo...' ~ $r$Z(Q)[^Q]*A.*?(\1)$r$; -- omitted for email brevity > > The ? after .* is pointless. If you remove it, the query returns > immediately. > > (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW) Incidentally, I ran across the exact same issue with a non-greedy regexp with a client earlier this week, and put on my TODO to figure out a good place to stick a check for interrupts. Does this mean I don't have to, because you're on it? ;) -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
Magnus Hagander wrote: > 2010/1/29 Alvaro Herrera <alvherre@commandprompt.com>: > > (There's a badly needed CHECK_FOR_INTERRUPTS in this code BTW) > > Incidentally, I ran across the exact same issue with a non-greedy > regexp with a client earlier this week, and put on my TODO to figure > out a good place to stick a check for interrupts. Does this mean I > don't have to, because you're on it? ;) No, sorry :-( -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes: > We came across a regexp that takes very much longer than expected. I poked into this a little bit. What seems to be happening is that the use of non-greedy quantifiers plus backreferences turns off most of the optimization that the regexp engine usually does, leaving the RE as a tree of possibilities that is explored in a fairly dumb fashion. In particular, there is a loop in crevdissect() that tries to locate a feasible division point between two concatenated sub-patterns, and for each try, a recursive call to crevdissect() tries to locate a feasible division point between *its* two sub-patterns, resulting in O(N^2) runtime. With a not very pleasant constant factor, too, because of repeated startup/shutdown costs for the DFA matching engine. I found a possible patch that seems to improve matters: AFAICS the DFA matching is independent of the checking that cdissect() and friends do, so we can apply that check first instead of second. This brings the runtime down from minutes to sub-second on my machine. However I don't have a whole lot of faith either in the correctness of this change, or that it might not make some other cases slower instead of faster. Has anyone got a reasonably messy collection of regexps they'd like to try this patch on? BTW, I filed the problem upstream with the Tcl project https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894 but I'm not going to hold my breath waiting for useful advice from them. Since Henry Spencer dropped off the net, there doesn't seem to be anybody there who understands that code much better than we do. Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls in there ... regards, tom lane Index: src/backend/regex/regexec.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/regex/regexec.c,v retrieving revision 1.27 diff -c -r1.27 regexec.c *** src/backend/regex/regexec.c 15 Oct 2005 02:49:24 -0000 1.27 --- src/backend/regex/regexec.c 30 Jan 2010 06:53:13 -0000 *************** *** 804,820 **** for (;;) { /* try this midpoint on for size */ ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! longest(v, d2, mid, end, (int *) NULL) == end && ! (er = cdissect(v, t->right, mid, end)) == ! REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) { ! freedfa(d); ! freedfa(d2); ! return er; } /* that midpoint didn't work, find a new one */ --- 804,821 ---- for (;;) { /* try this midpoint on for size */ ! if (longest(v, d2, mid, end, (int *) NULL) == end) { ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! (er = cdissect(v, t->right, mid, end)) == REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) ! { ! freedfa(d); ! freedfa(d2); ! return er; ! } } /* that midpoint didn't work, find a new one */ *************** *** 904,920 **** for (;;) { /* try this midpoint on for size */ ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! longest(v, d2, mid, end, (int *) NULL) == end && ! (er = cdissect(v, t->right, mid, end)) == ! REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) { ! freedfa(d); ! freedfa(d2); ! return er; } /* that midpoint didn't work, find a new one */ --- 905,922 ---- for (;;) { /* try this midpoint on for size */ ! if (longest(v, d2, mid, end, (int *) NULL) == end) { ! er = cdissect(v, t->left, begin, mid); ! if (er == REG_OKAY && ! (er = cdissect(v, t->right, mid, end)) == REG_OKAY) ! break; /* NOTE BREAK OUT */ ! if (er != REG_OKAY && er != REG_NOMATCH) ! { ! freedfa(d); ! freedfa(d2); ! return er; ! } } /* that midpoint didn't work, find a new one */
On Sat, Jan 30, 2010 at 08:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Glaesemann <michael.glaesemann@myyearbook.com> writes: >> We came across a regexp that takes very much longer than expected. > > I poked into this a little bit. What seems to be happening is that the > use of non-greedy quantifiers plus backreferences turns off most of the > optimization that the regexp engine usually does, leaving the RE as a > tree of possibilities that is explored in a fairly dumb fashion. In > particular, there is a loop in crevdissect() that tries to locate a > feasible division point between two concatenated sub-patterns, and > for each try, a recursive call to crevdissect() tries to locate a > feasible division point between *its* two sub-patterns, resulting in > O(N^2) runtime. With a not very pleasant constant factor, too, because > of repeated startup/shutdown costs for the DFA matching engine. > > I found a possible patch that seems to improve matters: AFAICS the DFA > matching is independent of the checking that cdissect() and friends do, > so we can apply that check first instead of second. This brings the > runtime down from minutes to sub-second on my machine. However I don't > have a whole lot of faith either in the correctness of this change, or > that it might not make some other cases slower instead of faster. > Has anyone got a reasonably messy collection of regexps they'd like > to try this patch on? I only have the one, so I don't think I can help all that much with that. > BTW, I filed the problem upstream with the Tcl project > https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894 > but I'm not going to hold my breath waiting for useful advice from > them. Since Henry Spencer dropped off the net, there doesn't seem > to be anybody there who understands that code much better than we do. > > Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls > in there ... Yeah. I have zero experience around that code, so if oyu have at least some, I'd much appreciate it if you (or someone who does) could look at it. Likely to cause a lot less breakage than me :D -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
I wrote: > I found a possible patch that seems to improve matters: AFAICS the DFA > matching is independent of the checking that cdissect() and friends do, > so we can apply that check first instead of second. This brings the > runtime down from minutes to sub-second on my machine. However I don't > have a whole lot of faith either in the correctness of this change, or > that it might not make some other cases slower instead of faster. > Has anyone got a reasonably messy collection of regexps they'd like > to try this patch on? The Tcl folk accepted that patch, so I went ahead and applied it to our code. It would still be a good idea for us to do any testing we can on it, though. > Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls > in there ... I looked at this a bit and decided that it would be messier than seems justified in the absence of known problems. The regex code uses malloc not palloc for transient data structures, so we can't just stick a CHECK_FOR_INTERRUPTS() into some inner loop hotspot --- throwing a longjmp would mean a permanent memory leak. I looked at integrating into the error mechanism it does have, but it turns out that that's not particularly designed to force immediate exit on failure either; they just set a flag bit that will be tested whenever control does return from the match. I think that a good fix would require first changing the mallocs to pallocs, then add CHECK_FOR_INTERRUPTS. But that's not something I have time to mess with at the moment. regards, tom lane
On Jan 31, 2010, at 22:14 , Tom Lane wrote: > The Tcl folk accepted that patch, so I went ahead and applied it to > our code. It would still be a good idea for us to do any testing we > can on it, though. I applied the patch and ran both the test query I submitted as well as original problematic query that triggered the report, and it runs much faster. Thanks for the fix! Michael Glaesemann michael.glaesemann@myyearbook.com
2010/2/1 Michael Glaesemann <michael.glaesemann@myyearbook.com>: > > On Jan 31, 2010, at 22:14 , Tom Lane wrote: > >> The Tcl folk accepted that patch, so I went ahead and applied it to >> our code. It would still be a good idea for us to do any testing we >> can on it, though. > > I applied the patch and ran both the test query I submitted as well as original problematic query that triggered the report,and it runs much faster. Thanks for the fix! I did the same, and it does not help in my case. FWIW, the regexp I'm matching is: <pre .*?>(.*?)</pre> (yes, the production system has already been fixed to use a smarter regexp that solves the same problem) The text is about 180Kb. PostgreSQL takes ~40 seconds without the patch, ~36 seconds with it, to extract the match from it. Perl takes 0.016 seconds. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote: > The text is about 180Kb. PostgreSQL takes ~40 seconds without the > patch, ~36 seconds with it, to extract the match from it. Perl takes > 0.016 seconds. Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P Best, David
On Mon, Feb 8, 2010 at 6:07 PM, David E. Wheeler <david@kineticode.com> wrote: > On Feb 8, 2010, at 5:15 AM, Magnus Hagander wrote: > >> The text is about 180Kb. PostgreSQL takes ~40 seconds without the >> patch, ~36 seconds with it, to extract the match from it. Perl takes >> 0.016 seconds. > > Obviously we need to support Perl regular expressions in core. Not PCRE, but Perl. ;-P FWIW, PCRE is BSD licensed, so we could actually include it... Would be interesting to see how it performs on the pattern at hand. Joachim