Re: Regexp_replace bug / does not terminate on long strings - Mailing list pgsql-general

From Tom Lane
Subject Re: Regexp_replace bug / does not terminate on long strings
Date
Msg-id 1809528.1629412955@sss.pgh.pa.us
Whole thread Raw
In response to Regexp_replace bug / does not terminate on long strings  ("Markhof, Ingolf" <ingolf.markhof@de.verizon.com>)
Responses Re: Regexp_replace bug / does not terminate on long strings
Re: [E] Re: Regexp_replace bug / does not terminate on long strings
List pgsql-general
"Markhof, Ingolf" <ingolf.markhof@de.verizon.com> writes:
> BRIEF:
> regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> complete or does not complete at all (?!) for big input strings (a few k
> characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses.  I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern.  In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word.  Or it could be N-1 repetitions followed by one other
word, and so on.  The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest.  Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time.  Get rid of one dup at a time, say by
     str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

            regards, tom lane

[1] https://www.postgresql.org/message-id/1808998.1629412269%40sss.pgh.pa.us



pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Selecting table row with latest date [RESOLVED]
Next
From: Michael Lewis
Date:
Subject: Re: Regexp_replace bug / does not terminate on long strings