Home > mailing lists

Re: Another regexp performance improvement: skip useless paren-captures - Mailing list pgsql-hackers

From	Mark Dilger
Subject	Re: Another regexp performance improvement: skip useless paren-captures
Date	August 8, 2021 18:22:24
Msg-id	BDB634FD-2EE5-4697-91A0-1F53E1363D3B@enterprisedb.com Whole thread Raw
In response to	Re: Another regexp performance improvement: skip useless paren-captures (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Another regexp performance improvement: skip useless paren-captures
List	pgsql-hackers

Tree view

> On Aug 8, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I've also rebased over the bug fixes from the other thread,
> and added a couple more test cases.
>
>             regards, tom lane

Hmm.  This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

 select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
  regexp_split_to_array
 -----------------------
- {"",zkodphfbfbfb}
+ {uuuzkodphfbfbfb}
 (1 row)

The string starts with three "u" characters.  The first of them is doubly-matched, meaning \1 and \2 refer to the first
"u"character.  The (\1\2) that follows matches the next two "u" characters.  When the extra "useless" capture group is
skipped,apparently this doesn't work anymore. I haven't looked at your patch, so I'm not sure why, but I'm guessing
that\2 doesn't refer to anything. 

That analysis is consistent with the next change:

 select regexp_split_to_array('snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe', '((((?:.))))\3');
-                        regexp_split_to_array
----------------------------------------------------------------------
- {snfwbvx,snzqabixqbixqiumpgxde,xvnsemjxgqoqknrqe,mcqmfslfspskqpqxe}
+                         regexp_split_to_array
+------------------------------------------------------------------------
+ {snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe}
 (1 row)

The pattern matches any double character.  I would expect it to match the "ee", the "mm" and the "ss" in the text.
Withthe patched code, it matches nothing. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Mark Dilger
Date: 08 August 2021, 18:02:03
Subject: Re: Assert triggered during RE_compile_and_cache

From: ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date: 08 August 2021, 18:31:40
Subject: Re: [PATCH] Add tab-complete for backslash commands

Re: Another regexp performance improvement: skip useless paren-captures - Mailing list pgsql-hackers

Previous

Next