Thread: Mixing greediness in regexp_matches

Mixing greediness in regexp_matches

From

"Daniel Verite"

Date:

23 December 2019, 14:56:47

  Hi,

When looking into how to implement a global replace of multiple
substrings (each with their own replacement) in sql
or plpgsql, I'm wondering if/how an RE with an alternation
can be used.

The basic idea is to iterate on the rows produced by

   regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g')

to break down the string into pairs of (non-matching segment,
matching segment) so that a final result can be assembled
from that (setting aside the last non-matching segment, that
can be retrieved in a final step).

The difficulty is that the longest strings in the alternation
should be prioritized, but the starting (.*?) makes the RE
non-greedy so "foo" is choosen over "foobar".

The doc at [1] leaves me unoptimistic when it mentions that:

 "...when an RE contains both greedy and non-greedy
 subexpressions, the total match length is either as long as possible
 or as short as possible, according to the attribute assigned to the
 whole RE. The attributes assigned to the subexpressions only affect
 how much of that match they are allowed to “eat” relative to each
 other."

Also it gives this example of forcing the RE as a whole to
be greedy despite it having a non-greedy sub-RE:
 regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}')
but it doesn't seem to be able to produce the desired result
in the case of the RE in the middle being an alternation
with strings of different lengths.

The ideal RE with a (foo|foobar|bar) alternation, when applied
globally to a string like 'the string has foo and foobar and bar and
more' would produce something like:

 {"the string has ","foo"}
 {" and ","foobar"}
 {" and ","bar"}

Is that possible with regexp_matches?


[1] https://www.postgresql.org/docs/current/functions-matching.html

Re: Mixing greediness in regexp_matches

From

Tom Lane

Date:

23 December 2019, 15:15:40

"Daniel Verite" <daniel@manitou-mail.org> writes:
> The basic idea is to iterate on the rows produced by
>    regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g')
> to break down the string into pairs of (non-matching segment,
> matching segment) so that a final result can be assembled
> from that (setting aside the last non-matching segment, that
> can be retrieved in a final step).
> The difficulty is that the longest strings in the alternation
> should be prioritized, but the starting (.*?) makes the RE
> non-greedy so "foo" is choosen over "foobar". 

I'd try forcing the match to be the whole string, ie

    ^(.*?)(foo|bar|foobar)(.*)$

which would also save some work for restarting the iteration,
since you'd have already captured the all-the-rest substring.

With the endpoints forced, it doesn't really matter whether
the engine deems the RE-as-a-whole to be greedy or not.
I think this would work without needing any explicit greediness
marking for the second and third parts, but I might be wrong
about that detail.

            regards, tom lane

Re: Mixing greediness in regexp_matches

From

Tom Lane

Date:

23 December 2019, 15:26:15

"Daniel Verite" <daniel@manitou-mail.org> writes:
>> The basic idea is to iterate on the rows produced by
>> regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g')
>> to break down the string into pairs of (non-matching segment,
>> matching segment) so that a final result can be assembled
>> from that (setting aside the last non-matching segment, that
>> can be retrieved in a final step).

BTW, just to think outside the box a bit, I wonder whether you
couldn't build this out of regexp_split_to_array:

regression=# select regexp_split_to_array('junkfoolbarfoolishfoobarmore', 'foo|bar|foobar');
 regexp_split_to_array
-----------------------
 {junk,l,"",lish,more}
(1 row)

The idea would be to iterate over the array elements, tracking the
corresponding position in the source string, and re-discovering at
each break which of the original alternatives must've matched.

It's sort of annoying that we don't have a simple "regexp_location"
function that would give you back the starting position of the
first match.

            regards, tom lane

Re: Mixing greediness in regexp_matches

From

"Daniel Verite"

Date:

23 December 2019, 15:58:47

    Tom Lane wrote:

> I'd try forcing the match to be the whole string, ie
>
>         ^(.*?)(foo|bar|foobar)(.*)$
>
> which would also save some work for restarting the iteration,
> since you'd have already captured the all-the-rest substring.

In that case regexp_matches will return 0 or 1 row. In the
above-mentioned example, that would be:

=> select regexp_matches('the string has foo and foobar and bar and more',
 '^(.*?)(foo|foobar|bar)(.*)$', 'g');
             regexp_matches
--------------------------------------------------------
 {"the string has ",foo," and foobar and bar and more"}

So the next iteration would consist of calling regexp_matches()
on result[3], and so on until no match is found.
I think it would work as desired, but probably much less efficiently on
large strings/large number of matches than if a single call of
regexp_matches() could return all matches.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Mixing greediness in regexp_matches

From

"Daniel Verite"

Date:

23 December 2019, 16:10:41

    Tom Lane wrote:

> regression=# select regexp_split_to_array('junkfoolbarfoolishfoobarmore',
> 'foo|bar|foobar');
> regexp_split_to_array
> -----------------------
> {junk,l,"",lish,more}
> (1 row)
>
> The idea would be to iterate over the array elements, tracking the
> corresponding position in the source string, and re-discovering at
> each break which of the original alternatives must've matched.
>
> It's sort of annoying that we don't have a simple "regexp_location"
> function that would give you back the starting position of the
> first match.

It occurred to me too that regexp_split_to_table or array would make
this problem really easy if only it had a mode to capture and return the
matched parts too.

FWIW, in plperl, there's a simple solution:

 $string =~ s/(foobar|foo|...)/$replace{$1}/g

when %replace is a hash of the substitutions %(foo=>baz,...).
The strings in the alternation are tested in their order of
appearance, so you can choose to be greedy or not by just sorting
them by length.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Mixing greediness in regexp_matches

From

Tom Lane

Date:

23 December 2019, 16:16:58

"Daniel Verite" <daniel@manitou-mail.org> writes:
> FWIW, in plperl, there's a simple solution:
>  $string =~ s/(foobar|foo|...)/$replace{$1}/g

Well, our manual does suggest using plperl (or pltcl) when the
built-in pattern match functions aren't adequate ;-)

            regards, tom lane