Thread: Mixing greediness in regexp_matches
Hi, When looking into how to implement a global replace of multiple substrings (each with their own replacement) in sql or plpgsql, I'm wondering if/how an RE with an alternation can be used. The basic idea is to iterate on the rows produced by regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g') to break down the string into pairs of (non-matching segment, matching segment) so that a final result can be assembled from that (setting aside the last non-matching segment, that can be retrieved in a final step). The difficulty is that the longest strings in the alternation should be prioritized, but the starting (.*?) makes the RE non-greedy so "foo" is choosen over "foobar". The doc at [1] leaves me unoptimistic when it mentions that: "...when an RE contains both greedy and non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to “eat” relative to each other." Also it gives this example of forcing the RE as a whole to be greedy despite it having a non-greedy sub-RE: regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}') but it doesn't seem to be able to produce the desired result in the case of the RE in the middle being an alternation with strings of different lengths. The ideal RE with a (foo|foobar|bar) alternation, when applied globally to a string like 'the string has foo and foobar and bar and more' would produce something like: {"the string has ","foo"} {" and ","foobar"} {" and ","bar"} Is that possible with regexp_matches? [1] https://www.postgresql.org/docs/current/functions-matching.html
"Daniel Verite" <daniel@manitou-mail.org> writes: > The basic idea is to iterate on the rows produced by > regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g') > to break down the string into pairs of (non-matching segment, > matching segment) so that a final result can be assembled > from that (setting aside the last non-matching segment, that > can be retrieved in a final step). > The difficulty is that the longest strings in the alternation > should be prioritized, but the starting (.*?) makes the RE > non-greedy so "foo" is choosen over "foobar". I'd try forcing the match to be the whole string, ie ^(.*?)(foo|bar|foobar)(.*)$ which would also save some work for restarting the iteration, since you'd have already captured the all-the-rest substring. With the endpoints forced, it doesn't really matter whether the engine deems the RE-as-a-whole to be greedy or not. I think this would work without needing any explicit greediness marking for the second and third parts, but I might be wrong about that detail. regards, tom lane
"Daniel Verite" <daniel@manitou-mail.org> writes: >> The basic idea is to iterate on the rows produced by >> regexp_matches(string, '(.*?)(foo|bar|foobar)', 'g') >> to break down the string into pairs of (non-matching segment, >> matching segment) so that a final result can be assembled >> from that (setting aside the last non-matching segment, that >> can be retrieved in a final step). BTW, just to think outside the box a bit, I wonder whether you couldn't build this out of regexp_split_to_array: regression=# select regexp_split_to_array('junkfoolbarfoolishfoobarmore', 'foo|bar|foobar'); regexp_split_to_array ----------------------- {junk,l,"",lish,more} (1 row) The idea would be to iterate over the array elements, tracking the corresponding position in the source string, and re-discovering at each break which of the original alternatives must've matched. It's sort of annoying that we don't have a simple "regexp_location" function that would give you back the starting position of the first match. regards, tom lane
Tom Lane wrote: > I'd try forcing the match to be the whole string, ie > > ^(.*?)(foo|bar|foobar)(.*)$ > > which would also save some work for restarting the iteration, > since you'd have already captured the all-the-rest substring. In that case regexp_matches will return 0 or 1 row. In the above-mentioned example, that would be: => select regexp_matches('the string has foo and foobar and bar and more', '^(.*?)(foo|foobar|bar)(.*)$', 'g'); regexp_matches -------------------------------------------------------- {"the string has ",foo," and foobar and bar and more"} So the next iteration would consist of calling regexp_matches() on result[3], and so on until no match is found. I think it would work as desired, but probably much less efficiently on large strings/large number of matches than if a single call of regexp_matches() could return all matches. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Tom Lane wrote: > regression=# select regexp_split_to_array('junkfoolbarfoolishfoobarmore', > 'foo|bar|foobar'); > regexp_split_to_array > ----------------------- > {junk,l,"",lish,more} > (1 row) > > The idea would be to iterate over the array elements, tracking the > corresponding position in the source string, and re-discovering at > each break which of the original alternatives must've matched. > > It's sort of annoying that we don't have a simple "regexp_location" > function that would give you back the starting position of the > first match. It occurred to me too that regexp_split_to_table or array would make this problem really easy if only it had a mode to capture and return the matched parts too. FWIW, in plperl, there's a simple solution: $string =~ s/(foobar|foo|...)/$replace{$1}/g when %replace is a hash of the substitutions %(foo=>baz,...). The strings in the alternation are tested in their order of appearance, so you can choose to be greedy or not by just sorting them by length. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > FWIW, in plperl, there's a simple solution: > $string =~ s/(foobar|foo|...)/$replace{$1}/g Well, our manual does suggest using plperl (or pltcl) when the built-in pattern match functions aren't adequate ;-) regards, tom lane