Thread: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

> On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:

> Attached is a patch implementing it this way.

This patch no longer applies, can you please submit a rebased version?

--
Daniel Gustafsson        https://vmware.com/




Daniel Gustafsson <daniel@yesql.se> writes:
>> On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:
>> Attached is a patch implementing it this way.

> This patch no longer applies, can you please submit a rebased version?

Also, since 642433707 ("This patch adds new functions regexp_count(),
regexp_instr(), regexp_like(), and regexp_substr(), and extends
regexp_replace() with some new optional arguments") is already in,
we need to think about how this interacts with that.  Do we even
still need any more functionality in this area?  Should we try to
align the APIs?

Those new function APIs have some Oracle-isms that I don't especially
care for, like use of int for what should be a boolean.  Still, users
aren't going to give us a pass for wildly inconsistent APIs just because
some functions came from Oracle and some didn't.

            regards, tom lane



> On 1 Sep 2021, at 16:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>>> On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:
>>> Attached is a patch implementing it this way.
>
>> This patch no longer applies, can you please submit a rebased version?

On a brief skim, this patch includes the doc stanza for regexp_replace which I
assume is a copy/pasteo.

+        TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starts”,
While “start_positions” is awfully verbose, just “starts” doesn’t really roll
off the tongue.  Perhaps “positions” would be more explanatory?

> Also, since 642433707 ("This patch adds new functions regexp_count(),
> regexp_instr(), regexp_like(), and regexp_substr(), and extends
> regexp_replace() with some new optional arguments") is already in,
> we need to think about how this interacts with that.  Do we even
> still need any more functionality in this area?  Should we try to
> align the APIs?

I can see value in a function like this one, and the API is AFAICT fairly
aligned with what I as a user would expect it to be given what we already have.

--
Daniel Gustafsson        https://vmware.com/




On Thu, Sep 2, 2021, at 00:03, Daniel Gustafsson wrote:
> I can see value in a function like this one, and the API is AFAICT fairly
> aligned with what I as a user would expect it to be given what we already have.

Good to hear and thanks for looking at this patch.

I've fixed the problem due to the recent change in setup_regexp_matches(),
which added a new int parameter "start_search".
I pass 0 as start_search, which I think should give the same behaviour as before.

I also changed the assigned oid values in pg_proc.dat for the two new regexp_positions() catalog functions.

$ make check

=======================
All 209 tests passed.
=======================

/Joel




Attachment
"Joel Jacobson" <joel@compiler.org> writes:
> [ 0005-regexp-positions.patch ]

I took a quick look at this patch.  I am on board now with the general
idea of returning match positions, but I think there are still
definitional issues to be discussed.

1. The main idea we might perhaps want to adopt from the Oracle-ish
regexp functions is a search-start-position argument.  I'm not sure
that this is exciting --- if we intend this function to be a close
analog of regexp_matches(), it'd be better to leave that off.  But
it deserves explicit consideration.

2. It looks like you modeled this on regexp_matches() to the extent
of returning a set and using the 'g' flag to control whether the
set can actually contain more than one row.  That's pretty ancient
precedent ... but we have revisited that design more than once
because regexp_matches() is just too much of a pain to use.  I think
if we're going to do this, we should learn from history, and provide an
analog to regexp_match() as well as regexp_matches() right off the bat.

3. The API convention you chose (separate start and length arrays)
is perhaps still confusing.  When I first looked at the test case

+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+-------------------
+ ("{4,7}","{3,5}")
+(1 row)

I thought it was all wrong because it seemed to be identifying
the substrings 'barbequ' and 'obarb'.  If there'd been a different
number of matches than capture groups, maybe I'd not have been
confused, but still...  I wonder if we'd be better advised to make
N capture groups produce N two-element arrays, or else mash it all
into one array of N*2 elements.  But this probably depends on which
way is the easiest to work with in SQL.

4. Not sure about the handling of sub-matches.
There are various plausible definitions we could use:

* We return the position/length of the overall match, never mind
about any parenthesized subexpressions.  This is simple but I think
it loses significant functionality.  As an example, you might have
a pattern like 'ab*(c*)d+' where what you actually want to know
is where the 'c's are, but they have to be in the context described
by the rest of the regexp.  Without subexpression-match capability
that's painful to do.

* If there's a capturing subexpression, return the position/length
of the first such subexpression, else return the overall match.
This matches the behavior of substring().

* If there are capturing subexpression(s), return the
positions/lengths of those, otherwise return the overall match.
This agrees with the behavior of regexp_match(es), so I'd tend
to lean to this option, but perhaps it's the hardest to use.

* Return the position/length of the overall match *and* those of
each capturing subexpression.  This is the most flexible choice,
but I give it low marks since it matches no existing behaviors.


As for comments on the patch itself:

* The documentation includes an extraneous entry for regexp_replace,
and it fails to add the promised paragraph to functions-posix-regexp.

* This bit is evidently copied from regexp_matches:

+        /* be sure to copy the input string into the multi-call ctx */
+        matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,

regexp_matches needs to save the input string so that
build_regexp_match_result can copy parts of that.  But
regexp_positions has no such need AFAICS, so I think we
don't need a long-lived copy of the string.

* I wouldn't use these names in build_regexp_positions_result:

+    ArrayType  *so_ary;
+    ArrayType  *eo_ary;

"so_ary" isn't awful, but it invites confusion with regex's "so"
field, which hasn't got the same semantics (off by one).
"eo_ary" is pretty bad because it isn't an "end offset" at all, but
a length.  I'd go for "start_ary" and "length_ary" or some such.

* Test cases seem a bit thin, notably there's no coverage of the
null-subexpression code path.

            regards, tom lane



This review has gone unanswered for two months, so I'm marking this patch
Returned with Feedback.  Please feel free to resubmit when a new version of the
patch is available.

--
Daniel Gustafsson        https://vmware.com/




On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
> > [ 0005-regexp-positions.patch ]
>
> I took a quick look at this patch.  I am on board now with the general
> idea of returning match positions, but I think there are still
> definitional issues to be discussed.
>
> 1. The main idea we might perhaps want to adopt from the Oracle-ish
> regexp functions is a search-start-position argument.  I'm not sure
> that this is exciting --- if we intend this function to be a close
> analog of regexp_matches(), it'd be better to leave that off.  But
> it deserves explicit consideration.

I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.

> 2. It looks like you modeled this on regexp_matches() to the extent
> of returning a set and using the 'g' flag to control whether the
> set can actually contain more than one row.  That's pretty ancient
> precedent ... but we have revisited that design more than once
> because regexp_matches() is just too much of a pain to use.  I think
> if we're going to do this, we should learn from history, and provide an
> analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

> 3. The API convention you chose (separate start and length arrays)
> is perhaps still confusing.  When I first looked at the test case
>
> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
> + regexp_positions  
> +-------------------
> + ("{4,7}","{3,5}")
> +(1 row)
>
> I thought it was all wrong because it seemed to be identifying
> the substrings 'barbequ' and 'obarb'.  If there'd been a different
> number of matches than capture groups, maybe I'd not have been
> confused, but still...  I wonder if we'd be better advised to make
> N capture groups produce N two-element arrays, or else mash it all
> into one array of N*2 elements.  But this probably depends on which
> way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we stick to it.

> 4. Not sure about the handling of sub-matches.
> There are various plausible definitions we could use:
>
> * We return the position/length of the overall match, never mind
> about any parenthesized subexpressions.  This is simple but I think
> it loses significant functionality.  As an example, you might have
> a pattern like 'ab*(c*)d+' where what you actually want to know
> is where the 'c's are, but they have to be in the context described
> by the rest of the regexp.  Without subexpression-match capability
> that's painful to do.
>
> * If there's a capturing subexpression, return the position/length
> of the first such subexpression, else return the overall match.
> This matches the behavior of substring().
>
> * If there are capturing subexpression(s), return the
> positions/lengths of those, otherwise return the overall match.
> This agrees with the behavior of regexp_match(es), so I'd tend
> to lean to this option, but perhaps it's the hardest to use.
>
> * Return the position/length of the overall match *and* those of
> each capturing subexpression.  This is the most flexible choice,
> but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.

> As for comments on the patch itself:
>
> * The documentation includes an extraneous entry for regexp_replace,
> and it fails to add the promised paragraph to functions-posix-regexp.
>
> * This bit is evidently copied from regexp_matches:
>
> +        /* be sure to copy the input string into the multi-call ctx */
> +        matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
>
> regexp_matches needs to save the input string so that
> build_regexp_match_result can copy parts of that.  But
> regexp_positions has no such need AFAICS, so I think we
> don't need a long-lived copy of the string.
>
> * I wouldn't use these names in build_regexp_positions_result:
>
> +    ArrayType  *so_ary;
> +    ArrayType  *eo_ary;
>
> "so_ary" isn't awful, but it invites confusion with regex's "so"
> field, which hasn't got the same semantics (off by one).
> "eo_ary" is pretty bad because it isn't an "end offset" at all, but
> a length.  I'd go for "start_ary" and "length_ary" or some such.
>
> * Test cases seem a bit thin, notably there's no coverage of the
> null-subexpression code path.

Thanks for the comments on the patch itself.
I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour.

/Joel