Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] |
Date | |
Msg-id | d292520f-56d7-4f7e-a99c-d267f85129f2@www.fastmail.com Whole thread Raw |
In response to | Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
> > [ 0005-regexp-positions.patch ]
>
> I took a quick look at this patch. I am on board now with the general
> idea of returning match positions, but I think there are still
> definitional issues to be discussed.
>
> 1. The main idea we might perhaps want to adopt from the Oracle-ish
> regexp functions is a search-start-position argument. I'm not sure
> that this is exciting --- if we intend this function to be a close
> analog of regexp_matches(), it'd be better to leave that off. But
> it deserves explicit consideration.
I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.
> 2. It looks like you modeled this on regexp_matches() to the extent
> of returning a set and using the 'g' flag to control whether the
> set can actually contain more than one row. That's pretty ancient
> precedent ... but we have revisited that design more than once
> because regexp_matches() is just too much of a pain to use. I think
> if we're going to do this, we should learn from history, and provide an
> analog to regexp_match() as well as regexp_matches() right off the bat.
Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.
That sounds like a good idea to me.
> 3. The API convention you chose (separate start and length arrays)
> is perhaps still confusing. When I first looked at the test case
>
> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
> + regexp_positions
> +-------------------
> + ("{4,7}","{3,5}")
> +(1 row)
>
> I thought it was all wrong because it seemed to be identifying
> the substrings 'barbequ' and 'obarb'. If there'd been a different
> number of matches than capture groups, maybe I'd not have been
> confused, but still... I wonder if we'd be better advised to make
> N capture groups produce N two-element arrays, or else mash it all
> into one array of N*2 elements. But this probably depends on which
> way is the easiest to work with in SQL.
The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we stick to it.
> 4. Not sure about the handling of sub-matches.
> There are various plausible definitions we could use:
>
> * We return the position/length of the overall match, never mind
> about any parenthesized subexpressions. This is simple but I think
> it loses significant functionality. As an example, you might have
> a pattern like 'ab*(c*)d+' where what you actually want to know
> is where the 'c's are, but they have to be in the context described
> by the rest of the regexp. Without subexpression-match capability
> that's painful to do.
>
> * If there's a capturing subexpression, return the position/length
> of the first such subexpression, else return the overall match.
> This matches the behavior of substring().
>
> * If there are capturing subexpression(s), return the
> positions/lengths of those, otherwise return the overall match.
> This agrees with the behavior of regexp_match(es), so I'd tend
> to lean to this option, but perhaps it's the hardest to use.
>
> * Return the position/length of the overall match *and* those of
> each capturing subexpression. This is the most flexible choice,
> but I give it low marks since it matches no existing behaviors.
The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.
I think it's best to let regexp_position(s) work the same way as regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.
> As for comments on the patch itself:
>
> * The documentation includes an extraneous entry for regexp_replace,
> and it fails to add the promised paragraph to functions-posix-regexp.
>
> * This bit is evidently copied from regexp_matches:
>
> + /* be sure to copy the input string into the multi-call ctx */
> + matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
>
> regexp_matches needs to save the input string so that
> build_regexp_match_result can copy parts of that. But
> regexp_positions has no such need AFAICS, so I think we
> don't need a long-lived copy of the string.
>
> * I wouldn't use these names in build_regexp_positions_result:
>
> + ArrayType *so_ary;
> + ArrayType *eo_ary;
>
> "so_ary" isn't awful, but it invites confusion with regex's "so"
> field, which hasn't got the same semantics (off by one).
> "eo_ary" is pretty bad because it isn't an "end offset" at all, but
> a length. I'd go for "start_ary" and "length_ary" or some such.
>
> * Test cases seem a bit thin, notably there's no coverage of the
> null-subexpression code path.
Thanks for the comments on the patch itself.
I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour.
/Joel
pgsql-hackers by date: