Home > mailing lists

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] - Mailing list pgsql-hackers

From	Joel Jacobson
Subject	Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date	December 25, 2021 14:02:29
Msg-id	d292520f-56d7-4f7e-a99c-d267f85129f2@www.fastmail.com Whole thread Raw
In response to	Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:

> "Joel Jacobson" <joel@compiler.org> writes:

> > [ 0005-regexp-positions.patch ]

> I took a quick look at this patch. I am on board now with the general

> idea of returning match positions, but I think there are still

> definitional issues to be discussed.

> 1. The main idea we might perhaps want to adopt from the Oracle-ish

> regexp functions is a search-start-position argument. I'm not sure

> that this is exciting --- if we intend this function to be a close

> analog of regexp_matches(), it'd be better to leave that off. But

> it deserves explicit consideration.

I intentionally made it a close analog of regexp_matches(),

to make it easy for existing users of regexp_matches() to

understand how regexp_positions() works.

> 2. It looks like you modeled this on regexp_matches() to the extent

> of returning a set and using the 'g' flag to control whether the

> set can actually contain more than one row. That's pretty ancient

> precedent ... but we have revisited that design more than once

> because regexp_matches() is just too much of a pain to use. I think

> if we're going to do this, we should learn from history, and provide an

> analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().

OK, so what you are suggesting is we should add both a regexp_position() function,

that would work like regexp_match() but return the position instead,

in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

> 3. The API convention you chose (separate start and length arrays)

> is perhaps still confusing. When I first looked at the test case

> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);

> + regexp_positions

> +-------------------

> + ("{4,7}","{3,5}")

> +(1 row)

> I thought it was all wrong because it seemed to be identifying

> the substrings 'barbequ' and 'obarb'. If there'd been a different

> number of matches than capture groups, maybe I'd not have been

> confused, but still... I wonder if we'd be better advised to make

> N capture groups produce N two-element arrays, or else mash it all

> into one array of N*2 elements. But this probably depends on which

> way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.

Personally, I prefer the version suggested in the latest patch, and suggest we stick to it.

> 4. Not sure about the handling of sub-matches.

> There are various plausible definitions we could use:

> * We return the position/length of the overall match, never mind

> about any parenthesized subexpressions. This is simple but I think

> it loses significant functionality. As an example, you might have

> a pattern like 'ab*(c*)d+' where what you actually want to know

> is where the 'c's are, but they have to be in the context described

> by the rest of the regexp. Without subexpression-match capability

> that's painful to do.

> * If there's a capturing subexpression, return the position/length

> of the first such subexpression, else return the overall match.

> This matches the behavior of substring().

> * If there are capturing subexpression(s), return the

> positions/lengths of those, otherwise return the overall match.

> This agrees with the behavior of regexp_match(es), so I'd tend

> to lean to this option, but perhaps it's the hardest to use.

> * Return the position/length of the overall match *and* those of

> each capturing subexpression. This is the most flexible choice,

> but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users,

but I think one quickly learn the semantics by trying out a few examples,

and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as regexp_match(es),

since otherwise users would have to learn and remember two different behaviours.

> As for comments on the patch itself:

> * The documentation includes an extraneous entry for regexp_replace,

> and it fails to add the promised paragraph to functions-posix-regexp.

> * This bit is evidently copied from regexp_matches:

> + /* be sure to copy the input string into the multi-call ctx */

> + matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,

> regexp_matches needs to save the input string so that

> build_regexp_match_result can copy parts of that. But

> regexp_positions has no such need AFAICS, so I think we

> don't need a long-lived copy of the string.

> * I wouldn't use these names in build_regexp_positions_result:

> + ArrayType *so_ary;

> + ArrayType *eo_ary;

> "so_ary" isn't awful, but it invites confusion with regex's "so"

> field, which hasn't got the same semantics (off by one).

> "eo_ary" is pretty bad because it isn't an "end offset" at all, but

> a length. I'd go for "start_ary" and "length_ary" or some such.

> * Test cases seem a bit thin, notably there's no coverage of the

> null-subexpression code path.

Thanks for the comments on the patch itself.

I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour.

/Joel

pgsql-hackers by date:

From: Amit Kapila
Date: 25 December 2021, 07:20:44
Subject: Re: row filtering for logical replication

From: Pavel Stehule
Date: 25 December 2021, 20:20:36
Subject: Re: Schema variables - new implementation for Postgres 15

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] - Mailing list pgsql-hackers

Previous

Next