Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] |
Date | |
Msg-id | 2038707.1632246924@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] |
List | pgsql-hackers |
"Joel Jacobson" <joel@compiler.org> writes: > [ 0005-regexp-positions.patch ] I took a quick look at this patch. I am on board now with the general idea of returning match positions, but I think there are still definitional issues to be discussed. 1. The main idea we might perhaps want to adopt from the Oracle-ish regexp functions is a search-start-position argument. I'm not sure that this is exciting --- if we intend this function to be a close analog of regexp_matches(), it'd be better to leave that off. But it deserves explicit consideration. 2. It looks like you modeled this on regexp_matches() to the extent of returning a set and using the 'g' flag to control whether the set can actually contain more than one row. That's pretty ancient precedent ... but we have revisited that design more than once because regexp_matches() is just too much of a pain to use. I think if we're going to do this, we should learn from history, and provide an analog to regexp_match() as well as regexp_matches() right off the bat. 3. The API convention you chose (separate start and length arrays) is perhaps still confusing. When I first looked at the test case +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$); + regexp_positions +------------------- + ("{4,7}","{3,5}") +(1 row) I thought it was all wrong because it seemed to be identifying the substrings 'barbequ' and 'obarb'. If there'd been a different number of matches than capture groups, maybe I'd not have been confused, but still... I wonder if we'd be better advised to make N capture groups produce N two-element arrays, or else mash it all into one array of N*2 elements. But this probably depends on which way is the easiest to work with in SQL. 4. Not sure about the handling of sub-matches. There are various plausible definitions we could use: * We return the position/length of the overall match, never mind about any parenthesized subexpressions. This is simple but I think it loses significant functionality. As an example, you might have a pattern like 'ab*(c*)d+' where what you actually want to know is where the 'c's are, but they have to be in the context described by the rest of the regexp. Without subexpression-match capability that's painful to do. * If there's a capturing subexpression, return the position/length of the first such subexpression, else return the overall match. This matches the behavior of substring(). * If there are capturing subexpression(s), return the positions/lengths of those, otherwise return the overall match. This agrees with the behavior of regexp_match(es), so I'd tend to lean to this option, but perhaps it's the hardest to use. * Return the position/length of the overall match *and* those of each capturing subexpression. This is the most flexible choice, but I give it low marks since it matches no existing behaviors. As for comments on the patch itself: * The documentation includes an extraneous entry for regexp_replace, and it fails to add the promised paragraph to functions-posix-regexp. * This bit is evidently copied from regexp_matches: + /* be sure to copy the input string into the multi-call ctx */ + matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern, regexp_matches needs to save the input string so that build_regexp_match_result can copy parts of that. But regexp_positions has no such need AFAICS, so I think we don't need a long-lived copy of the string. * I wouldn't use these names in build_regexp_positions_result: + ArrayType *so_ary; + ArrayType *eo_ary; "so_ary" isn't awful, but it invites confusion with regex's "so" field, which hasn't got the same semantics (off by one). "eo_ary" is pretty bad because it isn't an "end offset" at all, but a length. I'd go for "start_ary" and "length_ary" or some such. * Test cases seem a bit thin, notably there's no coverage of the null-subexpression code path. regards, tom lane
pgsql-hackers by date: