Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace - Mailing list pgsql-hackers

From Gilles Darold
Subject Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date
Msg-id d042905c-f0b3-be60-44b8-bd3229cca66e@darold.net
Whole thread Raw
In response to Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace  (Gilles Darold <gilles@darold.net>)
Responses Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace  (Gilles Darold <gilles@darold.net>)
List pgsql-hackers
Le 02/08/2021 à 23:22, Gilles Darold a écrit :
> Le 02/08/2021 à 01:21, Tom Lane a écrit :
>> Gilles Darold <gilles@darold.net> writes:
>>> [ v5-0001-regexp-foo-functions.patch ]
>> I've gone through this whole patch now, and found quite a lot that I did
>> not like.  In no particular order:
>>
>> * Wrapping parentheses around the user's regexp doesn't work.  It can
>> turn an invalid regexp into a valid one: for example 'a)(b' should draw
>> a syntax error.  With this patch, no error would be thrown, but the
>> "outer" parens wouldn't do what you expected.  Worse, it can turn a
>> valid regexp into an invalid one: the metasyntax options described in
>> 9.7.3.4 only work at the start of the regexp.  So we have to handle
>> whole-regexp cases honestly rather than trying to turn them into an
>> instance of the parenthesized-subexpression case.
>>
>> * You did a lot of things quite inefficiently, apparently to avoid
>> touching any existing code.  I think it's better to extend
>> setup_regexp_matches() and replace_text_regexp() a little bit so that
>> they can support the behaviors these new functions need.  In both of
>> them, it's absolutely trivial to allow a search start position to be
>> passed in; and it doesn't take much to teach replace_text_regexp()
>> to replace only the N'th match.
>>
>> * Speaking of N'th, there is not much of anything that I like
>> about Oracle's terminology for the function arguments, and I don't
>> think we ought to adopt it.  If we're documenting the functions as
>> processing the "N'th match", it seems to me to be natural to call
>> the parameter "N" not "occurrence".  Speaking of the "occurrence'th
>> occurrence" is just silly, not to mention long and easy to misspell.
>> Likewise, "position" is a horribly vague term for the search start
>> position; it could be interpreted to mean several other things.
>> "start" seems much better.  "return_opt" is likewise awfully unclear.
>> I went with "endoption" below, though I could be talked into something
>> else.  The only one of Oracle's choices that I like is "subexpr" for
>> subexpression number ... but you went with DB2's rather vague "group"
>> instead.  I don't want to use their "capture group" terminology,
>> because that appears nowhere else in our documentation.  Our existing
>> terminology is "parenthesized subexpression", which seems fine to me
>> (and also agrees with Oracle's docs).
>>
>> * I spent a lot of time on the docs too.  A lot of the syntax specs
>> were wrong (where you put the brackets matters), many of the examples
>> seemed confusingly overcomplicated, and the text explanations needed
>> copy-editing.
>>
>> * Also, the regression tests seemed misguided.  This patch is not
>> responsible for testing the regexp engine as such; we have tests
>> elsewhere that do that.  So I don't think we need complex regexps
>> here.  We just need to verify that the parameters of these functions
>> act properly, and check their error cases.  That can be done much
>> more quickly and straightforwardly than what you had.
>>
>>
>> So here's a revised version that I like better.  I think this
>> is pretty nearly committable, aside from the question of whether
>> a too-large subexpression number should be an error or not.
>
> Thanks a lot for the patch improvement and the guidance. I have read the
> patch and I agree with your choices I think I was too much trying to
> mimic the oraclisms. I don't think we should take care of the too-large
> subexpression number, the regexp writer should always test its regular
> expression and also this will not prevent him to chose the wrong capture
> group number but just a non existing one.


Actually I just found that the regexp_like() function doesn't support 
the start parameter which is something we should support. I saw that 
Oracle do not support it but DB2 does and I think we should also support 
it. I will post a new version of the patch once it is done.


Best regards,

-- 
Gilles Darold




pgsql-hackers by date:

Previous
From: "Andrey V. Lepikhov"
Date:
Subject: Extra code in commit_ts.h
Next
From: vignesh C
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions