Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace - Mailing list pgsql-hackers

From Gilles Darold
Subject Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Date
Msg-id 366af4de-b78e-ef76-5c7a-3d8a16567168@darold.net
Whole thread Raw
In response to Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Le 01/08/2021 à 19:23, Tom Lane a écrit :
I've been working through this patch, and trying to verify
compatibility against Oracle and DB2, and I see some points that need
discussion or at least recording for the archives.

* In Oracle, while the documentation for regexp_instr says that
return_option should only be 0 or 1, experimentation with sqlfiddle
shows that any nonzero value is silently treated as 1.  The patch
raises an error for other values, which I think is a good idea.
(IBM's docs say that DB2 raises an error too, though I can't test
that.)  We don't need to be bug-compatible to that extent.

* What should happen when the subexpression/capture group number of
regexp_instr or regexp_substr exceeds the number of parenthesized
subexpressions of the regexp?  Oracle silently returns a no-match
result (0 or NULL), as does this patch.  However, IBM's docs say
that DB2 raises an error.  I'm inclined to think that this is
likewise taking bug-compatibility too far, and that we should
raise an error like DB2.  There are clearly cases where throwing
an error would help debug a faulty call, while I'm less clear on
a use-case where not throwing an error would be useful.

* IBM's docs say that both regexp_count and regexp_like have
arguments "string, pattern [, start] [, flags]" --- that is,
each of start and flags can be independently specified or omitted.
The patch follows Oracle, which has no start option for 
regexp_like, and where you can't write flags for regexp_count
without writing start.  This is fine by me, because doing these
like DB2 would introduce the same which-argument-is-this issues
as we're being forced to cope with for regexp_replace.  I don't
think we need to accept ambiguity in these cases too.  But it's
worth memorializing this decision in the thread.

* The patch has most of these functions silently ignoring the 'g'
flag, but I think they should raise errors instead.  Oracle doesn't
accept a 'g' flag for these, so why should we?  The only case where
that logic doesn't hold is regexp_replace, because depending on which
syntax you use the 'g' flag might or might not be meaningful.  So
for regexp_replace, I'd vote for silently ignoring 'g' if the
occurrence-number parameter is given, while honoring it if not.

I've already made changes in my local copy per the last item,
but I've not done anything about throwing errors for out-of-range
subexpression numbers.  Anybody have an opinion about that one?


I thought about this while I was implementing the functions and chose to not throw an error because of the Oracle behavior and also with others regular expression implementation. For example in Perl there is no error:


$ perl -e '$str="hello world"; $str =~ s/(l)/$20/; print "$str\n";'
helo world


Usually a regular expression is always tested by its creator to be sure that this the right one and that it does what is expected. But I agree that it could help the writer to debug its RE.


Also if I recall well Oracle and DB2 limit the number of capture groups back references from \1 to \9 for Oracle and \0 to \9 for DB2. I have chosen to not apply this limit, I don't see the interest of such a limitation.



-- 
Gilles Darold
http://www.darold.net/

pgsql-hackers by date:

Previous
From: Gilles Darold
Date:
Subject: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Next
From: Andres Freund
Date:
Subject: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead