Thread: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

20 March 2021, 18:48:48

Hi,

This is a new version of the patch that now implements all the XQUERY regexp functions as described in the standard, minus the differences of PostgerSQL regular expression explain in [1].

The standard SQL describe functions like_regex(), occurrences_regex(), position_regex(), substring_regex() and translate_regex() which correspond to the commonly named functions regexp_like(), regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as reported by Chapman Flack in [2]. All these function are implemented in the patch. Syntax of the functions are:

- regexp_like(string, pattern [, flags ])

- regexp_count( string, pattern [, position ] [, flags ])

- regexp_instr( string, pattern [, position ] [, occurrence ] [, returnopt ] [, flags ] [, group ])

- regexp_substr( string, pattern [, position ] [, occurrence ] [, flags ] [, group ])

- regexp_replace(source, pattern, replacement [, position ] [, occurrence ] [, flags ])

In addition to previous patch version I have added the regexp()_like function and extended the existsing regex_replace() function. The patch documents these functions and adds regression tests for all functions. I will add it to the commitfest.

An other regexp functions regexp_positions() that returns all occurrences that matched a POSIX regular expression is also developped by Joel Jacobson, see [2]. This function expands the list of regexp functions described in XQUERY.

[1] https://www.postgresql.org/docs/13/functions-matching.html#FUNCTIONS-POSIX-REGEXP

[2] https://www.postgresql.org/message-id/flat/bf2222d5-909d-408b-8531-95b32f18d4ab%40www.fastmail.com#3ec8ba658eeabcae2ac6ccca33bd1aed

-- 
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/

Attachment

v2-0001-xquery-regexp-functions.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

21 March 2021, 09:21:17

Le 20/03/2021 à 19:48, Gilles Darold a écrit :

Hi,

This is a new version of the patch that now implements all the XQUERY regexp functions as described in the standard, minus the differences of PostgerSQL regular expression explain in [1].

The standard SQL describe functions like_regex(), occurrences_regex(), position_regex(), substring_regex() and translate_regex() which correspond to the commonly named functions regexp_like(), regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as reported by Chapman Flack in [2]. All these function are implemented in the patch. Syntax of the functions are:

- regexp_like(string, pattern [, flags ])
- regexp_count( string, pattern [, position ] [, flags ])
- regexp_instr( string, pattern [, position ] [, occurrence ] [, returnopt ] [, flags ] [, group ])
- regexp_substr( string, pattern [, position ] [, occurrence ] [, flags ] [, group ])
- regexp_replace(source, pattern, replacement [, position ] [, occurrence ] [, flags ])

In addition to previous patch version I have added the regexp()_like function and extended the existsing regex_replace() function. The patch documents these functions and adds regression tests for all functions. I will add it to the commitfest.

An other regexp functions regexp_positions() that returns all occurrences that matched a POSIX regular expression is also developped by Joel Jacobson, see [2]. This function expands the list of regexp functions described in XQUERY.

[1] https://www.postgresql.org/docs/13/functions-matching.html#FUNCTIONS-POSIX-REGEXP
[2] https://www.postgresql.org/message-id/flat/bf2222d5-909d-408b-8531-95b32f18d4ab%40www.fastmail.com#3ec8ba658eeabcae2ac6ccca33bd1aed

I would like to see these functions in PG 14 but it is a bit too late, added to commitfest 2021-07.

--
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

er@xs4all.nl

Date:

21 March 2021, 11:07:37

> On 2021.03.20. 19:48 Gilles Darold <gilles@darold.net> wrote:
>  
> This is a new version of the patch that now implements all the XQUERY
> regexp functions as described in the standard, minus the differences of
> PostgerSQL regular expression explain in [1].
> 
> The standard SQL describe functions like_regex(), occurrences_regex(),
> position_regex(), substring_regex() and translate_regex() which
> correspond to the commonly named functions regexp_like(),
> regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as
> reported by Chapman Flack in [2]. All these function are implemented in

> [v2-0001-xquery-regexp-functions.patch]

Hi,

Apply, compile and (world)check are fine. I haven't found errors in functionality.

I went through the docs, and came up with these changes in func.sgml, and pg_proc.dat.

Useful functions - thanks!

Erik Rijkers

Attachment

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

21 March 2021, 13:19:13

Le 21/03/2021 à 12:07, er@xs4all.nl a écrit :
>> On 2021.03.20. 19:48 Gilles Darold <gilles@darold.net> wrote:
>>
>> This is a new version of the patch that now implements all the XQUERY
>> regexp functions as described in the standard, minus the differences of
>> PostgerSQL regular expression explain in [1].
>>
>> The standard SQL describe functions like_regex(), occurrences_regex(),
>> position_regex(), substring_regex() and translate_regex() which
>> correspond to the commonly named functions regexp_like(),
>> regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as
>> reported by Chapman Flack in [2]. All these function are implemented in
>> [v2-0001-xquery-regexp-functions.patch]
> Hi,
>
> Apply, compile and (world)check are fine. I haven't found errors in functionality.
>
> I went through the docs, and came up with these changes in func.sgml, and pg_proc.dat.
>
> Useful functions - thanks!
>
> Erik Rijkers


Thanks a lot Erik, here is a version of the patch with your corrections.


--
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/

Attachment

v3-0001-xquery-regexp-functions.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Chapman Flack

Date:

21 March 2021, 14:42:34

On 03/21/21 09:19, Gilles Darold wrote:
>>> On 2021.03.20. 19:48 Gilles Darold <gilles@darold.net> wrote:
>>>  
>>> This is a new version of the patch that now implements all the XQUERY
>>> regexp functions as described in the standard, minus the differences of
>>> PostgerSQL regular expression explain in [1].
>>>
>>> The standard SQL describe functions like_regex(), occurrences_regex(),
>>> position_regex(), substring_regex() and translate_regex() which
>>> correspond to the commonly named functions regexp_like(),
>>> regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as
>>> reported by Chapman Flack in [2]. All these function are implemented in
>>> [v2-0001-xquery-regexp-functions.patch]

I quickly looked over this patch preparing to object if it actually
purported to implement the ISO foo_regex() named functions without
the ISO semantics, but a quick grep reassured me that it doesn't
implement any of those functions. It only supplies functions in
the alternative, apparently common de facto naming scheme regexp_foo().

To be clear, I think that's the right call. I do not think it would be
a good idea to supply functions that have the ISO names but not the
specified regex dialect.

A set of functions analogous to the ISO ones but differently named and
with a different regex dialect seems fine to me, especially if these
different names are de facto common, and as far as I can tell, that is
what this patch provides. So I have no objection to that. :)

It might then be fair to say that the /description/ of the patch as
implementing the XQuery-based foo_regex functions isn't quite right,
or at least carries a risk of jarring some readers into hasty
double-takes on Sunday mornings before coffee.

It might be clearer to just mention the close correspondence between
the functions in this differently-named set and the corresponding ISO ones.

If this turns out to be a case of "attached the wrong patch, here's
the one that does implement foo_regex functions!" then I reserve an
objection to that. :)

Regards,
-Chap

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

21 March 2021, 14:53:17

Chapman Flack <chap@anastigmatix.net> writes:
> If this turns out to be a case of "attached the wrong patch, here's
> the one that does implement foo_regex functions!" then I reserve an
> objection to that. :)

+1 to that.  Just to add a note, I do have some ideas about extending
our regex parser so that it could duplicate the XQuery syntax --- none
of the points we mention in 9.7.3.8 seem insurmountable.  I'm not
planning to work on that in the near future, mind you, but I definitely
think that we don't want to paint ourselves into a corner where we've
already implemented the XQuery regex functions with the wrong behavior.

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

21 March 2021, 16:40:45

Le 21/03/2021 à 15:53, Tom Lane a écrit :

Chapman Flack <chap@anastigmatix.net> writes:

If this turns out to be a case of "attached the wrong patch, here's
the one that does implement foo_regex functions!" then I reserve an
objection to that. :)

+1 to that.  Just to add a note, I do have some ideas about extending
our regex parser so that it could duplicate the XQuery syntax --- none
of the points we mention in 9.7.3.8 seem insurmountable.  I'm not
planning to work on that in the near future, mind you, but I definitely
think that we don't want to paint ourselves into a corner where we've
already implemented the XQuery regex functions with the wrong behavior.
			regards, tom lane

I apologize for confusing with the words and phrases I have used. This patch implements the regexp_foo () functions which are available in most RDBMS with the behavior described in the documentation. I have modified the title of the patch in the commitfest to removed wrong use of XQUERY.

I don't know too if the other RDBMS respect the XQUERY behavior but for what I've seen for Oracle they are using limited regexp modifiers with sometime not the same letter than PostgreSQL for the same behavior. I have implemented these functions with the Oracle behavior in Orafce [1] with a function that checks the modifiers used. This patch doesn't mimic the Oracle behavior, it use the PostgreSQL behavior with regexp, the one used by regex_replace() and regex_matches(). All regexp modifiers can be used.

[1] https://github.com/orafce/orafce/blob/master/orafce--3.14--3.15.sql

-- 
Gilles Darold
http://www.darold.net/

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

21 March 2021, 16:46:47

Le 21/03/2021 à 15:53, Tom Lane a écrit :

Chapman Flack <chap@anastigmatix.net> writes:

If this turns out to be a case of "attached the wrong patch, here's
the one that does implement foo_regex functions!" then I reserve an
objection to that. :)

And the patch renamed.

Attachment

v4-0001-regexp-foo-functions.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

26 July 2021, 19:56:01

Gilles Darold <gillesdarold@gmail.com> writes:
> [ v4-0001-regexp-foo-functions.patch ]

I started to work through this and was distressed to realize that
it's trying to redefine regexp_replace() in an incompatible way.
We already have

regression=# \df regexp_replace
                               List of functions
   Schema   |      Name      | Result data type |  Argument data types   | Type
------------+----------------+------------------+------------------------+------
 pg_catalog | regexp_replace | text             | text, text, text       | func
 pg_catalog | regexp_replace | text             | text, text, text, text | func
(2 rows)

The patch proposes to add (among other alternatives)

+{ oid => '9608', descr => 'replace text using regexp',
+  proname => 'regexp_replace', prorettype => 'text',
+  proargtypes => 'text text text int4', prosrc => 'textregexreplace_extended_no_occurrence' },

which is going to be impossibly confusing for both humans and machines.
I don't think we should go there.  Even if you managed to construct
examples that didn't result in "ambiguous function" failures, that
doesn't mean that ordinary mortals won't get bit that way.

I'm inclined to just drop the regexp_replace additions.  I don't think
that the extra parameters Oracle provides here are especially useful.
They're definitely not useful enough to justify creating compatibility
hazards for.

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

27 July 2021, 09:38:36

Le 26/07/2021 à 21:56, Tom Lane a écrit :
> Gilles Darold <gillesdarold@gmail.com> writes:
>> [ v4-0001-regexp-foo-functions.patch ]
> I started to work through this and was distressed to realize that
> it's trying to redefine regexp_replace() in an incompatible way.
> We already have
>
> regression=# \df regexp_replace
>                                List of functions
>    Schema   |      Name      | Result data type |  Argument data types   | Type
> ------------+----------------+------------------+------------------------+------
>  pg_catalog | regexp_replace | text             | text, text, text       | func
>  pg_catalog | regexp_replace | text             | text, text, text, text | func
> (2 rows)
>
> The patch proposes to add (among other alternatives)
>
> +{ oid => '9608', descr => 'replace text using regexp',
> +  proname => 'regexp_replace', prorettype => 'text',
> +  proargtypes => 'text text text int4', prosrc => 'textregexreplace_extended_no_occurrence' },
>
> which is going to be impossibly confusing for both humans and machines.
> I don't think we should go there.  Even if you managed to construct
> examples that didn't result in "ambiguous function" failures, that
> doesn't mean that ordinary mortals won't get bit that way.
>
> I'm inclined to just drop the regexp_replace additions.  I don't think
> that the extra parameters Oracle provides here are especially useful.
> They're definitely not useful enough to justify creating compatibility
> hazards for.


I would not say that being able to replace the Nth occurrence of a
pattern matching is not useful but i agree that this is not a common
case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form
and I have though that we can not have compatibility issues because of
the different data type at the 4th parameter. Anyway, maybe we can just
rename the function even if I would prefer that regexp_replace() be
extended. For example:


    regexp_replace(source, pattern, replacement [, flags ]);

    regexp_substitute(source, pattern, replacement [, position ] [,
occurrence ] [, flags ]);


of course with only 3 parameters the two functions are the same.


What do you think about the renaming proposal instead of simply drop the
extended form of the function?


Best regards,


[1] https://docs.oracle.com/database/121/SQLRF/functions163.htm#SQLRF06302

[2] https://www.ibm.com/docs/en/db2oc?topic=functions-regexp-replace


--
Gilles Darold
http://www.darold.net/

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

30 July 2021, 21:38:24

Gilles Darold <gilles@darold.net> writes:
> Le 26/07/2021 à 21:56, Tom Lane a écrit :
>> I'm inclined to just drop the regexp_replace additions.  I don't think
>> that the extra parameters Oracle provides here are especially useful.
>> They're definitely not useful enough to justify creating compatibility
>> hazards for.

> I would not say that being able to replace the Nth occurrence of a
> pattern matching is not useful but i agree that this is not a common
> case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form
> and I have though that we can not have compatibility issues because of
> the different data type at the 4th parameter.

Well, here's an example of the potential issues:

regression=# create function rr(text,text,text,text) returns text
regression-# language sql as $$select 'text'$$;
CREATE FUNCTION
regression=# create function rr(text,text,text,int4) returns text
language sql as $$select 'int4'$$;
CREATE FUNCTION
regression=# select rr('a','b','c','d');
  rr  
------
 text
(1 row)

regression=# select rr('a','b','c',42);
  rr  
------
 int4
(1 row)

So far so good, but:

regression=# prepare rr as select rr('a','b','c',$1);
PREPARE
regression=# execute rr(12);  
  rr  
------
 text
(1 row)

So somebody trying to use the 4-parameter Oracle form from, say, JDBC
would get bit if they were sloppy about specifying parameter types.

The one saving grace is that digits aren't valid regexp flags,
so the outcome would be something like

regression=# select regexp_replace('a','b','c','12');
ERROR:  invalid regular expression option: "1"

which'd be less difficult to debug than silent misbehavior.
Conversely, if you thought you were passing flags but it somehow
got interpreted as a start position, that would fail too:

regression=# prepare rri as select rr('a','b','c', $1::int);
PREPARE
regression=# execute rri('gi');
ERROR:  invalid input syntax for type integer: "gi"
LINE 1: execute rri('gi');
                    ^

Still, I bet a lot that we'd see periodic bug reports complaining
that it doesn't work.

> Anyway, maybe we can just
> rename the function even if I would prefer that regexp_replace() be
> extended. For example:
>     regexp_replace(source, pattern, replacement [, flags ]);
>     regexp_substitute(source, pattern, replacement [, position ] [,
> occurrence ] [, flags ]);

Hmm.  Of course the entire selling point of this patch seems to be
bug-compatibility with Oracle, so using different names is largely
defeating the point :-(

Maybe we should just hold our noses and do it.  The point that
you'd get a recognizable failure if the wrong function were chosen
reassures me a little bit.  We've seen a lot of cases where this
sort of ambiguity results in the system just silently doing something
different from what you expected, and I was afraid that that could
happen here.

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

01 August 2021, 17:23:13

I've been working through this patch, and trying to verify
compatibility against Oracle and DB2, and I see some points that need
discussion or at least recording for the archives.

* In Oracle, while the documentation for regexp_instr says that
return_option should only be 0 or 1, experimentation with sqlfiddle
shows that any nonzero value is silently treated as 1.  The patch
raises an error for other values, which I think is a good idea.
(IBM's docs say that DB2 raises an error too, though I can't test
that.)  We don't need to be bug-compatible to that extent.

* What should happen when the subexpression/capture group number of
regexp_instr or regexp_substr exceeds the number of parenthesized
subexpressions of the regexp?  Oracle silently returns a no-match
result (0 or NULL), as does this patch.  However, IBM's docs say
that DB2 raises an error.  I'm inclined to think that this is
likewise taking bug-compatibility too far, and that we should
raise an error like DB2.  There are clearly cases where throwing
an error would help debug a faulty call, while I'm less clear on
a use-case where not throwing an error would be useful.

* IBM's docs say that both regexp_count and regexp_like have
arguments "string, pattern [, start] [, flags]" --- that is,
each of start and flags can be independently specified or omitted.
The patch follows Oracle, which has no start option for 
regexp_like, and where you can't write flags for regexp_count
without writing start.  This is fine by me, because doing these
like DB2 would introduce the same which-argument-is-this issues
as we're being forced to cope with for regexp_replace.  I don't
think we need to accept ambiguity in these cases too.  But it's
worth memorializing this decision in the thread.

* The patch has most of these functions silently ignoring the 'g'
flag, but I think they should raise errors instead.  Oracle doesn't
accept a 'g' flag for these, so why should we?  The only case where
that logic doesn't hold is regexp_replace, because depending on which
syntax you use the 'g' flag might or might not be meaningful.  So
for regexp_replace, I'd vote for silently ignoring 'g' if the
occurrence-number parameter is given, while honoring it if not.

I've already made changes in my local copy per the last item,
but I've not done anything about throwing errors for out-of-range
subexpression numbers.  Anybody have an opinion about that one?

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

01 August 2021, 19:22:41

Le 30/07/2021 à 23:38, Tom Lane a écrit :

Gilles Darold <gilles@darold.net> writes:

Le 26/07/2021 à 21:56, Tom Lane a écrit :

I'm inclined to just drop the regexp_replace additions.  I don't think
that the extra parameters Oracle provides here are especially useful.
They're definitely not useful enough to justify creating compatibility
hazards for.

I would not say that being able to replace the Nth occurrence of a
pattern matching is not useful but i agree that this is not a common
case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form
and I have though that we can not have compatibility issues because of
the different data type at the 4th parameter.

Well, here's an example of the potential issues:

[...]

Thanks for pointing me this case, I did not think that the prepared statement could lead to this confusion.

Anyway, maybe we can just
rename the function even if I would prefer that regexp_replace() be
extended. For example:
    regexp_replace(source, pattern, replacement [, flags ]);
    regexp_substitute(source, pattern, replacement [, position ] [,
occurrence ] [, flags ]);

Hmm.  Of course the entire selling point of this patch seems to be
bug-compatibility with Oracle, so using different names is largely
defeating the point :-(

Maybe we should just hold our noses and do it.  The point that
you'd get a recognizable failure if the wrong function were chosen
reassures me a little bit.  We've seen a lot of cases where this
sort of ambiguity results in the system just silently doing something
different from what you expected, and I was afraid that that could
happen here.

I join a new version of the patch that include a check of the option parameter in the basic form of regexp_replace() and return an error in ambiguous cases.

PREPARE rr AS SELECT regexp_replace('healthy, wealthy, and wise','(\w+)thy', '\1ish', $1);
EXECUTE rr(1);
ERROR: ambiguous use of the option parameter in regex_replace(), value: 1
HINT: you might set the occurrence parameter to force the use of the extended form of regex_replace()

This is done by checking if the option parameter value is an integer and throw the error in this case. I don't think of anything better.

Best regards,

-- 
Gilles Darold

Attachment

v5-0001-regexp-foo-functions.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

01 August 2021, 19:48:00

Le 01/08/2021 à 19:23, Tom Lane a écrit :

I've been working through this patch, and trying to verify
compatibility against Oracle and DB2, and I see some points that need
discussion or at least recording for the archives.

* In Oracle, while the documentation for regexp_instr says that
return_option should only be 0 or 1, experimentation with sqlfiddle
shows that any nonzero value is silently treated as 1.  The patch
raises an error for other values, which I think is a good idea.
(IBM's docs say that DB2 raises an error too, though I can't test
that.)  We don't need to be bug-compatible to that extent.

* What should happen when the subexpression/capture group number of
regexp_instr or regexp_substr exceeds the number of parenthesized
subexpressions of the regexp?  Oracle silently returns a no-match
result (0 or NULL), as does this patch.  However, IBM's docs say
that DB2 raises an error.  I'm inclined to think that this is
likewise taking bug-compatibility too far, and that we should
raise an error like DB2.  There are clearly cases where throwing
an error would help debug a faulty call, while I'm less clear on
a use-case where not throwing an error would be useful.

* IBM's docs say that both regexp_count and regexp_like have
arguments "string, pattern [, start] [, flags]" --- that is,
each of start and flags can be independently specified or omitted.
The patch follows Oracle, which has no start option for 
regexp_like, and where you can't write flags for regexp_count
without writing start.  This is fine by me, because doing these
like DB2 would introduce the same which-argument-is-this issues
as we're being forced to cope with for regexp_replace.  I don't
think we need to accept ambiguity in these cases too.  But it's
worth memorializing this decision in the thread.

* The patch has most of these functions silently ignoring the 'g'
flag, but I think they should raise errors instead.  Oracle doesn't
accept a 'g' flag for these, so why should we?  The only case where
that logic doesn't hold is regexp_replace, because depending on which
syntax you use the 'g' flag might or might not be meaningful.  So
for regexp_replace, I'd vote for silently ignoring 'g' if the
occurrence-number parameter is given, while honoring it if not.

I've already made changes in my local copy per the last item,
but I've not done anything about throwing errors for out-of-range
subexpression numbers.  Anybody have an opinion about that one?

I thought about this while I was implementing the functions and chose to not throw an error because of the Oracle behavior and also with others regular expression implementation. For example in Perl there is no error:

$ perl -e '$str="hello world"; $str =~ s/(l)/$20/; print "$str\n";'
helo world

Usually a regular expression is always tested by its creator to be sure that this the right one and that it does what is expected. But I agree that it could help the writer to debug its RE.

Also if I recall well Oracle and DB2 limit the number of capture groups back references from \1 to \9 for Oracle and \0 to \9 for DB2. I have chosen to not apply this limit, I don't see the interest of such a limitation.

-- 
Gilles Darold
http://www.darold.net/

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

01 August 2021, 23:21:55

Gilles Darold <gilles@darold.net> writes:
> [ v5-0001-regexp-foo-functions.patch ]

I've gone through this whole patch now, and found quite a lot that I did
not like.  In no particular order:

* Wrapping parentheses around the user's regexp doesn't work.  It can
turn an invalid regexp into a valid one: for example 'a)(b' should draw
a syntax error.  With this patch, no error would be thrown, but the
"outer" parens wouldn't do what you expected.  Worse, it can turn a
valid regexp into an invalid one: the metasyntax options described in
9.7.3.4 only work at the start of the regexp.  So we have to handle
whole-regexp cases honestly rather than trying to turn them into an
instance of the parenthesized-subexpression case.

* You did a lot of things quite inefficiently, apparently to avoid
touching any existing code.  I think it's better to extend
setup_regexp_matches() and replace_text_regexp() a little bit so that
they can support the behaviors these new functions need.  In both of
them, it's absolutely trivial to allow a search start position to be
passed in; and it doesn't take much to teach replace_text_regexp()
to replace only the N'th match.

* Speaking of N'th, there is not much of anything that I like
about Oracle's terminology for the function arguments, and I don't
think we ought to adopt it.  If we're documenting the functions as
processing the "N'th match", it seems to me to be natural to call
the parameter "N" not "occurrence".  Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell.
Likewise, "position" is a horribly vague term for the search start
position; it could be interpreted to mean several other things.
"start" seems much better.  "return_opt" is likewise awfully unclear.
I went with "endoption" below, though I could be talked into something
else.  The only one of Oracle's choices that I like is "subexpr" for
subexpression number ... but you went with DB2's rather vague "group"
instead.  I don't want to use their "capture group" terminology,
because that appears nowhere else in our documentation.  Our existing
terminology is "parenthesized subexpression", which seems fine to me
(and also agrees with Oracle's docs).

* I spent a lot of time on the docs too.  A lot of the syntax specs
were wrong (where you put the brackets matters), many of the examples
seemed confusingly overcomplicated, and the text explanations needed
copy-editing.

* Also, the regression tests seemed misguided.  This patch is not
responsible for testing the regexp engine as such; we have tests
elsewhere that do that.  So I don't think we need complex regexps
here.  We just need to verify that the parameters of these functions
act properly, and check their error cases.  That can be done much
more quickly and straightforwardly than what you had.


So here's a revised version that I like better.  I think this
is pretty nearly committable, aside from the question of whether
a too-large subexpression number should be an error or not.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a5b6adc4bb..80aac4965e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3108,6 +3108,78 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>

+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_count</primary>
+        </indexterm>
+        <function>regexp_count</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type> 
+         [, <parameter>start</parameter> <type>integer</type>
+         [, <parameter>flags</parameter> <type>text</type> ] ] )
+        <returnvalue>integer</returnvalue>
+       </para>
+       <para>
+        Returns the number of times the POSIX regular
+        expression <parameter>pattern</parameter> matches in
+        the <parameter>string</parameter>; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_count('123456789012', '\d\d\d', 2)</literal>
+        <returnvalue>3</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_instr</primary>
+        </indexterm>
+        <function>regexp_instr</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type> 
+         [, <parameter>start</parameter> <type>integer</type>
+         [, <parameter>N</parameter> <type>integer</type>
+         [, <parameter>endoption</parameter> <type>integer</type>
+         [, <parameter>flags</parameter> <type>text</type>
+         [, <parameter>subexpr</parameter> <type>integer</type> ] ] ] ] ] )
+        <returnvalue>integer</returnvalue>
+       </para>
+       <para>
+        Returns the position within <parameter>string</parameter> where
+        the <parameter>N</parameter>'th match of the POSIX regular
+        expression <parameter>pattern</parameter> occurs, or zero if there is
+        no such match; see <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i')</literal>
+        <returnvalue>3</returnvalue>
+       </para>
+       <para>
+        <literal>regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i', 2)</literal>
+        <returnvalue>5</returnvalue>
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_like</primary>
+        </indexterm>
+        <function>regexp_like</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type> 
+         [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>boolean</returnvalue>
+       </para>
+       <para>
+        Checks whether a match of the POSIX regular
+        expression <parameter>pattern</parameter> occurs
+        within <parameter>string</parameter>; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_like('Hello World', 'world$', 'i')</literal>
+        <returnvalue>t</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
@@ -3117,8 +3189,9 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <returnvalue>text[]</returnvalue>
        </para>
        <para>
-        Returns captured substrings resulting from the first match of a POSIX
-        regular expression to the <parameter>string</parameter>; see
+        Returns captured substrings resulting from the first match of the
+        POSIX regular expression <parameter>pattern</parameter> to
+        the <parameter>string</parameter>; see
         <xref linkend="functions-posix-regexp"/>.
        </para>
        <para>
@@ -3136,10 +3209,11 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <returnvalue>setof text[]</returnvalue>
        </para>
        <para>
-        Returns captured substrings resulting from the first match of a
-        POSIX regular expression to the <parameter>string</parameter>,
-        or multiple matches if the <literal>g</literal> flag is used;
-        see <xref linkend="functions-posix-regexp"/>.
+        Returns captured substrings resulting from the first match of the
+        POSIX regular expression <parameter>pattern</parameter> to
+        the <parameter>string</parameter>, or all matches if
+        the <literal>g</literal> flag is used; see
+        <xref linkend="functions-posix-regexp"/>.
        </para>
        <para>
         <literal>regexp_matches('foobarbequebaz', 'ba.', 'g')</literal>
@@ -3156,14 +3230,16 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
         <indexterm>
          <primary>regexp_replace</primary>
         </indexterm>
-        <function>regexp_replace</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type>, <parameter>replacement</parameter> <type>text</type> [,
<parameter>flags</parameter><type>text</type> ] ) 
+        <function>regexp_replace</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type>, <parameter>replacement</parameter> <type>text</type> 
+         [, <parameter>start</parameter> <type>integer</type> ]
+         [, <parameter>flags</parameter> <type>text</type> ] )
         <returnvalue>text</returnvalue>
        </para>
        <para>
-        Replaces substrings resulting from the first match of a
-        POSIX regular expression, or multiple substring matches
-        if the <literal>g</literal> flag is used; see <xref
-        linkend="functions-posix-regexp"/>.
+        Replaces the substring that is the first match to the POSIX
+        regular expression <parameter>pattern</parameter>, or all matches
+        if the <literal>g</literal> flag is used; see
+        <xref linkend="functions-posix-regexp"/>.
        </para>
        <para>
         <literal>regexp_replace('Thomas', '.[mN]a.', 'M')</literal>
@@ -3171,6 +3247,26 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>

+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <function>regexp_replace</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type>, <parameter>replacement</parameter> <type>text</type>, 
+         <parameter>start</parameter> <type>integer</type>,
+         <parameter>N</parameter> <type>integer</type>
+         [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Replaces the substring that is the <parameter>N</parameter>'th
+        match to the POSIX regular expression <parameter>pattern</parameter>,
+        or all matches if <parameter>N</parameter> is zero; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_replace('Thomas', '.', 'X', 3, 2)</literal>
+        <returnvalue>ThoXas</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
@@ -3213,6 +3309,35 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>

+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_substr</primary>
+        </indexterm>
+        <function>regexp_substr</function> ( <parameter>string</parameter> <type>text</type>,
<parameter>pattern</parameter><type>text</type> 
+         [, <parameter>start</parameter> <type>integer</type>
+         [, <parameter>N</parameter> <type>integer</type>
+         [, <parameter>flags</parameter> <type>text</type>
+         [, <parameter>subexpr</parameter> <type>integer</type> ] ] ] ] )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Returns the substring within <parameter>string</parameter> that
+        matches the <parameter>N</parameter>'th occurrence of the POSIX
+        regular expression <parameter>pattern</parameter>,
+        or <literal>NULL</literal> if there is no such match; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_substr('ABCDEF', 'c(.)(..)', 1, 1, 'i')</literal>
+        <returnvalue>CDEF</returnvalue>
+       </para>
+       <para>
+        <literal>regexp_substr('ABCDEF', 'c(.)(..)', 1, 1, 'i', 2)</literal>
+        <returnvalue>EF</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
@@ -5377,6 +5502,15 @@ substring('foobar' similar '#"o_b#"%' escape '#')    <lineannotation>NULL</linea
    <indexterm>
     <primary>substring</primary>
    </indexterm>
+   <indexterm>
+    <primary>regexp_count</primary>
+   </indexterm>
+   <indexterm>
+    <primary>regexp_instr</primary>
+   </indexterm>
+   <indexterm>
+    <primary>regexp_like</primary>
+   </indexterm>
    <indexterm>
     <primary>regexp_match</primary>
    </indexterm>
@@ -5392,6 +5526,9 @@ substring('foobar' similar '#"o_b#"%' escape '#')    <lineannotation>NULL</linea
    <indexterm>
     <primary>regexp_split_to_array</primary>
    </indexterm>
+   <indexterm>
+    <primary>regexp_substr</primary>
+   </indexterm>

    <para>
     <xref linkend="functions-posix-table"/> lists the available
@@ -5542,6 +5679,109 @@ substring('foobar' from 'o(.)b')   <lineannotation>o</lineannotation>
 </programlisting>
    </para>

+    <para>
+     The <function>regexp_count</function> function counts the number of
+     places where a POSIX regular expression pattern matches a string.
+     It has the syntax
+     <function>regexp_count</function>(<replaceable>string</replaceable>,
+     <replaceable>pattern</replaceable>
+     <optional>, <replaceable>start</replaceable>
+     <optional>, <replaceable>flags</replaceable>
+     </optional></optional>).
+     <replaceable>pattern</replaceable> is searched for
+     in <replaceable>string</replaceable>, normally from the beginning of
+     the string, but if the <replaceable>start</replaceable> parameter is
+     provided then beginning from that character index.
+     The <replaceable>flags</replaceable> parameter is an optional text
+     string containing zero or more single-letter flags that change the
+     function's behavior.  For example, including <literal>i</literal> in
+     <replaceable>flags</replaceable> specifies case-insensitive matching.
+     Supported flags are described in
+     <xref linkend="posix-embedded-options-table"/>.
+    </para>
+
+    <para>
+     Some examples:
+<programlisting>
+regexp_count('ABCABCAXYaxy', 'A.')          <lineannotation>3</lineannotation>
+regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  <lineannotation>4</lineannotation>
+</programlisting>
+    </para>
+
+    <para>
+     The <function>regexp_instr</function> function returns the starting or
+     ending position of the <replaceable>N</replaceable>'th match of a
+     POSIX regular expression pattern to a string, or zero if there is no
+     such match.  It has the syntax
+     <function>regexp_instr</function>(<replaceable>string</replaceable>,
+     <replaceable>pattern</replaceable>
+     <optional>, <replaceable>start</replaceable>
+     <optional>, <replaceable>N</replaceable>
+     <optional>, <replaceable>endoption</replaceable>
+     <optional>, <replaceable>flags</replaceable>
+     <optional>, <replaceable>subexpr</replaceable>
+     </optional></optional></optional></optional></optional>).
+     <replaceable>pattern</replaceable> is searched for
+     in <replaceable>string</replaceable>, normally from the beginning of
+     the string, but if the <replaceable>start</replaceable> parameter is
+     provided then beginning from that character index.
+     If <replaceable>N</replaceable> is specified
+     then the <replaceable>N</replaceable>'th match of the pattern
+     is located, otherwise the first match is located.
+     If the <replaceable>endoption</replaceable> parameter is omitted or
+     specified as zero, the function returns the position of the first
+     character of the match.  Otherwise, <replaceable>endoption</replaceable>
+     must be one, and the function returns the position of the character
+     following the match.
+     The <replaceable>flags</replaceable> parameter is an optional text
+     string containing zero or more single-letter flags that change the
+     function's behavior.  Supported flags are described
+     in <xref linkend="posix-embedded-options-table"/>.
+     For a pattern containing parenthesized
+     subexpressions, <replaceable>subexpr</replaceable> is an integer
+     indicating which subexpression is of interest: the result identifies
+     the position of the substring matching that subexpression.
+     Subexpressions are numbered in the order of their leading parentheses.
+     When <replaceable>subexpr</replaceable> is omitted or zero, the result
+     identifies the position of the whole match regardless of
+     parenthesized subexpressions.
+    </para>
+
+    <para>
+     Some examples:
+<programlisting>
+regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2)
+                                   <lineannotation>23</lineannotation>
+regexp_instr('ABCDEFGHI', '(c..)(...)', 1, 1, 0, 'i', 2)
+                                   <lineannotation>6</lineannotation>
+</programlisting>
+    </para>
+
+    <para>
+     The <function>regexp_like</function> function checks whether a match
+     of a POSIX regular expression pattern occurs within a string,
+     returning boolean true or false.  It has the syntax
+     <function>regexp_like</function>(<replaceable>string</replaceable>,
+     <replaceable>pattern</replaceable>
+     <optional>, <replaceable>flags</replaceable> </optional>).
+     The <replaceable>flags</replaceable> parameter is an optional text
+     string containing zero or more single-letter flags that change the
+     function's behavior.  Supported flags are described
+     in <xref linkend="posix-embedded-options-table"/>.
+     This function has the same results as the <literal>~</literal>
+     operator if no flags are specified.  If only the <literal>i</literal>
+     flag is specified, it has the same results as
+     the <literal>~*</literal> operator.
+    </para>
+
+    <para>
+     Some examples:
+<programlisting>
+regexp_like('Hello World', 'world')       <lineannotation>false</lineannotation>
+regexp_like('Hello World', 'world', 'i')  <lineannotation>true</lineannotation>
+</programlisting>
+    </para>
+
     <para>
      The <function>regexp_match</function> function returns a text array of
      captured substring(s) resulting from the first match of a POSIX
@@ -5579,8 +5819,17 @@ SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
  {bar,beque}
 (1 row)
 </programlisting>
-    In the common case where you just want the whole matching substring
-    or <literal>NULL</literal> for no match, write something like
+   </para>
+
+    <tip>
+     <para>
+      In the common case where you just want the whole matching substring
+      or <literal>NULL</literal> for no match, the best solution is to
+      use <function>regexp_substr()</function>.
+      However, <function>regexp_substr()</function> only exists
+      in <productname>PostgreSQL</productname> version 15 and up.  When
+      working in older versions, you can extract the first element
+      of <function>regexp_match()</function>'s result, for example:
 <programlisting>
 SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
  regexp_match
@@ -5588,7 +5837,8 @@ SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
  barbeque
 (1 row)
 </programlisting>
-   </para>
+     </para>
+    </tip>

     <para>
      The <function>regexp_matches</function> function returns a set of text arrays
@@ -5650,7 +5900,13 @@ SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
      It has the syntax
      <function>regexp_replace</function>(<replaceable>source</replaceable>,
      <replaceable>pattern</replaceable>, <replaceable>replacement</replaceable>
+     <optional>, <replaceable>start</replaceable>
+     <optional>, <replaceable>N</replaceable>
+     </optional></optional>
      <optional>, <replaceable>flags</replaceable> </optional>).
+     (Notice that <replaceable>N</replaceable> cannot be specified
+     unless <replaceable>start</replaceable> is,
+     but <replaceable>flags</replaceable> can be given in any case.)
      The <replaceable>source</replaceable> string is returned unchanged if
      there is no match to the <replaceable>pattern</replaceable>.  If there is a
      match, the <replaceable>source</replaceable> string is returned with the
@@ -5663,11 +5919,22 @@ SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
      substring matching the entire pattern should be inserted.  Write
      <literal>\\</literal> if you need to put a literal backslash in the replacement
      text.
+     <replaceable>pattern</replaceable> is searched for
+     in <replaceable>string</replaceable>, normally from the beginning of
+     the string, but if the <replaceable>start</replaceable> parameter is
+     provided then beginning from that character index.
+     By default, only the first match of the pattern is replaced.
+     If <replaceable>N</replaceable> is specified and is greater than zero,
+     then the <replaceable>N</replaceable>'th match of the pattern
+     is replaced.
+     If the <literal>g</literal> flag is given, or
+     if <replaceable>N</replaceable> is specified and is zero, then all
+     matches at or after the <replaceable>start</replaceable> position are
+     replaced.  (The <literal>g</literal> flag is ignored
+     when <replaceable>N</replaceable> is specified.)
      The <replaceable>flags</replaceable> parameter is an optional text
      string containing zero or more single-letter flags that change the
-     function's behavior.  Flag <literal>i</literal> specifies case-insensitive
-     matching, while flag <literal>g</literal> specifies replacement of each matching
-     substring rather than only the first one.  Supported flags (though
+     function's behavior.  Supported flags (though
      not <literal>g</literal>) are
      described in <xref linkend="posix-embedded-options-table"/>.
     </para>
@@ -5681,6 +5948,10 @@ regexp_replace('foobarbaz', 'b..', 'X', 'g')
                                    <lineannotation>fooXX</lineannotation>
 regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
                                    <lineannotation>fooXarYXazY</lineannotation>
+regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i')
+                                   <lineannotation>X PXstgrXSQL fXnctXXn</lineannotation>
+regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 3, 'i')
+                                   <lineannotation>A PostgrXSQL function</lineannotation>
 </programlisting>
    </para>

@@ -5712,7 +5983,6 @@ regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
    <para>
     Some examples:
 <programlisting>
-
 SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo;
   foo
 -------
@@ -5761,11 +6031,51 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
     zero-length matches that occur at the start or end of the string
     or immediately after a previous match.  This is contrary to the strict
     definition of regexp matching that is implemented by
-    <function>regexp_match</function> and
-    <function>regexp_matches</function>, but is usually the most convenient behavior
+    the other regexp functions, but is usually the most convenient behavior
     in practice.  Other software systems such as Perl use similar definitions.
    </para>

+    <para>
+     The <function>regexp_substr</function> function returns the substring
+     that matches a POSIX regular expression pattern,
+     or <literal>NULL</literal> if there is no match.  It has the syntax
+     <function>regexp_substr</function>(<replaceable>string</replaceable>,
+     <replaceable>pattern</replaceable>
+     <optional>, <replaceable>start</replaceable>
+     <optional>, <replaceable>N</replaceable>
+     <optional>, <replaceable>flags</replaceable>
+     <optional>, <replaceable>subexpr</replaceable>
+     </optional></optional></optional></optional>).
+     <replaceable>pattern</replaceable> is searched for
+     in <replaceable>string</replaceable>, normally from the beginning of
+     the string, but if the <replaceable>start</replaceable> parameter is
+     provided then beginning from that character index.
+     If <replaceable>N</replaceable> is specified
+     then the <replaceable>N</replaceable>'th match of the pattern
+     is returned, otherwise the first match is returned.
+     The <replaceable>flags</replaceable> parameter is an optional text
+     string containing zero or more single-letter flags that change the
+     function's behavior.  Supported flags are described
+     in <xref linkend="posix-embedded-options-table"/>.
+     For a pattern containing parenthesized
+     subexpressions, <replaceable>subexpr</replaceable> is an integer
+     indicating which subexpression is of interest: the result is the
+     substring matching that subexpression.
+     Subexpressions are numbered in the order of their leading parentheses.
+     When <replaceable>subexpr</replaceable> is omitted or zero, the result
+     is the whole match regardless of parenthesized subexpressions.
+    </para>
+
+    <para>
+     Some examples:
+<programlisting>
+regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2)
+                                   <lineannotation> town zip</lineannotation>
+regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2)
+                                   <lineannotation>FGH</lineannotation>
+</programlisting>
+    </para>
+
 <!-- derived from the re_syntax.n man page -->

    <sect3 id="posix-syntax-details">
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index a32c5c82ab..484d4265fd 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -113,6 +113,7 @@ static cached_re_str re_array[MAX_CACHED_RES];    /* cached re's */
 /* Local functions */
 static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
                                                 pg_re_flags *flags,
+                                                int start_search,
                                                 Oid collation,
                                                 bool use_subpatterns,
                                                 bool ignore_degenerate,
@@ -629,7 +630,7 @@ textregexreplace_noopt(PG_FUNCTION_ARGS)

     re = RE_compile_and_cache(p, REG_ADVANCED, PG_GET_COLLATION());

-    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, false));
+    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0, 1));
 }

 /*
@@ -646,11 +647,97 @@ textregexreplace(PG_FUNCTION_ARGS)
     regex_t    *re;
     pg_re_flags flags;

+    /*
+     * regexp_replace() with four arguments will be preferentially resolved as
+     * this form when the fourth argument is of type UNKNOWN.  However, the
+     * user might have intended to call textregexreplace_extended_no_n.  If we
+     * see flags that look like an integer, emit the same error that
+     * parse_re_flags would, but add a HINT about how to fix it.
+     */
+    if (VARSIZE_ANY_EXHDR(opt) > 0)
+    {
+        char       *opt_p = VARDATA_ANY(opt);
+
+        if (*opt_p >= '0' && *opt_p <= '9')
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid regular expression option: \"%.*s\"",
+                            pg_mblen(opt_p), opt_p),
+                     errhint("If you meant to use regexp_replace() with a start parameter, cast the fourth argument to
integerexplicitly."))); 
+    }
+
     parse_re_flags(&flags, opt);

     re = RE_compile_and_cache(p, flags.cflags, PG_GET_COLLATION());

-    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, flags.glob));
+    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, 0,
+                                         flags.glob ? 0 : 1));
+}
+
+/*
+ * textregexreplace_extended()
+ *        Return a string matched by a regular expression, with replacement.
+ *        Extends textregexreplace by allowing a start position and the
+ *        choice of the occurrence to replace (0 means all occurrences).
+ */
+Datum
+textregexreplace_extended(PG_FUNCTION_ARGS)
+{
+    text       *s = PG_GETARG_TEXT_PP(0);
+    text       *p = PG_GETARG_TEXT_PP(1);
+    text       *r = PG_GETARG_TEXT_PP(2);
+    int            start = 1;
+    int            n = 1;
+    text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(5);
+    pg_re_flags re_flags;
+    regex_t    *re;
+
+    /* Collect optional parameters */
+    if (PG_NARGS() > 3)
+    {
+        start = PG_GETARG_INT32(3);
+        if (start <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "start", start)));
+    }
+    if (PG_NARGS() > 4)
+    {
+        n = PG_GETARG_INT32(4);
+        if (n < 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "n", n)));
+    }
+
+    /* Determine options */
+    parse_re_flags(&re_flags, flags);
+
+    /* If N was not specified, deduce it from the 'g' flag */
+    if (PG_NARGS() <= 4)
+        n = re_flags.glob ? 0 : 1;
+
+    /* Compile the regular expression */
+    re = RE_compile_and_cache(p, re_flags.cflags, PG_GET_COLLATION());
+
+    /* Do the replacement(s) */
+    PG_RETURN_TEXT_P(replace_text_regexp(s, (void *) re, r, start - 1, n));
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+textregexreplace_extended_no_n(PG_FUNCTION_ARGS)
+{
+    return textregexreplace_extended(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+textregexreplace_extended_no_flags(PG_FUNCTION_ARGS)
+{
+    return textregexreplace_extended(fcinfo);
 }

 /*
@@ -958,6 +1045,235 @@ similar_escape(PG_FUNCTION_ARGS)
     PG_RETURN_TEXT_P(result);
 }

+/*
+ * regexp_count()
+ *        Return the number of matches of a pattern within a string.
+ */
+Datum
+regexp_count(PG_FUNCTION_ARGS)
+{
+    text       *str = PG_GETARG_TEXT_PP(0);
+    text       *pattern = PG_GETARG_TEXT_PP(1);
+    int            start = 1;
+    text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(3);
+    pg_re_flags re_flags;
+    regexp_matches_ctx *matchctx;
+
+    /* Collect optional parameters */
+    if (PG_NARGS() > 2)
+    {
+        start = PG_GETARG_INT32(2);
+        if (start <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "start", start)));
+    }
+
+    /* Determine options */
+    parse_re_flags(&re_flags, flags);
+    /* User mustn't specify 'g' */
+    if (re_flags.glob)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+        /* translator: %s is a SQL function name */
+                 errmsg("%s does not support the \"global\" option",
+                        "regexp_count()")));
+    /* But we find all the matches anyway */
+    re_flags.glob = true;
+
+    /* Do the matching */
+    matchctx = setup_regexp_matches(str, pattern, &re_flags, start - 1,
+                                    PG_GET_COLLATION(),
+                                    false,    /* can ignore subexprs */
+                                    false, false);
+
+    PG_RETURN_INT32(matchctx->nmatches);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_count_no_start(PG_FUNCTION_ARGS)
+{
+    return regexp_count(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_count_no_flags(PG_FUNCTION_ARGS)
+{
+    return regexp_count(fcinfo);
+}
+
+/*
+ * regexp_instr()
+ *        Return the match's position within the string
+ */
+Datum
+regexp_instr(PG_FUNCTION_ARGS)
+{
+    text       *str = PG_GETARG_TEXT_PP(0);
+    text       *pattern = PG_GETARG_TEXT_PP(1);
+    int            start = 1;
+    int            n = 1;
+    int            endoption = 0;
+    text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(5);
+    int            subexpr = 0;
+    int            pos;
+    pg_re_flags re_flags;
+    regexp_matches_ctx *matchctx;
+
+    /* Collect optional parameters */
+    if (PG_NARGS() > 2)
+    {
+        start = PG_GETARG_INT32(2);
+        if (start <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "start", start)));
+    }
+    if (PG_NARGS() > 3)
+    {
+        n = PG_GETARG_INT32(3);
+        if (n <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "n", n)));
+    }
+    if (PG_NARGS() > 4)
+    {
+        endoption = PG_GETARG_INT32(4);
+        if (endoption != 0 && endoption != 1)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "endoption", endoption)));
+    }
+    if (PG_NARGS() > 6)
+    {
+        subexpr = PG_GETARG_INT32(6);
+        if (subexpr < 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "subexpr", subexpr)));
+    }
+
+    /* Determine options */
+    parse_re_flags(&re_flags, flags);
+    /* User mustn't specify 'g' */
+    if (re_flags.glob)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+        /* translator: %s is a SQL function name */
+                 errmsg("%s does not support the \"global\" option",
+                        "regexp_instr()")));
+    /* But we find all the matches anyway */
+    re_flags.glob = true;
+
+    /* Do the matching */
+    matchctx = setup_regexp_matches(str, pattern, &re_flags, start - 1,
+                                    PG_GET_COLLATION(),
+                                    (subexpr > 0),    /* need submatches? */
+                                    false, false);
+
+    /* When n exceeds matches return 0 (includes case of no matches) */
+    if (n > matchctx->nmatches)
+        PG_RETURN_INT32(0);
+
+    /* When subexpr exceeds number of subexpressions return 0 */
+    if (subexpr > matchctx->npatterns)
+        PG_RETURN_INT32(0);
+
+    /* Select the appropriate match position to return */
+    pos = (n - 1) * matchctx->npatterns;
+    if (subexpr > 0)
+        pos += subexpr - 1;
+    pos *= 2;
+    if (endoption == 1)
+        pos += 1;
+
+    if (matchctx->match_locs[pos] >= 0)
+        PG_RETURN_INT32(matchctx->match_locs[pos] + 1);
+    else
+        PG_RETURN_INT32(0);        /* position not identifiable */
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_instr_no_start(PG_FUNCTION_ARGS)
+{
+    return regexp_instr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_instr_no_n(PG_FUNCTION_ARGS)
+{
+    return regexp_instr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_instr_no_endoption(PG_FUNCTION_ARGS)
+{
+    return regexp_instr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_instr_no_flags(PG_FUNCTION_ARGS)
+{
+    return regexp_instr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_instr_no_subexpr(PG_FUNCTION_ARGS)
+{
+    return regexp_instr(fcinfo);
+}
+
+/*
+ * regexp_like()
+ *        Test for a pattern match within a string.
+ */
+Datum
+regexp_like(PG_FUNCTION_ARGS)
+{
+    text       *str = PG_GETARG_TEXT_PP(0);
+    text       *pattern = PG_GETARG_TEXT_PP(1);
+    text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+    pg_re_flags re_flags;
+
+    /* Determine options */
+    parse_re_flags(&re_flags, flags);
+    /* User mustn't specify 'g' */
+    if (re_flags.glob)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+        /* translator: %s is a SQL function name */
+                 errmsg("%s does not support the \"global\" option",
+                        "regexp_like()")));
+
+    /* Otherwise it's like textregexeq/texticregexeq */
+    PG_RETURN_BOOL(RE_compile_and_execute(pattern,
+                                          VARDATA_ANY(str),
+                                          VARSIZE_ANY_EXHDR(str),
+                                          re_flags.cflags,
+                                          PG_GET_COLLATION(),
+                                          0, NULL));
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_like_no_flags(PG_FUNCTION_ARGS)
+{
+    return regexp_like(fcinfo);
+}
+
 /*
  * regexp_match()
  *        Return the first substring(s) matching a pattern within a string.
@@ -982,7 +1298,7 @@ regexp_match(PG_FUNCTION_ARGS)
                         "regexp_match()"),
                  errhint("Use the regexp_matches function instead.")));

-    matchctx = setup_regexp_matches(orig_str, pattern, &re_flags,
+    matchctx = setup_regexp_matches(orig_str, pattern, &re_flags, 0,
                                     PG_GET_COLLATION(), true, false, false);

     if (matchctx->nmatches == 0)
@@ -1029,7 +1345,7 @@ regexp_matches(PG_FUNCTION_ARGS)

         /* be sure to copy the input string into the multi-call ctx */
         matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
-                                        &re_flags,
+                                        &re_flags, 0,
                                         PG_GET_COLLATION(),
                                         true, false, false);

@@ -1064,24 +1380,28 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 }

 /*
- * setup_regexp_matches --- do the initial matching for regexp_match
- *        and regexp_split functions
+ * setup_regexp_matches --- do the initial matching for regexp_match,
+ *        regexp_split, and related functions
  *
  * To avoid having to re-find the compiled pattern on each call, we do
  * all the matching in one swoop.  The returned regexp_matches_ctx contains
  * the locations of all the substrings matching the pattern.
  *
- * The three bool parameters have only two patterns (one for matching, one for
- * splitting) but it seems clearer to distinguish the functionality this way
- * than to key it all off one "is_split" flag. We don't currently assume that
- * fetching_unmatched is exclusive of fetching the matched text too; if it's
- * set, the conversion buffer is large enough to fetch any single matched or
- * unmatched string, but not any larger substring. (In practice, when splitting
- * the matches are usually small anyway, and it didn't seem worth complicating
- * the code further.)
+ * start_search: the character (not byte) offset in orig_str at which to
+ * begin the search.  Returned positions are relative to orig_str anyway.
+ * use_subpatterns: collect data about matches to parenthesized subexpressions.
+ * ignore_degenerate: ignore zero-length matches.
+ * fetching_unmatched: caller wants to fetch unmatched substrings.
+ *
+ * We don't currently assume that fetching_unmatched is exclusive of fetching
+ * the matched text too; if it's set, the conversion buffer is large enough to
+ * fetch any single matched or unmatched string, but not any larger
+ * substring.  (In practice, when splitting the matches are usually small
+ * anyway, and it didn't seem worth complicating the code further.)
  */
 static regexp_matches_ctx *
 setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
+                     int start_search,
                      Oid collation,
                      bool use_subpatterns,
                      bool ignore_degenerate,
@@ -1099,7 +1419,6 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     int            array_idx;
     int            prev_match_end;
     int            prev_valid_match_end;
-    int            start_search;
     int            maxlen = 0;        /* largest fetch length in characters */

     /* save original string --- we'll extract result substrings from it */
@@ -1142,7 +1461,6 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
     /* search for the pattern, perhaps repeatedly */
     prev_match_end = 0;
     prev_valid_match_end = 0;
-    start_search = 0;
     while (RE_wchar_execute(cpattern, wide_str, wide_len, start_search,
                             pmatch_len, pmatch))
     {
@@ -1367,7 +1685,7 @@ regexp_split_to_table(PG_FUNCTION_ARGS)

         /* be sure to copy the input string into the multi-call ctx */
         splitctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
-                                        &re_flags,
+                                        &re_flags, 0,
                                         PG_GET_COLLATION(),
                                         false, true, true);

@@ -1422,7 +1740,7 @@ regexp_split_to_array(PG_FUNCTION_ARGS)

     splitctx = setup_regexp_matches(PG_GETARG_TEXT_PP(0),
                                     PG_GETARG_TEXT_PP(1),
-                                    &re_flags,
+                                    &re_flags, 0,
                                     PG_GET_COLLATION(),
                                     false, true, true);

@@ -1489,6 +1807,125 @@ build_regexp_split_result(regexp_matches_ctx *splitctx)
     }
 }

+/*
+ * regexp_substr()
+ *        Return the substring that matches a regular expression pattern
+ */
+Datum
+regexp_substr(PG_FUNCTION_ARGS)
+{
+    text       *str = PG_GETARG_TEXT_PP(0);
+    text       *pattern = PG_GETARG_TEXT_PP(1);
+    int            start = 1;
+    int            n = 1;
+    text       *flags = PG_GETARG_TEXT_PP_IF_EXISTS(4);
+    int            subexpr = 0;
+    int            so,
+                eo,
+                pos;
+    pg_re_flags re_flags;
+    regexp_matches_ctx *matchctx;
+
+    /* Collect optional parameters */
+    if (PG_NARGS() > 2)
+    {
+        start = PG_GETARG_INT32(2);
+        if (start <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "start", start)));
+    }
+    if (PG_NARGS() > 3)
+    {
+        n = PG_GETARG_INT32(3);
+        if (n <= 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "n", n)));
+    }
+    if (PG_NARGS() > 5)
+    {
+        subexpr = PG_GETARG_INT32(5);
+        if (subexpr < 0)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for parameter \"%s\": %d",
+                            "subexpr", subexpr)));
+    }
+
+    /* Determine options */
+    parse_re_flags(&re_flags, flags);
+    /* User mustn't specify 'g' */
+    if (re_flags.glob)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+        /* translator: %s is a SQL function name */
+                 errmsg("%s does not support the \"global\" option",
+                        "regexp_substr()")));
+    /* But we find all the matches anyway */
+    re_flags.glob = true;
+
+    /* Do the matching */
+    matchctx = setup_regexp_matches(str, pattern, &re_flags, start - 1,
+                                    PG_GET_COLLATION(),
+                                    (subexpr > 0),    /* need submatches? */
+                                    false, false);
+
+    /* When n exceeds matches return NULL (includes case of no matches) */
+    if (n > matchctx->nmatches)
+        PG_RETURN_NULL();
+
+    /* When subexpr exceeds number of subexpressions return NULL */
+    if (subexpr > matchctx->npatterns)
+        PG_RETURN_NULL();
+
+    /* Select the appropriate match position to return */
+    pos = (n - 1) * matchctx->npatterns;
+    if (subexpr > 0)
+        pos += subexpr - 1;
+    pos *= 2;
+    so = matchctx->match_locs[pos];
+    eo = matchctx->match_locs[pos + 1];
+
+    if (so < 0 || eo < 0)
+        PG_RETURN_NULL();        /* unidentifiable location */
+
+    PG_RETURN_DATUM(DirectFunctionCall3(text_substr,
+                                        PointerGetDatum(matchctx->orig_str),
+                                        Int32GetDatum(so + 1),
+                                        Int32GetDatum(eo - so)));
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_substr_no_start(PG_FUNCTION_ARGS)
+{
+    return regexp_substr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_substr_no_n(PG_FUNCTION_ARGS)
+{
+    return regexp_substr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_substr_no_flags(PG_FUNCTION_ARGS)
+{
+    return regexp_substr(fcinfo);
+}
+
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_substr_no_subexpr(PG_FUNCTION_ARGS)
+{
+    return regexp_substr(fcinfo);
+}
+
 /*
  * regexp_fixed_prefix - extract fixed prefix, if any, for a regexp
  *
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d2a11b1b5d..a0bde4e352 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -4496,23 +4496,28 @@ appendStringInfoRegexpSubstr(StringInfo str, text *replace_text,
 /*
  * replace_text_regexp
  *
- * replace text that matches to regexp in src_text to replace_text.
+ * replace text that matches to regexp in src_text with replace_text.
+ *
+ * search_start: the character (not byte) offset in src_text at which to
+ * begin searching.
+ * n: if 0, replace all matches; if > 0, replace only the N'th match.
  *
  * Note: to avoid having to include regex.h in builtins.h, we declare
  * the regexp argument as void *, but really it's regex_t *.
  */
 text *
 replace_text_regexp(text *src_text, void *regexp,
-                    text *replace_text, bool glob)
+                    text *replace_text,
+                    int search_start, int n)
 {
     text       *ret_text;
     regex_t    *re = (regex_t *) regexp;
     int            src_text_len = VARSIZE_ANY_EXHDR(src_text);
+    int            nmatches = 0;
     StringInfoData buf;
     regmatch_t    pmatch[REGEXP_REPLACE_BACKREF_CNT];
     pg_wchar   *data;
     size_t        data_len;
-    int            search_start;
     int            data_pos;
     char       *start_ptr;
     bool        have_escape;
@@ -4530,7 +4535,6 @@ replace_text_regexp(text *src_text, void *regexp,
     start_ptr = (char *) VARDATA_ANY(src_text);
     data_pos = 0;

-    search_start = 0;
     while (search_start <= data_len)
     {
         int            regexec_result;
@@ -4560,6 +4564,23 @@ replace_text_regexp(text *src_text, void *regexp,
                      errmsg("regular expression failed: %s", errMsg)));
         }

+        /*
+         * Count matches, and decide whether to replace this match.
+         */
+        nmatches++;
+        if (n > 0 && nmatches != n)
+        {
+            /*
+             * No, so advance search_start, but not start_ptr/data_pos. (Thus,
+             * we treat the matched text as if it weren't matched, and copy it
+             * to the output later.)
+             */
+            search_start = pmatch[0].rm_eo;
+            if (pmatch[0].rm_so == pmatch[0].rm_eo)
+                search_start++;
+            continue;
+        }
+
         /*
          * Copy the text to the left of the match position.  Note we are given
          * character not byte indexes.
@@ -4596,9 +4617,9 @@ replace_text_regexp(text *src_text, void *regexp,
         data_pos = pmatch[0].rm_eo;

         /*
-         * When global option is off, replace the first instance only.
+         * If we only want to replace one occurrence, we're done.
          */
-        if (!glob)
+        if (n > 0)
             break;

         /*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8cd0252082..b603700ed9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3565,6 +3565,18 @@
 { oid => '2285', descr => 'replace text using regexp',
   proname => 'regexp_replace', prorettype => 'text',
   proargtypes => 'text text text text', prosrc => 'textregexreplace' },
+{ oid => '9611', descr => 'replace text using regexp',
+  proname => 'regexp_replace', prorettype => 'text',
+  proargtypes => 'text text text int4 int4 text',
+  prosrc => 'textregexreplace_extended' },
+{ oid => '9612', descr => 'replace text using regexp',
+  proname => 'regexp_replace', prorettype => 'text',
+  proargtypes => 'text text text int4 int4',
+  prosrc => 'textregexreplace_extended_no_flags' },
+{ oid => '9613', descr => 'replace text using regexp',
+  proname => 'regexp_replace', prorettype => 'text',
+  proargtypes => 'text text text int4',
+  prosrc => 'textregexreplace_extended_no_n' },
 { oid => '3396', descr => 'find first match for regexp',
   proname => 'regexp_match', prorettype => '_text', proargtypes => 'text text',
   prosrc => 'regexp_match_no_flags' },
@@ -3579,6 +3591,58 @@
   proname => 'regexp_matches', prorows => '10', proretset => 't',
   prorettype => '_text', proargtypes => 'text text text',
   prosrc => 'regexp_matches' },
+{ oid => '9614', descr => 'count regexp matches',
+  proname => 'regexp_count', prorettype => 'int4', proargtypes => 'text text',
+  prosrc => 'regexp_count_no_start' },
+{ oid => '9615', descr => 'count regexp matches',
+  proname => 'regexp_count', prorettype => 'int4',
+  proargtypes => 'text text int4', prosrc => 'regexp_count_no_flags' },
+{ oid => '9616', descr => 'count regexp matches',
+  proname => 'regexp_count', prorettype => 'int4',
+  proargtypes => 'text text int4 text', prosrc => 'regexp_count' },
+{ oid => '9617', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4', proargtypes => 'text text',
+  prosrc => 'regexp_instr_no_start' },
+{ oid => '9618', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4',
+  proargtypes => 'text text int4', prosrc => 'regexp_instr_no_n' },
+{ oid => '9619', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4',
+  proargtypes => 'text text int4 int4', prosrc => 'regexp_instr_no_endoption' },
+{ oid => '9620', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4',
+  proargtypes => 'text text int4 int4 int4',
+  prosrc => 'regexp_instr_no_flags' },
+{ oid => '9621', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4',
+  proargtypes => 'text text int4 int4 int4 text',
+  prosrc => 'regexp_instr_no_subexpr' },
+{ oid => '9622', descr => 'position of regexp match',
+  proname => 'regexp_instr', prorettype => 'int4',
+  proargtypes => 'text text int4 int4 int4 text int4',
+  prosrc => 'regexp_instr' },
+{ oid => '9623', descr => 'test for regexp match',
+  proname => 'regexp_like', prorettype => 'bool', proargtypes => 'text text',
+  prosrc => 'regexp_like_no_flags' },
+{ oid => '9624', descr => 'test for regexp match',
+  proname => 'regexp_like', prorettype => 'bool',
+  proargtypes => 'text text text', prosrc => 'regexp_like' },
+{ oid => '9625', descr => 'extract substring that matches regexp',
+  proname => 'regexp_substr', prorettype => 'text', proargtypes => 'text text',
+  prosrc => 'regexp_substr_no_start' },
+{ oid => '9626', descr => 'extract substring that matches regexp',
+  proname => 'regexp_substr', prorettype => 'text',
+  proargtypes => 'text text int4', prosrc => 'regexp_substr_no_n' },
+{ oid => '9627', descr => 'extract substring that matches regexp',
+  proname => 'regexp_substr', prorettype => 'text',
+  proargtypes => 'text text int4 int4', prosrc => 'regexp_substr_no_flags' },
+{ oid => '9628', descr => 'extract substring that matches regexp',
+  proname => 'regexp_substr', prorettype => 'text',
+  proargtypes => 'text text int4 int4 text',
+  prosrc => 'regexp_substr_no_subexpr' },
+{ oid => '9629', descr => 'extract substring that matches regexp',
+  proname => 'regexp_substr', prorettype => 'text',
+  proargtypes => 'text text int4 int4 text int4', prosrc => 'regexp_substr' },
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/include/utils/varlena.h b/src/include/utils/varlena.h
index 5c39723332..6645e2af13 100644
--- a/src/include/utils/varlena.h
+++ b/src/include/utils/varlena.h
@@ -34,6 +34,7 @@ extern bool SplitDirectoriesString(char *rawstring, char separator,
 extern bool SplitGUCList(char *rawstring, char separator,
                          List **namelist);
 extern text *replace_text_regexp(text *src_text, void *regexp,
-                                 text *replace_text, bool glob);
+                                 text *replace_text,
+                                 int search_start, int n);

 #endif
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 91aa819804..a9efd74c7b 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -515,6 +515,13 @@ SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
  cde
 (1 row)

+-- Check case where we have a match, but not a subexpression match
+SELECT SUBSTRING('foo' FROM 'foo(bar)?') IS NULL AS t;
+ t
+---
+ t
+(1 row)
+
 -- Check behavior of SIMILAR TO, which uses largely the same regexp variant
 SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
  true
@@ -592,6 +599,370 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
 -- invalid regexp option
 SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
 ERROR:  invalid regular expression option: "z"
+-- extended regexp_replace tests
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 1);
+    regexp_replace
+-----------------------
+ X PostgreSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 1, 2);
+    regexp_replace
+-----------------------
+ A PXstgreSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i');
+    regexp_replace
+-----------------------
+ X PXstgrXSQL fXnctXXn
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 1, 'i');
+    regexp_replace
+-----------------------
+ X PostgreSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 2, 'i');
+    regexp_replace
+-----------------------
+ A PXstgreSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 3, 'i');
+    regexp_replace
+-----------------------
+ A PostgrXSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 9, 'i');
+    regexp_replace
+-----------------------
+ A PostgreSQL function
+(1 row)
+
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 7, 0, 'i');
+    regexp_replace
+-----------------------
+ A PostgrXSQL fXnctXXn
+(1 row)
+
+-- 'g' flag should be ignored when N is specified
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 1, 'g');
+    regexp_replace
+-----------------------
+ A PXstgreSQL function
+(1 row)
+
+-- errors
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', -1, 0, 'i');
+ERROR:  invalid value for parameter "start": -1
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, -1, 'i');
+ERROR:  invalid value for parameter "n": -1
+-- erroneous invocation of non-extended form
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', '1');
+ERROR:  invalid regular expression option: "1"
+HINT:  If you meant to use regexp_replace() with a start parameter, cast the fourth argument to integer explicitly.
+--  regexp_count tests
+SELECT regexp_count('123123123123123', '(12)3');
+ regexp_count
+--------------
+            5
+(1 row)
+
+SELECT regexp_count('123123123123', '123', 1);
+ regexp_count
+--------------
+            4
+(1 row)
+
+SELECT regexp_count('123123123123', '123', 3);
+ regexp_count
+--------------
+            3
+(1 row)
+
+SELECT regexp_count('123123123123', '123', 33);
+ regexp_count
+--------------
+            0
+(1 row)
+
+SELECT regexp_count('ABCABCABCABC', 'Abc', 1, '');
+ regexp_count
+--------------
+            0
+(1 row)
+
+SELECT regexp_count('ABCABCABCABC', 'Abc', 1, 'i');
+ regexp_count
+--------------
+            4
+(1 row)
+
+-- errors
+SELECT regexp_count('123123123123', '123', 0);
+ERROR:  invalid value for parameter "start": 0
+SELECT regexp_count('123123123123', '123', -3);
+ERROR:  invalid value for parameter "start": -3
+-- regexp_like tests
+SELECT regexp_like('Steven', '^Ste(v|ph)en$');
+ regexp_like
+-------------
+ t
+(1 row)
+
+SELECT regexp_like('a'||CHR(10)||'d', 'a.d', 'n');
+ regexp_like
+-------------
+ f
+(1 row)
+
+SELECT regexp_like('a'||CHR(10)||'d', 'a.d', 's');
+ regexp_like
+-------------
+ t
+(1 row)
+
+SELECT regexp_like('abc', ' a . c ', 'x');
+ regexp_like
+-------------
+ t
+(1 row)
+
+SELECT regexp_like('abc', 'a.c', 'g');  -- error
+ERROR:  regexp_like() does not support the "global" option
+-- regexp_instr tests
+SELECT regexp_instr('abcdefghi', 'd.f');
+ regexp_instr
+--------------
+            4
+(1 row)
+
+SELECT regexp_instr('abcdefghi', 'd.q');
+ regexp_instr
+--------------
+            0
+(1 row)
+
+SELECT regexp_instr('abcabcabc', 'a.c');
+ regexp_instr
+--------------
+            1
+(1 row)
+
+SELECT regexp_instr('abcabcabc', 'a.c', 2);
+ regexp_instr
+--------------
+            4
+(1 row)
+
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 3);
+ regexp_instr
+--------------
+            7
+(1 row)
+
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 4);
+ regexp_instr
+--------------
+            0
+(1 row)
+
+SELECT regexp_instr('abcabcabc', 'A.C', 1, 2, 0, 'i');
+ regexp_instr
+--------------
+            4
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 0);
+ regexp_instr
+--------------
+            1
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 1);
+ regexp_instr
+--------------
+            1
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 2);
+ regexp_instr
+--------------
+            4
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 3);
+ regexp_instr
+--------------
+            5
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4);
+ regexp_instr
+--------------
+            7
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 5);
+ regexp_instr
+--------------
+            0
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 0);
+ regexp_instr
+--------------
+            9
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 1);
+ regexp_instr
+--------------
+            4
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 2);
+ regexp_instr
+--------------
+            9
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 3);
+ regexp_instr
+--------------
+            7
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 4);
+ regexp_instr
+--------------
+            9
+(1 row)
+
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 5);
+ regexp_instr
+--------------
+            0
+(1 row)
+
+-- Check case where we have a match, but not a subexpression match
+SELECT regexp_instr('foo', 'foo(bar)?', 1, 1, 0, '', 1);
+ regexp_instr
+--------------
+            0
+(1 row)
+
+-- errors
+SELECT regexp_instr('abcabcabc', 'a.c', 0, 1);
+ERROR:  invalid value for parameter "start": 0
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 0);
+ERROR:  invalid value for parameter "n": 0
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, -1);
+ERROR:  invalid value for parameter "endoption": -1
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 2);
+ERROR:  invalid value for parameter "endoption": 2
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, 'g');
+ERROR:  regexp_instr() does not support the "global" option
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, '', -1);
+ERROR:  invalid value for parameter "subexpr": -1
+-- regexp_substr tests
+SELECT regexp_substr('abcdefghi', 'd.f');
+ regexp_substr
+---------------
+ def
+(1 row)
+
+SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT regexp_substr('abcabcabc', 'a.c');
+ regexp_substr
+---------------
+ abc
+(1 row)
+
+SELECT regexp_substr('abcabcabc', 'a.c', 2);
+ regexp_substr
+---------------
+ abc
+(1 row)
+
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+ regexp_substr
+---------------
+ abc
+(1 row)
+
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+ regexp_substr
+---------------
+ abc
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
+ regexp_substr
+---------------
+ 12345678
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 1);
+ regexp_substr
+---------------
+ 123
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 2);
+ regexp_substr
+---------------
+ 45678
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 3);
+ regexp_substr
+---------------
+ 56
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 4);
+ regexp_substr
+---------------
+ 78
+(1 row)
+
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 5) IS NULL AS t;
+ t
+---
+ t
+(1 row)
+
+-- Check case where we have a match, but not a subexpression match
+SELECT regexp_substr('foo', 'foo(bar)?', 1, 1, '', 1) IS NULL AS t;
+ t
+---
+ t
+(1 row)
+
+-- errors
+SELECT regexp_substr('abcabcabc', 'a.c', 0, 1);
+ERROR:  invalid value for parameter "start": 0
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 0);
+ERROR:  invalid value for parameter "n": 0
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 1, 'g');
+ERROR:  regexp_substr() does not support the "global" option
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 1, '', -1);
+ERROR:  invalid value for parameter "subexpr": -1
 -- set so we can tell NULL from empty string
 \pset null '\\N'
 -- return all matches from regexp
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 2c502534c2..6a029cc369 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -171,6 +171,8 @@ SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde";

 -- With a parenthesized subexpression, return only what matches the subexpr
 SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde";
+-- Check case where we have a match, but not a subexpression match
+SELECT SUBSTRING('foo' FROM 'foo(bar)?') IS NULL AS t;

 -- Check behavior of SIMILAR TO, which uses largely the same regexp variant
 SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true;
@@ -193,6 +195,93 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'gi');
 -- invalid regexp option
 SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');

+-- extended regexp_replace tests
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 1);
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 1, 2);
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 1, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 2, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 3, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 9, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'A|e|i|o|u', 'X', 7, 0, 'i');
+-- 'g' flag should be ignored when N is specified
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 1, 'g');
+-- errors
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', -1, 0, 'i');
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, -1, 'i');
+-- erroneous invocation of non-extended form
+SELECT regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', '1');
+
+--  regexp_count tests
+SELECT regexp_count('123123123123123', '(12)3');
+SELECT regexp_count('123123123123', '123', 1);
+SELECT regexp_count('123123123123', '123', 3);
+SELECT regexp_count('123123123123', '123', 33);
+SELECT regexp_count('ABCABCABCABC', 'Abc', 1, '');
+SELECT regexp_count('ABCABCABCABC', 'Abc', 1, 'i');
+-- errors
+SELECT regexp_count('123123123123', '123', 0);
+SELECT regexp_count('123123123123', '123', -3);
+
+-- regexp_like tests
+SELECT regexp_like('Steven', '^Ste(v|ph)en$');
+SELECT regexp_like('a'||CHR(10)||'d', 'a.d', 'n');
+SELECT regexp_like('a'||CHR(10)||'d', 'a.d', 's');
+SELECT regexp_like('abc', ' a . c ', 'x');
+SELECT regexp_like('abc', 'a.c', 'g');  -- error
+
+-- regexp_instr tests
+SELECT regexp_instr('abcdefghi', 'd.f');
+SELECT regexp_instr('abcdefghi', 'd.q');
+SELECT regexp_instr('abcabcabc', 'a.c');
+SELECT regexp_instr('abcabcabc', 'a.c', 2);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 4);
+SELECT regexp_instr('abcabcabc', 'A.C', 1, 2, 0, 'i');
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 0);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 1);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 2);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 3);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 5);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 0);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 1);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 2);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 3);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 4);
+SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 1, 'i', 5);
+-- Check case where we have a match, but not a subexpression match
+SELECT regexp_instr('foo', 'foo(bar)?', 1, 1, 0, '', 1);
+-- errors
+SELECT regexp_instr('abcabcabc', 'a.c', 0, 1);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 0);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, -1);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 2);
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, 'g');
+SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, '', -1);
+
+-- regexp_substr tests
+SELECT regexp_substr('abcdefghi', 'd.f');
+SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
+SELECT regexp_substr('abcabcabc', 'a.c');
+SELECT regexp_substr('abcabcabc', 'a.c', 2);
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
+SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 1);
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 2);
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 3);
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 4);
+SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 5) IS NULL AS t;
+-- Check case where we have a match, but not a subexpression match
+SELECT regexp_substr('foo', 'foo(bar)?', 1, 1, '', 1) IS NULL AS t;
+-- errors
+SELECT regexp_substr('abcabcabc', 'a.c', 0, 1);
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 0);
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 1, 'g');
+SELECT regexp_substr('abcabcabc', 'a.c', 1, 1, '', -1);
+
 -- set so we can tell NULL from empty string
 \pset null '\\N'

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

02 August 2021, 01:02:20

I wrote:
> ... aside from the question of whether
> a too-large subexpression number should be an error or not.

Oh ... poking around some more, I noticed a very nearby precedent.
regexp_replace's replacement string can include \1 to \9 to insert
the substring matching the N'th parenthesized subexpression.  But
if there is no such subexpression, you don't get an error, just
an empty insertion.  So that seems like an argument for not
throwing an error for an out-of-range subexpr parameter.

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

02 August 2021, 21:22:04

Le 02/08/2021 à 01:21, Tom Lane a écrit :
> Gilles Darold <gilles@darold.net> writes:
>> [ v5-0001-regexp-foo-functions.patch ]
> I've gone through this whole patch now, and found quite a lot that I did
> not like.  In no particular order:
>
> * Wrapping parentheses around the user's regexp doesn't work.  It can
> turn an invalid regexp into a valid one: for example 'a)(b' should draw
> a syntax error.  With this patch, no error would be thrown, but the
> "outer" parens wouldn't do what you expected.  Worse, it can turn a
> valid regexp into an invalid one: the metasyntax options described in
> 9.7.3.4 only work at the start of the regexp.  So we have to handle
> whole-regexp cases honestly rather than trying to turn them into an
> instance of the parenthesized-subexpression case.
>
> * You did a lot of things quite inefficiently, apparently to avoid
> touching any existing code.  I think it's better to extend
> setup_regexp_matches() and replace_text_regexp() a little bit so that
> they can support the behaviors these new functions need.  In both of
> them, it's absolutely trivial to allow a search start position to be
> passed in; and it doesn't take much to teach replace_text_regexp()
> to replace only the N'th match.
>
> * Speaking of N'th, there is not much of anything that I like
> about Oracle's terminology for the function arguments, and I don't
> think we ought to adopt it.  If we're documenting the functions as
> processing the "N'th match", it seems to me to be natural to call
> the parameter "N" not "occurrence".  Speaking of the "occurrence'th
> occurrence" is just silly, not to mention long and easy to misspell.
> Likewise, "position" is a horribly vague term for the search start
> position; it could be interpreted to mean several other things.
> "start" seems much better.  "return_opt" is likewise awfully unclear.
> I went with "endoption" below, though I could be talked into something
> else.  The only one of Oracle's choices that I like is "subexpr" for
> subexpression number ... but you went with DB2's rather vague "group"
> instead.  I don't want to use their "capture group" terminology,
> because that appears nowhere else in our documentation.  Our existing
> terminology is "parenthesized subexpression", which seems fine to me
> (and also agrees with Oracle's docs).
>
> * I spent a lot of time on the docs too.  A lot of the syntax specs
> were wrong (where you put the brackets matters), many of the examples
> seemed confusingly overcomplicated, and the text explanations needed
> copy-editing.
>
> * Also, the regression tests seemed misguided.  This patch is not
> responsible for testing the regexp engine as such; we have tests
> elsewhere that do that.  So I don't think we need complex regexps
> here.  We just need to verify that the parameters of these functions
> act properly, and check their error cases.  That can be done much
> more quickly and straightforwardly than what you had.
>
>
> So here's a revised version that I like better.  I think this
> is pretty nearly committable, aside from the question of whether
> a too-large subexpression number should be an error or not.


Thanks a lot for the patch improvement and the guidance. I have read the
patch and I agree with your choices I think I was too much trying to
mimic the oraclisms. I don't think we should take care of the too-large
subexpression number, the regexp writer should always test its regular
expression and also this will not prevent him to chose the wrong capture
group number but just a non existing one.


Best regards,

--
Gilles Darold

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

03 August 2021, 09:45:09

Le 02/08/2021 à 23:22, Gilles Darold a écrit :
> Le 02/08/2021 à 01:21, Tom Lane a écrit :
>> Gilles Darold <gilles@darold.net> writes:
>>> [ v5-0001-regexp-foo-functions.patch ]
>> I've gone through this whole patch now, and found quite a lot that I did
>> not like.  In no particular order:
>>
>> * Wrapping parentheses around the user's regexp doesn't work.  It can
>> turn an invalid regexp into a valid one: for example 'a)(b' should draw
>> a syntax error.  With this patch, no error would be thrown, but the
>> "outer" parens wouldn't do what you expected.  Worse, it can turn a
>> valid regexp into an invalid one: the metasyntax options described in
>> 9.7.3.4 only work at the start of the regexp.  So we have to handle
>> whole-regexp cases honestly rather than trying to turn them into an
>> instance of the parenthesized-subexpression case.
>>
>> * You did a lot of things quite inefficiently, apparently to avoid
>> touching any existing code.  I think it's better to extend
>> setup_regexp_matches() and replace_text_regexp() a little bit so that
>> they can support the behaviors these new functions need.  In both of
>> them, it's absolutely trivial to allow a search start position to be
>> passed in; and it doesn't take much to teach replace_text_regexp()
>> to replace only the N'th match.
>>
>> * Speaking of N'th, there is not much of anything that I like
>> about Oracle's terminology for the function arguments, and I don't
>> think we ought to adopt it.  If we're documenting the functions as
>> processing the "N'th match", it seems to me to be natural to call
>> the parameter "N" not "occurrence".  Speaking of the "occurrence'th
>> occurrence" is just silly, not to mention long and easy to misspell.
>> Likewise, "position" is a horribly vague term for the search start
>> position; it could be interpreted to mean several other things.
>> "start" seems much better.  "return_opt" is likewise awfully unclear.
>> I went with "endoption" below, though I could be talked into something
>> else.  The only one of Oracle's choices that I like is "subexpr" for
>> subexpression number ... but you went with DB2's rather vague "group"
>> instead.  I don't want to use their "capture group" terminology,
>> because that appears nowhere else in our documentation.  Our existing
>> terminology is "parenthesized subexpression", which seems fine to me
>> (and also agrees with Oracle's docs).
>>
>> * I spent a lot of time on the docs too.  A lot of the syntax specs
>> were wrong (where you put the brackets matters), many of the examples
>> seemed confusingly overcomplicated, and the text explanations needed
>> copy-editing.
>>
>> * Also, the regression tests seemed misguided.  This patch is not
>> responsible for testing the regexp engine as such; we have tests
>> elsewhere that do that.  So I don't think we need complex regexps
>> here.  We just need to verify that the parameters of these functions
>> act properly, and check their error cases.  That can be done much
>> more quickly and straightforwardly than what you had.
>>
>>
>> So here's a revised version that I like better.  I think this
>> is pretty nearly committable, aside from the question of whether
>> a too-large subexpression number should be an error or not.
>
> Thanks a lot for the patch improvement and the guidance. I have read the
> patch and I agree with your choices I think I was too much trying to
> mimic the oraclisms. I don't think we should take care of the too-large
> subexpression number, the regexp writer should always test its regular
> expression and also this will not prevent him to chose the wrong capture
> group number but just a non existing one.


Actually I just found that the regexp_like() function doesn't support 
the start parameter which is something we should support. I saw that 
Oracle do not support it but DB2 does and I think we should also support 
it. I will post a new version of the patch once it is done.


Best regards,

-- 
Gilles Darold

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

03 August 2021, 11:26:48

Le 03/08/2021 à 11:45, Gilles Darold a écrit :
> Actually I just found that the regexp_like() function doesn't support
> the start parameter which is something we should support. I saw that
> Oracle do not support it but DB2 does and I think we should also
> support it. I will post a new version of the patch once it is done.


Here is a new version of the patch that adds the start parameter to
regexp_like() function but while I'm adding support to this parameter it
become less obvious for me that we should implement it. However feel
free to not use this version if you think that adding the start
parameter has no real interest.


Best regards,

--
Gilles Darold

Attachment

v7-0001-regexp-foo-functions.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Erik Rijkers

Date:

03 August 2021, 13:02:43

On 8/3/21 1:26 PM, Gilles Darold wrote:
> Le 03/08/2021 à 11:45, Gilles Darold a écrit :
>> Actually I just found that the regexp_like() function doesn't support 
>> the start parameter which is something we should support. I saw that 
>> Oracle do not support it but DB2 does and I think we should also 
>> support it. I will post a new version of the patch once it is done.
> 

+1

I for one am in favor of this 'start'-argument addition.  Slightly 
harder usage, but more precise manipulation.


Erik Rijkers


> 
> Here is a new version of the patch that adds the start parameter to 
> regexp_like() function but while I'm adding support to this parameter it 
> become less obvious for me that we should implement it. However feel 
> free to not use this version if you think that adding the start 
> parameter has no real interest.
> 
> 
> Best regards,
>

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

03 August 2021, 13:39:50

Erik Rijkers <er@xs4all.nl> writes:
> On 8/3/21 1:26 PM, Gilles Darold wrote:
>> Le 03/08/2021 à 11:45, Gilles Darold a écrit :
>>> Actually I just found that the regexp_like() function doesn't support 
>>> the start parameter which is something we should support. I saw that 
>>> Oracle do not support it but DB2 does and I think we should also 
>>> support it. I will post a new version of the patch once it is done.

> +1

> I for one am in favor of this 'start'-argument addition.  Slightly 
> harder usage, but more precise manipulation.

As I said upthread, I am *not* in favor of making those DB2 additions.
We do not need to create ambiguities around those functions like the
one we have for regexp_replace.  If Oracle doesn't have those options,
why do we need them?

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

03 August 2021, 14:11:24

Le 03/08/2021 à 15:39, Tom Lane a écrit :
> Erik Rijkers <er@xs4all.nl> writes:
>> On 8/3/21 1:26 PM, Gilles Darold wrote:
>>> Le 03/08/2021 à 11:45, Gilles Darold a écrit :
>>>> Actually I just found that the regexp_like() function doesn't support
>>>> the start parameter which is something we should support. I saw that
>>>> Oracle do not support it but DB2 does and I think we should also
>>>> support it. I will post a new version of the patch once it is done.
>> +1
>> I for one am in favor of this 'start'-argument addition.  Slightly
>> harder usage, but more precise manipulation.
> As I said upthread, I am *not* in favor of making those DB2 additions.
> We do not need to create ambiguities around those functions like the
> one we have for regexp_replace.  If Oracle doesn't have those options,
> why do we need them?


Sorry I have missed that, but I'm fine with this implemenation so let's 
keep the v6 version of the patch and drop this one.

-- 
Gilles Darold

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Tom Lane

Date:

03 August 2021, 17:10:24

Gilles Darold <gilles@darold.net> writes:
> Sorry I have missed that, but I'm fine with this implemenation so let's 
> keep the v6 version of the patch and drop this one.

Pushed, then.  There's still lots of time to tweak the behavior of course.

            regards, tom lane

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Peter Eisentraut

Date:

15 December 2021, 12:41:16

On 03.08.21 19:10, Tom Lane wrote:
> Gilles Darold <gilles@darold.net> writes:
>> Sorry I have missed that, but I'm fine with this implemenation so let's
>> keep the v6 version of the patch and drop this one.
> 
> Pushed, then.  There's still lots of time to tweak the behavior of course.

I have a documentation follow-up to this.  It seems that these new 
functions are almost a de facto standard, whereas the SQL-standard 
functions are not implemented anywhere.  I propose the attached patch to 
update the subsection in the pattern-matching section to give more 
detail on this and suggest equivalent functions among these newly added 
ones.  What do you think?

Attachment

0001-doc-More-documentation-on-regular-expressions-and-SQ.patch

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Gilles Darold

Date:

15 December 2021, 13:15:04

Le 15/12/2021 à 13:41, Peter Eisentraut a écrit :
> On 03.08.21 19:10, Tom Lane wrote:
>> Gilles Darold <gilles@darold.net> writes:
>>> Sorry I have missed that, but I'm fine with this implemenation so let's
>>> keep the v6 version of the patch and drop this one.
>>
>> Pushed, then.  There's still lots of time to tweak the behavior of
>> course.
>
> I have a documentation follow-up to this.  It seems that these new
> functions are almost a de facto standard, whereas the SQL-standard
> functions are not implemented anywhere.  I propose the attached patch
> to update the subsection in the pattern-matching section to give more
> detail on this and suggest equivalent functions among these newly
> added ones.  What do you think?


I'm in favor to apply your changes to documentation. It is a good thing
to precise the relation between this implementation of the regex_*
functions and the SQL stardard.

--
Gilles Darold

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

From

Peter Eisentraut

Date:

20 December 2021, 09:43:20

On 15.12.21 14:15, Gilles Darold wrote:
> Le 15/12/2021 à 13:41, Peter Eisentraut a écrit :
>> On 03.08.21 19:10, Tom Lane wrote:
>>> Gilles Darold <gilles@darold.net> writes:
>>>> Sorry I have missed that, but I'm fine with this implemenation so let's
>>>> keep the v6 version of the patch and drop this one.
>>>
>>> Pushed, then.  There's still lots of time to tweak the behavior of 
>>> course.
>>
>> I have a documentation follow-up to this.  It seems that these new 
>> functions are almost a de facto standard, whereas the SQL-standard 
>> functions are not implemented anywhere.  I propose the attached patch 
>> to update the subsection in the pattern-matching section to give more 
>> detail on this and suggest equivalent functions among these newly 
>> added ones.  What do you think?
> 
> 
> I'm in favor to apply your changes to documentation. It is a good thing 
> to precise the relation between this implementation of the regex_* 
> functions and the SQL stardard.

ok, done