Thread: regexp_match() returning text

regexp_match() returning text

From
Emre Hasegeli
Date:
Attached patch adds regexp_match() function which is a simple variant of
regexp_matches() that doesn't return a set.  It is based on Tom Lane's
comment to bug #10889 [1].

[1] https://www.postgresql.org/message-id/23769.1404747766@sss.pgh.pa.us

Attachment

Re: regexp_match() returning text

From
"David G. Johnston"
Date:
On Sun, May 29, 2016 at 1:28 PM, Emre Hasegeli <emre@hasegeli.com> wrote:
Attached patch adds regexp_match() function which is a simple variant of
regexp_matches() that doesn't return a set.  It is based on Tom Lane's
comment to bug #10889 [1].

[1] https://www.postgresql.org/message-id/23769.1404747766@sss.pgh.pa.us

Not sure if we need two functions or what but I dislike that:

"It ignores the parenthesized subexpressions in the pattern."

​The main problem being solved is the use of a SETOF result.  I'm inclined to prefer that the final, single, result is still an array.

Even if we return a single text value instead of an array it would be nice to return the first (only...) parenthesized subexpression so that the input pattern isn't constrained.

​I've got a style issue with the information_schema - I like to call it useless-use-of-E'' -  but that was there long before this patch...

/* user mustn't specify 'g' for regexp_split */ - do we add "or regexp_match" or just removed the extraneous detail?

There seems to be scope creep regarding "regexp_split_to_table" that I'm surprised to find.  Related to that is the unexpected removal of the "force_glob" parameter to setup_regexp_matches.  You took what was a single block of code and now duplicated it without any explanation in the commit message (a code comment wouldn't work for this kind of change).  The change to flags from passing a pointer to text to passing in a pointer to a previously derived pg_re_flags makes more sense on its face, and it is apparently a non-public API, but again constitutes a refactoring that at least would ideally be a separate commit from the one the introduces the new behavior.

Also, I see an assert for the "no subexpressions" policy but I have seemed to have overlooked where the non-assert-enabled user is informed of the prohibition.

Tom's opinion on all this will be needed - I am not a C code writer nor do I follow the patch writing process that closely generally, but I have a keen interest in this topic - but there seems to be a bit more here than simply having a function identical to the existing regexp_matches function that prohibits the global flag and only ever returns a single array per the existing API.

I'm good with the name, regexp_match - even with the behavior being changed to returning an array instead of text.  If there was any active plans to redo the API so that we'd return a composite type instead of an array I'd be more inclined to reserve "match" for that change and make this one a bit more verbose.

David J.

Re: regexp_match() returning text

From
Andrew Gierth
Date:
>>>>> "Emre" == Emre Hasegeli <emre@hasegeli.com> writes:
Emre> Attached patch adds regexp_match() function which is a simpleEmre> variant of regexp_matches() that doesn't
returna set.
 

We already have a function that takes a string and a regexp and returns
a single text result: substring().

Regexp flags other than 'g' can also be embedded in the regexp:

postgres=# select substring('foo bar' from '(?i)BA+');substring 
-----------ba

Returning an array containing the values of all capture groups might be
more useful (substring returns the value of the first capture group if
any, otherwise the matched string).

-- 
Andrew (irc:RhodiumToad)



Re: regexp_match() returning text

From
Jim Nasby
Date:
On 5/30/16 1:01 PM, Andrew Gierth wrote:
> Returning an array containing the values of all capture groups might be
> more useful (substring returns the value of the first capture group if
> any, otherwise the matched string).

+1.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: regexp_match() returning text

From
Emre Hasegeli
Date:
> The main problem being solved is the use of a SETOF result.  I'm inclined to
> prefer that the final, single, result is still an array.

I have changed it like that.  New patch attached.

> I've got a style issue with the information_schema - I like to call it
> useless-use-of-E'' -  but that was there long before this patch...

We better not touch it.

> /* user mustn't specify 'g' for regexp_split */ - do we add "or
> regexp_match" or just removed the extraneous detail?

I don't think it would be a nice error message.

> There seems to be scope creep regarding "regexp_split_to_table" that I'm
> surprised to find.  Related to that is the unexpected removal of the
> "force_glob" parameter to setup_regexp_matches.  You took what was a single
> block of code and now duplicated it without any explanation in the commit
> message (a code comment wouldn't work for this kind of change).  The change
> to flags from passing a pointer to text to passing in a pointer to a
> previously derived pg_re_flags makes more sense on its face, and it is
> apparently a non-public API, but again constitutes a refactoring that at
> least would ideally be a separate commit from the one the introduces the new
> behavior.

That check doesn't belong to setup_regexp_matches() in the first place.
The arguments of the function are organised to be caller agnostic,
and then it gives an error on behalf of regexp_split().  The check
fits better to the regexp_split() functions even with duplication.

I can split it to another patch, but I think these kind of changes most
often go together.

Would you mind adding yourself to the reviewers on the Commitfest app?
I think you have already read though it.

Thanks for all the feedback.

Attachment

Re: regexp_match() returning text

From
"David G. Johnston"
Date:
On Saturday, June 4, 2016, Emre Hasegeli <emre@hasegeli.com> wrote:
> The main problem being solved is the use of a SETOF result.  I'm inclined to
> prefer that the final, single, result is still an array.

I have changed it like that.  New patch attached.

Good
 

> I've got a style issue with the information_schema - I like to call it
> useless-use-of-E'' -  but that was there long before this patch...

We better not touch it.

Agreed
 

> /* user mustn't specify 'g' for regexp_split */ - do we add "or
> regexp_match" or just removed the extraneous detail?

I don't think it would be a nice error message.

Just meant the comment.  I think they look good now though the part about force global in the split area just says what the code does and not why.  Namely that when splitting we find all matches so the input is completely split.  We disallow a user specification of global for some arbitrary reason; though I don't have any reason to change that and the present behavior doesn't cause people to complain.
 

> There seems to be scope creep regarding "regexp_split_to_table" that I'm
> surprised to find.  Related to that is the unexpected removal of the
> "force_glob" parameter to setup_regexp_matches.  You took what was a single
> block of code and now duplicated it without any explanation in the commit
> message (a code comment wouldn't work for this kind of change).  The change
> to flags from passing a pointer to text to passing in a pointer to a
> previously derived pg_re_flags makes more sense on its face, and it is
> apparently a non-public API, but again constitutes a refactoring that at
> least would ideally be a separate commit from the one the introduces the new
> behavior.

That check doesn't belong to setup_regexp_matches() in the first place.
The arguments of the function are organised to be caller agnostic,
and then it gives an error on behalf of regexp_split().  The check
fits better to the regexp_split() functions even with duplication.

I can split it to another patch, but I think these kind of changes most
often go together.

After reading this again I understand better what is going on and agree with the changes as written.
 

Would you mind adding yourself to the reviewers on the Commitfest app?
I think you have already read though it.


Done.

I didn't compile either patch but given the scope and complexity I'd say it is ready for committer without that confirmed.  Tom usually touches the regexp code and I'm pretty sure he'll look at this with an eye no one else has.  Though I wouldn't expect anything until work on 10 begins in earnest.

David J.

Re: regexp_match() returning text

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> I didn't compile either patch but given the scope and complexity I'd say it
> is ready for committer without that confirmed.  Tom usually touches the
> regexp code and I'm pretty sure he'll look at this with an eye no one else
> has.  Though I wouldn't expect anything until work on 10 begins in earnest.

Pushed with some cosmetic adjustments to the code and rather more
extensive work on the documentation.

I did *not* push the hunk in citext.sgml, since that was alleging support
that doesn't actually exist in this patch.  To make this work for citext,
we need to add wrapper functions similar to citext's wrappers for
regexp_matches.  And that in turn means a citext extension version bump,
which makes it more notationally tedious than I felt like dealing with
today.  I'm bouncing that requirement back to you to produce a separate
patch for.
        regards, tom lane



Re: regexp_match() returning text

From
Emre Hasegeli
Date:
> I did *not* push the hunk in citext.sgml, since that was alleging support
> that doesn't actually exist in this patch.  To make this work for citext,
> we need to add wrapper functions similar to citext's wrappers for
> regexp_matches.  And that in turn means a citext extension version bump,
> which makes it more notationally tedious than I felt like dealing with
> today.  I'm bouncing that requirement back to you to produce a separate
> patch for.

It is attached.

Attachment

Re: regexp_match() returning text

From
Tom Lane
Date:
Emre Hasegeli <emre@hasegeli.com> writes:
>> I did *not* push the hunk in citext.sgml, since that was alleging support
>> that doesn't actually exist in this patch.  To make this work for citext,
>> we need to add wrapper functions similar to citext's wrappers for
>> regexp_matches.  And that in turn means a citext extension version bump,
>> which makes it more notationally tedious than I felt like dealing with
>> today.  I'm bouncing that requirement back to you to produce a separate
>> patch for.

> It is attached.

You missed updating citext_1.out, but otherwise looks good.  Pushed.
        regards, tom lane