Thread: Re: [HACKERS] writing new regexp functions

Re: [HACKERS] writing new regexp functions

From
Jeremy Drake
Date:
On Fri, 2 Feb 2007, David Fetter wrote:

> On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote:
> >
> > All of SQL's pattern match operators have the pattern on the right,
> > so my advice is to stick with that and try not to think about Perl
> > ;-)
>
> Perl provides inspiration, but here, consistency would help more than
> orderly imitation of how it does what it does.   And besides, when
> people really need Perl, they can pull it in as a PL :)

Alright, here is my code to date.  I have put the pattern after the string
in the split function, as discussed above.  The .tar.gz file expects to be
untarred in contrib/.  I have made some regression tests that can be run
using 'make installcheck' as normal for contrib.  I think they exercise
the corner cases in the code, but I may very well have missed some.  It
requires the (previously submitted) attached patch to core to compile, as
it takes advantage of new exported functions from
src/backend/utils/adt/regexp.c.

Let me know if you see any bugs or issues with this code, and I am open to
suggestions for further regression tests ;)


Things that I still want to look into:
* regexp flags (a la regexp_replace).

* maybe make regexp_matches return setof whatever, if given a 'g' flag
  return all matches in string.

* maybe a join function that works as an aggregate
   SELECT join(',', col) FROM tbl
  currently can be written as
   SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')




--
It was a virgin forest, a place where the Hand of Man had never set
foot.

Attachment

Re: [HACKERS] writing new regexp functions

From
David Fetter
Date:
On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:

> Let me know if you see any bugs or issues with this code, and I am
> open to suggestions for further regression tests ;)

> Things that I still want to look into:
> * regexp flags (a la regexp_replace).

One more text field at the end is how the regexp_replace() one does
it.

> * maybe make regexp_matches return setof whatever, if given a 'g' flag
>   return all matches in string.

This is doable with current machinery, albeit a little clumsily.

> * maybe a join function that works as an aggregate
>    SELECT join(',', col) FROM tbl
>   currently can be written as
>    SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')

The array_accum() aggregate in the docs works OK for this purpose.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: [HACKERS] writing new regexp functions

From
Jeremy Drake
Date:
On Sun, 4 Feb 2007, David Fetter wrote:

> On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
>
> > Let me know if you see any bugs or issues with this code, and I am
> > open to suggestions for further regression tests ;)
>
> > Things that I still want to look into:
> > * regexp flags (a la regexp_replace).
>
> One more text field at the end is how the regexp_replace() one does
> it.

That's how I did it.

> > * maybe make regexp_matches return setof whatever, if given a 'g' flag
> >   return all matches in string.
>
> This is doable with current machinery, albeit a little clumsily.

I have implemented this too.

> > * maybe a join function that works as an aggregate
> >    SELECT join(',', col) FROM tbl
> >   currently can be written as
> >    SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
>
> The array_accum() aggregate in the docs works OK for this purpose.

I have not tackled this yet, I think it may be better to stick with the
ARRAY() construct for now.


So, here is the new version of the code, and also a new version of the
patch to core, which fixes some compile warnings that I did not see at
first because I was using ICC rather than GCC.

Here is the README.regexp_ext from the tar file:


This package contains regexp functions beyond those currently provided
in core PostgreSQL, utilizing the regexp engine built into core.  This
is still a work-in-progress.

The most recent version of this code can be found at
 http://www.jdrake.com/postgresql/regexp/regexp_ext.tar.gz
and the prerequisite patch to PostgreSQL core, which has been submitted
for review, can be found at
 http://www.jdrake.com/postgresql/regexp/regexp-export.patch

The .tar.gz file expects to be untarred in contrib/.  I have made some
regression tests that can be run using 'make installcheck' as normal for
contrib.  I think they exercise the corner cases in the code, but I may
very well have missed some.  It requires the above mentioned patch to
core to compile, as it takes advantage of new exported functions from
src/backend/utils/adt/regexp.c.

Let me know if you see any bugs or issues with this code, and I am open to
suggestions for further regression tests ;)

Functions implemented in this module:
* regexp_split(str text, pattern text) RETURNS SETOF text
  regexp_split(str text, pattern text, flags text) RETURNS SETOF text
   returns each section of the string delimited by the pattern.
* regexp_matches(str text, pattern text) RETURNS text[]
   returns all capture groups when matching pattern against string in an array
* regexp_matches(str text, pattern text, flags text) RETURNS SETOF
    (prematch text, fullmatch text, matches text[], postmatch text)
   returns all capture groups when matching pattern against string in an array.
   also returns the entire match in fullmatch.  if the 'g' option is given,
   returns all matches in the string.  if the 'r' option is given, also return
   the text before and after the match in prematch and postmatch respectively.

See the regression tests for more details about usage and return values.

Recent changes:
* I have put the pattern after the string in all of the functions, as
  discussed on the pgsql-hackers mailing list.

* regexp flags (a la regexp_replace).

* make regexp_matches return setof whatever, if given a 'g' flag return
  all matches in string.

Things that I still want to look into:
* maybe a join function that works as an aggregate
   SELECT join(',', col) FROM tbl
  currently can be written as
   SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')


--
Philogeny recapitulates erogeny; erogeny recapitulates philogeny.

Attachment

Re: WIP patch adding new regexp functions

From
Jeremy Drake
Date:
On Wed, 7 Feb 2007, David Fetter wrote:

> On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote:
> > Jeremy Drake <pgsql@jdrake.com> writes:
> > > * Put together a patch to add these functions to core.  I could put them
> > >   directly in regexp.c, so the support functions could stay static.  My
> > >   concern here is that I don't know if there are any functions currently
> > >   in core with OUT parameters.
> >
> > As of 8.2 there are.
> >
> > If we are going to include these I would vote for core not contrib
> > status, exactly to avoid having to export those functions.
>
> +1 for core. :)

Here is a patch to add these functions to core.  Please take a look and
let me know what you think.  I had to jump through a bit of a hoop to
avoid opr_sanity test from breaking, may not have done that right.

Also, this patch allows regexp_replace to take more flags, since my two
new functions needed flags also, I split flag parsing into a seperate
function and made regexp_replace use it too.  It also results that the
error message for an invalid flag in regexp_replace changes, since it is
now more generic as it is called from multiple functions.

I still need to write documentation for the new functions before I
consider the patch completed, but please take a look at the code and see
if it is acceptable as I do not have any further changes planned for it.

--
Chicken Soup, n.:
    An ancient miracle drug containing equal parts of aureomycin,
    cocaine, interferon, and TLC.  The only ailment chicken soup
    can't cure is neurotic dependence on one's mother.
        -- Arthur Naiman, "Every Goy's Guide to Yiddish"

Attachment

patch adding new regexp functions

From
Jeremy Drake
Date:
On Wed, 7 Feb 2007, Jeremy Drake wrote:

> Here is a patch to add these functions to core.  Please take a look and
> let me know what you think.  I had to jump through a bit of a hoop to
> avoid opr_sanity test from breaking, may not have done that right.
>
> Also, this patch allows regexp_replace to take more flags, since my two
> new functions needed flags also, I split flag parsing into a seperate
> function and made regexp_replace use it too.  It also results that the
> error message for an invalid flag in regexp_replace changes, since it is
> now more generic as it is called from multiple functions.
>
> I still need to write documentation for the new functions before I
> consider the patch completed, but please take a look at the code and see
> if it is acceptable as I do not have any further changes planned for it.

I have added documentation for the functions in this patch.  At this point
I am ready to submit this patch for the review and application process.
Please let me know if you find any issues with it.

Thank you

--
The trouble with being punctual is that nobody's there to appreciate
it.
        -- Franklin P. Jones

Attachment

Re: patch adding new regexp functions

From
Neil Conway
Date:
On Thu, 2007-02-08 at 19:22 -0800, Jeremy Drake wrote:
> I have added documentation for the functions in this patch.  At this point
> I am ready to submit this patch for the review and application process.
> Please let me know if you find any issues with it.

Barring any objections, I'll review and apply this patch in the next few
days.

BTW, unless the patch is truly large, leaving the patch uncompressed
makes it easier to eyeball it without leaving my mail client. That's
just personal preference, though...

-Neil



Re: patch adding new regexp functions

From
Alvaro Herrera
Date:
Neil Conway wrote:

> BTW, unless the patch is truly large, leaving the patch uncompressed
> makes it easier to eyeball it without leaving my mail client. That's
> just personal preference, though...

FWIW, I just do a "| zcat | less", which shows it uncompressed.  No big
deal.  I don't know if Evolution can do that ... (it surprised me that
Thunderbird doesn't seem to be capable of searching a message based on
Message-Id).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Neil Conway wrote:
>
> > BTW, unless the patch is truly large, leaving the patch uncompressed
> > makes it easier to eyeball it without leaving my mail client. That's
> > just personal preference, though...
>
> FWIW, I just do a "| zcat | less", which shows it uncompressed.  No big
> deal.  I don't know if Evolution can do that ... (it surprised me that
> Thunderbird doesn't seem to be capable of searching a message based on
> Message-Id).

My .mailcap calls a script runs 'file' and tries to display it properly
automatically, and it worked fine for his attachment.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 8 Feb 2007, Neil Conway wrote:

> On Thu, 2007-02-08 at 19:22 -0800, Jeremy Drake wrote:
> > I have added documentation for the functions in this patch.  At this point
> > I am ready to submit this patch for the review and application process.
> > Please let me know if you find any issues with it.
>
> Barring any objections, I'll review and apply this patch in the next few
> days.
>
> BTW, unless the patch is truly large, leaving the patch uncompressed
> makes it easier to eyeball it without leaving my mail client. That's
> just personal preference, though...

Yeah, I try to do that, but this one just barely passed my personal
compression threshold.  Guess I should raise my threshold :)

Here is a new version of the patch which fixes up the documentation a
little (should have read it over again before posting).  I also added
mention of the additional flags that I added support for, and linked to
the embedded options table for details, since the flags supported
correspond to those.  This is my first foray into the world of Jade, so
please forgive any SGML snafus.

Also, it is uncompressed this time ;)

--
Q:  How many existentialists does it take to screw in a lightbulb?
A:  Two.  One to screw it in and one to observe how the lightbulb
    itself symbolizes a single incandescent beacon of subjective
    reality in a netherworld of endless absurdity reaching out toward a
    maudlin cosmos of nothingness.

Attachment

Re: patch adding new regexp functions

From
Neil Conway
Date:
On Fri, 2007-02-09 at 01:08 -0800, Jeremy Drake wrote:
> Yeah, I try to do that, but this one just barely passed my personal
> compression threshold.  Guess I should raise my threshold :)

No, don't pay any attention to me, I'm just lazy :)

> Here is a new version of the patch which fixes up the documentation a
> little (should have read it over again before posting).

The "doing_srf" business is rather tortured, and seems an invitation for
bugs. ISTM there should be a cleaner way to implement this. For example,
would it be possible to put all the common logic into one or more
additional functions, and then have SRF vs. non-SRF cases that call into
those functions after doing the appropriate initialization?

-Neil



Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Fri, 9 Feb 2007, Neil Conway wrote:

> The "doing_srf" business is rather tortured, and seems an invitation for
> bugs. ISTM there should be a cleaner way to implement this. For example,
> would it be possible to put all the common logic into one or more
> additional functions, and then have SRF vs. non-SRF cases that call into
> those functions after doing the appropriate initialization?

Here is a new version of the patch which eliminates the doing_srf stuff.
It does seem a lot cleaner this way...

--
Fortune's Real-Life Courtroom Quote #18:

Q:  Are you married?
A:  No, I'm divorced.
Q:  And what did your husband do before you divorced him?
A:  A lot of things I didn't know about.

Attachment

Re: patch adding new regexp functions

From
Neil Conway
Date:
On Fri, 2007-02-09 at 16:33 -0800, Jeremy Drake wrote:
> Here is a new version of the patch which eliminates the doing_srf stuff.

* C89 require constant-sized stack allocated arrays, so the coding in
perform_regexp_matches() is non-portable.

* I'm not clear about the control flow in regexp_matches() and
regexp_split(). Presumably it's not possible for the call_cntr to
actually exceed max_calls, so the error message in these cases should be
elog(ERROR), not ereport (the former is for "shouldn't happen" bug
scenarios, the latter is for user-facing errors). Can you describe the
logic here (e.g. via comments) a bit more?

* The logic in regexp_split (incremented_offset, first_match, etc.) is
pretty ugly and hard to follow. The "if" condition on line 1037 is
particularly objectionable. Again, ISTM there should be a cleaner way to
do this.

* Try to keep lines to 80 characters or fewer. If the code is wandering
off the right side of the screen all the time, that might be a hint that
it needs simplification.

Attached is a cleaned up version of your patch -- various improvements
throughout, but mostly cosmetic stuff. Do you want to take a look at the
above?

-Neil


Attachment

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sat, 10 Feb 2007, Neil Conway wrote:

> On Fri, 2007-02-09 at 16:33 -0800, Jeremy Drake wrote:
> > Here is a new version of the patch which eliminates the doing_srf stuff.
>
> * C89 require constant-sized stack allocated arrays, so the coding in
> perform_regexp_matches() is non-portable.

I thought that was the case, but I had seen some other code doing this.
Turns out it was in the fulldisjunctions pgfoundry project.  I replaced
them with palloc'd memory.

> * I'm not clear about the control flow in regexp_matches() and
> regexp_split(). Presumably it's not possible for the call_cntr to
> actually exceed max_calls, so the error message in these cases should be
> elog(ERROR), not ereport (the former is for "shouldn't happen" bug
> scenarios, the latter is for user-facing errors). Can you describe the
> logic here (e.g. via comments) a bit more?

I added some comments, and changed to using elog instead of ereport.

> * The logic in regexp_split (incremented_offset, first_match, etc.) is
> pretty ugly and hard to follow. The "if" condition on line 1037 is
> particularly objectionable. Again, ISTM there should be a cleaner way to
> do this.

That if condition was very difficult to get right ;)  I added a bunch of
comments, and switched the logic around with a continue so it is much more
obvious what is happening there.  Incidentally, that if condition being
incorrect is what results in call_cntr exceeding max_calls :)

>
> * Try to keep lines to 80 characters or fewer. If the code is wandering
> off the right side of the screen all the time, that might be a hint that
> it needs simplification.
>
> Attached is a cleaned up version of your patch -- various improvements
> throughout, but mostly cosmetic stuff. Do you want to take a look at the
> above?
>
> -Neil
>
>

--
If God had meant for us to be naked, we would have been born that way.

Attachment

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sat, 10 Feb 2007, Jeremy Drake wrote:

> On Sat, 10 Feb 2007, Neil Conway wrote:
>
> > * I'm not clear about the control flow in regexp_matches() and
> > regexp_split(). Presumably it's not possible for the call_cntr to
> > actually exceed max_calls, so the error message in these cases should be
> > elog(ERROR), not ereport (the former is for "shouldn't happen" bug
> > scenarios, the latter is for user-facing errors). Can you describe the
> > logic here (e.g. via comments) a bit more?
>
> I added some comments, and changed to using elog instead of ereport.

I fixed a couple more things in this patch.  I changed the max calls limit
to the real limit, rather than the arbitrarily high limit that was
previously set (three times the length of the string in bytes).  Also, I
changed the checks for offset to compare against wide_len rather than
orig_len, since in multibyte character sets orig_len is the length in
bytes of the string in whatever encoding it is in, while wide_len is the
length in characters, which is what everything else in these functions
deal with.

The calls to text_substr have me somewhat concerned now, also.  I think
performance starts to look like O(n^2) in multibyte character sets.  But I
think doing anything about it would require this code to know more about
the internals of text than it has any right to.  I guess settle for the
correctness now, and if performance is a problem this can be addressed.
Would hate to make this code even more ugly due to premature
optimization...

--
When does summertime come to Minnesota, you ask?
Well, last year, I think it was a Tuesday.

Attachment

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
I noticed in the documentation that you added indexterms for the
functions, but that they (along with the existing ones) are in the SIMILAR
TO regular expression section.  These functions do not take SIMILAR TO
regular expressions, but POSIX regular expressions, so I moved them to
that section in this patch, and also moved the regexp_replace indexterm
there also, since it also uses POSIX regular expressions.

--
10.0 times 0.1 is hardly ever 1.0.

Attachment

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
What was the status of this?  Was there anything else I needed to do with
this patch, or is it ready to be applied?  Let me know if there is
anything else I need to do on this...

--
What this country needs is a good five cent nickel.

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Jeremy Drake wrote:
> What was the status of this?  Was there anything else I needed to do with
> this patch, or is it ready to be applied?  Let me know if there is
> anything else I need to do on this...

I assume it is ready for application.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Neil Conway
Date:
On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote:
> What was the status of this?  Was there anything else I needed to do with
> this patch, or is it ready to be applied?  Let me know if there is
> anything else I need to do on this...

Will do -- I'm planning to apply this as soon as I have the free cycles
to do so, likely tomorrow or Friday.

-Neil



Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Neil Conway wrote:
> On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote:
> > What was the status of this?  Was there anything else I needed to
> > do with this patch, or is it ready to be applied?  Let me know if
> > there is anything else I need to do on this...
>
> Will do -- I'm planning to apply this as soon as I have the free
> cycles to do so, likely tomorrow or Friday.

I don't know which patch is actually being proposed now.  It would be
good to make this more explicit and maybe include a synopsis of the
functions in the email, so we know what's going on.

What confuses me about some of the functions I've seen in earlier
patches in this thread is that they return setof something.  But in my
mind, regular expression matches or string splits are inherently
ordered, so an array would be the correct return type.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 15 Feb 2007, Peter Eisentraut wrote:

> Neil Conway wrote:
> > On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote:
> > > What was the status of this?  Was there anything else I needed to
> > > do with this patch, or is it ready to be applied?  Let me know if
> > > there is anything else I need to do on this...
> >
> > Will do -- I'm planning to apply this as soon as I have the free
> > cycles to do so, likely tomorrow or Friday.
>
> I don't know which patch is actually being proposed now.  It would be
> good to make this more explicit and maybe include a synopsis of the
> functions in the email, so we know what's going on.

Sorry, my intent was just to check to see if I had gotten the patch
sufficiently fixed for Neil to apply and he just hadn't gotten to it yet
(which seems to be the case), or if there was something else he still
expected me to fix that I had missed in the prior discussions.  I suppose
I should have emailed him privately.

The patch in question can be seen in the archives here:
http://archives.postgresql.org/pgsql-patches/2007-02/msg00214.php

The functions added are:
* regexp_split(str text, pattern text) RETURNS SETOF text
  regexp_split(str text, pattern text, flags text) RETURNS SETOF text
   returns each section of the string delimited by the pattern.
* regexp_matches(str text, pattern text) RETURNS text[]
   returns all capture groups when matching pattern against string in an
   array
* regexp_matches(str text, pattern text, flags text) RETURNS SETOF
    (prematch text, fullmatch text, matches text[], postmatch text)
   returns all capture groups when matching pattern against string in an
   array.  also returns the entire match in fullmatch.  if the 'g' option
   is given, returns all matches in the string.  if the 'r' option is
   given, also return the text before and after the match in prematch and
   postmatch respectively.

> What confuses me about some of the functions I've seen in earlier
> patches in this thread is that they return setof something.  But in my
> mind, regular expression matches or string splits are inherently
> ordered, so an array would be the correct return type.

They do return SETOF.  Addressing them separately:

regexp_matches uses a text[] for the match groups.  If you specify the
global flag, it could return multiple matches.  Couple this with the
requested feature of pre- and postmatch returns (with its own flag) and
the return would turn into some sort of nasty array of tuples, or multiple
arrays.  It seems much cleaner to me to return a set of the matches found,
and I find which order the matches occur in to be much less important than
the fact that they did occur and their contents.


regexp_split returns setof text.  This has, in my opinion, a much greater
case to return an array.  However, there are several issues with this
approach:

# My experience with the array code leads me to believe that building up
an array is an expensive proposition.  I know I could code it smarter so
that the array is only constructed in the end.

# With a set-returning function, it is possible to add a LIMIT clause, to
prevent splitting up more of the string than is necessary.  It is also
immediately possible to insert them into a table, or do grouping on them,
or call a function on each value.  Most of the time when I do a split, I
intend to do something like this with each split value.

# You can still get an array if you really want it:
#* SELECT ARRAY(SELECT * FROM regexp_split('first, second, third', E',\\s*'))



--
No problem is so formidable that you can't just walk away from it.
        -- C. Schulz

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Jeremy Drake wrote:
> regexp_matches uses a text[] for the match groups.  If you specify
> the global flag, it could return multiple matches.  Couple this with
> the requested feature of pre- and postmatch returns (with its own
> flag) and the return would turn into some sort of nasty array of
> tuples, or multiple arrays.  It seems much cleaner to me to return a
> set of the matches found, and I find which order the matches occur in
> to be much less important than the fact that they did occur and their
> contents.

Then the fact that the flag-less matches function returns an array would
be a mistake.  They have to return the same category of object.

> regexp_split returns setof text.  This has, in my opinion, a much
> greater case to return an array.  However, there are several issues
> with this approach:

Any programming language I have ever seen returns the result of a
regular expression split as a structure with order.  That in turn
implies that there are use cases for having the order, which your
proposed function could not address.

> # My experience with the array code leads me to believe that building
> up an array is an expensive proposition.  I know I could code it
> smarter so that the array is only constructed in the end.

You can make any code arbitrarily fast if it doesn't have to give the
right answer.

> # With a set-returning function, it is possible to add a LIMIT
> clause, to prevent splitting up more of the string than is necessary.

You can also add such functionality to a function in form of a
parameter.  In fact, relying on a LIMIT clause to do this seems pretty
fragile.  We argue elsewhere that LIMIT without ORDER BY is not
well-defined, and while it's hard to imagine in the current
implementation why the result of a set returning function would come
back in arbitrary order, it is certainly possible in theory, so you
still need to order the result set if you want reliable limits, but
that is not possible of the order is lost in the result.

>  It is also immediately possible to insert them into a table, or do
> grouping on them, or call a function on each value.  Most of the time
> when I do a split, I intend to do something like this with each split
> value.

These sort of arguments remind me of the contrib/xml2 module, which also
has a very, uh, pragmatic API.  Sure, these operations may seem useful
to you.  But when we offer a function as part of the core API, it is
also important that we offer a clean design that allows other users to
combine reasonably orthogonal functionality into tools that are useful
to them.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Alvaro Herrera
Date:
Jeremy Drake wrote:

> The functions added are:
> * regexp_split(str text, pattern text) RETURNS SETOF text
>   regexp_split(str text, pattern text, flags text) RETURNS SETOF text
>    returns each section of the string delimited by the pattern.
> * regexp_matches(str text, pattern text) RETURNS text[]
>    returns all capture groups when matching pattern against string in an
>    array
> * regexp_matches(str text, pattern text, flags text) RETURNS SETOF
>     (prematch text, fullmatch text, matches text[], postmatch text)
>    returns all capture groups when matching pattern against string in an
>    array.  also returns the entire match in fullmatch.  if the 'g' option
>    is given, returns all matches in the string.  if the 'r' option is
>    given, also return the text before and after the match in prematch and
>    postmatch respectively.

I think the position the match is in could be important.  I'm wondering
if you could define them like

create type re_match(match text, position int)
regexp_split(str text, pattern text) returns setof re_match

or maybe
regexp_split(str text, pattern text, OUT match text, OUT position int);
(not sure of the exact syntax for this one)

so that you would have the position for each match, automatically.  Is
this information available in the regex code?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: patch adding new regexp functions

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Jeremy Drake wrote:
>> # My experience with the array code leads me to believe that building
>> up an array is an expensive proposition.  I know I could code it
>> smarter so that the array is only constructed in the end.

> You can make any code arbitrarily fast if it doesn't have to give the
> right answer.

Even more to the point, it's folly to suppose that the overhead of
processing a SETOF result is less than that of array construction.

I tend to agree with Peter's concern that returning a set is going to
make it hard to track which result is which.

            regards, tom lane

Re: patch adding new regexp functions

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> so that you would have the position for each match, automatically.  Is
> this information available in the regex code?

Certainly, that's where we got the text snippets from to begin with.
However, I'm not sure that this is important enough to justify a special
type --- for one thing, since we don't have arrays of composites, that
would foreclose responding to Peter's concern that SETOF is the wrong
thing.  If you look at the Perl and Tcl APIs for regexes, they return
just the strings, not the numerical positions; and I've not heard anyone
complaining about that.

            regards, tom lane

Re: patch adding new regexp functions

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > so that you would have the position for each match, automatically.  Is
> > this information available in the regex code?
>
> Certainly, that's where we got the text snippets from to begin with.
> However, I'm not sure that this is important enough to justify a special
> type --- for one thing, since we don't have arrays of composites, that
> would foreclose responding to Peter's concern that SETOF is the wrong
> thing.

My point is that if you want to have the order in which the matches were
found, you can do that easily by looking at the positions; no need to
create an ordered array.  Which does respond to Peter's concern, since
the point was to keep the ordering of matches, which an array does; but
if we provide the positions, the SETOF way does as well.

On the other hand, I don't think it's impossible to have matches that
start earlier than others in the string, but are actually found later
(say, because they are a parentized expression that ends later).  So
giving the starting positions allows one to know where are they located,
rather than where were they reported.  (I don't really know if the
matches are sorted before reporting though.)

> If you look at the Perl and Tcl APIs for regexes, they return
> just the strings, not the numerical positions; and I've not heard anyone
> complaining about that.

I know, but that may be just because it would be too much extra
complexity for them (in terms of user API) to be returning the positions
along the text.  I know I'd be fairly annoyed if =~ in Perl returned an
array of hashes { text => 'foo', position => 42} instead of array of
text.  We don't have that problem.

In fact, I would claim that's much easier to deal with a SETOF function
than is to deal with text[].

Regarding the "nobody complains" argument, I don't find that
particularly compelling; witness how people gets used to working around
limitations in MySQL ... ;-)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Alvaro Herrera wrote:
> On the other hand, I don't think it's impossible to have matches that
> start earlier than others in the string, but are actually found later
> (say, because they are a parentized expression that ends later).  So
> giving the starting positions allows one to know where are they
> located, rather than where were they reported.  (I don't really know
> if the matches are sorted before reporting though.)

I have no strong opinion about how matches are returned.  Seeing the
definitional difficulties that you point out, it may be fine to return
them unordered.  But then all "matches" functions should do that.

For the "split" functions, however, providing the order is clearly
important.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
David Fetter
Date:
On Thu, Feb 15, 2007 at 10:37:26AM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > so that you would have the position for each match, automatically.  Is
> > this information available in the regex code?
>
> Certainly, that's where we got the text snippets from to begin with.
> However, I'm not sure that this is important enough to justify a
> special type --- for one thing, since we don't have arrays of
> composites,

This is a TODO :)

I've obviously misunderstood the scope of the TODO because it appears
that an INSERT into pg_type at creation time for compound types that
looks something like the below would do it.  What have I missed?

INSERT INTO pg_type
VALUES (
    '_foo',         /* Generated by makeArrayTypeName */
    16744,          /* OID of schema */
    10,             /* OID of owner of the base type */
    -1,             /* typlen indicates varlena */
    'f',            /* not passed by value */
    'c',            /* typtype is composite */
    't',            /* type is already defined */
    ',',            /* typdelim */
    0,              /* should this actually refer to the type? */
    'foo'::regtype, /* typelem */
    'array_in',     /* typinput */
    'array_out',    /* typoutput */
    'array_recv',   /* typreceive */
    'array_send',   /* typsend */
    0,              /* typanalyze */
    'i',            /* typalign.  Should this be 'd'? */
    'x',            /* typstorage */
    'f',            /* not a DOMAIN, but while we're at it, why not arrays of DOMAIN? */
    0,              /* base type. should this be different? */
    -1,             /* no typmod */
    0               /* dims not specified */
);


> that would foreclose responding to Peter's concern that SETOF is the
> wrong thing.  If you look at the Perl and Tcl APIs for regexes, they
> return just the strings, not the numerical positions; and I've not
> heard anyone complaining about that.

They do return them in the order in which they appear, though, which,
as far as I can tell, Jeremy's functions also do.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
David Fetter
Date:
On Thu, Feb 15, 2007 at 10:57:45AM +0100, Peter Eisentraut wrote:
> Jeremy Drake wrote:

> > # With a set-returning function, it is possible to add a LIMIT
> > clause, to prevent splitting up more of the string than is
> > necessary.
>
> You can also add such functionality to a function in form of a
> parameter.

That's what things like Perl's split do :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 15 Feb 2007, Alvaro Herrera wrote:

> Jeremy Drake wrote:
>
> > The functions added are:
> > * regexp_split(str text, pattern text) RETURNS SETOF text
> >   regexp_split(str text, pattern text, flags text) RETURNS SETOF text
> >    returns each section of the string delimited by the pattern.
> > * regexp_matches(str text, pattern text) RETURNS text[]
> >    returns all capture groups when matching pattern against string in an
> >    array
> > * regexp_matches(str text, pattern text, flags text) RETURNS SETOF
> >     (prematch text, fullmatch text, matches text[], postmatch text)
> >    returns all capture groups when matching pattern against string in an
> >    array.  also returns the entire match in fullmatch.  if the 'g' option
> >    is given, returns all matches in the string.  if the 'r' option is
> >    given, also return the text before and after the match in prematch and
> >    postmatch respectively.
>
> I think the position the match is in could be important.  I'm wondering
> if you could define them like
>
> create type re_match(match text, position int)
> regexp_split(str text, pattern text) returns setof re_match

So it looks like the issues are:
1. regexp_matches without flags has a different return type than
   regexp_matches with flags.  I can make them return the same OUT
   parameters, but should I declare it as returning SETOF when I know
   for a fact that the no-flags version will never return more than one
   row?

2. regexp_split does not represent the order of the results.  I can do
   something like:

 regexp_split(str text, pattern text[, flags text], OUT result text, OUT
startpos int) RETURNS SETOF record;

It could also have the int being a simple counter to represent the
relative order, rather than the position.


Thoughts?  Do these changes address the issues recently expressed?




--
I have yet to see any problem, however complicated, which, when looked
at in the right way, did not become still more complicated.
        -- Poul Anderson

Re: patch adding new regexp functions

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> I've obviously misunderstood the scope of the TODO because it appears
> that an INSERT into pg_type at creation time for compound types that
> looks something like the below would do it.  What have I missed?

There are a couple of issues.  One is that we probably don't want two
pg_type entries for every single table.  Will you be satisfied if only
CREATE TYPE AS ... makes an array type?  The other is that, at least at
the time they were written, the array support routines couldn't handle
composite array values.  Things might or might not be easier today;
I don't think we had record_in and record_out in their current form
then.

            regards, tom lane

Re: patch adding new regexp functions

From
David Fetter
Date:
On Thu, Feb 15, 2007 at 07:35:46PM -0500, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > I've obviously misunderstood the scope of the TODO because it appears
> > that an INSERT into pg_type at creation time for compound types that
> > looks something like the below would do it.  What have I missed?
>
> There are a couple of issues.  One is that we probably don't want
> two pg_type entries for every single table.

Now that you mention it, I would want that if that's what it takes to
get arrays for them.  The long-term goal here is to make all of
PostgreSQL's types play nicely together.  I'm guessing that SETOF
will eventually be a way to describe a collection because MULTISET is
in SQL:2003.

> Will you be satisfied if only CREATE TYPE AS ... makes an array
> type?  The other is that, at least at the time they were written,
> the array support routines couldn't handle composite array values.
> Things might or might not be easier today; I don't think we had
> record_in and record_out in their current form then.

OK.  What about pg_depend?

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Andrew Dunstan
Date:

Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
>
>> I've obviously misunderstood the scope of the TODO because it appears
>> that an INSERT into pg_type at creation time for compound types that
>> looks something like the below would do it.  What have I missed?
>>
>
> There are a couple of issues.  One is that we probably don't want two
> pg_type entries for every single table.  Will you be satisfied if only
> CREATE TYPE AS ... makes an array type?
>

There should be some better way to create the array type for tables than
directly mangling pg_type, though. Maybe a builtin function that took an
oid?

cheers

andrew

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake:
> On Thu, 15 Feb 2007, Peter Eisentraut wrote:
> > I have no strong opinion about how matches are returned.  Seeing the
> > definitional difficulties that you point out, it may be fine to return
> > them unordered.  But then all "matches" functions should do that.
> >
> > For the "split" functions, however, providing the order is clearly
> > important.
>
> Does this version sufficiently address your concerns?

I don't think anyone asked for the start position and length in the result of
regexp_split().  The result should be an array of text.  That is what Perl et
al. do.

As for the regexp_matches() function, it seems to me that it returns too much
information at once.  What is the use case for getting all of prematch,
fullmatch, matches, and postmatch in one call?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 15 Feb 2007, Peter Eisentraut wrote:

> I have no strong opinion about how matches are returned.  Seeing the
> definitional difficulties that you point out, it may be fine to return
> them unordered.  But then all "matches" functions should do that.
>
> For the "split" functions, however, providing the order is clearly
> important.

Does this version sufficiently address your concerns?

--
Finagle's Creed:
    Science is true.  Don't be misled by facts.

Attachment

Re: patch adding new regexp functions

From
David Fetter
Date:
On Fri, Feb 16, 2007 at 01:19:55PM +0100, Peter Eisentraut wrote:
> Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake:
> > On Thu, 15 Feb 2007, Peter Eisentraut wrote:
> > > I have no strong opinion about how matches are returned.  Seeing
> > > the definitional difficulties that you point out, it may be fine
> > > to return them unordered.  But then all "matches" functions
> > > should do that.
> > >
> > > For the "split" functions, however, providing the order is
> > > clearly important.
> >
> > Does this version sufficiently address your concerns?
>
> I don't think anyone asked for the start position and length in the
> result of regexp_split().  The result should be an array of text.
> That is what Perl et al. do.
>
> As for the regexp_matches() function, it seems to me that it returns
> too much information at once.  What is the use case for getting all
> of prematch, fullmatch, matches, and postmatch in one call?

If not in one call, how would you get it?  Perl, for example, makes
these available to any regex match in the form of variables it sets.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Am Freitag, 16. Februar 2007 17:11 schrieb David Fetter:
> > As for the regexp_matches() function, it seems to me that it returns
> > too much information at once.  What is the use case for getting all
> > of prematch, fullmatch, matches, and postmatch in one call?
>
> If not in one call, how would you get it?  Perl, for example, makes
> these available to any regex match in the form of variables it sets.

The question is, what is the use case?  If there is one in Perl, can this
proposed function API support it?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
David Fetter
Date:
On Fri, Feb 16, 2007 at 05:54:47PM +0100, Peter Eisentraut wrote:
> Am Freitag, 16. Februar 2007 17:11 schrieb David Fetter:
> > > As for the regexp_matches() function, it seems to me that it
> > > returns too much information at once.  What is the use case for
> > > getting all of prematch, fullmatch, matches, and postmatch in
> > > one call?
> >
> > If not in one call, how would you get it?  Perl, for example,
> > makes these available to any regex match in the form of variables
> > it sets.
>
> The question is, what is the use case?  If there is one in Perl, can
> this proposed function API support it?

Perl makes the following variables available in any regex match,
although it optimizes some cases for when they're not there:

$1, ... $n (captured matches in parentheses)
$`         (pre-match)
$'         (post-match)
$&         (whole match)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Andrew Dunstan
Date:
David Fetter wrote:
>>
>> The question is, what is the use case?  If there is one in Perl, can
>> this proposed function API support it?
>>
>
> Perl makes the following variables available in any regex match,
> although it optimizes some cases for when they're not there:
>
> $1, ... $n (captured matches in parentheses)
> $`         (pre-match)
> $'         (post-match)
> $&         (whole match)
>
>

Use of any of these is notoriously costly, BTW, especially pre-match and
post-match. See perlre man page.

cheers

andrew

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
David Fetter wrote:
> > The question is, what is the use case?  If there is one in Perl,
> > can this proposed function API support it?
>
> Perl makes the following variables available in any regex match,

That is not a use case.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
David Fetter
Date:
On Fri, Feb 16, 2007 at 01:03:32PM -0500, Andrew Dunstan wrote:
> David Fetter wrote:
> >>
> >>The question is, what is the use case?  If there is one in Perl, can
> >>this proposed function API support it?
> >>
> >
> >Perl makes the following variables available in any regex match,
> >although it optimizes some cases for when they're not there:
> >
> >$1, ... $n (captured matches in parentheses)
> >$`         (pre-match)
> >$'         (post-match)
> >$&         (whole match)
> >
>
> Use of any of these is notoriously costly, BTW, especially pre-match
> and post-match. See perlre man page.

This is why they only appear when called for :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Fri, 16 Feb 2007, Peter Eisentraut wrote:

> Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake:
> > Does this version sufficiently address your concerns?
>
> I don't think anyone asked for the start position and length in the result of
> regexp_split().  The result should be an array of text.  That is what Perl et
> al. do.

The length is not returned, I simply call length() on the string result to
make sure that no whitespace gets in.

The start position was suggested in these two messages from Alvaro Herrera:
http://archives.postgresql.org/pgsql-patches/2007-02/msg00276.php
http://archives.postgresql.org/pgsql-patches/2007-02/msg00281.php

This was meant to address your concern about the order not necessarily
being preserved of the split results when being returned as SETOF.  This
gives the benefits of returning SETOF while still allowing the order to be
preserved if desired (simply add ORDER BY startpos to guarantee the
correct order).

In case you haven't noticed, I am rather averse to making this return
text[] because it is much easier in my experience to use the results when
returned in SETOF rather than text[], and in all of the code that I have
experience with where this would be useful I would end up using
information_schema._pg_expandarray (a function that, AFAIK, is
documented nowhere) to convert it into SETOF text.  While, if you really
really wanted a text[], you could use the (fully documented) ARRAY(select
resultstr from regexp_split(...) order by startpos) construct.

> As for the regexp_matches() function, it seems to me that it returns too much
> information at once.  What is the use case for getting all of prematch,
> fullmatch, matches, and postmatch in one call?

It was requested by David Fetter:
http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php

It was not horribly difficult to provide, and it seemed reasonable to me.
I have no need for them personally.

--
Some people in this department wouldn't recognize subtlety if it hit
them on the head.

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Jeremy Drake wrote:
> In case you haven't noticed, I am rather averse to making this return
> text[] because it is much easier in my experience to use the results
> when returned in SETOF rather than text[],

The primary use case I know for string splitting is parsing
comma/pipe/whatever separated fields into a row structure, and the way
I see it your API proposal makes that exceptionally difficult.

I don't know what your use case is, though.  All of this is missing
actual use cases.

> While, if you
> really really wanted a text[], you could use the (fully documented)
> ARRAY(select resultstr from regexp_split(...) order by startpos)
> construct.

I think, however, that we should be providing simple primitives that can
be combined into complex expressions rather than complex primitives
that have to be dissected apart to get simple results.

> > As for the regexp_matches() function, it seems to me that it
> > returns too much information at once.  What is the use case for
> > getting all of prematch, fullmatch, matches, and postmatch in one
> > call?
>
> It was requested by David Fetter:
> http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php
>
> It was not horribly difficult to provide, and it seemed reasonable to
> me. I have no need for them personally.

David Fetter has also repeated failed to offer a use case for this, so I
hesitate to accept this.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
David Fetter
Date:
On Sat, Feb 17, 2007 at 09:02:24AM +0100, Peter Eisentraut wrote:
> Jeremy Drake wrote:
> > In case you haven't noticed, I am rather averse to making this return
> > text[] because it is much easier in my experience to use the results
> > when returned in SETOF rather than text[],
>
> The primary use case I know for string splitting is parsing
> comma/pipe/whatever separated fields into a row structure, and the way
> I see it your API proposal makes that exceptionally difficult.
>
> I don't know what your use case is, though.  All of this is missing
> actual use cases.
>
> > While, if you
> > really really wanted a text[], you could use the (fully documented)
> > ARRAY(select resultstr from regexp_split(...) order by startpos)
> > construct.
>
> I think, however, that we should be providing simple primitives that can
> be combined into complex expressions rather than complex primitives
> that have to be dissected apart to get simple results.
>
> > > As for the regexp_matches() function, it seems to me that it
> > > returns too much information at once.  What is the use case for
> > > getting all of prematch, fullmatch, matches, and postmatch in one
> > > call?
> >
> > It was requested by David Fetter:
> > http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php
> >
> > It was not horribly difficult to provide, and it seemed reasonable
> > to me. I have no need for them personally.
>
> David Fetter has also repeated failed to offer a use case for this,
> so I hesitate to accept this.

What is it about having the whole match, pre-match and post-match
available that you're objecting to?  Are you saying there aren't
common uses for any or all of these?  Regular expression users use
them all over the place, and adding this capability to SQL seems like
a reasonable next step :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sat, 17 Feb 2007, Peter Eisentraut wrote:

> Jeremy Drake wrote:
> > In case you haven't noticed, I am rather averse to making this return
> > text[] because it is much easier in my experience to use the results
> > when returned in SETOF rather than text[],
>
> The primary use case I know for string splitting is parsing
> comma/pipe/whatever separated fields into a row structure, and the way
> I see it your API proposal makes that exceptionally difficult.

For this case see string_to_array:
http://developer.postgresql.org/pgdocs/postgres/functions-array.html
select string_to_array('a|b|c', '|');
 string_to_array
-----------------
 {a,b,c}
(1 row)


> I don't know what your use case is, though.  All of this is missing
> actual use cases.

The particular use case I had for this function was at a previous
employer, and I am not sure exactly how much detail is appropriate to
divulge.  Basically, the project was doing some text processing inside of
postgres, and getting all of the words from a string into a table with
some processing (excluding stopwords and so forth) as efficiently as
possible was a big concern.

The regexp_split function code was based on some code that a friend of
mine wrote which used PCRE rather than postgres' internal regexp support.
I don't know exactly what his use-case was, but he probably had
one because he wrote the function and had it returning SETOF text ;)
Perhaps he can share a general idea of what it was (nudge nudge)?

> > While, if you
> > really really wanted a text[], you could use the (fully documented)
> > ARRAY(select resultstr from regexp_split(...) order by startpos)
> > construct.
>
> I think, however, that we should be providing simple primitives that can
> be combined into complex expressions rather than complex primitives
> that have to be dissected apart to get simple results.

The most simple primitive is string_to_array(text, text) returns text[],
but it was not sufficient for our needs.

> > > As for the regexp_matches() function, it seems to me that it
> > > returns too much information at once.  What is the use case for
> > > getting all of prematch, fullmatch, matches, and postmatch in one
> > > call?
> >
> > It was requested by David Fetter:
> > http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php
> >
> > It was not horribly difficult to provide, and it seemed reasonable to
> > me. I have no need for them personally.
>
> David Fetter has also repeated failed to offer a use case for this, so I
> hesitate to accept this.

I have no strong opinion either way, so I will let those who do argue it
out and wait for the dust to settle ;)

--
The Law, in its majestic equality, forbids the rich, as well as the
poor, to sleep under the bridges, to beg in the streets, and to steal
bread.
        -- Anatole France

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
David Fetter wrote:
> What is it about having the whole match, pre-match and post-match
> available that you're objecting to?  Are you saying there aren't
> common uses for any or all of these?  Regular expression users use
> them all over the place,

You keep saying that, and I keep saying please show a use case.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> David Fetter wrote:
>> What is it about having the whole match, pre-match and post-match
>> available that you're objecting to?  Are you saying there aren't
>> common uses for any or all of these?  Regular expression users use
>> them all over the place,

> You keep saying that, and I keep saying please show a use case.

Maybe I'm missing something, but ISTM that given the ability to return
multiple match substrings there is no need for such features.  Given
an arbitrary regex 'xyz', write '^(.*)(xyz)(.*)$' and you'll get back
the pre-match, whole match, and post-match in addition to any
parenthesized substrings that 'xyz' contains.  If you only need some
of them, you can leave out the corresponding parts of this pattern.

So I'd vote against complicating the API in order to make special
provision for these results.  I claim that all we need is a function
taking (string text, pattern text, flags text) and returning either
array of text or setof text containing the matched substrings in
whatever order is standard (order by left-parenthesis position,
I think).  In the degenerate case where there are no parenthesized
subpatterns, just return the whole match as a single element.

As for the argument about array vs setof, I could see doing both to
end the argument of which one is really superior for any particular
problem.

            regards, tom lane

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Jeremy Drake wrote:
> For this case see string_to_array:

That only works with fixed delimiters, and I was hoping that
regexp_split would be an alternative with regexp delimiters.  It's
certainly another argument that all the split functions in PostgreSQL
should offer analogous interfaces.  There is no reason being offered
why splitting on a fixed string should result in an array and splitting
on a regexp should result in a table.

> The particular use case I had for this function was at a previous
> employer, and I am not sure exactly how much detail is appropriate to
> divulge.  Basically, the project was doing some text processing
> inside of postgres, and getting all of the words from a string into a
> table with some processing (excluding stopwords and so forth) as
> efficiently as possible was a big concern.

We already have tsearch for that.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
Finally the voice of reason :)
On Sat, 17 Feb 2007, Tom Lane wrote:

> So I'd vote against complicating the API in order to make special
> provision for these results.  I claim that all we need is a function
> taking (string text, pattern text, flags text) and returning either
> array of text or setof text

For this function, it would be setof array of text, as the capture groups
would definitely go in an array, but if you asked for global in the flags,
there could be more than one match in the string.

> containing the matched substrings in
> whatever order is standard (order by left-parenthesis position,
> I think).  In the degenerate case where there are no parenthesized
> subpatterns, just return the whole match as a single element.

Good idea, that did not occur to me.  I was planning to throw an error in
that case.

> As for the argument about array vs setof, I could see doing both to
> end the argument of which one is really superior for any particular
> problem.

The array vs setof argument was on the split function.  I will work on
doing both.  Any idea how the name and/or arguments should differ?  I
think that an optional limit argument for the array version like perl has
would be reasonable, but a name difference in the functions is probably
necessary to avoid confusion.


--
Speaking of Godzilla and other things that convey horror:

    With a purposeful grimace and a Mongo-like flair
    He throws the spinning disk drives in the air!
    And he picks up a Vax and he throws it back down
    As he wades through the lab making terrible sounds!
    Helpless users with projects due
    Scream "My God!" as he stomps on the tape drives, too!

    Oh, no!  He says Unix runs too slow!  Go, go, DECzilla!
    Oh, yes!  He's gonna bring up VMS!  Go, go, DECzilla!"

* VMS is a trademark of Digital Equipment Corporation
* DECzilla is a trademark of Hollow Chocolate Bunnies of Death, Inc.
        -- Curtis Jackson

Re: patch adding new regexp functions

From
Mark Dilger
Date:
Jeremy Drake wrote:
> The regexp_split function code was based on some code that a friend of
> mine wrote which used PCRE rather than postgres' internal regexp support.
> I don't know exactly what his use-case was, but he probably had
> one because he wrote the function and had it returning SETOF text ;)
> Perhaps he can share a general idea of what it was (nudge nudge)?

db=# CREATE OR REPLACE FUNCTION split(p TEXT, t TEXT) RETURNS SETOF TEXT AS $$
db$#     my ($p, $t) = @_;
db$#     return [ split(/$p/,$t) ];
db$# $$ LANGUAGE plperl;
CREATE FUNCTION
Time: 1.254 ms
db=# select distinct word from (select * from split('\\W+','mary had a little
lamb, whose fleece was black as soot') as word) as ss;
   word
--------
  a
  as
  black
  fleece
  had
  lamb
  little
  mary
  soot
  was
  whose
(11 rows)

Time: 30.517 ms



As you can see, this can easily be done with a plperl function.  Some people may
not want to install plperl, or may not want to allow arbitrary patterns to be
handed to perl in this fashion.  That was not my concern.  I was simply trying
to see if I could make it faster in a C-language coded function.

In the end I dropped the project because the plperl function works fast enough
for me and I don't have any objection to plperl from a security standpoint, etc.

mark

Re: patch adding new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> On Sat, 17 Feb 2007, Tom Lane wrote:
>> So I'd vote against complicating the API in order to make special
>> provision for these results.  I claim that all we need is a function
>> taking (string text, pattern text, flags text) and returning either
>> array of text or setof text

> For this function, it would be setof array of text, as the capture groups
> would definitely go in an array, but if you asked for global in the flags,
> there could be more than one match in the string.

Oh, right.  And you could do a 2-D array if you wanted it all in one
blob (or a guarantee of order).  So no need for record-returning functions?

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sat, 17 Feb 2007, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > On Sat, 17 Feb 2007, Tom Lane wrote:
> >> So I'd vote against complicating the API in order to make special
> >> provision for these results.  I claim that all we need is a function
> >> taking (string text, pattern text, flags text) and returning either
> >> array of text or setof text
>
> > For this function, it would be setof array of text, as the capture groups
> > would definitely go in an array, but if you asked for global in the flags,
> > there could be more than one match in the string.
>
> Oh, right.  And you could do a 2-D array if you wanted it all in one
> blob (or a guarantee of order).

I don't think that there is much of an argument for having this one return
a 2d array, in this particular case, even perl requires you to build a
loop, IIRC.

my $str = 'foobarbecuebazilbarf';
while($str=~/(b[^b]+)(b[^b]+)/g) {
    print $1, "\t", length($1), "\n";
    print $2, "\t", length($2), "\n";
    print "---\n";
}

bar     3
becue   5
---
bazil   5
barf    4
---

> So no need for record-returning functions?
No.


--
Go placidly amid the noise and waste, and remember what value there may
be in owning a piece thereof.
        -- National Lampoon, "Deteriorata"

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
Attached, please find a new version of the patch which follows Tom's
advice (except regexp_matches returing a 2-D array, which I am not sure if
it is required/desired).  Hopefully this version makes everyone happy ;)

On Sat, 17 Feb 2007, Tom Lane wrote:

> So I'd vote against complicating the API in order to make special
> provision for these results.  I claim that all we need is a function
> taking (string text, pattern text, flags text) and returning either
> array of text or setof text containing the matched substrings in
> whatever order is standard (order by left-parenthesis position,
> I think).  In the degenerate case where there are no parenthesized
> subpatterns, just return the whole match as a single element.

The signature is

regexp_matches(string text, pattern text[, flags text]) returns setof
text[]

and behaves as described.

>
> As for the argument about array vs setof, I could see doing both to
> end the argument of which one is really superior for any particular
> problem.

regexp_split(string text, pattern text[, flags text]) returns setof text

regexp_split_array(string text, pattern text[. flags text[, limit int]])
returns text[]


--
When I was in school, I cheated on my metaphysics exam:
I looked into the soul of the boy sitting next to me.
        -- Woody Allen

Attachment

Re: patch adding new regexp functions

From
Peter Eisentraut
Date:
Jeremy Drake wrote:
> > As for the argument about array vs setof, I could see doing both to
> > end the argument of which one is really superior for any particular
> > problem.
>
> regexp_split(string text, pattern text[, flags text]) returns setof
> text
>
> regexp_split_array(string text, pattern text[. flags text[, limit
> int]]) returns text[]

Since you are not splitting an array but returning an array, I would
think that "regexp_split_to_array" would be better, and the other
should then be "regexp_split_to_table".

But why does the second one have a limit and the first one doesn't?  Is
this because you rely on the LIMIT clause to do the same?  Is there a
guarantee that LIMIT on a table function makes a consistent order?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sun, 18 Feb 2007, Peter Eisentraut wrote:

> Jeremy Drake wrote:
> > > As for the argument about array vs setof, I could see doing both to
> > > end the argument of which one is really superior for any particular
> > > problem.
> >
> > regexp_split(string text, pattern text[, flags text]) returns setof
> > text
> >
> > regexp_split_array(string text, pattern text[. flags text[, limit
> > int]]) returns text[]
>
> Since you are not splitting an array but returning an array, I would
> think that "regexp_split_to_array" would be better, and the other
> should then be "regexp_split_to_table".

OK

>
> But why does the second one have a limit and the first one doesn't?  Is
> this because you rely on the LIMIT clause to do the same?

Yes

> Is there a
> guarantee that LIMIT on a table function makes a consistent order?

Why wouldn't it?

--
When you are in it up to your ears, keep your mouth shut.

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sun, 18 Feb 2007, Jeremy Drake wrote:

> On Sun, 18 Feb 2007, Peter Eisentraut wrote:
>
> > > regexp_split(string text, pattern text[, flags text]) returns setof
> > > text
> > >
> > > regexp_split_array(string text, pattern text[. flags text[, limit
> > > int]]) returns text[]
> >
> > Since you are not splitting an array but returning an array, I would
> > think that "regexp_split_to_array" would be better, and the other
> > should then be "regexp_split_to_table".
>
> OK
>
> > But why does the second one have a limit and the first one doesn't?

I will rename the functions regexp_split_to_(table|array) and I will add
an optional limit parameter to the regexp_split_to_table function, for
consistency and to avoid ordering concerns with LIMIT.

--
Sometimes I worry about being a success in a mediocre world.
        -- Lily Tomlin

Re: patch adding new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> I will rename the functions regexp_split_to_(table|array) and I will add
> an optional limit parameter to the regexp_split_to_table function, for
> consistency and to avoid ordering concerns with LIMIT.

I'd go the other way: get rid of the limit option all 'round.  It seems
like fairly useless complexity.

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sun, 18 Feb 2007, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > I will rename the functions regexp_split_to_(table|array) and I will add
> > an optional limit parameter to the regexp_split_to_table function, for
> > consistency and to avoid ordering concerns with LIMIT.
>
> I'd go the other way: get rid of the limit option all 'round.  It seems
> like fairly useless complexity.

OK, here is what is hopefully the final version of this patch.  I have
renamed the functions as above, and removed the optional limit parameter.
If there are no further complaints, this should be ready to apply.

--
It is the business of little minds to shrink.
        -- Carl Sandburg

Attachment

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Jeremy Drake wrote:
> On Sun, 18 Feb 2007, Tom Lane wrote:
>
> > Jeremy Drake <pgsql@jdrake.com> writes:
> > > I will rename the functions regexp_split_to_(table|array) and I will add
> > > an optional limit parameter to the regexp_split_to_table function, for
> > > consistency and to avoid ordering concerns with LIMIT.
> >
> > I'd go the other way: get rid of the limit option all 'round.  It seems
> > like fairly useless complexity.
>
> OK, here is what is hopefully the final version of this patch.  I have
> renamed the functions as above, and removed the optional limit parameter.
> If there are no further complaints, this should be ready to apply.
>
> --
> It is the business of little minds to shrink.
>         -- Carl Sandburg
Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Sun, 18 Feb 2007, Jeremy Drake wrote:

> OK, here is what is hopefully the final version of this patch.  I have
> renamed the functions as above, and removed the optional limit parameter.
> If there are no further complaints, this should be ready to apply.

The patch has been sitting in the unapplied patches queue for a while, and
the inevitable bitrot finally caught up with it.  This version of the
patch is exactly the same as the one in the patches queue, except for
using new, non-conflicting oids for the functions.


--
The gods gave man fire and he invented fire engines.  They gave him
love and he invented marriage.

Attachment

Re: patch adding new regexp functions

From
Neil Conway
Date:
Jeremy Drake wrote:
> The patch has been sitting in the unapplied patches queue for a while
Barring any objections, I'll apply this tomorrow.

-Neil


Re: patch adding new regexp functions

From
Neil Conway
Date:
Jeremy Drake wrote:
> The patch has been sitting in the unapplied patches queue for a while, and
> the inevitable bitrot finally caught up with it.  This version of the
> patch is exactly the same as the one in the patches queue, except for
> using new, non-conflicting oids for the functions.
>
Applied -- thanks for the patch. BTW, a few cosmetic comments:

* #includes should be sorted alphabetically (unless another #include
sort order rules take precedence, like including "postgres.h" first).

* patch included some trailing CR line endings

* IMHO using "++var" in the increment component of a for loop is bad
style in C (if only because it is needlessly inconsistent; good C++
style is not necessarily good C style)

* Comments like "/* get text type oid, too lazy to do it some other way
*/" do not exactly inspire confidence :)

-Neil


Re: patch adding new regexp functions

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> * Comments like "/* get text type oid, too lazy to do it some other way
> */" do not exactly inspire confidence :)

Surely the code was just using the TEXTOID macro?  If so, it does not
require a comment.  If not, it should be fixed ...

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Wed, 21 Mar 2007, Tom Lane wrote:

> Neil Conway <neilc@samurai.com> writes:
> > * Comments like "/* get text type oid, too lazy to do it some other way
> > */" do not exactly inspire confidence :)
>
> Surely the code was just using the TEXTOID macro?  If so, it does not
> require a comment.  If not, it should be fixed ...

Nope, it does get_fn_expr_argtype(fcinfo->flinfo, 0);  The too lazy remark
was that I thought there may be a better way (like the macro you
mentioned) but did not go looking for it because I already had something
that worked that I found in the manual.  If you like, I can put together
another patch that removes the param_type members from the structs and
hardcodes TEXTOID in the proper places.  BTW, should I be calling
get_typlenbyvalalign on TEXTOID or are there macros for those also?

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>

--
We're only in it for the volume.
        -- Black Sabbath

Re: patch adding new regexp functions

From
Gregory Stark
Date:
"Jeremy Drake" <pgsql@jdrake.com> writes:

> BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros
> for those also?

Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic
constants used throughout the source code. I'm surprised there isn't a macro
for it though.

Do you need the alignment? If so I want to check the code against the packed
varlena patch. Just in case.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: patch adding new regexp functions

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> "Jeremy Drake" <pgsql@jdrake.com> writes:
>
>> BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros
>> for those also?
>
> Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic
> constants used throughout the source code. I'm surprised there isn't a macro
> for it though.
>
> Do you need the alignment? If so I want to check the code against the packed
> varlena patch. Just in case.

Ah, it's just to construct an array, that's not a concern at all. And you're
detoasting the text data types before using or storing them so that's fine.


The only thing I would say is that this should maybe be a TextGetDatum() just
for code hygiene. It should be exactly equivalent though:

+     PointerGetDatum(matchctx->orig_str),

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: patch adding new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> BTW, should I be calling
> get_typlenbyvalalign on TEXTOID or are there macros for those also?

By and large we tend to hard-wire those properties too, eg there are
plenty of places that do things like this:

    /* XXX: this hardcodes assumptions about the regtype type */
    result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i');

Some are better commented than others ...

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Wed, 21 Mar 2007, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > BTW, should I be calling
> > get_typlenbyvalalign on TEXTOID or are there macros for those also?
>
> By and large we tend to hard-wire those properties too, eg there are
> plenty of places that do things like this:
>
>     /* XXX: this hardcodes assumptions about the regtype type */
>     result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i');
>
> Some are better commented than others ...

So, what action (if any) should be taken?  Should all (or some) of these
values be hardcoded, or should the current calls to determine them be left
in place?


>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>

--
Fortune's Real-Life Courtroom Quote #19:

Q:  Doctor, how many autopsies have you performed on dead people?
A:  All my autopsies have been performed on dead people.

Re: patch adding new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> On Wed, 21 Mar 2007, Tom Lane wrote:
>> By and large we tend to hard-wire those properties too, eg there are
>> plenty of places that do things like this:

> So, what action (if any) should be taken?  Should all (or some) of these
> values be hardcoded, or should the current calls to determine them be left
> in place?

I'd vote for making this new code look like the rest of it, to wit
hardwire the values.  As-is, you're expending code space and cycles
on flexibility that's entirely illusory --- if we ever decided to
change these properties of TEXT, we'd have way more work to do than
just fixing the new regexp functions.

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Wed, 21 Mar 2007, Gregory Stark wrote:

> The only thing I would say is that this should maybe be a TextGetDatum() just
> for code hygiene. It should be exactly equivalent though:

I could not find such a thing using ctags, nor TextPGetDatum().  I looked
at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses
PointerGetDatum.  If there is such a thing, it is well camoflaged...

--
If you think the problem is bad now, just wait until we've solved it.
        -- Arthur Kasspe

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 22 Mar 2007, Tom Lane wrote:

> I'd vote for making this new code look like the rest of it, to wit
> hardwire the values.

Attached please find a patch which does this.



--
Although written many years ago, Lady Chatterley's Lover has just been
reissued by the Grove Press, and this pictorial account of the
day-to-day life of an English gamekeeper is full of considerable
interest to outdoor minded readers, as it contains many passages on
pheasant-raising, the apprehending of poachers, ways to control vermin,
and other chores and duties of the professional gamekeeper.
Unfortunately, one is obliged to wade through many pages of extraneous
material in order to discover and savour those sidelights on the
management of a midland shooting estate, and in this reviewer's opinion
the book cannot take the place of J. R. Miller's "Practical
Gamekeeping."
        -- Ed Zern, "Field and Stream" (Nov. 1959)

Attachment

Re: patch adding new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> I could not find such a thing using ctags, nor TextPGetDatum().  I looked
> at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses
> PointerGetDatum.  If there is such a thing, it is well camoflaged...

AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other
datatypes) is lack of obvious usefulness.  A function dealing with a
"text *" doesn't normally have reason to convert that to a Datum until
it returns --- and at that point PG_RETURN_TEXT_P is the thing to use.
Do you have a counterexample, or does this just suggest that the regexp
function patch needs some refactoring?

            regards, tom lane

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Thu, 22 Mar 2007, Tom Lane wrote:

> AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other
> datatypes) is lack of obvious usefulness.  A function dealing with a
> "text *" doesn't normally have reason to convert that to a Datum until
> it returns --- and at that point PG_RETURN_TEXT_P is the thing to use.
> Do you have a counterexample, or does this just suggest that the regexp
> function patch needs some refactoring?

If you are asking why I have reason to convert text * to a Datum in cases
other than PG_RETURN_TEXT_P, it is used for calling text_substr functions
using DirectFunctionCallN.  BTW, this usage of text_substr using
PointerGetDatum was copied from the pre-existing textregexsubstr function.


--
Malek's Law:
    Any simple idea will be worded in the most complicated way.

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Jeremy Drake wrote:
> On Thu, 22 Mar 2007, Tom Lane wrote:
>
> > AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other
> > datatypes) is lack of obvious usefulness.  A function dealing with a
> > "text *" doesn't normally have reason to convert that to a Datum until
> > it returns --- and at that point PG_RETURN_TEXT_P is the thing to use.
> > Do you have a counterexample, or does this just suggest that the regexp
> > function patch needs some refactoring?
>
> If you are asking why I have reason to convert text * to a Datum in cases
> other than PG_RETURN_TEXT_P, it is used for calling text_substr functions
> using DirectFunctionCallN.  BTW, this usage of text_substr using
> PointerGetDatum was copied from the pre-existing textregexsubstr function.

Is there a follup patch based on this discussion?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Jeremy Drake wrote:
>> On Thu, 22 Mar 2007, Tom Lane wrote:
>>> AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other
>>> datatypes) is lack of obvious usefulness.

>> If you are asking why I have reason to convert text * to a Datum in cases
>> other than PG_RETURN_TEXT_P, it is used for calling text_substr functions
>> using DirectFunctionCallN.  BTW, this usage of text_substr using
>> PointerGetDatum was copied from the pre-existing textregexsubstr function.

> Is there a follup patch based on this discussion?

Not at the moment.  I suppose someone could run around and replace
PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems
like mostly make-work.  I definitely don't want to spend time on such
a project for 8.3.

Or were you speaking to the question of whether to adjust the regexp
patch to conform more nearly to the coding practices found elsewhere?
I agree with that, but I thought there was already a submitted patch
for it.

            regards, tom lane

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Jeremy Drake wrote:
> >> On Thu, 22 Mar 2007, Tom Lane wrote:
> >>> AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other
> >>> datatypes) is lack of obvious usefulness.
>
> >> If you are asking why I have reason to convert text * to a Datum in cases
> >> other than PG_RETURN_TEXT_P, it is used for calling text_substr functions
> >> using DirectFunctionCallN.  BTW, this usage of text_substr using
> >> PointerGetDatum was copied from the pre-existing textregexsubstr function.
>
> > Is there a follup patch based on this discussion?
>
> Not at the moment.  I suppose someone could run around and replace
> PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems
> like mostly make-work.  I definitely don't want to spend time on such
> a project for 8.3.
>
> Or were you speaking to the question of whether to adjust the regexp
> patch to conform more nearly to the coding practices found elsewhere?
> I agree with that, but I thought there was already a submitted patch
> for it.

Yes, regex patch adjustment, and I have not seen a patch which makes
such adjustments.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
On Mon, 26 Mar 2007, Bruce Momjian wrote:

> Tom Lane wrote:
> > Or were you speaking to the question of whether to adjust the regexp
> > patch to conform more nearly to the coding practices found elsewhere?
> > I agree with that, but I thought there was already a submitted patch
> > for it.
>
> Yes, regex patch adjustment, and I have not seen a patch which makes
> such adjustments.

http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php


--
Pope Goestheveezl was the shortest reigning pope in the history of the
Church, reigning for two hours and six minutes on 1 April 1866.  The
white smoke had hardly faded into the blue of the Vatican skies before
it dawned on the assembled multitudes in St. Peter's Square that his
name had hilarious possibilities.  The crowds fell about, helpless with
laughter, singing

    Half a pound of tuppenny rice
    Half a pound of treacle
    That's the way the chimney smokes
    Pope Goestheveezl

The square was finally cleared by armed carabineri with tears of
laughter streaming down their faces.  The event set a record for
hilarious civic functions, smashing the previous record set when Baron
Hans Neizant B"ompzidaize was elected Landburgher of K"oln in 1653.
        -- Mike Harding, "The Armchair Anarchist's Almanac"

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Jeremy Drake wrote:
> On Mon, 26 Mar 2007, Bruce Momjian wrote:
>
> > Tom Lane wrote:
> > > Or were you speaking to the question of whether to adjust the regexp
> > > patch to conform more nearly to the coding practices found elsewhere?
> > > I agree with that, but I thought there was already a submitted patch
> > > for it.
> >
> > Yes, regex patch adjustment, and I have not seen a patch which makes
> > such adjustments.
>
> http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php

Ah, yes, I still have that in my mailbox, but didn't think it was
related because there was discussion _after_ the patch was posted.  Will
add the patch to the queue now.  Thanks.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Jeremy Drake wrote:
> On Thu, 22 Mar 2007, Tom Lane wrote:
>
> > I'd vote for making this new code look like the rest of it, to wit
> > hardwire the values.
>
> Attached please find a patch which does this.
>
>
>
> --
> Although written many years ago, Lady Chatterley's Lover has just been
> reissued by the Grove Press, and this pictorial account of the
> day-to-day life of an English gamekeeper is full of considerable
> interest to outdoor minded readers, as it contains many passages on
> pheasant-raising, the apprehending of poachers, ways to control vermin,
> and other chores and duties of the professional gamekeeper.
> Unfortunately, one is obliged to wade through many pages of extraneous
> material in order to discover and savour those sidelights on the
> management of a midland shooting estate, and in this reviewer's opinion
> the book cannot take the place of J. R. Miller's "Practical
> Gamekeeping."
>         -- Ed Zern, "Field and Stream" (Nov. 1959)
Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Jeremy Drake
Date:
> Jeremy Drake wrote:
> > On Thu, 22 Mar 2007, Tom Lane wrote:
> >
> > > I'd vote for making this new code look like the rest of it, to wit
> > > hardwire the values.
> >
> > Attached please find a patch which does this.

I just realized that the last patch removed all usage of fcinfo in the
setup_regexp_matches function, so this version of the patch also removes
it as a parameter to that function.

--
Think of it!  With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Attachment

Re: patch adding new regexp functions

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Jeremy Drake wrote:
> > Jeremy Drake wrote:
> > > On Thu, 22 Mar 2007, Tom Lane wrote:
> > >
> > > > I'd vote for making this new code look like the rest of it, to wit
> > > > hardwire the values.
> > >
> > > Attached please find a patch which does this.
>
> I just realized that the last patch removed all usage of fcinfo in the
> setup_regexp_matches function, so this version of the patch also removes
> it as a parameter to that function.
>
> --
> Think of it!  With VLSI we can pack 100 ENIACs in 1 sq. cm.!
Content-Description:

[ Attachment, skipping... ]

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: patch adding new regexp functions

From
Neil Conway
Date:
Jeremy Drake wrote:
> I just realized that the last patch removed all usage of fcinfo in the
> setup_regexp_matches function, so this version of the patch also removes
> it as a parameter to that function.
Applied, thanks.


-Neil