Thread: writing new regexp functions

writing new regexp functions

From
Jeremy Drake
Date:
I am wanting to write some new C functions which leverage postgresql's
existing regexp code in an extension module.  I notice that the functions
RE_compile_and_cache and RE_compile_and_execute in
src/backend/util/regexp.c contain the code necessary to connect the regexp
code in src/backend/regex with the postgresql string conversion, error
reporting, and memory management infrastructure, as well as providing
caching of regexes which would probably be a win to any regex function in
postgres.  It would seem that these functions would be useful to any
C function dealing with regexp matching in postgresql, but they are static
functions, so they cannot be used outside of
src/backend/utils/adt/regexp.c.  Since all of the core regexp functions
are in this file, this has been ok, but it is my opinion that these
functions should be made visible and added to a header file so that
extensions can make use of them, because any add-on functions that want to
use the regex code in postgres in some new way would need to basically
duplicate that same code in order to do so.

Is there some specific reason that these functions are static, or would it
be ok to make them non-static and add them to a header (say,
src/include/utils/regexp.h) so that extensions could use them as well?  I
could put together a patch for this if desired, or it seems simple enough
that someone could just do it...

-- 
I can't decide whether to commit suicide or go bowling.    -- Florence Henderson


Re: writing new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> Is there some specific reason that these functions are static,

Yeah: not cluttering the global namespace.  I'm not excited about
exporting everything that anybody could possibly want access to;
that just makes it harder to maintain the code.  When you see a
static function, you know that you don't have to look further than
the current file to understand how it's used.  When you see a global
function, the difficulty of knowing how it's used is an order of
magnitude higher, maybe more.  What's more, if you want to change it
then you have to worry about the possible side-effects on unknown
non-core code that might be calling it.

Is there a reason for not putting your new code itself into regexp.c?
        regards, tom lane


Re: writing new regexp functions

From
Jeremy Drake
Date:
On Thu, 1 Feb 2007, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > Is there some specific reason that these functions are static,
>
> Yeah: not cluttering the global namespace.

> Is there a reason for not putting your new code itself into regexp.c?

Not really, I just figured it would be cleaner/easier to write it as an
extension.  I also figure that it is unlikely that every regexp function
that anyone could possibly want will be implemented in core in that one
file.  If anyone writes an extension like this, they would need to
duplicate a good amount of code in order to do so, that would make more
difficulty in maintaining the code if it should need to change.  It also
makes developing a new function a lot easier, no need to re-initdb to add
the function, no need to relink the postmaster and restart it every time
the function changes.

Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the first
match group, would return an array of text containing all of the match
groups.  I exported the functions in my sandbox, and wrote a module with a
function that does this.

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>

-- 
Calling J-Man Kink.  Calling J-Man Kink.  Hash missile sighted, target
Los Angeles.  Disregard personal feelings about city and intercept.


Re: writing new regexp functions

From
David Fetter
Date:
On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
> On Thu, 1 Feb 2007, Tom Lane wrote:
> 
> > Jeremy Drake <pgsql@jdrake.com> writes:
> > > Is there some specific reason that these functions are static,
> >
> > Yeah: not cluttering the global namespace.
> 
> > Is there a reason for not putting your new code itself into regexp.c?
> 
> Not really, I just figured it would be cleaner/easier to write it as an
> extension.  I also figure that it is unlikely that every regexp function
> that anyone could possibly want will be implemented in core in that one
> file.  If anyone writes an extension like this, they would need to
> duplicate a good amount of code in order to do so, that would make more
> difficulty in maintaining the code if it should need to change.  It also
> makes developing a new function a lot easier, no need to re-initdb to add
> the function, no need to relink the postmaster and restart it every time
> the function changes.
> 
> Anyway, the particular thing I was writing was a function like
> substring(str FROM pattern) which instead of returning just the
> first match group, would return an array of text containing all of
> the match groups.

That'd be great!  People who use dynamic languages like Perl would
feel much more at home having access to all the matches.  While you're
at it, could you could make pre-match and post-match (optionally--I
know it's expensive) available?

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: writing new regexp functions

From
Jeremy Drake
Date:
On Thu, 1 Feb 2007, David Fetter wrote:

> On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
> > Anyway, the particular thing I was writing was a function like
> > substring(str FROM pattern) which instead of returning just the
> > first match group, would return an array of text containing all of
> > the match groups.

If you are subscribed to -patches, I sent my code to date there earlier
this evening.  I also said that I wanted to make a function that split on
a pattern (like perl split) and returned setof text.

> That'd be great!  People who use dynamic languages like Perl would
> feel much more at home having access to all the matches.  While you're
> at it, could you could make pre-match and post-match (optionally--I
> know it's expensive) available?

I could, but I'm not sure how someone would go about accessing such a
thing.  What I just wrote would be most like this perl:
@foo = ($str=~/pattern/);

Where would pre and post match fit into this?  Are you talking about a
different function?  Or sticking prematch at the beginning of the array
and postmatch at the end?  I could also put the whole match somewhere
also, but I did not in this version.

The code I wrote returns a text[] which is one-dimensional, has a lower
bound of 1 (as most postgres arrays do), where if there are n capture
groups, ra[1] has the first capture group and ra[n] has the last one.
Since postgres has an option to make different lower bounds, I suppose I
could have an option to put the prematch in [-1], the entire match in [0],
and the postmatch in [n+1].  This seems to be odd to me though.

I guess I'm saying, I agree that the entire match, prematch, and postmatch
would be helpful, but how would you propose to present these to the user?

>
> Cheers,
> D
>

-- 
To err is human, to forgive, beyond the scope of the Operating System.


Re: writing new regexp functions

From
David Fetter
Date:
On Thu, Feb 01, 2007 at 10:16:54PM -0800, Jeremy Drake wrote:
> On Thu, 1 Feb 2007, David Fetter wrote:
> 
> > On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
> > > Anyway, the particular thing I was writing was a function like
> > > substring(str FROM pattern) which instead of returning just the
> > > first match group, would return an array of text containing all
> > > of the match groups.
> 
> If you are subscribed to -patches, I sent my code to date there
> earlier this evening.  I also said that I wanted to make a function
> that split on a pattern (like perl split) and returned setof text.
> 
> > That'd be great!  People who use dynamic languages like Perl would
> > feel much more at home having access to all the matches.  While
> > you're at it, could you could make pre-match and post-match
> > (optionally--I know it's expensive) available?
> 
> I could, but I'm not sure how someone would go about accessing such
> a thing.  What I just wrote would be most like this perl: @foo =
> ($str=~/pattern/);

> Where would pre and post match fit into this?  Are you talking about a
> different function?

Yes, although it might have the same name, as in regex_match(pattern
TEXT, string TEXT, return_pre_and_post BOOL).

> Or sticking prematch at the beginning of the array and postmatch at
> the end?  I could also put the whole match somewhere also, but I did
> not in this version.

The data structure could be something like

TYPE matches (   prematch TEXT,    match    TEXT[],    postmatch TEXT
)

> The code I wrote returns a text[] which is one-dimensional, has a lower
> bound of 1 (as most postgres arrays do), where if there are n capture
> groups, ra[1] has the first capture group and ra[n] has the last one.
> Since postgres has an option to make different lower bounds, I suppose I
> could have an option to put the prematch in [-1], the entire match in [0],
> and the postmatch in [n+1].  This seems to be odd to me though.

Odd == bad.  I think the pre- and post-matches should be different in
essence, not just in index :)

> I guess I'm saying, I agree that the entire match, prematch, and postmatch
> would be helpful, but how would you propose to present these to the user?

See above :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: writing new regexp functions

From
Jeremy Drake
Date:
On Thu, 1 Feb 2007, David Fetter wrote:

> Yes, although it might have the same name, as in regex_match(pattern
> TEXT, string TEXT, return_pre_and_post BOOL).
>
> The data structure could be something like
>
> TYPE matches (
>     prematch TEXT,
>      match    TEXT[],
>      postmatch TEXT
> )

I just coded up for this:

CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
text[]   AS 'MODULE_PATHNAME', 'regexp_matches'   LANGUAGE C IMMUTABLE STRICT;

CREATE FUNCTION regexp_matches(       IN str text, IN pattern text, IN return_pre_and_post bool,       OUT prematch
text,OUT fullmatch text, OUT matches text[], OUT
 
postmatch text) RETURNS record   AS 'MODULE_PATHNAME', 'regexp_matches'   LANGUAGE C IMMUTABLE STRICT;


Which works like this:

jeremyd=# \pset null '\\N'
Null display is "\N".
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$);regexp_matches
----------------{bar,beque}
(1 row)

jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, false);prematch | fullmatch |   matches   | postmatch
----------+-----------+-------------+-----------\N       | \N        | {bar,beque} | \N
(1 row)

jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, true);prematch | fullmatch |   matches   | postmatch
----------+-----------+-------------+-----------foo      | barbeque  | {bar,beque} | baz
(1 row)


And then you also have this behavior in the matches array:

jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.*)(beque)$re$);regexp_matches
----------------{bar,"",beque}
(1 row)

jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.+)(beque)$re$);regexp_matches
----------------\N
(1 row)

jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.+)?(beque)$re$); regexp_matches
------------------{bar,NULL,beque}
(1 row)

Reasonable?

-- 
A.A.A.A.A.:An organization for drunks who drive


Re: writing new regexp functions

From
Jeremy Drake
Date:
On Fri, 2 Feb 2007, Jeremy Drake wrote:

> jeremyd=# select * from regexp_matches('foobarbequebaz',
> $re$(bar)(beque)$re$, false);
>  prematch | fullmatch |   matches   | postmatch
> ----------+-----------+-------------+-----------
>  \N       | \N        | {bar,beque} | \N
> (1 row)

I just changed this to fill in fullmatch when the bool is false, so this
one would look like:prematch | fullmatch |   matches   | postmatch
----------+-----------+-------------+-----------\N       | barbeque  | {bar,beque} | \N
(1 row)

I also removed my check for capture groups, since in this setup you could
get useful output without any.  I am still trying to decide whether or not
to add back an error if you called the no-bool version which just returns
the array, and you do not have any capture groups.  ISTM this is likely an
oversight on the query author's part, and it would be helpful to alert him
to this.

If you have no capture groups, the matches array is empty (not null).  If
the match happened at the start of the string, the prematch is an empty
string, and if the match happened at the end of the string, the postmatch
is an empty string.


> Reasonable?

-- 
It's odd, and a little unsettling, to reflect upon the fact that
English is the only major language in which "I" is capitalized; in many
other languages "You" is capitalized and the "i" is lower case.    -- Sydney J. Harris


Re: writing new regexp functions

From
David Fetter
Date:
On Fri, Feb 02, 2007 at 12:54:30AM -0800, Jeremy Drake wrote:
> On Fri, 2 Feb 2007, Jeremy Drake wrote:
> 
> > jeremyd=# select * from regexp_matches('foobarbequebaz',
> > $re$(bar)(beque)$re$, false);
> >  prematch | fullmatch |   matches   | postmatch
> > ----------+-----------+-------------+-----------
> >  \N       | \N        | {bar,beque} | \N
> > (1 row)
> 
> I just changed this to fill in fullmatch when the bool is false, so this
> one would look like:
>  prematch | fullmatch |   matches   | postmatch
> ----------+-----------+-------------+-----------
>  \N       | barbeque  | {bar,beque} | \N
> (1 row)
> 
> I also removed my check for capture groups, since in this setup you could
> get useful output without any.  I am still trying to decide whether or not
> to add back an error if you called the no-bool version which just returns
> the array, and you do not have any capture groups.  ISTM this is likely an
> oversight on the query author's part, and it would be helpful to alert him
> to this.
> 
> If you have no capture groups, the matches array is empty (not null).  If
> the match happened at the start of the string, the prematch is an empty
> string, and if the match happened at the end of the string, the postmatch
> is an empty string.
> 
> > Reasonable?

This is great :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: writing new regexp functions

From
Jeremy Drake
Date:
On Fri, 2 Feb 2007, Jeremy Drake wrote:

> I just coded up for this:
>
> CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
> text[]
>     AS 'MODULE_PATHNAME', 'regexp_matches'
>     LANGUAGE C IMMUTABLE STRICT;
>
> CREATE FUNCTION regexp_matches(
>         IN str text, IN pattern text, IN return_pre_and_post bool,
>         OUT prematch text, OUT fullmatch text, OUT matches text[], OUT
> postmatch text) RETURNS record
>     AS 'MODULE_PATHNAME', 'regexp_matches'
>     LANGUAGE C IMMUTABLE STRICT;
>

I wanted to put out there the question of what order the parameters to
these regex functions should go.  ISTM most people expect them to go
(pattern, string), but I made these functions consistant with
substring(text,text) which takes (string, pattern).  Now I have been
working on a regexp_split function, which takes (pattern, string), which
is what someone familiar with the function from perl would expect, but is
not consistant with substring or now with my regexp_matches function.

I want to ask, should I break with following substring's precedent, and
put the pattern first (as most people probably would expect), or should I
break with perl's precedent and put the pattern second (to behave like
substring)?


-- 
We cannot put the face of a person on a stamp unless said person is
deceased.  My suggestion, therefore, is that you drop dead.    -- James E. Day, Postmaster General


Re: writing new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> I want to ask, should I break with following substring's precedent, and
> put the pattern first (as most people probably would expect), or should I
> break with perl's precedent and put the pattern second (to behave like
> substring)?

All of SQL's pattern match operators have the pattern on the right, so
my advice is to stick with that and try not to think about Perl ;-)
        regards, tom lane


Re: writing new regexp functions

From
David Fetter
Date:
On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote:
> Jeremy Drake <pgsql@jdrake.com> writes:
> > I want to ask, should I break with following substring's
> > precedent, and put the pattern first (as most people probably
> > would expect), or should I break with perl's precedent and put the
> > pattern second (to behave like substring)?
> 
> All of SQL's pattern match operators have the pattern on the right,
> so my advice is to stick with that and try not to think about Perl
> ;-)

Perl provides inspiration, but here, consistency would help more than
orderly imitation of how it does what it does.   And besides, when
people really need Perl, they can pull it in as a PL :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: [PATCHES] writing new regexp functions

From
Jeremy Drake
Date:
On Sun, 4 Feb 2007, David Fetter wrote:

> On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
>
> > Let me know if you see any bugs or issues with this code, and I am
> > open to suggestions for further regression tests ;)

I have not heard anything, so I guess at this point I should figure out
where to go next with this.  I see a couple options:

* Set this up as a pgfoundry project or contrib.  This would require merging the patch to expose some functions from
regexp.coutside that file, which has raised some concerns about maintainability.
 

* Put together a patch to add these functions to core.  I could put them directly in regexp.c, so the support functions
couldstay static.  My concern here is that I don't know if there are any functions currently in core with OUT
parameters. I don't know the acceptable style for handling this: OUT parameters, a named composite type, ...?
 

Does anyone have any opinions either way, as to how I should proceed
from here?



> > * maybe a join function that works as an aggregate
> >    SELECT join(',', col) FROM tbl
> >   currently can be written as
> >    SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
>
> The array_accum() aggregate in the docs works OK for this purpose.

I have decided not to pursue this function, I think the array construct,
or the array_accum option, is about the best possible currently.  If it
should become possible in the future to write aggregates with a non-sql
state type (structs with pointers) it may be worthwhile to re-evaluate
this.


-- 
The cost of living hasn't affected its popularity.


Re: [PATCHES] writing new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> * Put together a patch to add these functions to core.  I could put them
>   directly in regexp.c, so the support functions could stay static.  My
>   concern here is that I don't know if there are any functions currently
>   in core with OUT parameters.

As of 8.2 there are.

If we are going to include these I would vote for core not contrib
status, exactly to avoid having to export those functions.
        regards, tom lane


Re: [PATCHES] writing new regexp functions

From
David Fetter
Date:
On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote:
> Jeremy Drake <pgsql@jdrake.com> writes:
> > * Put together a patch to add these functions to core.  I could put them
> >   directly in regexp.c, so the support functions could stay static.  My
> >   concern here is that I don't know if there are any functions currently
> >   in core with OUT parameters.
> 
> As of 8.2 there are.
> 
> If we are going to include these I would vote for core not contrib
> status, exactly to avoid having to export those functions.

+1 for core. :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: [PATCHES] writing new regexp functions

From
Jeremy Drake
Date:
On Wed, 7 Feb 2007, Tom Lane wrote:

> Jeremy Drake <pgsql@jdrake.com> writes:
> > * Put together a patch to add these functions to core.  I could put them
> >   directly in regexp.c, so the support functions could stay static.  My
> >   concern here is that I don't know if there are any functions currently
> >   in core with OUT parameters.
>
> As of 8.2 there are.

Could you give me the name of one in pg_proc.h so I can see how I should
go about adding one there?

> If we are going to include these I would vote for core not contrib
> status, exactly to avoid having to export those functions.

OK, this patch will be my next project.


-- 
History is curious stuffYou'd think by now we had enough
Yet the fact remains I fearThey make more of it every year.


Re: [PATCHES] writing new regexp functions

From
Tom Lane
Date:
Jeremy Drake <pgsql@jdrake.com> writes:
> On Wed, 7 Feb 2007, Tom Lane wrote:
>> As of 8.2 there are.

> Could you give me the name of one in pg_proc.h so I can see how I should
> go about adding one there?

select * from pg_proc where proargmodes is not null;
        regards, tom lane