Thread: writing new regexp functions
I am wanting to write some new C functions which leverage postgresql's existing regexp code in an extension module. I notice that the functions RE_compile_and_cache and RE_compile_and_execute in src/backend/util/regexp.c contain the code necessary to connect the regexp code in src/backend/regex with the postgresql string conversion, error reporting, and memory management infrastructure, as well as providing caching of regexes which would probably be a win to any regex function in postgres. It would seem that these functions would be useful to any C function dealing with regexp matching in postgresql, but they are static functions, so they cannot be used outside of src/backend/utils/adt/regexp.c. Since all of the core regexp functions are in this file, this has been ok, but it is my opinion that these functions should be made visible and added to a header file so that extensions can make use of them, because any add-on functions that want to use the regex code in postgres in some new way would need to basically duplicate that same code in order to do so. Is there some specific reason that these functions are static, or would it be ok to make them non-static and add them to a header (say, src/include/utils/regexp.h) so that extensions could use them as well? I could put together a patch for this if desired, or it seems simple enough that someone could just do it... -- I can't decide whether to commit suicide or go bowling. -- Florence Henderson
Jeremy Drake <pgsql@jdrake.com> writes: > Is there some specific reason that these functions are static, Yeah: not cluttering the global namespace. I'm not excited about exporting everything that anybody could possibly want access to; that just makes it harder to maintain the code. When you see a static function, you know that you don't have to look further than the current file to understand how it's used. When you see a global function, the difficulty of knowing how it's used is an order of magnitude higher, maybe more. What's more, if you want to change it then you have to worry about the possible side-effects on unknown non-core code that might be calling it. Is there a reason for not putting your new code itself into regexp.c? regards, tom lane
On Thu, 1 Feb 2007, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > Is there some specific reason that these functions are static, > > Yeah: not cluttering the global namespace. > Is there a reason for not putting your new code itself into regexp.c? Not really, I just figured it would be cleaner/easier to write it as an extension. I also figure that it is unlikely that every regexp function that anyone could possibly want will be implemented in core in that one file. If anyone writes an extension like this, they would need to duplicate a good amount of code in order to do so, that would make more difficulty in maintaining the code if it should need to change. It also makes developing a new function a lot easier, no need to re-initdb to add the function, no need to relink the postmaster and restart it every time the function changes. Anyway, the particular thing I was writing was a function like substring(str FROM pattern) which instead of returning just the first match group, would return an array of text containing all of the match groups. I exported the functions in my sandbox, and wrote a module with a function that does this. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > -- Calling J-Man Kink. Calling J-Man Kink. Hash missile sighted, target Los Angeles. Disregard personal feelings about city and intercept.
On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote: > On Thu, 1 Feb 2007, Tom Lane wrote: > > > Jeremy Drake <pgsql@jdrake.com> writes: > > > Is there some specific reason that these functions are static, > > > > Yeah: not cluttering the global namespace. > > > Is there a reason for not putting your new code itself into regexp.c? > > Not really, I just figured it would be cleaner/easier to write it as an > extension. I also figure that it is unlikely that every regexp function > that anyone could possibly want will be implemented in core in that one > file. If anyone writes an extension like this, they would need to > duplicate a good amount of code in order to do so, that would make more > difficulty in maintaining the code if it should need to change. It also > makes developing a new function a lot easier, no need to re-initdb to add > the function, no need to relink the postmaster and restart it every time > the function changes. > > Anyway, the particular thing I was writing was a function like > substring(str FROM pattern) which instead of returning just the > first match group, would return an array of text containing all of > the match groups. That'd be great! People who use dynamic languages like Perl would feel much more at home having access to all the matches. While you're at it, could you could make pre-match and post-match (optionally--I know it's expensive) available? Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Thu, 1 Feb 2007, David Fetter wrote: > On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote: > > Anyway, the particular thing I was writing was a function like > > substring(str FROM pattern) which instead of returning just the > > first match group, would return an array of text containing all of > > the match groups. If you are subscribed to -patches, I sent my code to date there earlier this evening. I also said that I wanted to make a function that split on a pattern (like perl split) and returned setof text. > That'd be great! People who use dynamic languages like Perl would > feel much more at home having access to all the matches. While you're > at it, could you could make pre-match and post-match (optionally--I > know it's expensive) available? I could, but I'm not sure how someone would go about accessing such a thing. What I just wrote would be most like this perl: @foo = ($str=~/pattern/); Where would pre and post match fit into this? Are you talking about a different function? Or sticking prematch at the beginning of the array and postmatch at the end? I could also put the whole match somewhere also, but I did not in this version. The code I wrote returns a text[] which is one-dimensional, has a lower bound of 1 (as most postgres arrays do), where if there are n capture groups, ra[1] has the first capture group and ra[n] has the last one. Since postgres has an option to make different lower bounds, I suppose I could have an option to put the prematch in [-1], the entire match in [0], and the postmatch in [n+1]. This seems to be odd to me though. I guess I'm saying, I agree that the entire match, prematch, and postmatch would be helpful, but how would you propose to present these to the user? > > Cheers, > D > -- To err is human, to forgive, beyond the scope of the Operating System.
On Thu, Feb 01, 2007 at 10:16:54PM -0800, Jeremy Drake wrote: > On Thu, 1 Feb 2007, David Fetter wrote: > > > On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote: > > > Anyway, the particular thing I was writing was a function like > > > substring(str FROM pattern) which instead of returning just the > > > first match group, would return an array of text containing all > > > of the match groups. > > If you are subscribed to -patches, I sent my code to date there > earlier this evening. I also said that I wanted to make a function > that split on a pattern (like perl split) and returned setof text. > > > That'd be great! People who use dynamic languages like Perl would > > feel much more at home having access to all the matches. While > > you're at it, could you could make pre-match and post-match > > (optionally--I know it's expensive) available? > > I could, but I'm not sure how someone would go about accessing such > a thing. What I just wrote would be most like this perl: @foo = > ($str=~/pattern/); > Where would pre and post match fit into this? Are you talking about a > different function? Yes, although it might have the same name, as in regex_match(pattern TEXT, string TEXT, return_pre_and_post BOOL). > Or sticking prematch at the beginning of the array and postmatch at > the end? I could also put the whole match somewhere also, but I did > not in this version. The data structure could be something like TYPE matches ( prematch TEXT, match TEXT[], postmatch TEXT ) > The code I wrote returns a text[] which is one-dimensional, has a lower > bound of 1 (as most postgres arrays do), where if there are n capture > groups, ra[1] has the first capture group and ra[n] has the last one. > Since postgres has an option to make different lower bounds, I suppose I > could have an option to put the prematch in [-1], the entire match in [0], > and the postmatch in [n+1]. This seems to be odd to me though. Odd == bad. I think the pre- and post-matches should be different in essence, not just in index :) > I guess I'm saying, I agree that the entire match, prematch, and postmatch > would be helpful, but how would you propose to present these to the user? See above :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Thu, 1 Feb 2007, David Fetter wrote: > Yes, although it might have the same name, as in regex_match(pattern > TEXT, string TEXT, return_pre_and_post BOOL). > > The data structure could be something like > > TYPE matches ( > prematch TEXT, > match TEXT[], > postmatch TEXT > ) I just coded up for this: CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS text[] AS 'MODULE_PATHNAME', 'regexp_matches' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION regexp_matches( IN str text, IN pattern text, IN return_pre_and_post bool, OUT prematch text,OUT fullmatch text, OUT matches text[], OUT postmatch text) RETURNS record AS 'MODULE_PATHNAME', 'regexp_matches' LANGUAGE C IMMUTABLE STRICT; Which works like this: jeremyd=# \pset null '\\N' Null display is "\N". jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);regexp_matches ----------------{bar,beque} (1 row) jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, false);prematch | fullmatch | matches | postmatch ----------+-----------+-------------+-----------\N | \N | {bar,beque} | \N (1 row) jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$, true);prematch | fullmatch | matches | postmatch ----------+-----------+-------------+-----------foo | barbeque | {bar,beque} | baz (1 row) And then you also have this behavior in the matches array: jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);regexp_matches ----------------{bar,"",beque} (1 row) jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(.+)(beque)$re$);regexp_matches ----------------\N (1 row) jeremyd=# select * from regexp_matches('foobarbequebaz', $re$(bar)(.+)?(beque)$re$); regexp_matches ------------------{bar,NULL,beque} (1 row) Reasonable? -- A.A.A.A.A.:An organization for drunks who drive
On Fri, 2 Feb 2007, Jeremy Drake wrote: > jeremyd=# select * from regexp_matches('foobarbequebaz', > $re$(bar)(beque)$re$, false); > prematch | fullmatch | matches | postmatch > ----------+-----------+-------------+----------- > \N | \N | {bar,beque} | \N > (1 row) I just changed this to fill in fullmatch when the bool is false, so this one would look like:prematch | fullmatch | matches | postmatch ----------+-----------+-------------+-----------\N | barbeque | {bar,beque} | \N (1 row) I also removed my check for capture groups, since in this setup you could get useful output without any. I am still trying to decide whether or not to add back an error if you called the no-bool version which just returns the array, and you do not have any capture groups. ISTM this is likely an oversight on the query author's part, and it would be helpful to alert him to this. If you have no capture groups, the matches array is empty (not null). If the match happened at the start of the string, the prematch is an empty string, and if the match happened at the end of the string, the postmatch is an empty string. > Reasonable? -- It's odd, and a little unsettling, to reflect upon the fact that English is the only major language in which "I" is capitalized; in many other languages "You" is capitalized and the "i" is lower case. -- Sydney J. Harris
On Fri, Feb 02, 2007 at 12:54:30AM -0800, Jeremy Drake wrote: > On Fri, 2 Feb 2007, Jeremy Drake wrote: > > > jeremyd=# select * from regexp_matches('foobarbequebaz', > > $re$(bar)(beque)$re$, false); > > prematch | fullmatch | matches | postmatch > > ----------+-----------+-------------+----------- > > \N | \N | {bar,beque} | \N > > (1 row) > > I just changed this to fill in fullmatch when the bool is false, so this > one would look like: > prematch | fullmatch | matches | postmatch > ----------+-----------+-------------+----------- > \N | barbeque | {bar,beque} | \N > (1 row) > > I also removed my check for capture groups, since in this setup you could > get useful output without any. I am still trying to decide whether or not > to add back an error if you called the no-bool version which just returns > the array, and you do not have any capture groups. ISTM this is likely an > oversight on the query author's part, and it would be helpful to alert him > to this. > > If you have no capture groups, the matches array is empty (not null). If > the match happened at the start of the string, the prematch is an empty > string, and if the match happened at the end of the string, the postmatch > is an empty string. > > > Reasonable? This is great :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Fri, 2 Feb 2007, Jeremy Drake wrote: > I just coded up for this: > > CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS > text[] > AS 'MODULE_PATHNAME', 'regexp_matches' > LANGUAGE C IMMUTABLE STRICT; > > CREATE FUNCTION regexp_matches( > IN str text, IN pattern text, IN return_pre_and_post bool, > OUT prematch text, OUT fullmatch text, OUT matches text[], OUT > postmatch text) RETURNS record > AS 'MODULE_PATHNAME', 'regexp_matches' > LANGUAGE C IMMUTABLE STRICT; > I wanted to put out there the question of what order the parameters to these regex functions should go. ISTM most people expect them to go (pattern, string), but I made these functions consistant with substring(text,text) which takes (string, pattern). Now I have been working on a regexp_split function, which takes (pattern, string), which is what someone familiar with the function from perl would expect, but is not consistant with substring or now with my regexp_matches function. I want to ask, should I break with following substring's precedent, and put the pattern first (as most people probably would expect), or should I break with perl's precedent and put the pattern second (to behave like substring)? -- We cannot put the face of a person on a stamp unless said person is deceased. My suggestion, therefore, is that you drop dead. -- James E. Day, Postmaster General
Jeremy Drake <pgsql@jdrake.com> writes: > I want to ask, should I break with following substring's precedent, and > put the pattern first (as most people probably would expect), or should I > break with perl's precedent and put the pattern second (to behave like > substring)? All of SQL's pattern match operators have the pattern on the right, so my advice is to stick with that and try not to think about Perl ;-) regards, tom lane
On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > I want to ask, should I break with following substring's > > precedent, and put the pattern first (as most people probably > > would expect), or should I break with perl's precedent and put the > > pattern second (to behave like substring)? > > All of SQL's pattern match operators have the pattern on the right, > so my advice is to stick with that and try not to think about Perl > ;-) Perl provides inspiration, but here, consistency would help more than orderly imitation of how it does what it does. And besides, when people really need Perl, they can pull it in as a PL :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Sun, 4 Feb 2007, David Fetter wrote: > On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote: > > > Let me know if you see any bugs or issues with this code, and I am > > open to suggestions for further regression tests ;) I have not heard anything, so I guess at this point I should figure out where to go next with this. I see a couple options: * Set this up as a pgfoundry project or contrib. This would require merging the patch to expose some functions from regexp.coutside that file, which has raised some concerns about maintainability. * Put together a patch to add these functions to core. I could put them directly in regexp.c, so the support functions couldstay static. My concern here is that I don't know if there are any functions currently in core with OUT parameters. I don't know the acceptable style for handling this: OUT parameters, a named composite type, ...? Does anyone have any opinions either way, as to how I should proceed from here? > > * maybe a join function that works as an aggregate > > SELECT join(',', col) FROM tbl > > currently can be written as > > SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',') > > The array_accum() aggregate in the docs works OK for this purpose. I have decided not to pursue this function, I think the array construct, or the array_accum option, is about the best possible currently. If it should become possible in the future to write aggregates with a non-sql state type (structs with pointers) it may be worthwhile to re-evaluate this. -- The cost of living hasn't affected its popularity.
Jeremy Drake <pgsql@jdrake.com> writes: > * Put together a patch to add these functions to core. I could put them > directly in regexp.c, so the support functions could stay static. My > concern here is that I don't know if there are any functions currently > in core with OUT parameters. As of 8.2 there are. If we are going to include these I would vote for core not contrib status, exactly to avoid having to export those functions. regards, tom lane
On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > * Put together a patch to add these functions to core. I could put them > > directly in regexp.c, so the support functions could stay static. My > > concern here is that I don't know if there are any functions currently > > in core with OUT parameters. > > As of 8.2 there are. > > If we are going to include these I would vote for core not contrib > status, exactly to avoid having to export those functions. +1 for core. :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Wed, 7 Feb 2007, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > * Put together a patch to add these functions to core. I could put them > > directly in regexp.c, so the support functions could stay static. My > > concern here is that I don't know if there are any functions currently > > in core with OUT parameters. > > As of 8.2 there are. Could you give me the name of one in pg_proc.h so I can see how I should go about adding one there? > If we are going to include these I would vote for core not contrib > status, exactly to avoid having to export those functions. OK, this patch will be my next project. -- History is curious stuffYou'd think by now we had enough Yet the fact remains I fearThey make more of it every year.
Jeremy Drake <pgsql@jdrake.com> writes: > On Wed, 7 Feb 2007, Tom Lane wrote: >> As of 8.2 there are. > Could you give me the name of one in pg_proc.h so I can see how I should > go about adding one there? select * from pg_proc where proargmodes is not null; regards, tom lane