Thread: Re: [HACKERS] writing new regexp functions
On Fri, 2 Feb 2007, David Fetter wrote: > On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote: > > > > All of SQL's pattern match operators have the pattern on the right, > > so my advice is to stick with that and try not to think about Perl > > ;-) > > Perl provides inspiration, but here, consistency would help more than > orderly imitation of how it does what it does. And besides, when > people really need Perl, they can pull it in as a PL :) Alright, here is my code to date. I have put the pattern after the string in the split function, as discussed above. The .tar.gz file expects to be untarred in contrib/. I have made some regression tests that can be run using 'make installcheck' as normal for contrib. I think they exercise the corner cases in the code, but I may very well have missed some. It requires the (previously submitted) attached patch to core to compile, as it takes advantage of new exported functions from src/backend/utils/adt/regexp.c. Let me know if you see any bugs or issues with this code, and I am open to suggestions for further regression tests ;) Things that I still want to look into: * regexp flags (a la regexp_replace). * maybe make regexp_matches return setof whatever, if given a 'g' flag return all matches in string. * maybe a join function that works as an aggregate SELECT join(',', col) FROM tbl currently can be written as SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',') -- It was a virgin forest, a place where the Hand of Man had never set foot.
Attachment
On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote: > Let me know if you see any bugs or issues with this code, and I am > open to suggestions for further regression tests ;) > Things that I still want to look into: > * regexp flags (a la regexp_replace). One more text field at the end is how the regexp_replace() one does it. > * maybe make regexp_matches return setof whatever, if given a 'g' flag > return all matches in string. This is doable with current machinery, albeit a little clumsily. > * maybe a join function that works as an aggregate > SELECT join(',', col) FROM tbl > currently can be written as > SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',') The array_accum() aggregate in the docs works OK for this purpose. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Sun, 4 Feb 2007, David Fetter wrote: > On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote: > > > Let me know if you see any bugs or issues with this code, and I am > > open to suggestions for further regression tests ;) > > > Things that I still want to look into: > > * regexp flags (a la regexp_replace). > > One more text field at the end is how the regexp_replace() one does > it. That's how I did it. > > * maybe make regexp_matches return setof whatever, if given a 'g' flag > > return all matches in string. > > This is doable with current machinery, albeit a little clumsily. I have implemented this too. > > * maybe a join function that works as an aggregate > > SELECT join(',', col) FROM tbl > > currently can be written as > > SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',') > > The array_accum() aggregate in the docs works OK for this purpose. I have not tackled this yet, I think it may be better to stick with the ARRAY() construct for now. So, here is the new version of the code, and also a new version of the patch to core, which fixes some compile warnings that I did not see at first because I was using ICC rather than GCC. Here is the README.regexp_ext from the tar file: This package contains regexp functions beyond those currently provided in core PostgreSQL, utilizing the regexp engine built into core. This is still a work-in-progress. The most recent version of this code can be found at http://www.jdrake.com/postgresql/regexp/regexp_ext.tar.gz and the prerequisite patch to PostgreSQL core, which has been submitted for review, can be found at http://www.jdrake.com/postgresql/regexp/regexp-export.patch The .tar.gz file expects to be untarred in contrib/. I have made some regression tests that can be run using 'make installcheck' as normal for contrib. I think they exercise the corner cases in the code, but I may very well have missed some. It requires the above mentioned patch to core to compile, as it takes advantage of new exported functions from src/backend/utils/adt/regexp.c. Let me know if you see any bugs or issues with this code, and I am open to suggestions for further regression tests ;) Functions implemented in this module: * regexp_split(str text, pattern text) RETURNS SETOF text regexp_split(str text, pattern text, flags text) RETURNS SETOF text returns each section of the string delimited by the pattern. * regexp_matches(str text, pattern text) RETURNS text[] returns all capture groups when matching pattern against string in an array * regexp_matches(str text, pattern text, flags text) RETURNS SETOF (prematch text, fullmatch text, matches text[], postmatch text) returns all capture groups when matching pattern against string in an array. also returns the entire match in fullmatch. if the 'g' option is given, returns all matches in the string. if the 'r' option is given, also return the text before and after the match in prematch and postmatch respectively. See the regression tests for more details about usage and return values. Recent changes: * I have put the pattern after the string in all of the functions, as discussed on the pgsql-hackers mailing list. * regexp flags (a la regexp_replace). * make regexp_matches return setof whatever, if given a 'g' flag return all matches in string. Things that I still want to look into: * maybe a join function that works as an aggregate SELECT join(',', col) FROM tbl currently can be written as SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',') -- Philogeny recapitulates erogeny; erogeny recapitulates philogeny.
Attachment
On Wed, 7 Feb 2007, David Fetter wrote: > On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote: > > Jeremy Drake <pgsql@jdrake.com> writes: > > > * Put together a patch to add these functions to core. I could put them > > > directly in regexp.c, so the support functions could stay static. My > > > concern here is that I don't know if there are any functions currently > > > in core with OUT parameters. > > > > As of 8.2 there are. > > > > If we are going to include these I would vote for core not contrib > > status, exactly to avoid having to export those functions. > > +1 for core. :) Here is a patch to add these functions to core. Please take a look and let me know what you think. I had to jump through a bit of a hoop to avoid opr_sanity test from breaking, may not have done that right. Also, this patch allows regexp_replace to take more flags, since my two new functions needed flags also, I split flag parsing into a seperate function and made regexp_replace use it too. It also results that the error message for an invalid flag in regexp_replace changes, since it is now more generic as it is called from multiple functions. I still need to write documentation for the new functions before I consider the patch completed, but please take a look at the code and see if it is acceptable as I do not have any further changes planned for it. -- Chicken Soup, n.: An ancient miracle drug containing equal parts of aureomycin, cocaine, interferon, and TLC. The only ailment chicken soup can't cure is neurotic dependence on one's mother. -- Arthur Naiman, "Every Goy's Guide to Yiddish"
Attachment
On Wed, 7 Feb 2007, Jeremy Drake wrote: > Here is a patch to add these functions to core. Please take a look and > let me know what you think. I had to jump through a bit of a hoop to > avoid opr_sanity test from breaking, may not have done that right. > > Also, this patch allows regexp_replace to take more flags, since my two > new functions needed flags also, I split flag parsing into a seperate > function and made regexp_replace use it too. It also results that the > error message for an invalid flag in regexp_replace changes, since it is > now more generic as it is called from multiple functions. > > I still need to write documentation for the new functions before I > consider the patch completed, but please take a look at the code and see > if it is acceptable as I do not have any further changes planned for it. I have added documentation for the functions in this patch. At this point I am ready to submit this patch for the review and application process. Please let me know if you find any issues with it. Thank you -- The trouble with being punctual is that nobody's there to appreciate it. -- Franklin P. Jones
Attachment
On Thu, 2007-02-08 at 19:22 -0800, Jeremy Drake wrote: > I have added documentation for the functions in this patch. At this point > I am ready to submit this patch for the review and application process. > Please let me know if you find any issues with it. Barring any objections, I'll review and apply this patch in the next few days. BTW, unless the patch is truly large, leaving the patch uncompressed makes it easier to eyeball it without leaving my mail client. That's just personal preference, though... -Neil
Neil Conway wrote: > BTW, unless the patch is truly large, leaving the patch uncompressed > makes it easier to eyeball it without leaving my mail client. That's > just personal preference, though... FWIW, I just do a "| zcat | less", which shows it uncompressed. No big deal. I don't know if Evolution can do that ... (it surprised me that Thunderbird doesn't seem to be capable of searching a message based on Message-Id). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Neil Conway wrote: > > > BTW, unless the patch is truly large, leaving the patch uncompressed > > makes it easier to eyeball it without leaving my mail client. That's > > just personal preference, though... > > FWIW, I just do a "| zcat | less", which shows it uncompressed. No big > deal. I don't know if Evolution can do that ... (it surprised me that > Thunderbird doesn't seem to be capable of searching a message based on > Message-Id). My .mailcap calls a script runs 'file' and tries to display it properly automatically, and it worked fine for his attachment. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 8 Feb 2007, Neil Conway wrote: > On Thu, 2007-02-08 at 19:22 -0800, Jeremy Drake wrote: > > I have added documentation for the functions in this patch. At this point > > I am ready to submit this patch for the review and application process. > > Please let me know if you find any issues with it. > > Barring any objections, I'll review and apply this patch in the next few > days. > > BTW, unless the patch is truly large, leaving the patch uncompressed > makes it easier to eyeball it without leaving my mail client. That's > just personal preference, though... Yeah, I try to do that, but this one just barely passed my personal compression threshold. Guess I should raise my threshold :) Here is a new version of the patch which fixes up the documentation a little (should have read it over again before posting). I also added mention of the additional flags that I added support for, and linked to the embedded options table for details, since the flags supported correspond to those. This is my first foray into the world of Jade, so please forgive any SGML snafus. Also, it is uncompressed this time ;) -- Q: How many existentialists does it take to screw in a lightbulb? A: Two. One to screw it in and one to observe how the lightbulb itself symbolizes a single incandescent beacon of subjective reality in a netherworld of endless absurdity reaching out toward a maudlin cosmos of nothingness.
Attachment
On Fri, 2007-02-09 at 01:08 -0800, Jeremy Drake wrote: > Yeah, I try to do that, but this one just barely passed my personal > compression threshold. Guess I should raise my threshold :) No, don't pay any attention to me, I'm just lazy :) > Here is a new version of the patch which fixes up the documentation a > little (should have read it over again before posting). The "doing_srf" business is rather tortured, and seems an invitation for bugs. ISTM there should be a cleaner way to implement this. For example, would it be possible to put all the common logic into one or more additional functions, and then have SRF vs. non-SRF cases that call into those functions after doing the appropriate initialization? -Neil
On Fri, 9 Feb 2007, Neil Conway wrote: > The "doing_srf" business is rather tortured, and seems an invitation for > bugs. ISTM there should be a cleaner way to implement this. For example, > would it be possible to put all the common logic into one or more > additional functions, and then have SRF vs. non-SRF cases that call into > those functions after doing the appropriate initialization? Here is a new version of the patch which eliminates the doing_srf stuff. It does seem a lot cleaner this way... -- Fortune's Real-Life Courtroom Quote #18: Q: Are you married? A: No, I'm divorced. Q: And what did your husband do before you divorced him? A: A lot of things I didn't know about.
Attachment
On Fri, 2007-02-09 at 16:33 -0800, Jeremy Drake wrote: > Here is a new version of the patch which eliminates the doing_srf stuff. * C89 require constant-sized stack allocated arrays, so the coding in perform_regexp_matches() is non-portable. * I'm not clear about the control flow in regexp_matches() and regexp_split(). Presumably it's not possible for the call_cntr to actually exceed max_calls, so the error message in these cases should be elog(ERROR), not ereport (the former is for "shouldn't happen" bug scenarios, the latter is for user-facing errors). Can you describe the logic here (e.g. via comments) a bit more? * The logic in regexp_split (incremented_offset, first_match, etc.) is pretty ugly and hard to follow. The "if" condition on line 1037 is particularly objectionable. Again, ISTM there should be a cleaner way to do this. * Try to keep lines to 80 characters or fewer. If the code is wandering off the right side of the screen all the time, that might be a hint that it needs simplification. Attached is a cleaned up version of your patch -- various improvements throughout, but mostly cosmetic stuff. Do you want to take a look at the above? -Neil
Attachment
On Sat, 10 Feb 2007, Neil Conway wrote: > On Fri, 2007-02-09 at 16:33 -0800, Jeremy Drake wrote: > > Here is a new version of the patch which eliminates the doing_srf stuff. > > * C89 require constant-sized stack allocated arrays, so the coding in > perform_regexp_matches() is non-portable. I thought that was the case, but I had seen some other code doing this. Turns out it was in the fulldisjunctions pgfoundry project. I replaced them with palloc'd memory. > * I'm not clear about the control flow in regexp_matches() and > regexp_split(). Presumably it's not possible for the call_cntr to > actually exceed max_calls, so the error message in these cases should be > elog(ERROR), not ereport (the former is for "shouldn't happen" bug > scenarios, the latter is for user-facing errors). Can you describe the > logic here (e.g. via comments) a bit more? I added some comments, and changed to using elog instead of ereport. > * The logic in regexp_split (incremented_offset, first_match, etc.) is > pretty ugly and hard to follow. The "if" condition on line 1037 is > particularly objectionable. Again, ISTM there should be a cleaner way to > do this. That if condition was very difficult to get right ;) I added a bunch of comments, and switched the logic around with a continue so it is much more obvious what is happening there. Incidentally, that if condition being incorrect is what results in call_cntr exceeding max_calls :) > > * Try to keep lines to 80 characters or fewer. If the code is wandering > off the right side of the screen all the time, that might be a hint that > it needs simplification. > > Attached is a cleaned up version of your patch -- various improvements > throughout, but mostly cosmetic stuff. Do you want to take a look at the > above? > > -Neil > > -- If God had meant for us to be naked, we would have been born that way.
Attachment
On Sat, 10 Feb 2007, Jeremy Drake wrote: > On Sat, 10 Feb 2007, Neil Conway wrote: > > > * I'm not clear about the control flow in regexp_matches() and > > regexp_split(). Presumably it's not possible for the call_cntr to > > actually exceed max_calls, so the error message in these cases should be > > elog(ERROR), not ereport (the former is for "shouldn't happen" bug > > scenarios, the latter is for user-facing errors). Can you describe the > > logic here (e.g. via comments) a bit more? > > I added some comments, and changed to using elog instead of ereport. I fixed a couple more things in this patch. I changed the max calls limit to the real limit, rather than the arbitrarily high limit that was previously set (three times the length of the string in bytes). Also, I changed the checks for offset to compare against wide_len rather than orig_len, since in multibyte character sets orig_len is the length in bytes of the string in whatever encoding it is in, while wide_len is the length in characters, which is what everything else in these functions deal with. The calls to text_substr have me somewhat concerned now, also. I think performance starts to look like O(n^2) in multibyte character sets. But I think doing anything about it would require this code to know more about the internals of text than it has any right to. I guess settle for the correctness now, and if performance is a problem this can be addressed. Would hate to make this code even more ugly due to premature optimization... -- When does summertime come to Minnesota, you ask? Well, last year, I think it was a Tuesday.
Attachment
I noticed in the documentation that you added indexterms for the functions, but that they (along with the existing ones) are in the SIMILAR TO regular expression section. These functions do not take SIMILAR TO regular expressions, but POSIX regular expressions, so I moved them to that section in this patch, and also moved the regexp_replace indexterm there also, since it also uses POSIX regular expressions. -- 10.0 times 0.1 is hardly ever 1.0.
Attachment
What was the status of this? Was there anything else I needed to do with this patch, or is it ready to be applied? Let me know if there is anything else I need to do on this... -- What this country needs is a good five cent nickel.
Jeremy Drake wrote: > What was the status of this? Was there anything else I needed to do with > this patch, or is it ready to be applied? Let me know if there is > anything else I need to do on this... I assume it is ready for application. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: > What was the status of this? Was there anything else I needed to do with > this patch, or is it ready to be applied? Let me know if there is > anything else I need to do on this... Will do -- I'm planning to apply this as soon as I have the free cycles to do so, likely tomorrow or Friday. -Neil
Neil Conway wrote: > On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: > > What was the status of this? Was there anything else I needed to > > do with this patch, or is it ready to be applied? Let me know if > > there is anything else I need to do on this... > > Will do -- I'm planning to apply this as soon as I have the free > cycles to do so, likely tomorrow or Friday. I don't know which patch is actually being proposed now. It would be good to make this more explicit and maybe include a synopsis of the functions in the email, so we know what's going on. What confuses me about some of the functions I've seen in earlier patches in this thread is that they return setof something. But in my mind, regular expression matches or string splits are inherently ordered, so an array would be the correct return type. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Thu, 15 Feb 2007, Peter Eisentraut wrote: > Neil Conway wrote: > > On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: > > > What was the status of this? Was there anything else I needed to > > > do with this patch, or is it ready to be applied? Let me know if > > > there is anything else I need to do on this... > > > > Will do -- I'm planning to apply this as soon as I have the free > > cycles to do so, likely tomorrow or Friday. > > I don't know which patch is actually being proposed now. It would be > good to make this more explicit and maybe include a synopsis of the > functions in the email, so we know what's going on. Sorry, my intent was just to check to see if I had gotten the patch sufficiently fixed for Neil to apply and he just hadn't gotten to it yet (which seems to be the case), or if there was something else he still expected me to fix that I had missed in the prior discussions. I suppose I should have emailed him privately. The patch in question can be seen in the archives here: http://archives.postgresql.org/pgsql-patches/2007-02/msg00214.php The functions added are: * regexp_split(str text, pattern text) RETURNS SETOF text regexp_split(str text, pattern text, flags text) RETURNS SETOF text returns each section of the string delimited by the pattern. * regexp_matches(str text, pattern text) RETURNS text[] returns all capture groups when matching pattern against string in an array * regexp_matches(str text, pattern text, flags text) RETURNS SETOF (prematch text, fullmatch text, matches text[], postmatch text) returns all capture groups when matching pattern against string in an array. also returns the entire match in fullmatch. if the 'g' option is given, returns all matches in the string. if the 'r' option is given, also return the text before and after the match in prematch and postmatch respectively. > What confuses me about some of the functions I've seen in earlier > patches in this thread is that they return setof something. But in my > mind, regular expression matches or string splits are inherently > ordered, so an array would be the correct return type. They do return SETOF. Addressing them separately: regexp_matches uses a text[] for the match groups. If you specify the global flag, it could return multiple matches. Couple this with the requested feature of pre- and postmatch returns (with its own flag) and the return would turn into some sort of nasty array of tuples, or multiple arrays. It seems much cleaner to me to return a set of the matches found, and I find which order the matches occur in to be much less important than the fact that they did occur and their contents. regexp_split returns setof text. This has, in my opinion, a much greater case to return an array. However, there are several issues with this approach: # My experience with the array code leads me to believe that building up an array is an expensive proposition. I know I could code it smarter so that the array is only constructed in the end. # With a set-returning function, it is possible to add a LIMIT clause, to prevent splitting up more of the string than is necessary. It is also immediately possible to insert them into a table, or do grouping on them, or call a function on each value. Most of the time when I do a split, I intend to do something like this with each split value. # You can still get an array if you really want it: #* SELECT ARRAY(SELECT * FROM regexp_split('first, second, third', E',\\s*')) -- No problem is so formidable that you can't just walk away from it. -- C. Schulz
Jeremy Drake wrote: > regexp_matches uses a text[] for the match groups. If you specify > the global flag, it could return multiple matches. Couple this with > the requested feature of pre- and postmatch returns (with its own > flag) and the return would turn into some sort of nasty array of > tuples, or multiple arrays. It seems much cleaner to me to return a > set of the matches found, and I find which order the matches occur in > to be much less important than the fact that they did occur and their > contents. Then the fact that the flag-less matches function returns an array would be a mistake. They have to return the same category of object. > regexp_split returns setof text. This has, in my opinion, a much > greater case to return an array. However, there are several issues > with this approach: Any programming language I have ever seen returns the result of a regular expression split as a structure with order. That in turn implies that there are use cases for having the order, which your proposed function could not address. > # My experience with the array code leads me to believe that building > up an array is an expensive proposition. I know I could code it > smarter so that the array is only constructed in the end. You can make any code arbitrarily fast if it doesn't have to give the right answer. > # With a set-returning function, it is possible to add a LIMIT > clause, to prevent splitting up more of the string than is necessary. You can also add such functionality to a function in form of a parameter. In fact, relying on a LIMIT clause to do this seems pretty fragile. We argue elsewhere that LIMIT without ORDER BY is not well-defined, and while it's hard to imagine in the current implementation why the result of a set returning function would come back in arbitrary order, it is certainly possible in theory, so you still need to order the result set if you want reliable limits, but that is not possible of the order is lost in the result. > It is also immediately possible to insert them into a table, or do > grouping on them, or call a function on each value. Most of the time > when I do a split, I intend to do something like this with each split > value. These sort of arguments remind me of the contrib/xml2 module, which also has a very, uh, pragmatic API. Sure, these operations may seem useful to you. But when we offer a function as part of the core API, it is also important that we offer a clean design that allows other users to combine reasonably orthogonal functionality into tools that are useful to them. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Jeremy Drake wrote: > The functions added are: > * regexp_split(str text, pattern text) RETURNS SETOF text > regexp_split(str text, pattern text, flags text) RETURNS SETOF text > returns each section of the string delimited by the pattern. > * regexp_matches(str text, pattern text) RETURNS text[] > returns all capture groups when matching pattern against string in an > array > * regexp_matches(str text, pattern text, flags text) RETURNS SETOF > (prematch text, fullmatch text, matches text[], postmatch text) > returns all capture groups when matching pattern against string in an > array. also returns the entire match in fullmatch. if the 'g' option > is given, returns all matches in the string. if the 'r' option is > given, also return the text before and after the match in prematch and > postmatch respectively. I think the position the match is in could be important. I'm wondering if you could define them like create type re_match(match text, position int) regexp_split(str text, pattern text) returns setof re_match or maybe regexp_split(str text, pattern text, OUT match text, OUT position int); (not sure of the exact syntax for this one) so that you would have the position for each match, automatically. Is this information available in the regex code? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Peter Eisentraut <peter_e@gmx.net> writes: > Jeremy Drake wrote: >> # My experience with the array code leads me to believe that building >> up an array is an expensive proposition. I know I could code it >> smarter so that the array is only constructed in the end. > You can make any code arbitrarily fast if it doesn't have to give the > right answer. Even more to the point, it's folly to suppose that the overhead of processing a SETOF result is less than that of array construction. I tend to agree with Peter's concern that returning a set is going to make it hard to track which result is which. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > so that you would have the position for each match, automatically. Is > this information available in the regex code? Certainly, that's where we got the text snippets from to begin with. However, I'm not sure that this is important enough to justify a special type --- for one thing, since we don't have arrays of composites, that would foreclose responding to Peter's concern that SETOF is the wrong thing. If you look at the Perl and Tcl APIs for regexes, they return just the strings, not the numerical positions; and I've not heard anyone complaining about that. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > so that you would have the position for each match, automatically. Is > > this information available in the regex code? > > Certainly, that's where we got the text snippets from to begin with. > However, I'm not sure that this is important enough to justify a special > type --- for one thing, since we don't have arrays of composites, that > would foreclose responding to Peter's concern that SETOF is the wrong > thing. My point is that if you want to have the order in which the matches were found, you can do that easily by looking at the positions; no need to create an ordered array. Which does respond to Peter's concern, since the point was to keep the ordering of matches, which an array does; but if we provide the positions, the SETOF way does as well. On the other hand, I don't think it's impossible to have matches that start earlier than others in the string, but are actually found later (say, because they are a parentized expression that ends later). So giving the starting positions allows one to know where are they located, rather than where were they reported. (I don't really know if the matches are sorted before reporting though.) > If you look at the Perl and Tcl APIs for regexes, they return > just the strings, not the numerical positions; and I've not heard anyone > complaining about that. I know, but that may be just because it would be too much extra complexity for them (in terms of user API) to be returning the positions along the text. I know I'd be fairly annoyed if =~ in Perl returned an array of hashes { text => 'foo', position => 42} instead of array of text. We don't have that problem. In fact, I would claim that's much easier to deal with a SETOF function than is to deal with text[]. Regarding the "nobody complains" argument, I don't find that particularly compelling; witness how people gets used to working around limitations in MySQL ... ;-) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > On the other hand, I don't think it's impossible to have matches that > start earlier than others in the string, but are actually found later > (say, because they are a parentized expression that ends later). So > giving the starting positions allows one to know where are they > located, rather than where were they reported. (I don't really know > if the matches are sorted before reporting though.) I have no strong opinion about how matches are returned. Seeing the definitional difficulties that you point out, it may be fine to return them unordered. But then all "matches" functions should do that. For the "split" functions, however, providing the order is clearly important. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Thu, Feb 15, 2007 at 10:37:26AM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > so that you would have the position for each match, automatically. Is > > this information available in the regex code? > > Certainly, that's where we got the text snippets from to begin with. > However, I'm not sure that this is important enough to justify a > special type --- for one thing, since we don't have arrays of > composites, This is a TODO :) I've obviously misunderstood the scope of the TODO because it appears that an INSERT into pg_type at creation time for compound types that looks something like the below would do it. What have I missed? INSERT INTO pg_type VALUES ( '_foo', /* Generated by makeArrayTypeName */ 16744, /* OID of schema */ 10, /* OID of owner of the base type */ -1, /* typlen indicates varlena */ 'f', /* not passed by value */ 'c', /* typtype is composite */ 't', /* type is already defined */ ',', /* typdelim */ 0, /* should this actually refer to the type? */ 'foo'::regtype, /* typelem */ 'array_in', /* typinput */ 'array_out', /* typoutput */ 'array_recv', /* typreceive */ 'array_send', /* typsend */ 0, /* typanalyze */ 'i', /* typalign. Should this be 'd'? */ 'x', /* typstorage */ 'f', /* not a DOMAIN, but while we're at it, why not arrays of DOMAIN? */ 0, /* base type. should this be different? */ -1, /* no typmod */ 0 /* dims not specified */ ); > that would foreclose responding to Peter's concern that SETOF is the > wrong thing. If you look at the Perl and Tcl APIs for regexes, they > return just the strings, not the numerical positions; and I've not > heard anyone complaining about that. They do return them in the order in which they appear, though, which, as far as I can tell, Jeremy's functions also do. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Thu, Feb 15, 2007 at 10:57:45AM +0100, Peter Eisentraut wrote: > Jeremy Drake wrote: > > # With a set-returning function, it is possible to add a LIMIT > > clause, to prevent splitting up more of the string than is > > necessary. > > You can also add such functionality to a function in form of a > parameter. That's what things like Perl's split do :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Thu, 15 Feb 2007, Alvaro Herrera wrote: > Jeremy Drake wrote: > > > The functions added are: > > * regexp_split(str text, pattern text) RETURNS SETOF text > > regexp_split(str text, pattern text, flags text) RETURNS SETOF text > > returns each section of the string delimited by the pattern. > > * regexp_matches(str text, pattern text) RETURNS text[] > > returns all capture groups when matching pattern against string in an > > array > > * regexp_matches(str text, pattern text, flags text) RETURNS SETOF > > (prematch text, fullmatch text, matches text[], postmatch text) > > returns all capture groups when matching pattern against string in an > > array. also returns the entire match in fullmatch. if the 'g' option > > is given, returns all matches in the string. if the 'r' option is > > given, also return the text before and after the match in prematch and > > postmatch respectively. > > I think the position the match is in could be important. I'm wondering > if you could define them like > > create type re_match(match text, position int) > regexp_split(str text, pattern text) returns setof re_match So it looks like the issues are: 1. regexp_matches without flags has a different return type than regexp_matches with flags. I can make them return the same OUT parameters, but should I declare it as returning SETOF when I know for a fact that the no-flags version will never return more than one row? 2. regexp_split does not represent the order of the results. I can do something like: regexp_split(str text, pattern text[, flags text], OUT result text, OUT startpos int) RETURNS SETOF record; It could also have the int being a simple counter to represent the relative order, rather than the position. Thoughts? Do these changes address the issues recently expressed? -- I have yet to see any problem, however complicated, which, when looked at in the right way, did not become still more complicated. -- Poul Anderson
David Fetter <david@fetter.org> writes: > I've obviously misunderstood the scope of the TODO because it appears > that an INSERT into pg_type at creation time for compound types that > looks something like the below would do it. What have I missed? There are a couple of issues. One is that we probably don't want two pg_type entries for every single table. Will you be satisfied if only CREATE TYPE AS ... makes an array type? The other is that, at least at the time they were written, the array support routines couldn't handle composite array values. Things might or might not be easier today; I don't think we had record_in and record_out in their current form then. regards, tom lane
On Thu, Feb 15, 2007 at 07:35:46PM -0500, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > I've obviously misunderstood the scope of the TODO because it appears > > that an INSERT into pg_type at creation time for compound types that > > looks something like the below would do it. What have I missed? > > There are a couple of issues. One is that we probably don't want > two pg_type entries for every single table. Now that you mention it, I would want that if that's what it takes to get arrays for them. The long-term goal here is to make all of PostgreSQL's types play nicely together. I'm guessing that SETOF will eventually be a way to describe a collection because MULTISET is in SQL:2003. > Will you be satisfied if only CREATE TYPE AS ... makes an array > type? The other is that, at least at the time they were written, > the array support routines couldn't handle composite array values. > Things might or might not be easier today; I don't think we had > record_in and record_out in their current form then. OK. What about pg_depend? Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Tom Lane wrote: > David Fetter <david@fetter.org> writes: > >> I've obviously misunderstood the scope of the TODO because it appears >> that an INSERT into pg_type at creation time for compound types that >> looks something like the below would do it. What have I missed? >> > > There are a couple of issues. One is that we probably don't want two > pg_type entries for every single table. Will you be satisfied if only > CREATE TYPE AS ... makes an array type? > There should be some better way to create the array type for tables than directly mangling pg_type, though. Maybe a builtin function that took an oid? cheers andrew
Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake: > On Thu, 15 Feb 2007, Peter Eisentraut wrote: > > I have no strong opinion about how matches are returned. Seeing the > > definitional difficulties that you point out, it may be fine to return > > them unordered. But then all "matches" functions should do that. > > > > For the "split" functions, however, providing the order is clearly > > important. > > Does this version sufficiently address your concerns? I don't think anyone asked for the start position and length in the result of regexp_split(). The result should be an array of text. That is what Perl et al. do. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Thu, 15 Feb 2007, Peter Eisentraut wrote: > I have no strong opinion about how matches are returned. Seeing the > definitional difficulties that you point out, it may be fine to return > them unordered. But then all "matches" functions should do that. > > For the "split" functions, however, providing the order is clearly > important. Does this version sufficiently address your concerns? -- Finagle's Creed: Science is true. Don't be misled by facts.
Attachment
On Fri, Feb 16, 2007 at 01:19:55PM +0100, Peter Eisentraut wrote: > Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake: > > On Thu, 15 Feb 2007, Peter Eisentraut wrote: > > > I have no strong opinion about how matches are returned. Seeing > > > the definitional difficulties that you point out, it may be fine > > > to return them unordered. But then all "matches" functions > > > should do that. > > > > > > For the "split" functions, however, providing the order is > > > clearly important. > > > > Does this version sufficiently address your concerns? > > I don't think anyone asked for the start position and length in the > result of regexp_split(). The result should be an array of text. > That is what Perl et al. do. > > As for the regexp_matches() function, it seems to me that it returns > too much information at once. What is the use case for getting all > of prematch, fullmatch, matches, and postmatch in one call? If not in one call, how would you get it? Perl, for example, makes these available to any regex match in the form of variables it sets. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Am Freitag, 16. Februar 2007 17:11 schrieb David Fetter: > > As for the regexp_matches() function, it seems to me that it returns > > too much information at once. What is the use case for getting all > > of prematch, fullmatch, matches, and postmatch in one call? > > If not in one call, how would you get it? Perl, for example, makes > these available to any regex match in the form of variables it sets. The question is, what is the use case? If there is one in Perl, can this proposed function API support it? -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Fri, Feb 16, 2007 at 05:54:47PM +0100, Peter Eisentraut wrote: > Am Freitag, 16. Februar 2007 17:11 schrieb David Fetter: > > > As for the regexp_matches() function, it seems to me that it > > > returns too much information at once. What is the use case for > > > getting all of prematch, fullmatch, matches, and postmatch in > > > one call? > > > > If not in one call, how would you get it? Perl, for example, > > makes these available to any regex match in the form of variables > > it sets. > > The question is, what is the use case? If there is one in Perl, can > this proposed function API support it? Perl makes the following variables available in any regex match, although it optimizes some cases for when they're not there: $1, ... $n (captured matches in parentheses) $` (pre-match) $' (post-match) $& (whole match) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
David Fetter wrote: >> >> The question is, what is the use case? If there is one in Perl, can >> this proposed function API support it? >> > > Perl makes the following variables available in any regex match, > although it optimizes some cases for when they're not there: > > $1, ... $n (captured matches in parentheses) > $` (pre-match) > $' (post-match) > $& (whole match) > > Use of any of these is notoriously costly, BTW, especially pre-match and post-match. See perlre man page. cheers andrew
David Fetter wrote: > > The question is, what is the use case? If there is one in Perl, > > can this proposed function API support it? > > Perl makes the following variables available in any regex match, That is not a use case. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Fri, Feb 16, 2007 at 01:03:32PM -0500, Andrew Dunstan wrote: > David Fetter wrote: > >> > >>The question is, what is the use case? If there is one in Perl, can > >>this proposed function API support it? > >> > > > >Perl makes the following variables available in any regex match, > >although it optimizes some cases for when they're not there: > > > >$1, ... $n (captured matches in parentheses) > >$` (pre-match) > >$' (post-match) > >$& (whole match) > > > > Use of any of these is notoriously costly, BTW, especially pre-match > and post-match. See perlre man page. This is why they only appear when called for :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Fri, 16 Feb 2007, Peter Eisentraut wrote: > Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake: > > Does this version sufficiently address your concerns? > > I don't think anyone asked for the start position and length in the result of > regexp_split(). The result should be an array of text. That is what Perl et > al. do. The length is not returned, I simply call length() on the string result to make sure that no whitespace gets in. The start position was suggested in these two messages from Alvaro Herrera: http://archives.postgresql.org/pgsql-patches/2007-02/msg00276.php http://archives.postgresql.org/pgsql-patches/2007-02/msg00281.php This was meant to address your concern about the order not necessarily being preserved of the split results when being returned as SETOF. This gives the benefits of returning SETOF while still allowing the order to be preserved if desired (simply add ORDER BY startpos to guarantee the correct order). In case you haven't noticed, I am rather averse to making this return text[] because it is much easier in my experience to use the results when returned in SETOF rather than text[], and in all of the code that I have experience with where this would be useful I would end up using information_schema._pg_expandarray (a function that, AFAIK, is documented nowhere) to convert it into SETOF text. While, if you really really wanted a text[], you could use the (fully documented) ARRAY(select resultstr from regexp_split(...) order by startpos) construct. > As for the regexp_matches() function, it seems to me that it returns too much > information at once. What is the use case for getting all of prematch, > fullmatch, matches, and postmatch in one call? It was requested by David Fetter: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php It was not horribly difficult to provide, and it seemed reasonable to me. I have no need for them personally. -- Some people in this department wouldn't recognize subtlety if it hit them on the head.
Jeremy Drake wrote: > In case you haven't noticed, I am rather averse to making this return > text[] because it is much easier in my experience to use the results > when returned in SETOF rather than text[], The primary use case I know for string splitting is parsing comma/pipe/whatever separated fields into a row structure, and the way I see it your API proposal makes that exceptionally difficult. I don't know what your use case is, though. All of this is missing actual use cases. > While, if you > really really wanted a text[], you could use the (fully documented) > ARRAY(select resultstr from regexp_split(...) order by startpos) > construct. I think, however, that we should be providing simple primitives that can be combined into complex expressions rather than complex primitives that have to be dissected apart to get simple results. > > As for the regexp_matches() function, it seems to me that it > > returns too much information at once. What is the use case for > > getting all of prematch, fullmatch, matches, and postmatch in one > > call? > > It was requested by David Fetter: > http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php > > It was not horribly difficult to provide, and it seemed reasonable to > me. I have no need for them personally. David Fetter has also repeated failed to offer a use case for this, so I hesitate to accept this. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Sat, Feb 17, 2007 at 09:02:24AM +0100, Peter Eisentraut wrote: > Jeremy Drake wrote: > > In case you haven't noticed, I am rather averse to making this return > > text[] because it is much easier in my experience to use the results > > when returned in SETOF rather than text[], > > The primary use case I know for string splitting is parsing > comma/pipe/whatever separated fields into a row structure, and the way > I see it your API proposal makes that exceptionally difficult. > > I don't know what your use case is, though. All of this is missing > actual use cases. > > > While, if you > > really really wanted a text[], you could use the (fully documented) > > ARRAY(select resultstr from regexp_split(...) order by startpos) > > construct. > > I think, however, that we should be providing simple primitives that can > be combined into complex expressions rather than complex primitives > that have to be dissected apart to get simple results. > > > > As for the regexp_matches() function, it seems to me that it > > > returns too much information at once. What is the use case for > > > getting all of prematch, fullmatch, matches, and postmatch in one > > > call? > > > > It was requested by David Fetter: > > http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php > > > > It was not horribly difficult to provide, and it seemed reasonable > > to me. I have no need for them personally. > > David Fetter has also repeated failed to offer a use case for this, > so I hesitate to accept this. What is it about having the whole match, pre-match and post-match available that you're objecting to? Are you saying there aren't common uses for any or all of these? Regular expression users use them all over the place, and adding this capability to SQL seems like a reasonable next step :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Sat, 17 Feb 2007, Peter Eisentraut wrote: > Jeremy Drake wrote: > > In case you haven't noticed, I am rather averse to making this return > > text[] because it is much easier in my experience to use the results > > when returned in SETOF rather than text[], > > The primary use case I know for string splitting is parsing > comma/pipe/whatever separated fields into a row structure, and the way > I see it your API proposal makes that exceptionally difficult. For this case see string_to_array: http://developer.postgresql.org/pgdocs/postgres/functions-array.html select string_to_array('a|b|c', '|'); string_to_array ----------------- {a,b,c} (1 row) > I don't know what your use case is, though. All of this is missing > actual use cases. The particular use case I had for this function was at a previous employer, and I am not sure exactly how much detail is appropriate to divulge. Basically, the project was doing some text processing inside of postgres, and getting all of the words from a string into a table with some processing (excluding stopwords and so forth) as efficiently as possible was a big concern. The regexp_split function code was based on some code that a friend of mine wrote which used PCRE rather than postgres' internal regexp support. I don't know exactly what his use-case was, but he probably had one because he wrote the function and had it returning SETOF text ;) Perhaps he can share a general idea of what it was (nudge nudge)? > > While, if you > > really really wanted a text[], you could use the (fully documented) > > ARRAY(select resultstr from regexp_split(...) order by startpos) > > construct. > > I think, however, that we should be providing simple primitives that can > be combined into complex expressions rather than complex primitives > that have to be dissected apart to get simple results. The most simple primitive is string_to_array(text, text) returns text[], but it was not sufficient for our needs. > > > As for the regexp_matches() function, it seems to me that it > > > returns too much information at once. What is the use case for > > > getting all of prematch, fullmatch, matches, and postmatch in one > > > call? > > > > It was requested by David Fetter: > > http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php > > > > It was not horribly difficult to provide, and it seemed reasonable to > > me. I have no need for them personally. > > David Fetter has also repeated failed to offer a use case for this, so I > hesitate to accept this. I have no strong opinion either way, so I will let those who do argue it out and wait for the dust to settle ;) -- The Law, in its majestic equality, forbids the rich, as well as the poor, to sleep under the bridges, to beg in the streets, and to steal bread. -- Anatole France
David Fetter wrote: > What is it about having the whole match, pre-match and post-match > available that you're objecting to? Are you saying there aren't > common uses for any or all of these? Regular expression users use > them all over the place, You keep saying that, and I keep saying please show a use case. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes: > David Fetter wrote: >> What is it about having the whole match, pre-match and post-match >> available that you're objecting to? Are you saying there aren't >> common uses for any or all of these? Regular expression users use >> them all over the place, > You keep saying that, and I keep saying please show a use case. Maybe I'm missing something, but ISTM that given the ability to return multiple match substrings there is no need for such features. Given an arbitrary regex 'xyz', write '^(.*)(xyz)(.*)$' and you'll get back the pre-match, whole match, and post-match in addition to any parenthesized substrings that 'xyz' contains. If you only need some of them, you can leave out the corresponding parts of this pattern. So I'd vote against complicating the API in order to make special provision for these results. I claim that all we need is a function taking (string text, pattern text, flags text) and returning either array of text or setof text containing the matched substrings in whatever order is standard (order by left-parenthesis position, I think). In the degenerate case where there are no parenthesized subpatterns, just return the whole match as a single element. As for the argument about array vs setof, I could see doing both to end the argument of which one is really superior for any particular problem. regards, tom lane
Jeremy Drake wrote: > For this case see string_to_array: That only works with fixed delimiters, and I was hoping that regexp_split would be an alternative with regexp delimiters. It's certainly another argument that all the split functions in PostgreSQL should offer analogous interfaces. There is no reason being offered why splitting on a fixed string should result in an array and splitting on a regexp should result in a table. > The particular use case I had for this function was at a previous > employer, and I am not sure exactly how much detail is appropriate to > divulge. Basically, the project was doing some text processing > inside of postgres, and getting all of the words from a string into a > table with some processing (excluding stopwords and so forth) as > efficiently as possible was a big concern. We already have tsearch for that. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Finally the voice of reason :) On Sat, 17 Feb 2007, Tom Lane wrote: > So I'd vote against complicating the API in order to make special > provision for these results. I claim that all we need is a function > taking (string text, pattern text, flags text) and returning either > array of text or setof text For this function, it would be setof array of text, as the capture groups would definitely go in an array, but if you asked for global in the flags, there could be more than one match in the string. > containing the matched substrings in > whatever order is standard (order by left-parenthesis position, > I think). In the degenerate case where there are no parenthesized > subpatterns, just return the whole match as a single element. Good idea, that did not occur to me. I was planning to throw an error in that case. > As for the argument about array vs setof, I could see doing both to > end the argument of which one is really superior for any particular > problem. The array vs setof argument was on the split function. I will work on doing both. Any idea how the name and/or arguments should differ? I think that an optional limit argument for the array version like perl has would be reasonable, but a name difference in the functions is probably necessary to avoid confusion. -- Speaking of Godzilla and other things that convey horror: With a purposeful grimace and a Mongo-like flair He throws the spinning disk drives in the air! And he picks up a Vax and he throws it back down As he wades through the lab making terrible sounds! Helpless users with projects due Scream "My God!" as he stomps on the tape drives, too! Oh, no! He says Unix runs too slow! Go, go, DECzilla! Oh, yes! He's gonna bring up VMS! Go, go, DECzilla!" * VMS is a trademark of Digital Equipment Corporation * DECzilla is a trademark of Hollow Chocolate Bunnies of Death, Inc. -- Curtis Jackson
Jeremy Drake wrote: > The regexp_split function code was based on some code that a friend of > mine wrote which used PCRE rather than postgres' internal regexp support. > I don't know exactly what his use-case was, but he probably had > one because he wrote the function and had it returning SETOF text ;) > Perhaps he can share a general idea of what it was (nudge nudge)? db=# CREATE OR REPLACE FUNCTION split(p TEXT, t TEXT) RETURNS SETOF TEXT AS $$ db$# my ($p, $t) = @_; db$# return [ split(/$p/,$t) ]; db$# $$ LANGUAGE plperl; CREATE FUNCTION Time: 1.254 ms db=# select distinct word from (select * from split('\\W+','mary had a little lamb, whose fleece was black as soot') as word) as ss; word -------- a as black fleece had lamb little mary soot was whose (11 rows) Time: 30.517 ms As you can see, this can easily be done with a plperl function. Some people may not want to install plperl, or may not want to allow arbitrary patterns to be handed to perl in this fashion. That was not my concern. I was simply trying to see if I could make it faster in a C-language coded function. In the end I dropped the project because the plperl function works fast enough for me and I don't have any objection to plperl from a security standpoint, etc. mark
Jeremy Drake <pgsql@jdrake.com> writes: > On Sat, 17 Feb 2007, Tom Lane wrote: >> So I'd vote against complicating the API in order to make special >> provision for these results. I claim that all we need is a function >> taking (string text, pattern text, flags text) and returning either >> array of text or setof text > For this function, it would be setof array of text, as the capture groups > would definitely go in an array, but if you asked for global in the flags, > there could be more than one match in the string. Oh, right. And you could do a 2-D array if you wanted it all in one blob (or a guarantee of order). So no need for record-returning functions? regards, tom lane
On Sat, 17 Feb 2007, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > On Sat, 17 Feb 2007, Tom Lane wrote: > >> So I'd vote against complicating the API in order to make special > >> provision for these results. I claim that all we need is a function > >> taking (string text, pattern text, flags text) and returning either > >> array of text or setof text > > > For this function, it would be setof array of text, as the capture groups > > would definitely go in an array, but if you asked for global in the flags, > > there could be more than one match in the string. > > Oh, right. And you could do a 2-D array if you wanted it all in one > blob (or a guarantee of order). I don't think that there is much of an argument for having this one return a 2d array, in this particular case, even perl requires you to build a loop, IIRC. my $str = 'foobarbecuebazilbarf'; while($str=~/(b[^b]+)(b[^b]+)/g) { print $1, "\t", length($1), "\n"; print $2, "\t", length($2), "\n"; print "---\n"; } bar 3 becue 5 --- bazil 5 barf 4 --- > So no need for record-returning functions? No. -- Go placidly amid the noise and waste, and remember what value there may be in owning a piece thereof. -- National Lampoon, "Deteriorata"
Attached, please find a new version of the patch which follows Tom's advice (except regexp_matches returing a 2-D array, which I am not sure if it is required/desired). Hopefully this version makes everyone happy ;) On Sat, 17 Feb 2007, Tom Lane wrote: > So I'd vote against complicating the API in order to make special > provision for these results. I claim that all we need is a function > taking (string text, pattern text, flags text) and returning either > array of text or setof text containing the matched substrings in > whatever order is standard (order by left-parenthesis position, > I think). In the degenerate case where there are no parenthesized > subpatterns, just return the whole match as a single element. The signature is regexp_matches(string text, pattern text[, flags text]) returns setof text[] and behaves as described. > > As for the argument about array vs setof, I could see doing both to > end the argument of which one is really superior for any particular > problem. regexp_split(string text, pattern text[, flags text]) returns setof text regexp_split_array(string text, pattern text[. flags text[, limit int]]) returns text[] -- When I was in school, I cheated on my metaphysics exam: I looked into the soul of the boy sitting next to me. -- Woody Allen
Attachment
Jeremy Drake wrote: > > As for the argument about array vs setof, I could see doing both to > > end the argument of which one is really superior for any particular > > problem. > > regexp_split(string text, pattern text[, flags text]) returns setof > text > > regexp_split_array(string text, pattern text[. flags text[, limit > int]]) returns text[] Since you are not splitting an array but returning an array, I would think that "regexp_split_to_array" would be better, and the other should then be "regexp_split_to_table". But why does the second one have a limit and the first one doesn't? Is this because you rely on the LIMIT clause to do the same? Is there a guarantee that LIMIT on a table function makes a consistent order? -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Sun, 18 Feb 2007, Peter Eisentraut wrote: > Jeremy Drake wrote: > > > As for the argument about array vs setof, I could see doing both to > > > end the argument of which one is really superior for any particular > > > problem. > > > > regexp_split(string text, pattern text[, flags text]) returns setof > > text > > > > regexp_split_array(string text, pattern text[. flags text[, limit > > int]]) returns text[] > > Since you are not splitting an array but returning an array, I would > think that "regexp_split_to_array" would be better, and the other > should then be "regexp_split_to_table". OK > > But why does the second one have a limit and the first one doesn't? Is > this because you rely on the LIMIT clause to do the same? Yes > Is there a > guarantee that LIMIT on a table function makes a consistent order? Why wouldn't it? -- When you are in it up to your ears, keep your mouth shut.
On Sun, 18 Feb 2007, Jeremy Drake wrote: > On Sun, 18 Feb 2007, Peter Eisentraut wrote: > > > > regexp_split(string text, pattern text[, flags text]) returns setof > > > text > > > > > > regexp_split_array(string text, pattern text[. flags text[, limit > > > int]]) returns text[] > > > > Since you are not splitting an array but returning an array, I would > > think that "regexp_split_to_array" would be better, and the other > > should then be "regexp_split_to_table". > > OK > > > But why does the second one have a limit and the first one doesn't? I will rename the functions regexp_split_to_(table|array) and I will add an optional limit parameter to the regexp_split_to_table function, for consistency and to avoid ordering concerns with LIMIT. -- Sometimes I worry about being a success in a mediocre world. -- Lily Tomlin
Jeremy Drake <pgsql@jdrake.com> writes: > I will rename the functions regexp_split_to_(table|array) and I will add > an optional limit parameter to the regexp_split_to_table function, for > consistency and to avoid ordering concerns with LIMIT. I'd go the other way: get rid of the limit option all 'round. It seems like fairly useless complexity. regards, tom lane
On Sun, 18 Feb 2007, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > I will rename the functions regexp_split_to_(table|array) and I will add > > an optional limit parameter to the regexp_split_to_table function, for > > consistency and to avoid ordering concerns with LIMIT. > > I'd go the other way: get rid of the limit option all 'round. It seems > like fairly useless complexity. OK, here is what is hopefully the final version of this patch. I have renamed the functions as above, and removed the optional limit parameter. If there are no further complaints, this should be ready to apply. -- It is the business of little minds to shrink. -- Carl Sandburg
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Jeremy Drake wrote: > On Sun, 18 Feb 2007, Tom Lane wrote: > > > Jeremy Drake <pgsql@jdrake.com> writes: > > > I will rename the functions regexp_split_to_(table|array) and I will add > > > an optional limit parameter to the regexp_split_to_table function, for > > > consistency and to avoid ordering concerns with LIMIT. > > > > I'd go the other way: get rid of the limit option all 'round. It seems > > like fairly useless complexity. > > OK, here is what is hopefully the final version of this patch. I have > renamed the functions as above, and removed the optional limit parameter. > If there are no further complaints, this should be ready to apply. > > -- > It is the business of little minds to shrink. > -- Carl Sandburg Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, 18 Feb 2007, Jeremy Drake wrote: > OK, here is what is hopefully the final version of this patch. I have > renamed the functions as above, and removed the optional limit parameter. > If there are no further complaints, this should be ready to apply. The patch has been sitting in the unapplied patches queue for a while, and the inevitable bitrot finally caught up with it. This version of the patch is exactly the same as the one in the patches queue, except for using new, non-conflicting oids for the functions. -- The gods gave man fire and he invented fire engines. They gave him love and he invented marriage.
Attachment
Jeremy Drake wrote: > The patch has been sitting in the unapplied patches queue for a while Barring any objections, I'll apply this tomorrow. -Neil
Jeremy Drake wrote: > The patch has been sitting in the unapplied patches queue for a while, and > the inevitable bitrot finally caught up with it. This version of the > patch is exactly the same as the one in the patches queue, except for > using new, non-conflicting oids for the functions. > Applied -- thanks for the patch. BTW, a few cosmetic comments: * #includes should be sorted alphabetically (unless another #include sort order rules take precedence, like including "postgres.h" first). * patch included some trailing CR line endings * IMHO using "++var" in the increment component of a for loop is bad style in C (if only because it is needlessly inconsistent; good C++ style is not necessarily good C style) * Comments like "/* get text type oid, too lazy to do it some other way */" do not exactly inspire confidence :) -Neil
Neil Conway <neilc@samurai.com> writes: > * Comments like "/* get text type oid, too lazy to do it some other way > */" do not exactly inspire confidence :) Surely the code was just using the TEXTOID macro? If so, it does not require a comment. If not, it should be fixed ... regards, tom lane
On Wed, 21 Mar 2007, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > * Comments like "/* get text type oid, too lazy to do it some other way > > */" do not exactly inspire confidence :) > > Surely the code was just using the TEXTOID macro? If so, it does not > require a comment. If not, it should be fixed ... Nope, it does get_fn_expr_argtype(fcinfo->flinfo, 0); The too lazy remark was that I thought there may be a better way (like the macro you mentioned) but did not go looking for it because I already had something that worked that I found in the manual. If you like, I can put together another patch that removes the param_type members from the structs and hardcodes TEXTOID in the proper places. BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > -- We're only in it for the volume. -- Black Sabbath
"Jeremy Drake" <pgsql@jdrake.com> writes: > BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros > for those also? Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic constants used throughout the source code. I'm surprised there isn't a macro for it though. Do you need the alignment? If so I want to check the code against the packed varlena patch. Just in case. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Gregory Stark" <stark@enterprisedb.com> writes: > "Jeremy Drake" <pgsql@jdrake.com> writes: > >> BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros >> for those also? > > Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic > constants used throughout the source code. I'm surprised there isn't a macro > for it though. > > Do you need the alignment? If so I want to check the code against the packed > varlena patch. Just in case. Ah, it's just to construct an array, that's not a concern at all. And you're detoasting the text data types before using or storing them so that's fine. The only thing I would say is that this should maybe be a TextGetDatum() just for code hygiene. It should be exactly equivalent though: + PointerGetDatum(matchctx->orig_str), -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Jeremy Drake <pgsql@jdrake.com> writes: > BTW, should I be calling > get_typlenbyvalalign on TEXTOID or are there macros for those also? By and large we tend to hard-wire those properties too, eg there are plenty of places that do things like this: /* XXX: this hardcodes assumptions about the regtype type */ result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i'); Some are better commented than others ... regards, tom lane
On Wed, 21 Mar 2007, Tom Lane wrote: > Jeremy Drake <pgsql@jdrake.com> writes: > > BTW, should I be calling > > get_typlenbyvalalign on TEXTOID or are there macros for those also? > > By and large we tend to hard-wire those properties too, eg there are > plenty of places that do things like this: > > /* XXX: this hardcodes assumptions about the regtype type */ > result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i'); > > Some are better commented than others ... So, what action (if any) should be taken? Should all (or some) of these values be hardcoded, or should the current calls to determine them be left in place? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > -- Fortune's Real-Life Courtroom Quote #19: Q: Doctor, how many autopsies have you performed on dead people? A: All my autopsies have been performed on dead people.
Jeremy Drake <pgsql@jdrake.com> writes: > On Wed, 21 Mar 2007, Tom Lane wrote: >> By and large we tend to hard-wire those properties too, eg there are >> plenty of places that do things like this: > So, what action (if any) should be taken? Should all (or some) of these > values be hardcoded, or should the current calls to determine them be left > in place? I'd vote for making this new code look like the rest of it, to wit hardwire the values. As-is, you're expending code space and cycles on flexibility that's entirely illusory --- if we ever decided to change these properties of TEXT, we'd have way more work to do than just fixing the new regexp functions. regards, tom lane
On Wed, 21 Mar 2007, Gregory Stark wrote: > The only thing I would say is that this should maybe be a TextGetDatum() just > for code hygiene. It should be exactly equivalent though: I could not find such a thing using ctags, nor TextPGetDatum(). I looked at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses PointerGetDatum. If there is such a thing, it is well camoflaged... -- If you think the problem is bad now, just wait until we've solved it. -- Arthur Kasspe
On Thu, 22 Mar 2007, Tom Lane wrote: > I'd vote for making this new code look like the rest of it, to wit > hardwire the values. Attached please find a patch which does this. -- Although written many years ago, Lady Chatterley's Lover has just been reissued by the Grove Press, and this pictorial account of the day-to-day life of an English gamekeeper is full of considerable interest to outdoor minded readers, as it contains many passages on pheasant-raising, the apprehending of poachers, ways to control vermin, and other chores and duties of the professional gamekeeper. Unfortunately, one is obliged to wade through many pages of extraneous material in order to discover and savour those sidelights on the management of a midland shooting estate, and in this reviewer's opinion the book cannot take the place of J. R. Miller's "Practical Gamekeeping." -- Ed Zern, "Field and Stream" (Nov. 1959)
Attachment
Jeremy Drake <pgsql@jdrake.com> writes: > I could not find such a thing using ctags, nor TextPGetDatum(). I looked > at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses > PointerGetDatum. If there is such a thing, it is well camoflaged... AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. A function dealing with a "text *" doesn't normally have reason to convert that to a Datum until it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. Do you have a counterexample, or does this just suggest that the regexp function patch needs some refactoring? regards, tom lane
On Thu, 22 Mar 2007, Tom Lane wrote: > AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other > datatypes) is lack of obvious usefulness. A function dealing with a > "text *" doesn't normally have reason to convert that to a Datum until > it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. > Do you have a counterexample, or does this just suggest that the regexp > function patch needs some refactoring? If you are asking why I have reason to convert text * to a Datum in cases other than PG_RETURN_TEXT_P, it is used for calling text_substr functions using DirectFunctionCallN. BTW, this usage of text_substr using PointerGetDatum was copied from the pre-existing textregexsubstr function. -- Malek's Law: Any simple idea will be worded in the most complicated way.
Jeremy Drake wrote: > On Thu, 22 Mar 2007, Tom Lane wrote: > > > AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other > > datatypes) is lack of obvious usefulness. A function dealing with a > > "text *" doesn't normally have reason to convert that to a Datum until > > it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. > > Do you have a counterexample, or does this just suggest that the regexp > > function patch needs some refactoring? > > If you are asking why I have reason to convert text * to a Datum in cases > other than PG_RETURN_TEXT_P, it is used for calling text_substr functions > using DirectFunctionCallN. BTW, this usage of text_substr using > PointerGetDatum was copied from the pre-existing textregexsubstr function. Is there a follup patch based on this discussion? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Jeremy Drake wrote: >> On Thu, 22 Mar 2007, Tom Lane wrote: >>> AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other >>> datatypes) is lack of obvious usefulness. >> If you are asking why I have reason to convert text * to a Datum in cases >> other than PG_RETURN_TEXT_P, it is used for calling text_substr functions >> using DirectFunctionCallN. BTW, this usage of text_substr using >> PointerGetDatum was copied from the pre-existing textregexsubstr function. > Is there a follup patch based on this discussion? Not at the moment. I suppose someone could run around and replace PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems like mostly make-work. I definitely don't want to spend time on such a project for 8.3. Or were you speaking to the question of whether to adjust the regexp patch to conform more nearly to the coding practices found elsewhere? I agree with that, but I thought there was already a submitted patch for it. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Jeremy Drake wrote: > >> On Thu, 22 Mar 2007, Tom Lane wrote: > >>> AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other > >>> datatypes) is lack of obvious usefulness. > > >> If you are asking why I have reason to convert text * to a Datum in cases > >> other than PG_RETURN_TEXT_P, it is used for calling text_substr functions > >> using DirectFunctionCallN. BTW, this usage of text_substr using > >> PointerGetDatum was copied from the pre-existing textregexsubstr function. > > > Is there a follup patch based on this discussion? > > Not at the moment. I suppose someone could run around and replace > PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems > like mostly make-work. I definitely don't want to spend time on such > a project for 8.3. > > Or were you speaking to the question of whether to adjust the regexp > patch to conform more nearly to the coding practices found elsewhere? > I agree with that, but I thought there was already a submitted patch > for it. Yes, regex patch adjustment, and I have not seen a patch which makes such adjustments. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, 26 Mar 2007, Bruce Momjian wrote: > Tom Lane wrote: > > Or were you speaking to the question of whether to adjust the regexp > > patch to conform more nearly to the coding practices found elsewhere? > > I agree with that, but I thought there was already a submitted patch > > for it. > > Yes, regex patch adjustment, and I have not seen a patch which makes > such adjustments. http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php -- Pope Goestheveezl was the shortest reigning pope in the history of the Church, reigning for two hours and six minutes on 1 April 1866. The white smoke had hardly faded into the blue of the Vatican skies before it dawned on the assembled multitudes in St. Peter's Square that his name had hilarious possibilities. The crowds fell about, helpless with laughter, singing Half a pound of tuppenny rice Half a pound of treacle That's the way the chimney smokes Pope Goestheveezl The square was finally cleared by armed carabineri with tears of laughter streaming down their faces. The event set a record for hilarious civic functions, smashing the previous record set when Baron Hans Neizant B"ompzidaize was elected Landburgher of K"oln in 1653. -- Mike Harding, "The Armchair Anarchist's Almanac"
Jeremy Drake wrote: > On Mon, 26 Mar 2007, Bruce Momjian wrote: > > > Tom Lane wrote: > > > Or were you speaking to the question of whether to adjust the regexp > > > patch to conform more nearly to the coding practices found elsewhere? > > > I agree with that, but I thought there was already a submitted patch > > > for it. > > > > Yes, regex patch adjustment, and I have not seen a patch which makes > > such adjustments. > > http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php Ah, yes, I still have that in my mailbox, but didn't think it was related because there was discussion _after_ the patch was posted. Will add the patch to the queue now. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Jeremy Drake wrote: > On Thu, 22 Mar 2007, Tom Lane wrote: > > > I'd vote for making this new code look like the rest of it, to wit > > hardwire the values. > > Attached please find a patch which does this. > > > > -- > Although written many years ago, Lady Chatterley's Lover has just been > reissued by the Grove Press, and this pictorial account of the > day-to-day life of an English gamekeeper is full of considerable > interest to outdoor minded readers, as it contains many passages on > pheasant-raising, the apprehending of poachers, ways to control vermin, > and other chores and duties of the professional gamekeeper. > Unfortunately, one is obliged to wade through many pages of extraneous > material in order to discover and savour those sidelights on the > management of a midland shooting estate, and in this reviewer's opinion > the book cannot take the place of J. R. Miller's "Practical > Gamekeeping." > -- Ed Zern, "Field and Stream" (Nov. 1959) Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> Jeremy Drake wrote: > > On Thu, 22 Mar 2007, Tom Lane wrote: > > > > > I'd vote for making this new code look like the rest of it, to wit > > > hardwire the values. > > > > Attached please find a patch which does this. I just realized that the last patch removed all usage of fcinfo in the setup_regexp_matches function, so this version of the patch also removes it as a parameter to that function. -- Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Jeremy Drake wrote: > > Jeremy Drake wrote: > > > On Thu, 22 Mar 2007, Tom Lane wrote: > > > > > > > I'd vote for making this new code look like the rest of it, to wit > > > > hardwire the values. > > > > > > Attached please find a patch which does this. > > I just realized that the last patch removed all usage of fcinfo in the > setup_regexp_matches function, so this version of the patch also removes > it as a parameter to that function. > > -- > Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.! Content-Description: [ Attachment, skipping... ] -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Jeremy Drake wrote: > I just realized that the last patch removed all usage of fcinfo in the > setup_regexp_matches function, so this version of the patch also removes > it as a parameter to that function. Applied, thanks. -Neil