Thread: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
Hi It looks like we've failed to reach an agreement on how to proceed on the issue with missing commutators for the various text matching operators ("~", "~~", and their case-insensitive variants). We do seem to have agreed, however, that adding commutators for the non-deprecated operators which lack them is generally a Good Idea. Amidst the discussion, Alvaro suggested that we resolve the issue by adding a distinct type for patterns as opposed to text. That'd allow us to make "~" it's own commutator by defining both text ~ pattern and pattern ~ text. We'd of course need to keep the operator text ~ text and make it behave like text ~ pattern. Thus, if someone wrote 'a_pattern' ~ 'some_text' (i.e. forgot to cast 'a_pattern' to type "pattern"), he wouldn't get an error but instead unintended behaviour. If we want to avoid that too, we'd have to name the new operators something other than "~". There's also the question of how we deal with "~~" (the operator behind LIKE). We could either re-use the type "pattern" for that, meaning that values of type "pattern" would represent any kind of text pattern, not necessarily a regular expression. Alternatively, we could represent LIKE pattern by a type distinct from "pattern", say "likepattern". Finally, we could handle LIKE like we handle SIMILAR TO, i.e. define a function that transforms a LIKE pattern into a regular expression, and deprecate the "~~" operator and friends. The last option looks appealing from a code complexity point of view, but might severely harm performance of LIKE and ILIKE comparisons. Comments? Opinions? best regards, Florian Pflug Someone
On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug <fgp@phlo.org> wrote: > Amidst the discussion, Alvaro suggested that we resolve the issue > by adding a distinct type for patterns as opposed to text. That'd > allow us to make "~" it's own commutator by defining both > text ~ pattern > and > pattern ~ text. That's kind of a neat idea. There might be an efficiency benefit to having a regex type that is precompiled by the input function. > There's also the question of how we deal with "~~" (the operator > behind LIKE). We could either re-use the type "pattern" for that, > meaning that values of type "pattern" would represent any kind of > text pattern, not necessarily a regular expression. Alternatively, > we could represent LIKE pattern by a type distinct from "pattern", > say "likepattern". Finally, we could handle LIKE like we handle > SIMILAR TO, i.e. define a function that transforms a LIKE pattern > into a regular expression, and deprecate the "~~" operator and friends. > > The last option looks appealing from a code complexity point of view, > but might severely harm performance of LIKE and ILIKE comparisons. I don't believe it would be a very good idea to try to shoehorn multiple kinds of patterns into a single pattern type. I do think this may be the long route to solving this problem, though.Is it really this hard to agree on a commutator name? I mean, I'm not in love with anything that's been suggested so far, but I could live with any of them. An unintuitive operator name for matches-with-the-arguments-reversed is not going to be the worst wart we have, by a long shot... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Florian Pflug
Date:
On Jun19, 2011, at 20:56 , Robert Haas wrote: > On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug <fgp@phlo.org> wrote: >> Amidst the discussion, Alvaro suggested that we resolve the issue >> by adding a distinct type for patterns as opposed to text. That'd >> allow us to make "~" it's own commutator by defining both >> text ~ pattern >> and >> pattern ~ text. > > That's kind of a neat idea. There might be an efficiency benefit to > having a regex type that is precompiled by the input function. Hm, yeah, that though crossed my mind too. A distinct type is only a first step in that direction though - we'd also need a way to attach a parsed representation of a value to a varlena. If you have an idea how to accomplish that, by all means, out with it! ;-) The XML would also benefit greatly... >> There's also the question of how we deal with "~~" (the operator >> behind LIKE). We could either re-use the type "pattern" for that, >> meaning that values of type "pattern" would represent any kind of >> text pattern, not necessarily a regular expression. Alternatively, >> we could represent LIKE pattern by a type distinct from "pattern", >> say "likepattern". Finally, we could handle LIKE like we handle >> SIMILAR TO, i.e. define a function that transforms a LIKE pattern >> into a regular expression, and deprecate the "~~" operator and friends. >> >> The last option looks appealing from a code complexity point of view, >> but might severely harm performance of LIKE and ILIKE comparisons. > > I don't believe it would be a very good idea to try to shoehorn > multiple kinds of patterns into a single pattern type. That depends on whether we expect to eventually make LIKE use the regex matching machinery. If we do, then it's not really shoehorning. If we don't, then yeah, using a single type seems unwise, especially in the light of your idea of keeping a parsed representation of regexp's around. > I do think this may be the long route to solving this problem, though. Yeah - but maybe also the one with the largest benefit in the long run. We're also just at the beginning of a release cycle, so I think we have time enough to figure this out... > Is it really this hard to agree on a commutator name? So far, every suggestion has been met with fierce opposition, so, um, yeah it is I'd say... > I mean, I'm > not in love with anything that's been suggested so far, but I could > live with any of them. An unintuitive operator name for > matches-with-the-arguments-reversed is not going to be the worst wart > we have, by a long shot... Maybe not. But then, if the name is unintuitive enough to impair readability anyway, then people might just as well define a custom operator in their database. Since we're capable of inlining SQL functions, there won't even be a difference in performance. The only real benefit of having this is core is that you don't have to go search the catalog to find the meaning of such an operator if you encounter it in an SQL statement. best regards, Florian Pflug
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Andrew Dunstan
Date:
On 06/19/2011 02:56 PM, Robert Haas wrote: > On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org> wrote: >> Amidst the discussion, Alvaro suggested that we resolve the issue >> by adding a distinct type for patterns as opposed to text. That'd >> allow us to make "~" it's own commutator by defining both >> text ~ pattern >> and >> pattern ~ text. > That's kind of a neat idea. There might be an efficiency benefit to > having a regex type that is precompiled by the input function. What do we do when we get text or unknown in place of pattern? How are we going to know if the pattern is supposed to be the left or right operand? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 06/19/2011 02:56 PM, Robert Haas wrote: >> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org> wrote: >>> Amidst the discussion, Alvaro suggested that we resolve the issue >>> by adding a distinct type for patterns as opposed to text. That'd >>> allow us to make "~" it's own commutator by defining both >>> text ~ pattern >>> and >>> pattern ~ text. >> That's kind of a neat idea. There might be an efficiency benefit to >> having a regex type that is precompiled by the input function. > What do we do when we get text or unknown in place of pattern? How are > we going to know if the pattern is supposed to be the left or right operand? Yeah, this would result inSELECT 'something' ~ 'something'; failing outright. I don't think it's a good substitute for biting the bullet and choosing distinct operator names. (I do think a distinct regex datatype might be a good idea, but it doesn't eliminate this particular problem.) regards, tom lane
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Florian Pflug
Date:
On Jun19, 2011, at 22:10 , Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 06/19/2011 02:56 PM, Robert Haas wrote: >>> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org> wrote: >>>> Amidst the discussion, Alvaro suggested that we resolve the issue >>>> by adding a distinct type for patterns as opposed to text. That'd >>>> allow us to make "~" it's own commutator by defining both >>>> text ~ pattern >>>> and >>>> pattern ~ text. > >>> That's kind of a neat idea. There might be an efficiency benefit to >>> having a regex type that is precompiled by the input function. > >> What do we do when we get text or unknown in place of pattern? How are >> we going to know if the pattern is supposed to be the left or right operand? > > Yeah, this would result in > SELECT 'something' ~ 'something'; > failing outright. I don't think it's a good substitute for biting > the bullet and choosing distinct operator names. Yeah, well, the complaint (put forward mainly by Alvaro) that lead to this approach in the first place was precisely that 'something' ~ 'anything' *doesn't* give any indication of what constitutes the pattern and what the text. So I consider that to be a feature, not a bug. BTW, arithmetical operators currently show exactly the same behaviour postgres# select '1' + '1' ERROR: operator is not unique: unknown + unknown at character 12 The only argument against that I can see is that it poses a compatibility problem if "~" remains the pattern matching operator. I do believe, however, that the chance of unknown ~ unknown appearing in actual applications is rather small - that'd only happen if people used postgresql's regexp engine together with purely external data. best regards, Florian Pflug
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Andrew Dunstan
Date:
On 06/19/2011 05:02 PM, Florian Pflug wrote: > On Jun19, 2011, at 22:10 , Tom Lane wrote: >> Andrew Dunstan<andrew@dunslane.net> writes: >>> On 06/19/2011 02:56 PM, Robert Haas wrote: >>>> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org> wrote: >>>>> Amidst the discussion, Alvaro suggested that we resolve the issue >>>>> by adding a distinct type for patterns as opposed to text. That'd >>>>> allow us to make "~" it's own commutator by defining both >>>>> text ~ pattern >>>>> and >>>>> pattern ~ text. >>>> That's kind of a neat idea. There might be an efficiency benefit to >>>> having a regex type that is precompiled by the input function. >>> What do we do when we get text or unknown in place of pattern? How are >>> we going to know if the pattern is supposed to be the left or right operand? >> Yeah, this would result in >> SELECT 'something' ~ 'something'; >> failing outright. I don't think it's a good substitute for biting >> the bullet and choosing distinct operator names. > Yeah, well, the complaint (put forward mainly by Alvaro) that lead to > this approach in the first place was precisely that > 'something' ~ 'anything' > *doesn't* give any indication of what constitutes the pattern and > what the text. > > So I consider that to be a feature, not a bug. > > BTW, arithmetical operators currently show exactly the same behaviour > postgres# select '1' + '1' > ERROR: operator is not unique: unknown + unknown at character 12 > > The only argument against that I can see is that it poses > a compatibility problem if "~" remains the pattern matching > operator. I do believe, however, that the chance of > unknown ~ unknown > appearing in actual applications is rather small - that'd only > happen if people used postgresql's regexp engine together with > purely external data. > > People can store regular expressions in text fields now, and do, let me assure you. So the chances you'll encounter text ~ unknown or unknown ~ text or text ~ text are 100% cheers andrew
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Florian Pflug
Date:
On Jun20, 2011, at 00:56 , Andrew Dunstan wrote: > On 06/19/2011 05:02 PM, Florian Pflug wrote: >> The only argument against that I can see is that it poses >> a compatibility problem if "~" remains the pattern matching >> operator. I do believe, however, that the chance of >> unknown ~ unknown >> appearing in actual applications is rather small - that'd only >> happen if people used postgresql's regexp engine together with >> purely external data. > > People can store regular expressions in text fields now, and do, > let me assure you. So the chances you'll encounter text ~ unknown > or unknown ~ text or text ~ text are 100% Hm, it seems we either all have different idea about how such a pattern type would be be defined, or have grown so accustomed to pg's type system that we've forgotten how powerful it really is ;-) (For me, the latter is surely true...). I've now created a primitive prototype that uses a composite type for "pattern". That changes the input syntax for patterns (you need to enclose them in brackets), but should model all the implicit and explicit casting rules and operator selection correctly. It also uses "~~~" in place of "~", for obvious reasons and again without changing the casting and overloading rules. The prototype defines text ~~~ text text ~~~ pattern pattern ~~~ text and can be found at end of this mail. With that prototype, *all* the cases (even unknown ~~~ unknown) work as today, i.e. all of the statements below return true postgres=# select 'abc' ~~~ '^ab+c$'; postgres=# select 'abc'::text ~~~ '^ab+c$'; postgres=# select 'abc' ~~~ '^ab+c$'::text; postgres=# select 'abc' ~~~ '(^ab+c$)'::pattern; postgres=# select '(^ab+c$)'::pattern ~~~ 'abc'; (The same happens with and without setting pattern's typcategory to 'S'. Not really sure if the category has any effect here at all). That's not exactly what I had in mind - I'd have preferred unknown ~~~ unknown to return an error but text ~~~ unknown and unknown ~~~ text to work, but it looks that that's not easily done. Still, I believe the behaviour of the prototype is acceptable. BTW, The reason that 'unknown ~~~ unknown' works is, I believe the following comment func_select_candidate, together with the fact that 'text' is the preferred type in the string category. If any candidate has an input datatype of STRING category, use STRING category (this bias towards STRING is appropriatesince unknown-type literals look like strings). best regards, Florian Pflug create type pattern as (p text); create function match_right(l text, r text) returns boolean as $$ select $1 ~ $2 $$ language sql strict immutable; create operator ~~~ ( procedure = match_right, leftarg = text, rightarg = text ); create function match_right(l text, r pattern) returns boolean as $$ select $1 ~ $2.p $$ language sql strict immutable; create operator ~~~ ( procedure = match_right, commutator = '~~~', leftarg = text, rightarg = pattern ); create function match_left(l pattern, r text) returns boolean as $$ select $2 ~ $1.p $$ language sql strict immutable; create operator ~~~ ( procedure = match_left, commutator = '~~~', leftarg = pattern, rightarg = text ); update pg_type set typcategory = 'S' where oid = 'pattern'::regtype;
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
"David E. Wheeler"
Date:
On Jun 19, 2011, at 4:56 PM, Florian Pflug wrote: > Hm, it seems we either all have different idea about how such > a pattern type would be be defined, or have grown so accustomed to > pg's type system that we've forgotten how powerful it really > is ;-) (For me, the latter is surely true...). > > I've now created a primitive prototype that uses a composite > type for "pattern". That changes the input syntax for patterns > (you need to enclose them in brackets), but should model all > the implicit and explicit casting rules and operator selection > correctly. It also uses "~~~" in place of "~", for obvious > reasons and again without changing the casting and overloading > rules. Ew. > The prototype defines > text ~~~ text > text ~~~ pattern > pattern ~~~ text > and can be found at end of this mail. > > With that prototype, *all* the cases (even unknown ~~~ unknown) > work as today, i.e. all of the statements below return true Florian++ Very nice, thanks! I don't suppose there's a special quoting to be had for patterns? Perhaps one of these (modulo SQL parsing issues); /pattern/ {pattern} qr/pattern/ qr'pattern' R/pattern/ R'pattern' Mike bikeshed is scarlet, David
Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate
From
Florian Pflug
Date:
On Jun20, 2011, at 18:28 , David E. Wheeler wrote: > I don't suppose there's a special quoting to be had for patterns? Perhaps one of these (modulo SQL parsing issues); > > /pattern/ > {pattern} > qr/pattern/ > qr'pattern' > R/pattern/ > R'pattern' Pretty daring suggestion, I must say ;-) I think regexp's are nearly prominent enough in SQL to warrant this. Also, the main reason why this is such a huge deal for most programming languages is that it avoids having to double-escape backslashes. At least with standard_conforming_strings=on, however, that isn't a problem in SQL because backslashes in literals aren't treated specially. For example writing 'test' ~ '^\w+$' Just Works (TM) if standard_conforming_strings=on, whereas in C you'd have to write regexp_match("test", "^\\w+$") to give the regexp engine a chance to even see the "\". best regards, Florian Pflug