Thread: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Florian Pflug

Date:

19 June 2011, 10:54:08

Hi

It looks like we've failed to reach an agreement on how to
proceed on the issue with missing commutators for the various
text matching operators ("~", "~~", and their case-insensitive
variants). We do seem to have agreed, however, that adding
commutators for the non-deprecated operators which lack them
is generally a Good Idea.

Amidst the discussion, Alvaro suggested that we resolve the issue
by adding a distinct type for patterns as opposed to text. That'd
allow us to make "~" it's own commutator by defining both text ~ pattern
and pattern ~ text.

We'd of course need to keep the operator text ~ text
and make it behave like text ~ pattern.
Thus, if someone wrote 'a_pattern' ~ 'some_text'
(i.e. forgot to cast 'a_pattern' to type "pattern"), he wouldn't
get an error but instead unintended behaviour. If we want to avoid
that too, we'd have to name the new operators something other than
"~".

There's also the question of how we deal with "~~" (the operator
behind LIKE). We could either re-use the type "pattern" for that,
meaning that values of type "pattern" would represent any kind of
text pattern, not necessarily a regular expression. Alternatively,
we could represent LIKE pattern by a type distinct from "pattern",
say "likepattern". Finally, we could handle LIKE like we handle
SIMILAR TO, i.e. define a function that transforms a LIKE pattern
into a regular expression, and deprecate the "~~" operator and friends.

The last option looks appealing from a code complexity point of view,
but might severely harm performance of LIKE and ILIKE comparisons.

Comments? Opinions?

best regards,
Florian Pflug



Someone

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Robert Haas

Date:

19 June 2011, 15:56:51

On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug <fgp@phlo.org> wrote:
> Amidst the discussion, Alvaro suggested that we resolve the issue
> by adding a distinct type for patterns as opposed to text. That'd
> allow us to make "~" it's own commutator by defining both
>  text ~ pattern
> and
>  pattern ~ text.

That's kind of a neat idea.  There might be an efficiency benefit to
having a regex type that is precompiled by the input function.

> There's also the question of how we deal with "~~" (the operator
> behind LIKE). We could either re-use the type "pattern" for that,
> meaning that values of type "pattern" would represent any kind of
> text pattern, not necessarily a regular expression. Alternatively,
> we could represent LIKE pattern by a type distinct from "pattern",
> say "likepattern". Finally, we could handle LIKE like we handle
> SIMILAR TO, i.e. define a function that transforms a LIKE pattern
> into a regular expression, and deprecate the "~~" operator and friends.
>
> The last option looks appealing from a code complexity point of view,
> but might severely harm performance of LIKE and ILIKE comparisons.

I don't believe it would be a very good idea to try to shoehorn
multiple kinds of patterns into a single pattern type.

I do think this may be the long route to solving this problem, though.Is it really this hard to agree on a commutator
name? I mean, I'm 
not in love with anything that's been suggested so far, but I could
live with any of them.  An unintuitive operator name for
matches-with-the-arguments-reversed is not going to be the worst wart
we have, by a long shot...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Florian Pflug

Date:

19 June 2011, 16:26:59

On Jun19, 2011, at 20:56 , Robert Haas wrote:
> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug <fgp@phlo.org> wrote:
>> Amidst the discussion, Alvaro suggested that we resolve the issue
>> by adding a distinct type for patterns as opposed to text. That'd
>> allow us to make "~" it's own commutator by defining both
>>  text ~ pattern
>> and
>>  pattern ~ text.
> 
> That's kind of a neat idea.  There might be an efficiency benefit to
> having a regex type that is precompiled by the input function.

Hm, yeah, that though crossed my mind too. A distinct type is only
a first step in that direction though - we'd also need a way to 
attach a parsed representation of a value to a varlena. If you have
an idea how to accomplish that, by all means, out with it! ;-)
The XML would also benefit greatly...

>> There's also the question of how we deal with "~~" (the operator
>> behind LIKE). We could either re-use the type "pattern" for that,
>> meaning that values of type "pattern" would represent any kind of
>> text pattern, not necessarily a regular expression. Alternatively,
>> we could represent LIKE pattern by a type distinct from "pattern",
>> say "likepattern". Finally, we could handle LIKE like we handle
>> SIMILAR TO, i.e. define a function that transforms a LIKE pattern
>> into a regular expression, and deprecate the "~~" operator and friends.
>> 
>> The last option looks appealing from a code complexity point of view,
>> but might severely harm performance of LIKE and ILIKE comparisons.
> 
> I don't believe it would be a very good idea to try to shoehorn
> multiple kinds of patterns into a single pattern type.

That depends on whether we expect to eventually make LIKE
use the regex matching machinery. If we do, then it's not really
shoehorning. If we don't, then yeah, using a single type seems
unwise, especially in the light of your idea of keeping a parsed
representation of regexp's around.

> I do think this may be the long route to solving this problem, though.

Yeah - but maybe also the one with the largest benefit in the long run.
We're also just at the beginning of a release cycle, so I think we
have time enough to figure this out...

> Is it really this hard to agree on a commutator name?

So far, every suggestion has been met with fierce opposition, so, um,
yeah it is I'd say...

> I mean, I'm
> not in love with anything that's been suggested so far, but I could
> live with any of them.  An unintuitive operator name for
> matches-with-the-arguments-reversed is not going to be the worst wart
> we have, by a long shot...

Maybe not. But then, if the name is unintuitive enough to impair
readability anyway, then people might just as well define a custom
operator in their database. Since we're capable of inlining SQL
functions, there won't even be a difference in performance. The only
real benefit of having this is core is that you don't have to
go search the catalog to find the meaning of such an operator if
you encounter it in an SQL statement.

best regards,
Florian Pflug

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Andrew Dunstan

Date:

19 June 2011, 16:38:40


On 06/19/2011 02:56 PM, Robert Haas wrote:
> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org>  wrote:
>> Amidst the discussion, Alvaro suggested that we resolve the issue
>> by adding a distinct type for patterns as opposed to text. That'd
>> allow us to make "~" it's own commutator by defining both
>>   text ~ pattern
>> and
>>   pattern ~ text.
> That's kind of a neat idea.  There might be an efficiency benefit to
> having a regex type that is precompiled by the input function.


What do we do when we get text or unknown in place of pattern? How are 
we going to know if the pattern is supposed to be the left or right operand?

cheers

andrew

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Tom Lane

Date:

19 June 2011, 17:10:32

Andrew Dunstan <andrew@dunslane.net> writes:
> On 06/19/2011 02:56 PM, Robert Haas wrote:
>> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org>  wrote:
>>> Amidst the discussion, Alvaro suggested that we resolve the issue
>>> by adding a distinct type for patterns as opposed to text. That'd
>>> allow us to make "~" it's own commutator by defining both
>>> text ~ pattern
>>> and
>>> pattern ~ text.

>> That's kind of a neat idea.  There might be an efficiency benefit to
>> having a regex type that is precompiled by the input function.

> What do we do when we get text or unknown in place of pattern? How are 
> we going to know if the pattern is supposed to be the left or right operand?

Yeah, this would result inSELECT 'something' ~ 'something';
failing outright.  I don't think it's a good substitute for biting
the bullet and choosing distinct operator names.

(I do think a distinct regex datatype might be a good idea, but it
doesn't eliminate this particular problem.)
        regards, tom lane

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Florian Pflug

Date:

19 June 2011, 18:02:52

On Jun19, 2011, at 22:10 , Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 06/19/2011 02:56 PM, Robert Haas wrote:
>>> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org>  wrote:
>>>> Amidst the discussion, Alvaro suggested that we resolve the issue
>>>> by adding a distinct type for patterns as opposed to text. That'd
>>>> allow us to make "~" it's own commutator by defining both
>>>> text ~ pattern
>>>> and
>>>> pattern ~ text.
>
>>> That's kind of a neat idea.  There might be an efficiency benefit to
>>> having a regex type that is precompiled by the input function.
>
>> What do we do when we get text or unknown in place of pattern? How are
>> we going to know if the pattern is supposed to be the left or right operand?
>
> Yeah, this would result in
>     SELECT 'something' ~ 'something';
> failing outright.  I don't think it's a good substitute for biting
> the bullet and choosing distinct operator names.

Yeah, well, the complaint (put forward mainly by Alvaro) that lead to
this approach in the first place was precisely that 'something' ~ 'anything'
*doesn't* give any indication of what constitutes the pattern and
what the text.

So I consider that to be a feature, not a bug.

BTW, arithmetical operators currently show exactly the same behaviour
postgres# select '1' + '1'
ERROR:  operator is not unique: unknown + unknown at character 12

The only argument against that I can see is that it poses
a compatibility problem if "~" remains the pattern matching
operator. I do believe, however, that the chance of unknown ~ unknown
appearing in actual applications is rather small - that'd only
happen if people used postgresql's regexp engine together with
purely external data.

best regards,
Florian Pflug

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Andrew Dunstan

Date:

19 June 2011, 19:56:21


On 06/19/2011 05:02 PM, Florian Pflug wrote:
> On Jun19, 2011, at 22:10 , Tom Lane wrote:
>> Andrew Dunstan<andrew@dunslane.net>  writes:
>>> On 06/19/2011 02:56 PM, Robert Haas wrote:
>>>> On Sun, Jun 19, 2011 at 9:53 AM, Florian Pflug<fgp@phlo.org>   wrote:
>>>>> Amidst the discussion, Alvaro suggested that we resolve the issue
>>>>> by adding a distinct type for patterns as opposed to text. That'd
>>>>> allow us to make "~" it's own commutator by defining both
>>>>> text ~ pattern
>>>>> and
>>>>> pattern ~ text.
>>>> That's kind of a neat idea.  There might be an efficiency benefit to
>>>> having a regex type that is precompiled by the input function.
>>> What do we do when we get text or unknown in place of pattern? How are
>>> we going to know if the pattern is supposed to be the left or right operand?
>> Yeah, this would result in
>>     SELECT 'something' ~ 'something';
>> failing outright.  I don't think it's a good substitute for biting
>> the bullet and choosing distinct operator names.
> Yeah, well, the complaint (put forward mainly by Alvaro) that lead to
> this approach in the first place was precisely that
>    'something' ~ 'anything'
> *doesn't* give any indication of what constitutes the pattern and
> what the text.
>
> So I consider that to be a feature, not a bug.
>
> BTW, arithmetical operators currently show exactly the same behaviour
> postgres# select '1' + '1'
> ERROR:  operator is not unique: unknown + unknown at character 12
>
> The only argument against that I can see is that it poses
> a compatibility problem if "~" remains the pattern matching
> operator. I do believe, however, that the chance of
>    unknown ~ unknown
> appearing in actual applications is rather small - that'd only
> happen if people used postgresql's regexp engine together with
> purely external data.
>
>

People can store regular expressions in text fields now, and do, let me 
assure you. So the chances you'll encounter text ~ unknown or unknown ~ 
text  or text ~ text are 100%

cheers

andrew

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Florian Pflug

Date:

19 June 2011, 20:56:54

On Jun20, 2011, at 00:56 , Andrew Dunstan wrote:
> On 06/19/2011 05:02 PM, Florian Pflug wrote:
>> The only argument against that I can see is that it poses
>> a compatibility problem if "~" remains the pattern matching
>> operator. I do believe, however, that the chance of
>>   unknown ~ unknown
>> appearing in actual applications is rather small - that'd only
>> happen if people used postgresql's regexp engine together with
>> purely external data.
> 
> People can store regular expressions in text fields now, and do,
> let me assure you. So the chances you'll encounter text ~ unknown
> or unknown ~ text  or text ~ text are 100%

Hm, it seems we either all have different idea about how such
a pattern type would be be defined, or have grown so accustomed to
pg's type system that we've forgotten how powerful it really
is ;-) (For me, the latter is surely true...).

I've now created a primitive prototype that uses a composite
type for "pattern". That changes the input syntax for patterns
(you need to enclose them in brackets), but should model all
the implicit and explicit casting rules and operator selection
correctly. It also uses "~~~" in place of "~", for obvious
reasons and again without changing the casting and overloading
rules.

The prototype defines text ~~~ text text ~~~ pattern pattern ~~~ text
and can be found at end of this mail.

With that prototype, *all* the cases (even unknown ~~~ unknown)
work as today, i.e. all of the statements below return true

postgres=# select 'abc' ~~~ '^ab+c$';
postgres=# select 'abc'::text ~~~ '^ab+c$';
postgres=# select 'abc' ~~~ '^ab+c$'::text;
postgres=# select 'abc' ~~~ '(^ab+c$)'::pattern;
postgres=# select '(^ab+c$)'::pattern ~~~ 'abc';

(The same happens with and without setting pattern's typcategory
to 'S'. Not really sure if the category has any effect here
at all).

That's not exactly what I had in mind - I'd have preferred unknown ~~~ unknown
to return an error but text ~~~ unknown
and unknown ~~~ text
to work, but it looks that that's not easily done.

Still, I believe the behaviour of the prototype is acceptable.

BTW, The reason that 'unknown ~~~ unknown' works is, I believe
the following comment func_select_candidate, together with the
fact that 'text' is the preferred type in the string category.
 If any candidate has an input datatype of STRING category, use STRING category (this bias towards STRING is
appropriatesince unknown-type literals look like strings).

best regards,
Florian Pflug

create type pattern as (p text);

create function match_right(l text, r text) returns boolean as $$ select $1 ~ $2
$$ language sql strict immutable;

create operator ~~~ ( procedure = match_right, leftarg = text, rightarg = text
);

create function match_right(l text, r pattern) returns boolean as $$ select $1 ~ $2.p
$$ language sql strict immutable;

create operator ~~~ ( procedure = match_right, commutator = '~~~', leftarg = text, rightarg = pattern
);

create function match_left(l pattern, r text) returns boolean as $$ select $2 ~ $1.p
$$ language sql strict immutable;

create operator ~~~ ( procedure = match_left, commutator = '~~~', leftarg = pattern, rightarg = text
);

update pg_type set typcategory = 'S' where oid = 'pattern'::regtype;

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

"David E. Wheeler"

Date:

20 June 2011, 13:29:08

On Jun 19, 2011, at 4:56 PM, Florian Pflug wrote:

> Hm, it seems we either all have different idea about how such
> a pattern type would be be defined, or have grown so accustomed to
> pg's type system that we've forgotten how powerful it really
> is ;-) (For me, the latter is surely true...).
>
> I've now created a primitive prototype that uses a composite
> type for "pattern". That changes the input syntax for patterns
> (you need to enclose them in brackets), but should model all
> the implicit and explicit casting rules and operator selection
> correctly. It also uses "~~~" in place of "~", for obvious
> reasons and again without changing the casting and overloading
> rules.

Ew.

> The prototype defines
>  text ~~~ text
>  text ~~~ pattern
>  pattern ~~~ text
> and can be found at end of this mail.
>
> With that prototype, *all* the cases (even unknown ~~~ unknown)
> work as today, i.e. all of the statements below return true

Florian++ Very nice, thanks!

I don't suppose there's a special quoting to be had for patterns? Perhaps one of these (modulo SQL parsing issues);
   /pattern/   {pattern}   qr/pattern/   qr'pattern'   R/pattern/   R'pattern'

Mike bikeshed is scarlet,

David

Re: Adding a distinct "pattern" type to resolve the "~" commutator stalemate

From

Florian Pflug

Date:

20 June 2011, 13:38:42

On Jun20, 2011, at 18:28 , David E. Wheeler wrote:
> I don't suppose there's a special quoting to be had for patterns? Perhaps one of these (modulo SQL parsing issues);
>
>    /pattern/
>    {pattern}
>    qr/pattern/
>    qr'pattern'
>    R/pattern/
>    R'pattern'

Pretty daring suggestion, I must say ;-)

I think regexp's are nearly prominent enough in SQL to warrant this.
Also, the main reason why this is such a huge deal for most programming
languages is that it avoids having to double-escape backslashes.

At least with standard_conforming_strings=on, however, that isn't a problem
in SQL because backslashes in literals aren't treated specially. For example
writing 'test' ~ '^\w+$'
Just Works (TM) if standard_conforming_strings=on, whereas in C you'd
have to write regexp_match("test", "^\\w+$")
to give the regexp engine a chance to even see the "\".

best regards,
Florian Pflug