Re: insensitive collations - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: insensitive collations
Date
Msg-id CAH2-Wzm2DhYvHYzOqD49rdkQ_MJ-Vhqvv3_noow6ui+R35goZw@mail.gmail.com
Whole thread Raw
In response to Re: insensitive collations  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses Re: insensitive collations  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Another patch to address merge conflicts.

Some remarks on this:

* Your draft commit message says:

> This patch makes changes in three areas:
>
> - CREATE COLLATION DDL changes and system catalog changes to support
>   this new flag.
>
> - Many executor nodes and auxiliary code are extended to track
>   collations.  Previously, this code would just throw away collation
>   information, because the eventually-called user-defined functions
>   didn't use it since they only cared about equality, which didn't
>   need collation information.
>
> - String data type functions that do equality comparisons and hashing
>   are changed to take the (non-)deterministic flag into account.  For
>   comparison, this just means skipping various shortcuts and tie
>   breakers that use byte-wise comparison.  For hashing, we first need
>   to convert the input string to a canonical "sort key" using the ICU
>   analogue of strxfrm().

I wonder if it would be better to break this into distinct commits?

* Why is support for non-deterministic comparisons limited to the ICU
provider? If that's the only case that could possibly be affected,
then why did we ever add the varstrcmp() tie-breaker call to strcmp()
in the first place? The behavior described in the commit message of
bugfix commit 656beff5 describes a case where Hungarian text caused
index corruption by being strcoll()-wise equal but not bitwise equal.
Besides, I think that you can vendor your own case insensitive
collation with glibc, since it's based on UCA.

This restriction feels like it's actually a kludge to work around the
fact that database-wide collations have this foreign key related
restriction in your patch:

> @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid,
>     }
>
>     /*
> -    * Apply the comparison operator.  We assume it doesn't care about
> -    * collations.
> -    */
> -   return DatumGetBool(FunctionCall2(&entry->eq_opr_finfo,
> -                                     oldvalue, newvalue));
> +    * Apply the comparison operator.
> +    *
> +    * Note: This function is part of a call stack that determines whether an
> +    * update to a row is significant enough that it needs checking or action
> +    * on the other side of a foreign-key constraint.  Therefore, the
> +    * comparison here would need to be done with the collation of the *other*
> +    * table.  For simplicity (e.g., we might not even have the other table
> +    * open), we'll just use the default collation here, which could lead to
> +    * some false negatives.  All this would break if we ever allow
> +    * database-wide collations to be nondeterministic.
> +    */
> +   return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
> +                                         DEFAULT_COLLATION_OID,
> +                                         oldvalue, newvalue));
>  }

The existing restriction on ICU collations (that they cannot be
database-wide) side-steps the issue.

* Can said restriction somehow be lifted? That seems like it would be
a lot cleaner.

* Why have you disable this optimization?:

>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> +   if ((!sss->locale || sss->locale->deterministic) &&
> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)

I don't see why this is necessary. A non-deterministic collation
cannot indicate that bitwise identical strings are non-equal.

* Perhaps you should add a "Tip" referencing the feature to the
contrib/citext documentation.
--
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: NOT IN subquery optimization
Next
From: Chapman Flack
Date:
Subject: Re: proposal: variadic argument support for least, greatest function