Re: insensitive collations - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: insensitive collations |
Date | |
Msg-id | CAH2-Wzm2DhYvHYzOqD49rdkQ_MJ-Vhqvv3_noow6ui+R35goZw@mail.gmail.com Whole thread Raw |
In response to | Re: insensitive collations (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>) |
Responses |
Re: insensitive collations
|
List | pgsql-hackers |
On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Another patch to address merge conflicts. Some remarks on this: * Your draft commit message says: > This patch makes changes in three areas: > > - CREATE COLLATION DDL changes and system catalog changes to support > this new flag. > > - Many executor nodes and auxiliary code are extended to track > collations. Previously, this code would just throw away collation > information, because the eventually-called user-defined functions > didn't use it since they only cared about equality, which didn't > need collation information. > > - String data type functions that do equality comparisons and hashing > are changed to take the (non-)deterministic flag into account. For > comparison, this just means skipping various shortcuts and tie > breakers that use byte-wise comparison. For hashing, we first need > to convert the input string to a canonical "sort key" using the ICU > analogue of strxfrm(). I wonder if it would be better to break this into distinct commits? * Why is support for non-deterministic comparisons limited to the ICU provider? If that's the only case that could possibly be affected, then why did we ever add the varstrcmp() tie-breaker call to strcmp() in the first place? The behavior described in the commit message of bugfix commit 656beff5 describes a case where Hungarian text caused index corruption by being strcoll()-wise equal but not bitwise equal. Besides, I think that you can vendor your own case insensitive collation with glibc, since it's based on UCA. This restriction feels like it's actually a kludge to work around the fact that database-wide collations have this foreign key related restriction in your patch: > @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid, > } > > /* > - * Apply the comparison operator. We assume it doesn't care about > - * collations. > - */ > - return DatumGetBool(FunctionCall2(&entry->eq_opr_finfo, > - oldvalue, newvalue)); > + * Apply the comparison operator. > + * > + * Note: This function is part of a call stack that determines whether an > + * update to a row is significant enough that it needs checking or action > + * on the other side of a foreign-key constraint. Therefore, the > + * comparison here would need to be done with the collation of the *other* > + * table. For simplicity (e.g., we might not even have the other table > + * open), we'll just use the default collation here, which could lead to > + * some false negatives. All this would break if we ever allow > + * database-wide collations to be nondeterministic. > + */ > + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo, > + DEFAULT_COLLATION_OID, > + oldvalue, newvalue)); > } The existing restriction on ICU collations (that they cannot be database-wide) side-steps the issue. * Can said restriction somehow be lifted? That seems like it would be a lot cleaner. * Why have you disable this optimization?: > /* Fast pre-check for equality, as discussed in varstr_cmp() */ > - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) > + if ((!sss->locale || sss->locale->deterministic) && > + len1 == len2 && memcmp(a1p, a2p, len1) == 0) I don't see why this is necessary. A non-deterministic collation cannot indicate that bitwise identical strings are non-equal. * Perhaps you should add a "Tip" referencing the feature to the contrib/citext documentation. -- Peter Geoghegan
pgsql-hackers by date: