Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Date
Msg-id CA+TgmoaRTar_j6SjP6c-ZbMdL6X0U52yJgn-=yEyW1qc17BAkA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Fri, Jun 9, 2017 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 6/9/17 11:12, Tom Lane wrote:
>>> https://www.postgresql.org/message-id/27064.1134753128@sss.pgh.pa.us
>
>> Good to know.  That just says that if we were to go with the strcoll()
>> result only, things would work correctly.
>
> There's still the hashing problem.

Tom, that mailing list discussions is very illuminating.  Thanks for
digging it up.

Regarding the question of hashing, one way to support that would be if
we had some sort of canonicalization function.  IOW, suppose there
were a collation API call distill() which had the property that
strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal
under that collation.  Then, you could define your hash function as
hash_any(distill(X)).  Alternatively, if the collation library
provided its own hashing function, that would be fine too, and
probably faster.

On the other hand, is there any rule that says we have to support
hashing?  Certainly, if we defined a new datatype collated_text, it
could have a btree opfamily and no hash opfamily.  It's trickier with
only one datatype, but possibly we could come up with a way for an
opfamily to be consulted about whether it is available for a given
choice of collation.  I'm not exactly sure what is possible or
desirable, but I would not be too surprised to hear complaints about
the observed behavior different from the "pure" ICU behavior because
of the tiebreak, and at least some users might even find it worth
giving up hashing in order to get the exact sort order they need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] partial aggregation with internal state type
Next
From: Peter Geoghegan
Date:
Subject: Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings