Why do indexes and sorts use the database collation? - Mailing list pgsql-hackers

From Jeff Davis
Subject Why do indexes and sorts use the database collation?
Date
Msg-id b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
Whole thread Raw
Responses Re: Why do indexes and sorts use the database collation?
Re: Why do indexes and sorts use the database collation?
List pgsql-hackers
An "en_US" user doing:

   CREATE TABLE foo(t TEXT PRIMARY KEY);

is providing no indication that they want an index tailored to their
locale. Yet we are creating the index with the "en_US" collation and
therefore imposing huge performance costs (something like 2X slower
index build time than the "C" locale), and also huge dependency
versioning risks that could lead to index corruption and/or wrong
results.

Similarly, a user doing:

   SELECT DISTINCT t FROM bar;

is providing no indication that they care about the collation of "t"
(we are free to choose a HashAgg which makes no ordering guarantee at
all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
"en_US" locale, which is something like 2X slower than the "C" locale.

One of the strongest arguments for using a non-C collation in these
cases is the chance to use a non-deterministic collation, like a case-
insensitive one. But the database collation is always deterministic,
and all deterministic collations have exactly the same definition of
equality, so there's no reason not to use "C".

Another argument is that, if the column is the database collation and
the index is "C", then the index is unusable for text range scans, etc.
But there are two ways to solve that problem:

  1. Set the column collation to "C"; or
  2. Set the index collation to the database collation.

Range scans are often most useful when the text is not actually natural
language, but instead is some kind of formatted text representing
another type of thing, often in ASCII. In that case, the range scan is
really some kind of prefix search or partitioning, and the "C" locale
is probably the right thing to use, and #1 wins.

Granted, there are reasons to want an index to have a particular
collation, in which case it makes sense to opt-in to #2. But in the
common case, the high performance costs and dependency versioning risks
aren't worth it.

Thoughts?

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: maybe a type_sanity. sql bug
Next
From: Michael Paquier
Date:
Subject: Re: A recent message added to pg_upgade