> The libc collation provider is a bad default[1]. The builtin collation
> provider is a good default, so let's use that.
Agreed! I've been in so many situations where a libc collation being the
default has caused problems down the line, but never in a situation where it
being default has been helpful.
> In the absence of specific user requirements, these factors weigh
> heavily in favor of the builtin collation provider, and heavily
> against libc.
Even worse is that the current default uses whatever was set in the
environment of the session that invokes initdb. This is very unlikely to be
the default anyone wants, especially since these environment variables
follows through ssh on debian bases systems by default.
Me having sv_SE set on my local computer doesn't make it likely to be a
reasonable default locale if I ssh to a server to run initdb.
> The builtin provider uses code point order, i.e. memcmp(), so the
> final result display order is less human-friendly. For instance, 'Z'
> comes before 'a'.
> That problem is annoying, but *much* easier to fix than the other
> factors. The user might add a COLLATE clause to the final ORDER BY, or
> perform the sort in the application layer or presentation layer.
I'd say that this would be a _good_ feature of choosing a generic unicode
collation by default. It's immediately obvious that you need to do something
if you want ordering according to some specific language's rules.
> Furthermore, in the default case, we don't even really know which
> language and region to use. We infer it from the environment variable
> LC_COLLATE at initdb time, but that's a weak signal: there's little
> reason to think that the OS admin, DBA, and end user are all in the
> same locale.
If I'm a Turkish person working for a German company and my environment
variables happens to specify tr_TR when I run initdb I have not made a
conscious choice and it may take years before someone reports an issue with
Ö being sorted after O instead of at the end of the alphabet, at which point
recifying the situation can be unnecessarily tricky.
> I propose changing the default to PG_C_UTF8 because it seems simple
> and practical. However, I'm also fine with PG_UNICODE_FAST if those
> affected by the "full" case mapping find it helpful. "C" is also a
> possibility, but the query semantics suffer. All are better than libc.
These are great options for a default for initdb, since we don't have any
knowledge of which language specific collation might be appropriate. Maybe we
should also document that it's recommended to set locale when running
CREATE DATABASE unless the builtin semantics are fine?
--
Anders Åstrand
Percona