Re: Why do indexes and sorts use the database collation? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Why do indexes and sorts use the database collation?
Date
Msg-id 20231113221212.lgjhziuavnmapeu3@awork3.anarazel.de
Whole thread Raw
In response to Re: Why do indexes and sorts use the database collation?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Why do indexes and sorts use the database collation?
Re: Why do indexes and sorts use the database collation?
List pgsql-hackers
Hi,

On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
> I don't think we can just arbitrarily override the default because we
> happen to think "C" is going to be faster. If we could prove that using
> "C" is going to produce exactly the same results as for the implicit
> collation (for a given operation), then we can simply do that. Not sure
> if such proof is possible, though.

Yea, I don't know if there's any interesting cases where we could prove that.

I think there *are* interesting cases where we should prove that non-C
collations are identical. It's imo bonkers that we consider the
"collname.encname" collations distinct from the equivalent "collname"
collation.

We document that we consider "collname" equivalent to
  "collname.<database encoding>":

> Within any particular database, only collations that use that database's
> encoding are of interest. Other entries in pg_collation are ignored. Thus, a
> stripped collation name such as de_DE can be considered unique within a
> given database even though it would not be unique globally. Use of the
> stripped collation names is recommended, since it will make one fewer thing
> you need to change if you decide to change to another database
> encoding. Note however that the default, C, and POSIX collations can be used
> regardless of the database encoding.

Followed by:


> PostgreSQL considers distinct collation objects to be incompatible even when
> they have identical properties. Thus for example, [...]  Mixing stripped and
> non-stripped collation names is therefore not recommended.

Why on earth are we solving this by having multiple pg_collation entries for
exactly the same collation, instead of normalizing the collation-name during
lookup by adding the relevant encoding name if not explicitly specified?  It
makes a lot of sense to not force the user to specify the encoding when it
can't differ.


It's imo similarly absurd that an index with "default" collation cannot be
used when specifying the equivalent collation explicitly in the query and vice
versa.




> >>> Also, wouldn't the intent to use a different collation for the column
> >>> be
> >>> expressed by changing the column's collation?
> >>
> >> The column collation expresses the semantics of that column. If the
> >> user has a database collation of "en_US", they should expect ORDER BY
> >> on that column to be according to that locale unless otherwise
> >> specified.
> >
> > That makes no sense to me. Either the user cares about ordering, in which case
> > the index needs to be in that ordering for efficient ORDER BY, or they don't,
> > in which neither index nor column needs a non-C collation. You partially
> > premised your argument on the content of primary keys typically making non-C
> > collations undesirable!
> >
>
> I may be missing something, but what's the disagreement here? If the
> user cares about ordering, they'll specify ORDER BY with either an
> explicit or the default collation. If the index collation matches, it
> may be useful for the ordering.
>
> Of course, if we feel entitled to create the primary key index with a
> collation of our choosing, that'd make this unpredictable.

Jeff was saying that textual primary keys typically don't need sorting and
because of that we could default to "C", for performance. Part of my response
was that I think the user's intent could be expressed by specifying the column
collation as "C" - to which Jeff replied that that would change the
semantics. Which, to me, seems to completely run counter to his argument that
we could just use "C" for such indexes.



> >>> - Teach the planner to use cheaper collations when ordering for
g> >>> reasons other
> >>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
> >>> joins).
> >>
> >> +1. Where "cheaper" comes from is an interesting question -- is it a
> >> property of the provider or the specific collation? Or do we just call
> >> "C" special?
> >
> > I'd think the specific collation. Even if we initially perhaps just get the
> > default cost from the provider such, it structurally seems the sanest place to
> > locate the cost.
> >
>
> ISTM it's about how complex the rules implemented by the collation are,
> so I agree the cost should be a feature of collations not providers.

I'm not sure analysing the complexity in detail is worth it. ISTM there's a
few "levels" of costliness:

1) memcmp() suffices
2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
3) deterministic collations
4) non-deterministic collations

I'm sure there are graduations, particularly within 3), but I'm not sure it's
realistic / worthwhile to go to that detail. I think a cost model like the
above would provide enough detail to make better decisions than today...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [PATCHES] Post-special page storage TDE support
Next
From: Melanie Plageman
Date:
Subject: lazy_scan_heap() should release lock on buffer before vacuuming FSM