Re: Why do indexes and sorts use the database collation? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Why do indexes and sorts use the database collation?
Date
Msg-id e23815b9-3ed4-2ed0-62b0-e36648d35e5f@enterprisedb.com
Whole thread Raw
In response to Re: Why do indexes and sorts use the database collation?  (Andres Freund <andres@anarazel.de>)
Responses Re: Why do indexes and sorts use the database collation?
List pgsql-hackers
On 11/13/23 23:12, Andres Freund wrote:
> Hi,
> 
> On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
>> I don't think we can just arbitrarily override the default because we
>> happen to think "C" is going to be faster. If we could prove that using
>> "C" is going to produce exactly the same results as for the implicit
>> collation (for a given operation), then we can simply do that. Not sure
>> if such proof is possible, though.
> 
> Yea, I don't know if there's any interesting cases where we could prove that.
> 
> I think there *are* interesting cases where we should prove that non-C
> collations are identical. It's imo bonkers that we consider the
> "collname.encname" collations distinct from the equivalent "collname"
> collation.
> 

Yeah, I agree that seems a bit ... strange.

> We document that we consider "collname" equivalent to
>   "collname.<database encoding>":
> 
>> Within any particular database, only collations that use that database's
>> encoding are of interest. Other entries in pg_collation are ignored. Thus, a
>> stripped collation name such as de_DE can be considered unique within a
>> given database even though it would not be unique globally. Use of the
>> stripped collation names is recommended, since it will make one fewer thing
>> you need to change if you decide to change to another database
>> encoding. Note however that the default, C, and POSIX collations can be used
>> regardless of the database encoding.
> 
> Followed by:
> 
> 
>> PostgreSQL considers distinct collation objects to be incompatible even when
>> they have identical properties. Thus for example, [...]  Mixing stripped and
>> non-stripped collation names is therefore not recommended.
> 
> Why on earth are we solving this by having multiple pg_collation entries for
> exactly the same collation, instead of normalizing the collation-name during
> lookup by adding the relevant encoding name if not explicitly specified?  It
> makes a lot of sense to not force the user to specify the encoding when it
> can't differ.
> 

True, insisting on having multiple separate entries for the same
collation (and not recognizing which collations are the same) seems
somewhat inconvenient.

> 
> It's imo similarly absurd that an index with "default" collation cannot be
> used when specifying the equivalent collation explicitly in the query and vice
> versa.
> 

Right. Having to spell

    COLLATE "default"

and not the actual collation it references to is weird. Similarly, I
just realized that the collation name in pg_database and pg_collation
are not quite consistent. Consider this:

select datcollate from pg_database where datname = 'test';

 datcollate
------------
 C.UTF-8
(1 row)

but then

  test=# select * from t where c = 'x' collate "C.UTF-8";
  ERROR:  collation "C.UTF-8" for encoding "UTF8" does not exist
  LINE 1: select * from t where c = 'x' collate "C.UTF-8";

because the collation is actually known as C.utf8.


> 
> 
> 
>>>>> Also, wouldn't the intent to use a different collation for the column
>>>>> be
>>>>> expressed by changing the column's collation?
>>>>
>>>> The column collation expresses the semantics of that column. If the
>>>> user has a database collation of "en_US", they should expect ORDER BY
>>>> on that column to be according to that locale unless otherwise
>>>> specified.
>>>
>>> That makes no sense to me. Either the user cares about ordering, in which case
>>> the index needs to be in that ordering for efficient ORDER BY, or they don't,
>>> in which neither index nor column needs a non-C collation. You partially
>>> premised your argument on the content of primary keys typically making non-C
>>> collations undesirable!
>>>
>>
>> I may be missing something, but what's the disagreement here? If the
>> user cares about ordering, they'll specify ORDER BY with either an
>> explicit or the default collation. If the index collation matches, it
>> may be useful for the ordering.
>>
>> Of course, if we feel entitled to create the primary key index with a
>> collation of our choosing, that'd make this unpredictable.
> 
> Jeff was saying that textual primary keys typically don't need sorting and
> because of that we could default to "C", for performance. Part of my response
> was that I think the user's intent could be expressed by specifying the column
> collation as "C" - to which Jeff replied that that would change the
> semantics. Which, to me, seems to completely run counter to his argument that
> we could just use "C" for such indexes.
> 

True. I think that's somewhat self-contradictory argument.

It's not clear to me if the argument is meant to apply to indexes on all
columns or just those backing primary keys, but I guess it's the latter.
But that (forcing users to specify collation for PK columns, while using
the default for non-PK columns) seems like a recipe for subtle bugs in
applications.

> 
> 
>>>>> - Teach the planner to use cheaper collations when ordering for
> g> >>> reasons other
>>>>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
>>>>> joins).
>>>>
>>>> +1. Where "cheaper" comes from is an interesting question -- is it a
>>>> property of the provider or the specific collation? Or do we just call
>>>> "C" special?
>>>
>>> I'd think the specific collation. Even if we initially perhaps just get the
>>> default cost from the provider such, it structurally seems the sanest place to
>>> locate the cost.
>>>
>>
>> ISTM it's about how complex the rules implemented by the collation are,
>> so I agree the cost should be a feature of collations not providers.
> 
> I'm not sure analysing the complexity in detail is worth it. ISTM there's a
> few "levels" of costliness:
> 
> 1) memcmp() suffices
> 2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
> 3) deterministic collations
> 4) non-deterministic collations
> 
> I'm sure there are graduations, particularly within 3), but I'm not sure it's
> realistic / worthwhile to go to that detail. I think a cost model like the
> above would provide enough detail to make better decisions than today...
> 

I'm not saying we have to analyze the complexity of the rules. I was
simply agreeing with you that the "cost" should be associated with
individual collations, not the providers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Peter Smith
Date:
Subject: Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)