Re: Why do indexes and sorts use the database collation? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Why do indexes and sorts use the database collation? |
Date | |
Msg-id | e23815b9-3ed4-2ed0-62b0-e36648d35e5f@enterprisedb.com Whole thread Raw |
In response to | Re: Why do indexes and sorts use the database collation? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Why do indexes and sorts use the database collation?
|
List | pgsql-hackers |
On 11/13/23 23:12, Andres Freund wrote: > Hi, > > On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote: >> I don't think we can just arbitrarily override the default because we >> happen to think "C" is going to be faster. If we could prove that using >> "C" is going to produce exactly the same results as for the implicit >> collation (for a given operation), then we can simply do that. Not sure >> if such proof is possible, though. > > Yea, I don't know if there's any interesting cases where we could prove that. > > I think there *are* interesting cases where we should prove that non-C > collations are identical. It's imo bonkers that we consider the > "collname.encname" collations distinct from the equivalent "collname" > collation. > Yeah, I agree that seems a bit ... strange. > We document that we consider "collname" equivalent to > "collname.<database encoding>": > >> Within any particular database, only collations that use that database's >> encoding are of interest. Other entries in pg_collation are ignored. Thus, a >> stripped collation name such as de_DE can be considered unique within a >> given database even though it would not be unique globally. Use of the >> stripped collation names is recommended, since it will make one fewer thing >> you need to change if you decide to change to another database >> encoding. Note however that the default, C, and POSIX collations can be used >> regardless of the database encoding. > > Followed by: > > >> PostgreSQL considers distinct collation objects to be incompatible even when >> they have identical properties. Thus for example, [...] Mixing stripped and >> non-stripped collation names is therefore not recommended. > > Why on earth are we solving this by having multiple pg_collation entries for > exactly the same collation, instead of normalizing the collation-name during > lookup by adding the relevant encoding name if not explicitly specified? It > makes a lot of sense to not force the user to specify the encoding when it > can't differ. > True, insisting on having multiple separate entries for the same collation (and not recognizing which collations are the same) seems somewhat inconvenient. > > It's imo similarly absurd that an index with "default" collation cannot be > used when specifying the equivalent collation explicitly in the query and vice > versa. > Right. Having to spell COLLATE "default" and not the actual collation it references to is weird. Similarly, I just realized that the collation name in pg_database and pg_collation are not quite consistent. Consider this: select datcollate from pg_database where datname = 'test'; datcollate ------------ C.UTF-8 (1 row) but then test=# select * from t where c = 'x' collate "C.UTF-8"; ERROR: collation "C.UTF-8" for encoding "UTF8" does not exist LINE 1: select * from t where c = 'x' collate "C.UTF-8"; because the collation is actually known as C.utf8. > > > >>>>> Also, wouldn't the intent to use a different collation for the column >>>>> be >>>>> expressed by changing the column's collation? >>>> >>>> The column collation expresses the semantics of that column. If the >>>> user has a database collation of "en_US", they should expect ORDER BY >>>> on that column to be according to that locale unless otherwise >>>> specified. >>> >>> That makes no sense to me. Either the user cares about ordering, in which case >>> the index needs to be in that ordering for efficient ORDER BY, or they don't, >>> in which neither index nor column needs a non-C collation. You partially >>> premised your argument on the content of primary keys typically making non-C >>> collations undesirable! >>> >> >> I may be missing something, but what's the disagreement here? If the >> user cares about ordering, they'll specify ORDER BY with either an >> explicit or the default collation. If the index collation matches, it >> may be useful for the ordering. >> >> Of course, if we feel entitled to create the primary key index with a >> collation of our choosing, that'd make this unpredictable. > > Jeff was saying that textual primary keys typically don't need sorting and > because of that we could default to "C", for performance. Part of my response > was that I think the user's intent could be expressed by specifying the column > collation as "C" - to which Jeff replied that that would change the > semantics. Which, to me, seems to completely run counter to his argument that > we could just use "C" for such indexes. > True. I think that's somewhat self-contradictory argument. It's not clear to me if the argument is meant to apply to indexes on all columns or just those backing primary keys, but I guess it's the latter. But that (forcing users to specify collation for PK columns, while using the default for non-PK columns) seems like a recipe for subtle bugs in applications. > > >>>>> - Teach the planner to use cheaper collations when ordering for > g> >>> reasons other >>>>> than the user's direct request (e.g. DISTINCT/GROUP BY, merge >>>>> joins). >>>> >>>> +1. Where "cheaper" comes from is an interesting question -- is it a >>>> property of the provider or the specific collation? Or do we just call >>>> "C" special? >>> >>> I'd think the specific collation. Even if we initially perhaps just get the >>> default cost from the provider such, it structurally seems the sanest place to >>> locate the cost. >>> >> >> ISTM it's about how complex the rules implemented by the collation are, >> so I agree the cost should be a feature of collations not providers. > > I'm not sure analysing the complexity in detail is worth it. ISTM there's a > few "levels" of costliness: > > 1) memcmp() suffices > 2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort > 3) deterministic collations > 4) non-deterministic collations > > I'm sure there are graduations, particularly within 3), but I'm not sure it's > realistic / worthwhile to go to that detail. I think a cost model like the > above would provide enough detail to make better decisions than today... > I'm not saying we have to analyze the complexity of the rules. I was simply agreeing with you that the "cost" should be associated with individual collations, not the providers. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date:
Previous
From: Peter GeogheganDate:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Peter SmithDate:
Subject: Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)