Thread: Why do indexes and sorts use the database collation?

Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
An "en_US" user doing:

   CREATE TABLE foo(t TEXT PRIMARY KEY);

is providing no indication that they want an index tailored to their
locale. Yet we are creating the index with the "en_US" collation and
therefore imposing huge performance costs (something like 2X slower
index build time than the "C" locale), and also huge dependency
versioning risks that could lead to index corruption and/or wrong
results.

Similarly, a user doing:

   SELECT DISTINCT t FROM bar;

is providing no indication that they care about the collation of "t"
(we are free to choose a HashAgg which makes no ordering guarantee at
all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
"en_US" locale, which is something like 2X slower than the "C" locale.

One of the strongest arguments for using a non-C collation in these
cases is the chance to use a non-deterministic collation, like a case-
insensitive one. But the database collation is always deterministic,
and all deterministic collations have exactly the same definition of
equality, so there's no reason not to use "C".

Another argument is that, if the column is the database collation and
the index is "C", then the index is unusable for text range scans, etc.
But there are two ways to solve that problem:

  1. Set the column collation to "C"; or
  2. Set the index collation to the database collation.

Range scans are often most useful when the text is not actually natural
language, but instead is some kind of formatted text representing
another type of thing, often in ASCII. In that case, the range scan is
really some kind of prefix search or partitioning, and the "C" locale
is probably the right thing to use, and #1 wins.

Granted, there are reasons to want an index to have a particular
collation, in which case it makes sense to opt-in to #2. But in the
common case, the high performance costs and dependency versioning risks
aren't worth it.

Thoughts?

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Andres Freund
Date:
Hi,

On 2023-11-10 16:03:16 -0800, Jeff Davis wrote:
> An "en_US" user doing:
>
>    CREATE TABLE foo(t TEXT PRIMARY KEY);
>
> is providing no indication that they want an index tailored to their
> locale. Yet we are creating the index with the "en_US" collation and
> therefore imposing huge performance costs (something like 2X slower
> index build time than the "C" locale), and also huge dependency
> versioning risks that could lead to index corruption and/or wrong
> results.

I guess you are arguing that the user didn't intend to create an index here? I
don't think that is true - users know that pkeys create indexes. If we used C
here, users would often need to create a second index on the same column using
the actual database collation - I think we'd very commonly end up with
complaints that the pkey index doesn't work, because it's been automatically
created with a different collation than the column.

Also, wouldn't the intent to use a different collation for the column be
expressed by changing the column's collation?


> Similarly, a user doing:
>
>    SELECT DISTINCT t FROM bar;
>
> is providing no indication that they care about the collation of "t"
> (we are free to choose a HashAgg which makes no ordering guarantee at
> all). Yet if we choose Sort+GroupAgg, the Sort will be performed in the
> "en_US" locale, which is something like 2X slower than the "C" locale.

OTOH, if we are choosing a groupagg, we might be able to implement that using
an index, which is more likey to exist in the databases collation.  Looks like
we even just look for indexes that are in the database collation.

Might be worth teaching the planner additional smarts here.


> Thoughts?

I seriously doubt its a good idea to change which collations primary keys use
by default.  But I think there's a decent bit of work we could do in the
planner, e.g:

- Teach the planner to take collation costs into account for costing - right
  now index scans with "C" cost the same as index scans with more expensive
  collations. That seems wrong even for equality lookups and would make it
  hard to make improvements to prefer cheaper collations in other situations.

- Teach the planner to use cheaper collations when ordering for reasons other
  than the user's direct request (e.g. DISTINCT/GROUP BY, merge joins).
  

I think we should also explain in our docs that C can be considerably faster -
I couldn't find anything in a quick look.

Greetings,

Andres Freund



Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Fri, 2023-11-10 at 17:19 -0800, Andres Freund wrote:
> I guess you are arguing that the user didn't intend to create an
> index here?

No, obviously the user should expect an index when a primary key is
created. But that doesn't mean that it necessarily needs to be ordered
according to the database collation.

Unfortunately, right now the planner doesn't understand that an index
in the "C" locale can satisfy equality searches and constraint
enforcement for "en_US" (or any other deterministic collation). That's
probably the first thing to fix.

Inequalities and ORDER BYs can't benefit from an index with a different
collation, but lots of indexes don't need that.

> Also, wouldn't the intent to use a different collation for the column
> be
> expressed by changing the column's collation?

The column collation expresses the semantics of that column. If the
user has a database collation of "en_US", they should expect ORDER BY
on that column to be according to that locale unless otherwise
specified.

That doesn't imply that indexes must have a matching collation. In fact
we already allow the column and index collations to differ, it just
doesn't work as well as it should.

>
> OTOH, if we are choosing a groupagg, we might be able to implement
> that using
> an index, which is more likey to exist in the databases collation. 
> Looks like
> we even just look for indexes that are in the database collation.
>
> Might be worth teaching the planner additional smarts here.

Yeah, we don't need to force anything, we could just create a few paths
with appropriate path key information and cost them.

>
> - Teach the planner to take collation costs into account for costing

+1. I noticed that GroupAgg+Sort is often in the same ballpark as
HashAgg in runtime when the collation is "C", but HashAgg is way faster
when the collation is something else.

> - Teach the planner to use cheaper collations when ordering for
> reasons other
>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
> joins).

+1. Where "cheaper" comes from is an interesting question -- is it a
property of the provider or the specific collation? Or do we just call
"C" special?

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Peter Eisentraut
Date:
On 11.11.23 01:03, Jeff Davis wrote:
> But the database collation is always deterministic,

So far!



Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Mon, 2023-11-13 at 13:43 +0100, Peter Eisentraut wrote:
> On 11.11.23 01:03, Jeff Davis wrote:
> > But the database collation is always deterministic,
>
> So far!

Yeah, if we did that, clearly the index collation would need to match
that of the database to be useful. What are the main challenges in
allowing non-deterministic collations at the database level?

If someone opts into a collation (and surely a non-deterministic
collation would be opt-in), then I think it makes sense that they
accept some performance costs and dependency versioning risks for the
functionality.

My point still stands that all deterministic collations are, at least
for equality, identical.

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Andres Freund
Date:
Hi,

On 2023-11-11 23:19:55 -0800, Jeff Davis wrote:
> On Fri, 2023-11-10 at 17:19 -0800, Andres Freund wrote:
> > I guess you are arguing that the user didn't intend to create an
> > index here?
>
> No, obviously the user should expect an index when a primary key is
> created. But that doesn't mean that it necessarily needs to be ordered
> according to the database collation.
>
> Unfortunately, right now the planner doesn't understand that an index
> in the "C" locale can satisfy equality searches and constraint
> enforcement for "en_US" (or any other deterministic collation). That's
> probably the first thing to fix.
>
> Inequalities and ORDER BYs can't benefit from an index with a different
> collation, but lots of indexes don't need that.

But we don't know whether the index is used for that. If we just change the
behaviour, there will be lots of pain around upgrades, because queries will
continue to work but be dog slow.


> > Also, wouldn't the intent to use a different collation for the column
> > be
> > expressed by changing the column's collation?
>
> The column collation expresses the semantics of that column. If the
> user has a database collation of "en_US", they should expect ORDER BY
> on that column to be according to that locale unless otherwise
> specified.

That makes no sense to me. Either the user cares about ordering, in which case
the index needs to be in that ordering for efficient ORDER BY, or they don't,
in which neither index nor column needs a non-C collation. You partially
premised your argument on the content of primary keys typically making non-C
collations undesirable!


> > OTOH, if we are choosing a groupagg, we might be able to implement
> > that using
> > an index, which is more likey to exist in the databases collation.
> > Looks like
> > we even just look for indexes that are in the database collation.
> >
> > Might be worth teaching the planner additional smarts here.
>
> Yeah, we don't need to force anything, we could just create a few paths
> with appropriate path key information and cost them.

I'm not sure it's quite that easy. One issue is obviously that this could lead
to a huge increase in paths we need to keep around due to differing path
keys. We might need to be a bit more aggressive about pruning such paths than
I think we would be today.


> > - Teach the planner to use cheaper collations when ordering for
> > reasons other
> >   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
> > joins).
>
> +1. Where "cheaper" comes from is an interesting question -- is it a
> property of the provider or the specific collation? Or do we just call
> "C" special?

I'd think the specific collation. Even if we initially perhaps just get the
default cost from the provider such, it structurally seems the sanest place to
locate the cost.

Greetings,

Andres Freund



Re: Why do indexes and sorts use the database collation?

From
Tomas Vondra
Date:

On 11/13/23 19:02, Andres Freund wrote:
> Hi,
> 
> On 2023-11-11 23:19:55 -0800, Jeff Davis wrote:
>> On Fri, 2023-11-10 at 17:19 -0800, Andres Freund wrote:
>>> I guess you are arguing that the user didn't intend to create an
>>> index here?
>>
>> No, obviously the user should expect an index when a primary key is
>> created. But that doesn't mean that it necessarily needs to be ordered
>> according to the database collation.
>>
>> Unfortunately, right now the planner doesn't understand that an index
>> in the "C" locale can satisfy equality searches and constraint
>> enforcement for "en_US" (or any other deterministic collation). That's
>> probably the first thing to fix.
>>
>> Inequalities and ORDER BYs can't benefit from an index with a different
>> collation, but lots of indexes don't need that.
> 
> But we don't know whether the index is used for that. If we just change the
> behaviour, there will be lots of pain around upgrades, because queries will
> continue to work but be dog slow.
> 

Yeah. I don't quite agree with the initial argument that not specifying
the collation explicitly in CREATE TABLE or a query means the user does
not care about the collation. We do have the sensible behavior that if
you don't specify a collation, you get the database one as a default.

I don't think we can just arbitrarily override the default because we
happen to think "C" is going to be faster. If we could prove that using
"C" is going to produce exactly the same results as for the implicit
collation (for a given operation), then we can simply do that. Not sure
if such proof is possible, though.

For example, I don't see how we could arbitrarily override the collation
for indexes backing primary keys, because how would you know the user
will never do a sort on it? Not that uncommon with natural primary keys,
I think (not a great practice, but people do that).

Perhaps we could allow the PK index to have a different collation, say
by supporting something like this:

  ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");

And then the planner would just pick the right index, I think.

> 
>>> Also, wouldn't the intent to use a different collation for the column
>>> be
>>> expressed by changing the column's collation?
>>
>> The column collation expresses the semantics of that column. If the
>> user has a database collation of "en_US", they should expect ORDER BY
>> on that column to be according to that locale unless otherwise
>> specified.
> 
> That makes no sense to me. Either the user cares about ordering, in which case
> the index needs to be in that ordering for efficient ORDER BY, or they don't,
> in which neither index nor column needs a non-C collation. You partially
> premised your argument on the content of primary keys typically making non-C
> collations undesirable!
> 

I may be missing something, but what's the disagreement here? If the
user cares about ordering, they'll specify ORDER BY with either an
explicit or the default collation. If the index collation matches, it
may be useful for the ordering.

Of course, if we feel entitled to create the primary key index with a
collation of our choosing, that'd make this unpredictable.

> 
>>> OTOH, if we are choosing a groupagg, we might be able to implement
>>> that using
>>> an index, which is more likey to exist in the databases collation.
>>> Looks like
>>> we even just look for indexes that are in the database collation.
>>>
>>> Might be worth teaching the planner additional smarts here.
>>
>> Yeah, we don't need to force anything, we could just create a few paths
>> with appropriate path key information and cost them.
> 
> I'm not sure it's quite that easy. One issue is obviously that this could lead
> to a huge increase in paths we need to keep around due to differing path
> keys. We might need to be a bit more aggressive about pruning such paths than
> I think we would be today.
> 

Right. There's also the challenge that we determine "interesting
pathkeys" very early, and I'm not sure if we can decide which pathkeys
(for different collations) are cheaper at that point.

> 
>>> - Teach the planner to use cheaper collations when ordering for
>>> reasons other
>>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
>>> joins).
>>
>> +1. Where "cheaper" comes from is an interesting question -- is it a
>> property of the provider or the specific collation? Or do we just call
>> "C" special?
> 
> I'd think the specific collation. Even if we initially perhaps just get the
> default cost from the provider such, it structurally seems the sanest place to
> locate the cost.
> 

ISTM it's about how complex the rules implemented by the collation are,
so I agree the cost should be a feature of collations not providers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why do indexes and sorts use the database collation?

From
Andres Freund
Date:
Hi,

On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
> I don't think we can just arbitrarily override the default because we
> happen to think "C" is going to be faster. If we could prove that using
> "C" is going to produce exactly the same results as for the implicit
> collation (for a given operation), then we can simply do that. Not sure
> if such proof is possible, though.

Yea, I don't know if there's any interesting cases where we could prove that.

I think there *are* interesting cases where we should prove that non-C
collations are identical. It's imo bonkers that we consider the
"collname.encname" collations distinct from the equivalent "collname"
collation.

We document that we consider "collname" equivalent to
  "collname.<database encoding>":

> Within any particular database, only collations that use that database's
> encoding are of interest. Other entries in pg_collation are ignored. Thus, a
> stripped collation name such as de_DE can be considered unique within a
> given database even though it would not be unique globally. Use of the
> stripped collation names is recommended, since it will make one fewer thing
> you need to change if you decide to change to another database
> encoding. Note however that the default, C, and POSIX collations can be used
> regardless of the database encoding.

Followed by:


> PostgreSQL considers distinct collation objects to be incompatible even when
> they have identical properties. Thus for example, [...]  Mixing stripped and
> non-stripped collation names is therefore not recommended.

Why on earth are we solving this by having multiple pg_collation entries for
exactly the same collation, instead of normalizing the collation-name during
lookup by adding the relevant encoding name if not explicitly specified?  It
makes a lot of sense to not force the user to specify the encoding when it
can't differ.


It's imo similarly absurd that an index with "default" collation cannot be
used when specifying the equivalent collation explicitly in the query and vice
versa.




> >>> Also, wouldn't the intent to use a different collation for the column
> >>> be
> >>> expressed by changing the column's collation?
> >>
> >> The column collation expresses the semantics of that column. If the
> >> user has a database collation of "en_US", they should expect ORDER BY
> >> on that column to be according to that locale unless otherwise
> >> specified.
> >
> > That makes no sense to me. Either the user cares about ordering, in which case
> > the index needs to be in that ordering for efficient ORDER BY, or they don't,
> > in which neither index nor column needs a non-C collation. You partially
> > premised your argument on the content of primary keys typically making non-C
> > collations undesirable!
> >
>
> I may be missing something, but what's the disagreement here? If the
> user cares about ordering, they'll specify ORDER BY with either an
> explicit or the default collation. If the index collation matches, it
> may be useful for the ordering.
>
> Of course, if we feel entitled to create the primary key index with a
> collation of our choosing, that'd make this unpredictable.

Jeff was saying that textual primary keys typically don't need sorting and
because of that we could default to "C", for performance. Part of my response
was that I think the user's intent could be expressed by specifying the column
collation as "C" - to which Jeff replied that that would change the
semantics. Which, to me, seems to completely run counter to his argument that
we could just use "C" for such indexes.



> >>> - Teach the planner to use cheaper collations when ordering for
g> >>> reasons other
> >>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
> >>> joins).
> >>
> >> +1. Where "cheaper" comes from is an interesting question -- is it a
> >> property of the provider or the specific collation? Or do we just call
> >> "C" special?
> >
> > I'd think the specific collation. Even if we initially perhaps just get the
> > default cost from the provider such, it structurally seems the sanest place to
> > locate the cost.
> >
>
> ISTM it's about how complex the rules implemented by the collation are,
> so I agree the cost should be a feature of collations not providers.

I'm not sure analysing the complexity in detail is worth it. ISTM there's a
few "levels" of costliness:

1) memcmp() suffices
2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
3) deterministic collations
4) non-deterministic collations

I'm sure there are graduations, particularly within 3), but I'm not sure it's
realistic / worthwhile to go to that detail. I think a cost model like the
above would provide enough detail to make better decisions than today...

Greetings,

Andres Freund



Re: Why do indexes and sorts use the database collation?

From
Tomas Vondra
Date:
On 11/13/23 23:12, Andres Freund wrote:
> Hi,
> 
> On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
>> I don't think we can just arbitrarily override the default because we
>> happen to think "C" is going to be faster. If we could prove that using
>> "C" is going to produce exactly the same results as for the implicit
>> collation (for a given operation), then we can simply do that. Not sure
>> if such proof is possible, though.
> 
> Yea, I don't know if there's any interesting cases where we could prove that.
> 
> I think there *are* interesting cases where we should prove that non-C
> collations are identical. It's imo bonkers that we consider the
> "collname.encname" collations distinct from the equivalent "collname"
> collation.
> 

Yeah, I agree that seems a bit ... strange.

> We document that we consider "collname" equivalent to
>   "collname.<database encoding>":
> 
>> Within any particular database, only collations that use that database's
>> encoding are of interest. Other entries in pg_collation are ignored. Thus, a
>> stripped collation name such as de_DE can be considered unique within a
>> given database even though it would not be unique globally. Use of the
>> stripped collation names is recommended, since it will make one fewer thing
>> you need to change if you decide to change to another database
>> encoding. Note however that the default, C, and POSIX collations can be used
>> regardless of the database encoding.
> 
> Followed by:
> 
> 
>> PostgreSQL considers distinct collation objects to be incompatible even when
>> they have identical properties. Thus for example, [...]  Mixing stripped and
>> non-stripped collation names is therefore not recommended.
> 
> Why on earth are we solving this by having multiple pg_collation entries for
> exactly the same collation, instead of normalizing the collation-name during
> lookup by adding the relevant encoding name if not explicitly specified?  It
> makes a lot of sense to not force the user to specify the encoding when it
> can't differ.
> 

True, insisting on having multiple separate entries for the same
collation (and not recognizing which collations are the same) seems
somewhat inconvenient.

> 
> It's imo similarly absurd that an index with "default" collation cannot be
> used when specifying the equivalent collation explicitly in the query and vice
> versa.
> 

Right. Having to spell

    COLLATE "default"

and not the actual collation it references to is weird. Similarly, I
just realized that the collation name in pg_database and pg_collation
are not quite consistent. Consider this:

select datcollate from pg_database where datname = 'test';

 datcollate
------------
 C.UTF-8
(1 row)

but then

  test=# select * from t where c = 'x' collate "C.UTF-8";
  ERROR:  collation "C.UTF-8" for encoding "UTF8" does not exist
  LINE 1: select * from t where c = 'x' collate "C.UTF-8";

because the collation is actually known as C.utf8.


> 
> 
> 
>>>>> Also, wouldn't the intent to use a different collation for the column
>>>>> be
>>>>> expressed by changing the column's collation?
>>>>
>>>> The column collation expresses the semantics of that column. If the
>>>> user has a database collation of "en_US", they should expect ORDER BY
>>>> on that column to be according to that locale unless otherwise
>>>> specified.
>>>
>>> That makes no sense to me. Either the user cares about ordering, in which case
>>> the index needs to be in that ordering for efficient ORDER BY, or they don't,
>>> in which neither index nor column needs a non-C collation. You partially
>>> premised your argument on the content of primary keys typically making non-C
>>> collations undesirable!
>>>
>>
>> I may be missing something, but what's the disagreement here? If the
>> user cares about ordering, they'll specify ORDER BY with either an
>> explicit or the default collation. If the index collation matches, it
>> may be useful for the ordering.
>>
>> Of course, if we feel entitled to create the primary key index with a
>> collation of our choosing, that'd make this unpredictable.
> 
> Jeff was saying that textual primary keys typically don't need sorting and
> because of that we could default to "C", for performance. Part of my response
> was that I think the user's intent could be expressed by specifying the column
> collation as "C" - to which Jeff replied that that would change the
> semantics. Which, to me, seems to completely run counter to his argument that
> we could just use "C" for such indexes.
> 

True. I think that's somewhat self-contradictory argument.

It's not clear to me if the argument is meant to apply to indexes on all
columns or just those backing primary keys, but I guess it's the latter.
But that (forcing users to specify collation for PK columns, while using
the default for non-PK columns) seems like a recipe for subtle bugs in
applications.

> 
> 
>>>>> - Teach the planner to use cheaper collations when ordering for
> g> >>> reasons other
>>>>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
>>>>> joins).
>>>>
>>>> +1. Where "cheaper" comes from is an interesting question -- is it a
>>>> property of the provider or the specific collation? Or do we just call
>>>> "C" special?
>>>
>>> I'd think the specific collation. Even if we initially perhaps just get the
>>> default cost from the provider such, it structurally seems the sanest place to
>>> locate the cost.
>>>
>>
>> ISTM it's about how complex the rules implemented by the collation are,
>> so I agree the cost should be a feature of collations not providers.
> 
> I'm not sure analysing the complexity in detail is worth it. ISTM there's a
> few "levels" of costliness:
> 
> 1) memcmp() suffices
> 2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
> 3) deterministic collations
> 4) non-deterministic collations
> 
> I'm sure there are graduations, particularly within 3), but I'm not sure it's
> realistic / worthwhile to go to that detail. I think a cost model like the
> above would provide enough detail to make better decisions than today...
> 

I'm not saying we have to analyze the complexity of the rules. I was
simply agreeing with you that the "cost" should be associated with
individual collations, not the providers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why do indexes and sorts use the database collation?

From
Andres Freund
Date:
Hi,

On 2023-11-14 00:02:13 +0100, Tomas Vondra wrote:
> On 11/13/23 23:12, Andres Freund wrote:
> > On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
> >> ISTM it's about how complex the rules implemented by the collation are,
> >> so I agree the cost should be a feature of collations not providers.
> > 
> > I'm not sure analysing the complexity in detail is worth it. ISTM there's a
> > few "levels" of costliness:
> > 
> > 1) memcmp() suffices
> > 2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
> > 3) deterministic collations
> > 4) non-deterministic collations
> > 
> > I'm sure there are graduations, particularly within 3), but I'm not sure it's
> > realistic / worthwhile to go to that detail. I think a cost model like the
> > above would provide enough detail to make better decisions than today...
> > 
> 
> I'm not saying we have to analyze the complexity of the rules. I was
> simply agreeing with you that the "cost" should be associated with
> individual collations, not the providers.

Just to be clear, I didn't intend to contradict you or anything - I was just
outlining my initial thoughts of how we could model the costs.

Greetings,

Andres Freund



Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Mon, 2023-11-13 at 10:02 -0800, Andres Freund wrote:
> > Inequalities and ORDER BYs can't benefit from an index with a
> > different
> > collation, but lots of indexes don't need that.
>
> But we don't know whether the index is used for that.

That will be hard to quantify, but perhaps we can discuss in terms of
the conditions that must be satisfied for pathkeys provided by an index
to be useful (the following is a bit fuzzy, let me know if there's
something major I'm missing):

  a. There needs to be an ORDER BY or inequality somewhere in a query
that involves that text field.
  b. The index must be correlated, or the data cached well enough, or
there must be a highly selective inequality.
  c. For ORDER BY, the result size needs to be large enough for the
pathkeys to really matter vs just sorting at the top of the plan.
  d. The pathkeys must be part of a winning plan (e.g. the winning plan
must keep the indexed field on the outer of a join, or use  MergeJoin).

In my experience, considering a text index speciifically: queries are
less likely to use inequalities on text fields than a timestamp field;
and indexes on text are less likely to be correlated than an index on a
timestamp field. That pushes down the probabilities of (a) and (b)
compared with timestamps. (Timestamps obviously don't have collation;
I'm just using timestamps as a point of reference where index pathkeys
are really useful.)

I know the above are hard to quantify (and not statistically
independent), but I don't think we should take it for granted that
pathkeys on a text index are overwhelmingly useful. I would describe
them as "sometimes useful".

> >
> That makes no sense to me. Either the user cares about ordering, in
> which case
> the index needs to be in that ordering for efficient ORDER BY

I disagree. The user may want top-level ORDER BYs on that field to
return 'a' before 'Z', and that's a very reasonable expectation that
requires a non-"C" collation.

But that does not imply much about what order an index on that field
should be. The fact that an ORDER BY exists satisfies only condition
(a) above. If the other conditions are not met, then the pathkeys
provided by the index are close to useless anyway.

The index itself might be useful for other reasons though, like
constraints or equality lookups. But indexes for those purposes don't
need to provide pathkeys.

> You partially
> premised your argument on the content of primary keys typically
> making non-C
> collations undesirable!

Primary keys requires (in a practical sense) an index to be created,
and that index should be useful for other purposes, too.

Equality lookups are clearly required to implement a primary key, so of
course the index should be useful for any other equality lookups as
well, because that has zero cost.

But "useful for other purposes" is not a blank check. Providing useful
pathkeys offers some marginal utility (assuming the conditions (a)-(e)
are satisfied), but also has a marginal cost (build time and versioning
risks). For typical cases I believe Postgres is on the wrong side of
that trade; that's all I'm saying.

> I'm not sure it's quite that easy. One issue is obviously that this
> could lead
> to a huge increase in paths we need to keep 

If there's a particularly bad case you have in mind, please let me
know. Otherwise we can sort the details out when it comes to a patch.

> >
> I'd think the specific collation. Even if we initially perhaps just
> get the
> default cost from the provider such, it structurally seems the sanest
> place to
> locate the cost.

Makes sense, though I'm thinking we'd still want to special case the
fastest collation as "C".

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Mon, 2023-11-13 at 22:36 +0100, Tomas Vondra wrote:
> Yeah. I don't quite agree with the initial argument that not
> specifying
> the collation explicitly in CREATE TABLE or a query means the user
> does
> not care about the collation.

I didn't argue that the user doesn't care about collation -- we need to
honor the collation semantics of the column. And a column with
unspecified collation must be the database collation (otherwise what
would the database collation mean?). But the index's collation is an
implementation detail that is not necessary to provide the requested
semantics.

I'm arguing that pathkeys are often not even useful for providing the
requested semantics, so why should the user have the pain of poor
performance and versioning risks for every text index in their system?
If the user just wants PK/FK constraints, and equality lookups, then an
index with the "C" collation makes a lot of sense to serve those
purposes.

> For example, I don't see how we could arbitrarily override the
> collation
> for indexes backing primary keys, because how would you know the user
> will never do a sort on it?

The column collation and index collation are tracked separately in the
catalog. The column collation cannot be overridden because it's
semantically signficant, but there are at least some degrees of freedom
we have with the index collation.

I don't think we can completely change the default index collation to
be "C", but perhaps there could be a database-level option to do so,
and that would have no effect on semantics at all. If the user notices
some queries that could benefit from an index with a non-"C" collation,
they can add/replace an index as they see fit.

>  Not that uncommon with natural primary keys,
> I think (not a great practice, but people do that).

Natural keys often have an uncorrelated index, and if the index is not
correlated, it's often not useful ORDER BY.

When I actually think about schemas and plans I've seen in the wild, I
struggle to think of many cases that would really benefit from an index
in a non-"C" collation. The best cases I can think of are where it's
doing some kind of prefix search. That's not rare, but it's also not so
common that I'd like to risk index corruption on every index in the
system by default in case a prefix search is performed.

> Perhaps we could allow the PK index to have a different collation,
> say
> by supporting something like this:
>
>   ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");

Yes, I'd like something like that to be supported. We'd have to check
that, if the collations are different, that both are deterministic.

> And then the planner would just pick the right index, I think.

Right now the planner doesn't seem to understand that an index in the
"C" collation works just fine for answering equality queries. That
should be fixed.

> If the
> user cares about ordering, they'll specify ORDER BY with either an
> explicit or the default collation. If the index collation matches, it
> may be useful for the ordering.

Exactly.

> Of course, if we feel entitled to create the primary key index with a
> collation of our choosing, that'd make this unpredictable.

I wouldn't describe it as "unpredictable". We'd have some defined way
of defaulting the collation of an index which might be affected by a
database option or something. In any case, it would be visible with \d.

Regards,
    Jeff Davis

>



Re: Why do indexes and sorts use the database collation?

From
Laurenz Albe
Date:
On Mon, 2023-11-13 at 22:36 +0100, Tomas Vondra wrote:
> Perhaps we could allow the PK index to have a different collation, say
> by supporting something like this:
>
>   ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");

An appealing idea!  While at it, we could add an INCLUDE clause...

The risk here would be extending standard syntax in a way that might
possibly conflict with future changes to the standard.

Yours,
Laurenz Albe



Re: Why do indexes and sorts use the database collation?

From
Tomas Vondra
Date:

On 11/14/23 02:58, Jeff Davis wrote:
> On Mon, 2023-11-13 at 22:36 +0100, Tomas Vondra wrote:
>> Yeah. I don't quite agree with the initial argument that not
>> specifying
>> the collation explicitly in CREATE TABLE or a query means the user
>> does
>> not care about the collation.
> 
> I didn't argue that the user doesn't care about collation -- we need to
> honor the collation semantics of the column. And a column with
> unspecified collation must be the database collation (otherwise what
> would the database collation mean?). But the index's collation is an
> implementation detail that is not necessary to provide the requested
> semantics.
> 
> I'm arguing that pathkeys are often not even useful for providing the
> requested semantics, so why should the user have the pain of poor
> performance and versioning risks for every text index in their system?
> If the user just wants PK/FK constraints, and equality lookups, then an
> index with the "C" collation makes a lot of sense to serve those
> purposes.
> 

Thanks for the clarification. I agree index's collation can be seen as
an implementation detail, as long as it produces the correct results
(with respect to the column's collation). I'm somewhat skeptical about
doing this automatically, because the collations may be equivalent only
for some operations (and we don't know what the user will do).

My concern is we'll decide to alter the index collation, and then the
user will face the consequences. Presumably we'd no generate incorrect
results, but we'd not be able use an index, causing performance issues.

AFAICS this is a trade-off between known benefits (faster equality
searches, which are common for PK columns) vs. unknown downsides
(performance penalty for operations with unknown frequency).

Not sure it's a decision we can make automatically. But it's mostly
achievable manually, if the user specifies COLLATE "C" for the column.
You're right that changes the semantics of the column, but if the user
only does equality searches, that shouldn't be an issue. And if an
ordering is needed after all, it's possible to specify the collation in
the ORDER BY clause.

I realize you propose to do this automatically for everyone, because few
people will realize how much faster can this be. But maybe there's a way
to make this manual approach more convenient? Say, by allowing the PK to
have a different collation (which I don't think is allowed now).

FWIW I wonder what the impact of doing this automatically would be in
practice. I mean, in my experience the number of tables with TEXT (or
types sensitive to collations) primary keys is fairly low, especially
for tables of non-trivial size (where the performance impact might be
measurable).

>> For example, I don't see how we could arbitrarily override the
>> collation
>> for indexes backing primary keys, because how would you know the user
>> will never do a sort on it?
> 
> The column collation and index collation are tracked separately in the
> catalog. The column collation cannot be overridden because it's
> semantically signficant, but there are at least some degrees of freedom
> we have with the index collation.
> 
> I don't think we can completely change the default index collation to
> be "C", but perhaps there could be a database-level option to do so,
> and that would have no effect on semantics at all. If the user notices
> some queries that could benefit from an index with a non-"C" collation,
> they can add/replace an index as they see fit.
> 

True. What about trying to allow a separate collation for the PK
constraint (and the backing index)?

>>  Not that uncommon with natural primary keys,
>> I think (not a great practice, but people do that).
> 
> Natural keys often have an uncorrelated index, and if the index is not
> correlated, it's often not useful ORDER BY.
> 
> When I actually think about schemas and plans I've seen in the wild, I
> struggle to think of many cases that would really benefit from an index
> in a non-"C" collation. The best cases I can think of are where it's
> doing some kind of prefix search. That's not rare, but it's also not so
> common that I'd like to risk index corruption on every index in the
> system by default in case a prefix search is performed.
> 

OK. I personally don't recall any case where I'd see a collation on PK
indexes as a performance issue. Or maybe I just didn't realize it.

But speaking of natural keys, I recall a couple schemas with natural
keys in code/dimension tables, and it's not uncommon to cluster those
slow-moving tables once in a while. I don't know if ORDER BY queries
were very common on those tables, though.

>> Perhaps we could allow the PK index to have a different collation,
>> say
>> by supporting something like this:
>>
>>   ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");
> 
> Yes, I'd like something like that to be supported. We'd have to check
> that, if the collations are different, that both are deterministic.
> 

OK, I think this answers my earlier question. Now that I think about
this, the one confusing thing with this syntax is that it seems to
assign the collation to the constraint, but in reality we want the
constraint to be enforced with the column's collation and the
alternative collation is for the index.

>> And then the planner would just pick the right index, I think.
> 
> Right now the planner doesn't seem to understand that an index in the
> "C" collation works just fine for answering equality queries. That
> should be fixed.
> 
>> If the
>> user cares about ordering, they'll specify ORDER BY with either an
>> explicit or the default collation. If the index collation matches, it
>> may be useful for the ordering.
> 
> Exactly.
> 
>> Of course, if we feel entitled to create the primary key index with a
>> collation of our choosing, that'd make this unpredictable.
> 
> I wouldn't describe it as "unpredictable". We'd have some defined way
> of defaulting the collation of an index which might be affected by a
> database option or something. In any case, it would be visible with \d.
> 

Perhaps "unpredictable" was not the right word. What I meant to express
is that it happens in the background, possibly confusing for the user.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why do indexes and sorts use the database collation?

From
Peter Eisentraut
Date:
On 13.11.23 17:49, Jeff Davis wrote:
> On Mon, 2023-11-13 at 13:43 +0100, Peter Eisentraut wrote:
>> On 11.11.23 01:03, Jeff Davis wrote:
>>> But the database collation is always deterministic,
>>
>> So far!
> 
> Yeah, if we did that, clearly the index collation would need to match
> that of the database to be useful. What are the main challenges in
> allowing non-deterministic collations at the database level?

Text pattern matching operations (LIKE, ~) don't work.

"Allowing" here is the right word.  We could enable it, and it would 
work just fine, but because of the restriction I mentioned, the 
experience would not be very pleasant.




Re: Why do indexes and sorts use the database collation?

From
Peter Eisentraut
Date:
On 14.11.23 02:58, Jeff Davis wrote:
> If the user just wants PK/FK constraints, and equality lookups, then an
> index with the "C" collation makes a lot of sense to serve those
> purposes.

The problem is that the user has no way to declare whether they just 
want this.  The default assumption is that you get a btree and that is 
useful for range queries.  If the user just wants equality lookups, they 
could use a hash index.  Hash indexes kind of work like what we 
discussed in another message: They use C collation semantics unless the 
collation is declared nondeterministic.  Of course, hash indexes don't 
support uniqueness, but maybe that could be fixed?  And/or we could 
provide some other syntax that say, I want a btree but I just want 
equality lookups?




Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Tue, 2023-11-14 at 17:15 +0100, Peter Eisentraut wrote:
> On 14.11.23 02:58, Jeff Davis wrote:
> > If the user just wants PK/FK constraints, and equality lookups,
> > then an
> > index with the "C" collation makes a lot of sense to serve those
> > purposes.
>
> The problem is that the user has no way to declare whether they just
> want this.

We should add a way to declare that a primary key should create an
index in a particular collation. We need to be careful not to interfere
with the SQL standard, but other than that, I think this is non-
controversial.

>   The default assumption is that you get a btree and that is
> useful for range queries.

As I've said elsewhere in this thread, I think the benefit of these
pathkeys are overstated, and the costs of providing those pathkeys with
an index (performance and corruption risk) are understated.

That being said, obviously we don't want to make any sudden change to
the default behavior that would regress lots of users. But there's lots
of stuff we can do that is not so radical.

>   If the user just wants equality lookups, they
> could use a hash index.

That's a good point, and we should probably support hash indexes for
primary keys. But I don't see a reason to push users toward hash
indexes if they aren't already inclined to use hash over btree. Btree
indexes in the "C" collation work just fine if we fix a planner issue
or two.

Regards,
    Jeff Davis





Re: Why do indexes and sorts use the database collation?

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2023-11-14 at 17:15 +0100, Peter Eisentraut wrote:
>> The problem is that the user has no way to declare whether they just 
>> want this.

> We should add a way to declare that a primary key should create an
> index in a particular collation.

Why should that ever be different from the column's own declared
collation?

            regards, tom lane



Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Tue, 2023-11-14 at 14:47 -0500, Tom Lane wrote:
> Why should that ever be different from the column's own declared
> collation?

Because an index with the "C" collation is more efficient in terms of
building/maintaining/searching the index, and it also doesn't carry
risks of corrupting your PK index when you upgrade libc or other
dependency headaches.

A "C" collation index is also perfectly capable of performing the
duties of a PK index: equality means the exact same thing in every
deterministic collation, so it can enforce the same notion of
uniqueness. It can also be used for ordinary equality lookups in the
same way, though currently our planner doesn't do that (I'll take a
shot at fixing that).

Of course such an index won't offer range scans or pathkeys useful for
ORDER BY on that text column. But as I've argued elsewhere in this
thread, that's less useful than it may seem at first (text indexes are
often uncorrelated). It seems valid to offer this as a trade-off that
users can make.

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Matthias van de Meent
Date:
On Wed, 15 Nov 2023 at 00:28, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2023-11-14 at 14:47 -0500, Tom Lane wrote:
> > Why should that ever be different from the column's own declared
> > collation?
>
> Because an index with the "C" collation is more efficient in terms of
> building/maintaining/searching the index, and it also doesn't carry
> risks of corrupting your PK index when you upgrade libc or other
> dependency headaches.

That doesn't really answer the question for me. Why would you have a
primary key that has different collation rules (which include equality
rules) than the columns that this primary key contains? It is not
unlikely that users are misinformed about the behaviour of the
collation they're creating, thus breaking any primary key or equality
lookup that uses indexes auto-converted from that collation to the "C"
collation.

If the collation on my primary key's columns changes from one that is
deterministic to one that isn't, then my primary key surely has to be
reindexed. If the collation of the underlying index was overwritten to
'C' for performance, then that's a problem, right, as we wouldn't have
the expectation that the index is based on the columns' actual
collation's properties?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Wed, 2023-11-15 at 00:52 +0100, Matthias van de Meent wrote:
> That doesn't really answer the question for me. Why would you have a
> primary key that has different collation rules (which include
> equality
> rules)

The equality rules for all deterministic collations are the same: if
the bytes are identical, the values are considered equal; and if the
bytes are not identical, the values are considered unequal.

That's the basis for this entire thread. The "C" collation provides the
same equality semantics as every other deterministic collation, but
with better performance and lower risk. (As long as you don't actually
need range scans or path keys from the index.)

See varstr_cmp() or varstrfastcmp_locale(). Those functions first check
for identical bytes and return 0 if so. If the bytes aren't equal, it
passes it to the collation provider, but if the collation provider
returns 0, we do a final memcmp() to break the tie. You can also see
this in hashtext(), where for deterministic collations it just calls
hash_any() on the bytes.

None of this works for non-deterministic collations (e.g. case
insensitive), but that would be easy to block where necessary.

Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Tue, 2023-11-14 at 13:01 +0100, Tomas Vondra wrote:

> Presumably we'd no generate incorrect
> results, but we'd not be able use an index, causing performance
> issues.

Couldn't use the index for its pathkeys or range scans, but could use
it for equality.

> AFAICS this is a trade-off between known benefits (faster equality
> searches, which are common for PK columns) vs. unknown downsides
> (performance penalty for operations with unknown frequency).

Don't forget the dependency versioning risk: changes to the collation
provider library can corrupt your indexes. That affects even small
tables where performance is not a concern.

> Not sure it's a decision we can make automatically. But it's mostly
> achievable manually, if the user specifies COLLATE "C" for the
> column.

Changing the column collation is the wrong place to do it, in my
opinion. It conflates semantics with performance considerations and
dependency risks.

We already have the catalog support for indexes with a different
collation:

  CREATE TABLE foo (t TEXT COLLATE "en_US");
  INSERT INTO foo SELECT g::TEXT FROM generate_series(1,1000000) g;
  CREATE INDEX foo_idx ON foo (t COLLATE "C");
  ANALYZE foo;

The problem is that:

  EXPLAIN SELECT * FROM foo WHERE t = '345678';

doesn't use the index. And also that there's no way to do it for a PK
index.

> I realize you propose to do this automatically for everyone,

I don't think I proposed that. Perhaps we nudge users in that direction
over time as the utility becomes clear, but I'm not trying to push for
a sudden radical change.

Perhaps many read $SUBJECT as a rhetorical question, but I really do
want to know if I am missing important and common use cases for indexes
in a non-"C" collation.

>  But maybe there's a way
> to make this manual approach more convenient? Say, by allowing the PK
> to
> have a different collation (which I don't think is allowed now).

Yeah, that should be fairly non-controversial.

> FWIW I wonder what the impact of doing this automatically would be in
> practice. I mean, in my experience the number of tables with TEXT (or
> types sensitive to collations) primary keys is fairly low, especially
> for tables of non-trivial size (where the performance impact might be
> measurable).

I think a lot more users would be helped than hurt. But in absolute
numbers, the latter group still represents a lot of regressions, so
let's not do anything radical.

> >
> True. What about trying to allow a separate collation for the PK
> constraint (and the backing index)?

+1.

> >
> OK. I personally don't recall any case where I'd see a collation on
> PK
> indexes as a performance issue. Or maybe I just didn't realize it.

Try a simple experiment building an index with the "C" collation and
then try with a different locale on the same data. Numbers vary, but
I've seen 1.5X to 4X on some simple generated data. Others have
reported much worse numbers on some versions of glibc that are
especially slow with lots of non-latin characters.

> But speaking of natural keys, I recall a couple schemas with natural
> keys in code/dimension tables, and it's not uncommon to cluster those
> slow-moving tables once in a while. I don't know if ORDER BY queries
> were very common on those tables, though.

Yeah, not exactly a common case though.

> >
> OK, I think this answers my earlier question. Now that I think about
> this, the one confusing thing with this syntax is that it seems to
> assign the collation to the constraint, but in reality we want the
> constraint to be enforced with the column's collation and the
> alternative collation is for the index.

Yeah, let's be careful about that. It's still technically correct:
uniqueness in either collation makes sense. But it could be confusing
anyway.

> >
Regards,
    Jeff Davis




Re: Why do indexes and sorts use the database collation?

From
Jeff Davis
Date:
On Mon, 2023-11-13 at 14:12 -0800, Andres Freund wrote:
> Why on earth are we solving this by having multiple pg_collation
> entries for
> exactly the same collation, instead of normalizing the collation-name
> during
> lookup by adding the relevant encoding name if not explicitly
> specified?  It
> makes a lot of sense to not force the user to specify the encoding
> when it
> can't differ.

I'm not aware of it being a common practical problem, so perhaps lack
of motivation. But you're right that it doesn't look very efficient.

We can even go deeper into ICU if we wanted to: lots of locales are
actually aliases to a much smaller number of actual collators. And a
lot are just aliases to the root locale. It's not trivial to reliably
tell if two collators are identical, but in principle it should be
possible: each collation is just a set of tailorings on top of the root
locale, so I suppose if those are equal it's the same collator, right?

> It's imo similarly absurd that an index with "default" collation
> cannot be
> used when specifying the equivalent collation explicitly in the query
> and vice
> versa.

The catalog representation is not ideal to treat the database collation
consistently with other collations. It would be nice to fix that.

> > > >
> Jeff was saying that textual primary keys typically don't need
> sorting and
> because of that we could default to "C", for performance. Part of my
> response
> was that I think the user's intent could be expressed by specifying
> the column
> collation as "C" - to which Jeff replied that that would change the
> semantics. Which, to me, seems to completely run counter to his
> argument that
> we could just use "C" for such indexes.

I am saying we shouldn't prematurely optimize for the case of ORDER BY
on a text PK case by making a an index with a non-"C" collation, given
the costs and risks of non-"C" indexes. Particularly because, even if
there is an ORDER BY, there are several common reasons such an index
would not help anyway.

> > > >
Regards,
    Jeff Davis