Re: Why do indexes and sorts use the database collation? - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Why do indexes and sorts use the database collation?
Date
Msg-id 4c7b88bdc63c01fb51b56447b5e0c6f1dd702ffa.camel@j-davis.com
Whole thread Raw
In response to Re: Why do indexes and sorts use the database collation?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, 2023-11-13 at 10:02 -0800, Andres Freund wrote:
> > Inequalities and ORDER BYs can't benefit from an index with a
> > different
> > collation, but lots of indexes don't need that.
>
> But we don't know whether the index is used for that.

That will be hard to quantify, but perhaps we can discuss in terms of
the conditions that must be satisfied for pathkeys provided by an index
to be useful (the following is a bit fuzzy, let me know if there's
something major I'm missing):

  a. There needs to be an ORDER BY or inequality somewhere in a query
that involves that text field.
  b. The index must be correlated, or the data cached well enough, or
there must be a highly selective inequality.
  c. For ORDER BY, the result size needs to be large enough for the
pathkeys to really matter vs just sorting at the top of the plan.
  d. The pathkeys must be part of a winning plan (e.g. the winning plan
must keep the indexed field on the outer of a join, or use  MergeJoin).

In my experience, considering a text index speciifically: queries are
less likely to use inequalities on text fields than a timestamp field;
and indexes on text are less likely to be correlated than an index on a
timestamp field. That pushes down the probabilities of (a) and (b)
compared with timestamps. (Timestamps obviously don't have collation;
I'm just using timestamps as a point of reference where index pathkeys
are really useful.)

I know the above are hard to quantify (and not statistically
independent), but I don't think we should take it for granted that
pathkeys on a text index are overwhelmingly useful. I would describe
them as "sometimes useful".

> >
> That makes no sense to me. Either the user cares about ordering, in
> which case
> the index needs to be in that ordering for efficient ORDER BY

I disagree. The user may want top-level ORDER BYs on that field to
return 'a' before 'Z', and that's a very reasonable expectation that
requires a non-"C" collation.

But that does not imply much about what order an index on that field
should be. The fact that an ORDER BY exists satisfies only condition
(a) above. If the other conditions are not met, then the pathkeys
provided by the index are close to useless anyway.

The index itself might be useful for other reasons though, like
constraints or equality lookups. But indexes for those purposes don't
need to provide pathkeys.

> You partially
> premised your argument on the content of primary keys typically
> making non-C
> collations undesirable!

Primary keys requires (in a practical sense) an index to be created,
and that index should be useful for other purposes, too.

Equality lookups are clearly required to implement a primary key, so of
course the index should be useful for any other equality lookups as
well, because that has zero cost.

But "useful for other purposes" is not a blank check. Providing useful
pathkeys offers some marginal utility (assuming the conditions (a)-(e)
are satisfied), but also has a marginal cost (build time and versioning
risks). For typical cases I believe Postgres is on the wrong side of
that trade; that's all I'm saying.

> I'm not sure it's quite that easy. One issue is obviously that this
> could lead
> to a huge increase in paths we need to keep 

If there's a particularly bad case you have in mind, please let me
know. Otherwise we can sort the details out when it comes to a patch.

> >
> I'd think the specific collation. Even if we initially perhaps just
> get the
> default cost from the provider such, it structurally seems the sanest
> place to
> locate the cost.

Makes sense, though I'm thinking we'd still want to special case the
fastest collation as "C".

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Next
From: Andres Freund
Date:
Subject: Re: [PATCH] pgbench log file headers