Re: Postgres picks suboptimal index after building of an extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Postgres picks suboptimal index after building of an extended statistics
Date
Msg-id d3ce7422-978e-16a1-aa0e-7c749615797b@enterprisedb.com
Whole thread Raw
In response to Re: Postgres picks suboptimal index after building of an extended statistics  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: Postgres picks suboptimal index after building of an extended statistics
List pgsql-hackers
On 9/25/23 06:30, Andrey Lepikhov wrote:
> ...
> I can't stop thinking about this issue. It is bizarre when Postgres
> chooses a non-unique index if a unique index gives us proof of minimum
> scan.

That's true, but no one implemented this heuristics. So the "proof of
minimum scan" is merely hypothetical - the optimizer is unaware of it.

> I don't see a reason to teach the clauselist_selectivity() routine to
> estimate UNIQUE indexes. We add some cycles, but it will work with btree
> indexes only.

I'm not sure I understand what this is meant to say. Can you elaborate?
We only allow UNIQUE for btree indexes anyway, so what exactly is the
problem here?

> Maybe to change compare_path_costs_fuzzily() and add some heuristic, for
> example:
> "If selectivity of both paths gives us no more than 1 row, prefer to use
> a unique index or an index with least selectivity."
> 

I don't understand how this would work. What do yo mean by "selectivity
of a path"? AFAICS the problem here is that we estimate a path to return
more rows (while we know there can only be 1, thanks to UNIQUE index).

So how would you know the path does not give us more than 1 row? Surely
you're not proposing compare_path_costs_fuzzily() to do something
expensive like analyzing the paths, or something.

Also, how would it work in upper levels? If you just change which path
we keep, but leave the inaccurate row estimate in place, that may fix
that level, but it's certainly going to confuse later planning steps.

IMHO the problem here is that we produce wrong estimate, so we better
fix that, instead of adding band-aids to other places.

This happens because functional dependencies are very simple type of
statistics - it has some expectations about the input data and also the
queries executed on it. For example it assumes the data is reasonably
homogeneous, so that we can calculate a global "degree".

But the data in the example directly violates that - it has 1000 rows
that are very random (so we'd find no dependencies), and 1000 rows with
perfect dependencies. Hence we end with degree=0.5, which approximates
the dependencies to all data. Not great, true, but that's the price for
simplicity of this statistics kind.

So the simplest solution is to disable dependencies on such data sets.
It's a bit inconvenient/unfortunate we build dependencies by default,
and people may not be aware of there assumptions.

Perhaps we could/should make dependency_degree() more pessimistic when
we find some "violations" of the rule (we intentionally are not strict
about it, because data are imperfect). I don't have a good idea how to
change the formulas, but I'm open to the idea in principle.

The other thing we could do is checking for unique indexes in
clauselist_selectivity, and if we find a match we can just skip the
extended statistics altogether. Not sure how expensive this would be,
but for typical cases (with modest number of indexes) perhaps OK. It
wouldn't work without a unique index, but I don't have a better idea.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Next
From: Tomas Vondra
Date:
Subject: Re: Detoasting optionally to make Explain-Analyze less misleading