Re: a wrong index choose when statistics is out of date - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: a wrong index choose when statistics is out of date
Date
Msg-id 5284e0de-7457-490c-9413-14ba90116c92@enterprisedb.com
Whole thread Raw
In response to Re: a wrong index choose when statistics is out of date  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On 3/4/24 06:33, David Rowley wrote:
> On Sun, 3 Mar 2024 at 20:08, Andy Fan <zhihuifan1213@163.com> wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
> 
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.
> 
> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.
> 
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.
> 
> The exact maths to calculate the certainty_factor for this case I
> don't quite have worked out yet. I plan to work on documenting the
> design of this and try and get a prototype patch out sometime during
> this coming southern hemisphere winter so that there's at least a full
> cycle of feedback opportunity before the PG18 freeze.
> 

I've been thinking about this stuff too, so I'm curious to hear what
kind of plan you come up with. Every time I tried to formalize a more
concrete plan, I ended up with a very complex (and possible yet more
fragile) approach.

I think we'd need to consider a couple things:


1) reliability of cardinality estimates

I think this is pretty much the same concept as confidence intervals,
i.e. knowing not just the regular estimate, but also a range where the
actual value lies with high confidence (e.g. 90%).

For a single clauses this might not be terribly difficult, but for more
complex cases (multiple conditions, ...) it seems far more complex :-(
For example, let's say we know confidence intervals for two conditions.
What's the confidence interval when combined using AND or OR?


2) robustness of the paths

Knowing just the confidence intervals does not seem sufficient, though.
The other thing that matters is how this affects the paths, how robust
the paths are. I mean, if we have alternative paths with costs that flip
somewhere in the confidence interval - which one to pick? Surely one
thing to consider is how fast the costs change for each path.


3) complexity of the model

I suppose we'd prefer a systematic approach (and not some ad hoc
solution for one particular path/plan type). So this would be somehow
integrated into the cost model, making it yet more complex. I'm quite
worried about this (not necessarily because of performance reasons).

I wonder if trying to improve the robustness solely by changes in the
planning phase is a very promising approach. I mean - sure, we should
improve that, but by definition it relies on a priori information. And
not only the stats may be stale - it's a very lossy approximation of the
actual data. Even if the stats are perfectly up to date / very detailed,
there's still going to be a lot of uncertainty.


I wonder if we'd be better off if we experimented with more robust
plans, like SmoothScan [1] or g-join [2].


regards

[1]
https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf

[2]
http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: Extract numeric filed in JSONB more effectively
Next
From: Andy Fan
Date:
Subject: Re: Shared detoast Datum proposal