Re: a wrong index choose when statistics is out of date - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: a wrong index choose when statistics is out of date |
Date | |
Msg-id | 5284e0de-7457-490c-9413-14ba90116c92@enterprisedb.com Whole thread Raw |
In response to | Re: a wrong index choose when statistics is out of date (David Rowley <dgrowleyml@gmail.com>) |
List | pgsql-hackers |
On 3/4/24 06:33, David Rowley wrote: > On Sun, 3 Mar 2024 at 20:08, Andy Fan <zhihuifan1213@163.com> wrote: >> The issue can be reproduced with the following steps: >> >> create table x_events (.., created_at timestamp, a int, b int); >> >> create index idx_1 on t(created_at, a); >> create index idx_2 on t(created_at, b); >> >> query: >> select * from t where create_at = current_timestamp and b = 1; >> >> index (created_at, a) rather than (created_at, b) may be chosen for the >> above query if the statistics think "create_at = current_timestamp" has >> no rows, then both index are OK, actually it is true just because >> statistics is out of date. > > I don't think there's really anything too special about the fact that > the created_at column is always increasing. We commonly get 1-row > estimates after multiplying the selectivities from individual stats. > Your example just seems like yet another reason that this could > happen. > > I've been periodically talking about introducing "risk" as a factor > that the planner should consider. I did provide some detail in [1] > about the design that was in my head at that time. I'd not previously > thought that it could also solve this problem, but after reading your > email, I think it can. > > I don't think it would be right to fudge the costs in any way, but I > think the risk factor for IndexPaths could take into account the > number of unmatched index clauses and increment the risk factor, or > "certainty_factor" as it is currently in my brain-based design. That > way add_path() would be more likely to prefer the index that matches > the most conditions. > > The exact maths to calculate the certainty_factor for this case I > don't quite have worked out yet. I plan to work on documenting the > design of this and try and get a prototype patch out sometime during > this coming southern hemisphere winter so that there's at least a full > cycle of feedback opportunity before the PG18 freeze. > I've been thinking about this stuff too, so I'm curious to hear what kind of plan you come up with. Every time I tried to formalize a more concrete plan, I ended up with a very complex (and possible yet more fragile) approach. I think we'd need to consider a couple things: 1) reliability of cardinality estimates I think this is pretty much the same concept as confidence intervals, i.e. knowing not just the regular estimate, but also a range where the actual value lies with high confidence (e.g. 90%). For a single clauses this might not be terribly difficult, but for more complex cases (multiple conditions, ...) it seems far more complex :-( For example, let's say we know confidence intervals for two conditions. What's the confidence interval when combined using AND or OR? 2) robustness of the paths Knowing just the confidence intervals does not seem sufficient, though. The other thing that matters is how this affects the paths, how robust the paths are. I mean, if we have alternative paths with costs that flip somewhere in the confidence interval - which one to pick? Surely one thing to consider is how fast the costs change for each path. 3) complexity of the model I suppose we'd prefer a systematic approach (and not some ad hoc solution for one particular path/plan type). So this would be somehow integrated into the cost model, making it yet more complex. I'm quite worried about this (not necessarily because of performance reasons). I wonder if trying to improve the robustness solely by changes in the planning phase is a very promising approach. I mean - sure, we should improve that, but by definition it relies on a priori information. And not only the stats may be stale - it's a very lossy approximation of the actual data. Even if the stats are perfectly up to date / very detailed, there's still going to be a lot of uncertainty. I wonder if we'd be better off if we experimented with more robust plans, like SmoothScan [1] or g-join [2]. regards [1] https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf [2] http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: