Re: A wrong index choose issue because of inaccurate statistics - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: A wrong index choose issue because of inaccurate statistics |
Date | |
Msg-id | CAApHDvovVWCbeR4v+A4Dkwb=YS_GuJG9OyCm8jZu++cP2xsY_A@mail.gmail.com Whole thread Raw |
In response to | A wrong index choose issue because of inaccurate statistics (Andy Fan <zhihui.fan1213@gmail.com>) |
Responses |
Re: A wrong index choose issue because of inaccurate statistics
Re: A wrong index choose issue because of inaccurate statistics |
List | pgsql-hackers |
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote: > The one-line fix describe the exact idea in my mind: > > +++ b/src/backend/optimizer/path/costsize.c > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, > > cpu_run_cost += cpu_per_tuple * tuples_fetched; > > + /* > + * To make the planner more robust to handle some inaccurate statistics > + * issue, we will add a extra cost to qpquals so that the less qpquals > + * the lower cost it has. > + */ > + cpu_run_cost += 0.01 * list_length(qpquals); > + > > This change do fix the issue above, but will it make some other cases worse? My > answer is no based on my current knowledge, and this is most important place > I want to get advised. The mainly impact I can figure out is: it not only > change the cost difference between (a, b) and (a, c) it also cause the cost > difference between Index scan on (a, c) and seq scan. However the > cost different between index scan and seq scan are huge by practice so > I don't think this impact is harmful. Didn't that idea already get shot down in the final paragraph on [1]? I understand that you wish to increase the cost by some seemingly innocent constant to fix your problem case. Here are my thoughts about that: Telling lies is not really that easy to pull off. Bad liers think it's easy and good ones know it's hard. The problem is that the lies can start small, but then at some point the future you must fashion some more lies to account for your initial lie. Rinse and repeat that a few times and before you know it, your course is set well away from the point of truth. I feel the part about "rinse and repeat" applies reasonably well to how join costing works. The lie is likely to be amplified as the join level gets deeper. I think you need to think of a more generic solution and propose that instead. There are plenty of other quirks in the planner that can cause suffering due to inaccurate or non-existing statistics. For example, due to how we multiply individual selectivity estimates, having a few correlated columns in a join condition can cause the number of join rows to be underestimated. Subsequent joins can then end up making bad choices on which join operator to use based on those inaccurate row estimates. There's also a problem with WHERE <x> ORDER BY col LIMIT n; sometimes choosing an index that provides pre-sorted input to the ORDER BY but cannot use <x> as an indexable condition. We don't record any stats to make better choices there, maybe we should, but either way, we're taking a bit risk there as all the rows matching <x> might be right at the end of the index and we might need to scan the entire thing before hitting the LIMIT. For now, we just assume completely even distribution of rows. i.e. If there are 50 rows estimated in the path and the limit is for 5 rows, then we'll assume we need to read 10% of those before finding all the ones we need. In reality, everything matching <x> might be 95% through the index and we could end up reading 100% of rows. That particular problem is not just caused by the uneven distribution of rows in the index, but also from selectivity underestimation. I'd more recently wondered if we shouldn't have some sort of "risk" factor to the cost model. I really don't have ideas on how exactly we would calculate the risk factor in all cases, but in this case, say the risk factor always starts out as 1. We could multiply that risk factor by some >1 const each time we find another index filter qual. add_path() can prefer lower risk paths when the costs are similar. Unsure what the exact add_path logic would be. Perhaps a GUC would need to assist with the decision there. Likewise, with NestedLoopPaths which have a large number of join quals, the risk factor could go up a bit with those so that we take a stronger consideration for hash or merge joins instead. Anyway, it's pretty much a large research subject which would take a lot of work to iron out even just the design. It's likely not a perfect idea, but I think it has a bit more merit that trying to introduce lies to the cost modal to account for a single case where there is a problem. David [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
pgsql-hackers by date: