Re: A wrong index choose issue because of inaccurate statistics - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: A wrong index choose issue because of inaccurate statistics |
Date | |
Msg-id | CAExHW5vDey+a-3hu0hkvensMPEkCHp2oaf4C6dZ8qz=XFPSzsQ@mail.gmail.com Whole thread Raw |
In response to | Re: A wrong index choose issue because of inaccurate statistics (Pavel Stehule <pavel.stehule@gmail.com>) |
Responses |
Re: A wrong index choose issue because of inaccurate statistics
|
List | pgsql-hackers |
I know one project where they used PostgreSQL code base to detect "robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf describe the idea. In short, the idea is to annotate a plan with a "bandwidth" i.e. how does the plan fair with degradation of statistics. A plan which has a slightly higher cost which doesn't degrade much with degradation of statistics is preferred over a low cost plan whose cost rises sharply with degradation of statistics. This is similar to what David is suggesting. On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > > pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal: >> >> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote: >> > The one-line fix describe the exact idea in my mind: >> > >> > +++ b/src/backend/optimizer/path/costsize.c >> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, >> > >> > cpu_run_cost += cpu_per_tuple * tuples_fetched; >> > >> > + /* >> > + * To make the planner more robust to handle some inaccurate statistics >> > + * issue, we will add a extra cost to qpquals so that the less qpquals >> > + * the lower cost it has. >> > + */ >> > + cpu_run_cost += 0.01 * list_length(qpquals); >> > + >> > >> > This change do fix the issue above, but will it make some other cases worse? My >> > answer is no based on my current knowledge, and this is most important place >> > I want to get advised. The mainly impact I can figure out is: it not only >> > change the cost difference between (a, b) and (a, c) it also cause the cost >> > difference between Index scan on (a, c) and seq scan. However the >> > cost different between index scan and seq scan are huge by practice so >> > I don't think this impact is harmful. >> >> Didn't that idea already get shot down in the final paragraph on [1]? >> >> I understand that you wish to increase the cost by some seemingly >> innocent constant to fix your problem case. Here are my thoughts >> about that: Telling lies is not really that easy to pull off. Bad >> liers think it's easy and good ones know it's hard. The problem is >> that the lies can start small, but then at some point the future you >> must fashion some more lies to account for your initial lie. Rinse and >> repeat that a few times and before you know it, your course is set >> well away from the point of truth. I feel the part about "rinse and >> repeat" applies reasonably well to how join costing works. The lie is >> likely to be amplified as the join level gets deeper. >> >> I think you need to think of a more generic solution and propose that >> instead. There are plenty of other quirks in the planner that can >> cause suffering due to inaccurate or non-existing statistics. For >> example, due to how we multiply individual selectivity estimates, >> having a few correlated columns in a join condition can cause the >> number of join rows to be underestimated. Subsequent joins can then >> end up making bad choices on which join operator to use based on those >> inaccurate row estimates. There's also a problem with WHERE <x> ORDER >> BY col LIMIT n; sometimes choosing an index that provides pre-sorted >> input to the ORDER BY but cannot use <x> as an indexable condition. >> We don't record any stats to make better choices there, maybe we >> should, but either way, we're taking a bit risk there as all the rows >> matching <x> might be right at the end of the index and we might need >> to scan the entire thing before hitting the LIMIT. For now, we just >> assume completely even distribution of rows. i.e. If there are 50 rows >> estimated in the path and the limit is for 5 rows, then we'll assume >> we need to read 10% of those before finding all the ones we need. In >> reality, everything matching <x> might be 95% through the index and we >> could end up reading 100% of rows. That particular problem is not just >> caused by the uneven distribution of rows in the index, but also from >> selectivity underestimation. >> >> I'd more recently wondered if we shouldn't have some sort of "risk" >> factor to the cost model. I really don't have ideas on how exactly we >> would calculate the risk factor in all cases, but in this case, say >> the risk factor always starts out as 1. We could multiply that risk >> factor by some >1 const each time we find another index filter qual. >> add_path() can prefer lower risk paths when the costs are similar. >> Unsure what the exact add_path logic would be. Perhaps a GUC would >> need to assist with the decision there. Likewise, with >> NestedLoopPaths which have a large number of join quals, the risk >> factor could go up a bit with those so that we take a stronger >> consideration for hash or merge joins instead. >> > > I thought about these ideas too. And I am not alone. > > https://hal.archives-ouvertes.fr/hal-01316823/document > > Regards > > Pavel > >> Anyway, it's pretty much a large research subject which would take a >> lot of work to iron out even just the design. It's likely not a >> perfect idea, but I think it has a bit more merit that trying to >> introduce lies to the cost modal to account for a single case where >> there is a problem. >> >> David >> >> [1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development >> >> -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: