Re: A wrong index choose issue because of inaccurate statistics - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | Re: A wrong index choose issue because of inaccurate statistics |
Date | |
Msg-id | CAKU4AWrRA4oXrORsCvXgy1r3oC_eGZ5DZBR_TBc0W+O9z_fcaw@mail.gmail.com Whole thread Raw |
In response to | Re: A wrong index choose issue because of inaccurate statistics (David Rowley <dgrowleyml@gmail.com>) |
List | pgsql-hackers |
On Fri, Jun 5, 2020 at 2:19 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
> cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> + /*
> + * To make the planner more robust to handle some inaccurate statistics
> + * issue, we will add a extra cost to qpquals so that the less qpquals
> + * the lower cost it has.
> + */
> + cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan. However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.
Didn't that idea already get shot down in the final paragraph on [1]?
Thanks for chiming in. I treat this as I didn't describe my idea clearly enough then
both Tomas and Tom didn't spend much time to read it (no offense, and I
understand they need to do lots of things every day), so I re-summarize the
issue to make it easier to read.
In Tomas's reply, he raises concerns about how to fix the issue, since we
don't know how much it errored 1%, 10% and so on, so I emphasized I don't
touch that part actually. Even the wrong estimation still plays a bad role on
later join, but that would not be the issue I would fix here.
I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case. Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth. I feel the part about "rinse and
repeat" applies reasonably well to how join costing works. The lie is
likely to be amplified as the join level gets deeper.
I agree with that to some extent. However we can just provide more options to
users. At the same time, I still believe we should provide such options very carefully.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there. Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.
I probably underestimated the impacts for a large number of join quals.
This looks like a weakness we can't ignore confomforably, so I checked
the code further, maybe we don't need a risk factor for this case.
For query WHERE a = 1 AND b = 2, both Selectivity(a = 1) and
Selectivity(b = 2) are greater than 0 even the statistics are stale enough,
so the IndexSelectivity is greater than 0 as well. Based on this,
IndexSelectivity(A, B) should be less than IndexSelectivity(A, C) for the
above query However they still generate the same cost because of the
below code. (genericcostestimate and btcostestimate)
numIndexTuples = indexSelectivity * index->rel->tuples;
numIndexTuples = rint(numIndexTuples / num_sa_scans);
if (numIndexTuples < 1.0)
numIndexTuples = rint(numIndexTuples / num_sa_scans);
if (numIndexTuples < 1.0)
numIndexTuples = 1.0;
later numIndexTuples is used to calculate cost. The above code eats the
difference of indexSelectivity.
Many places say we need to "round to integer" but I am still not figuring out
why it is a must. so If we can "round to integer" just after the IndexCost
is calculated, the issue can be fixed as well.
The attached is the patch in my mind, since this patch may lower the index
costs if the numIndexTuples < 1.0, so it makes the optimizer prefer to use
nest loop rather than a hash join if the loop is small, which is a common
case in our regression test where we don't have much data, so there are
several changes like that.
Another impact I found in the regression test is that optimizer choose an
IndexScan of a conditional index rather than IndexOnlyScan a normal index.
I checked the code and didn't find anything wrong, so I'd like to say that is
because the data is too small.
I also tested TPC-H workload, Query-12 changed hash join to nest loop,
The execution time changed from1605.118ms to 699.032ms.
> the root cause of this situation is because IndexSelectivity = 0.
This is wrong. I got the 0 by elog(INFO, "%f", IndexSelectivity).
that reported 0 while the value is pretty small.
Best Regards
Andy Fan
Attachment
pgsql-hackers by date: