Re: A wrong index choose issue because of inaccurate statistics - Mailing list pgsql-hackers

From David Rowley
Subject Re: A wrong index choose issue because of inaccurate statistics
Date
Msg-id CAApHDvovVWCbeR4v+A4Dkwb=YS_GuJG9OyCm8jZu++cP2xsY_A@mail.gmail.com
Whole thread Raw
In response to A wrong index choose issue because of inaccurate statistics  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: A wrong index choose issue because of inaccurate statistics
Re: A wrong index choose issue because of inaccurate statistics
List pgsql-hackers
On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
>         cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> +       /*
> +        * To make the planner more robust to handle some inaccurate statistics
> +        * issue, we will add a extra cost to qpquals so that the less qpquals
> +        * the lower cost it has.
> +        */
> +       cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan.  However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case.  Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth.  I feel the part about "rinse and
repeat" applies reasonably well to how join costing works.  The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead.  There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use <x> as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching <x> might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching <x> might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case,  say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there.   Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Next
From: Michael Paquier
Date:
Subject: Re: Removal of currtid()/currtid2() and some table AM cleanup