Home > mailing lists

Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation - Mailing list pgsql-hackers

From	Yuto Hayamizu
Subject	Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
Date	January 19, 2018 14:07:13
Msg-id	CANE+7D_biXGVz2jhh-y20zNSv_NaQzn=ZkrKKSjx2d9Bs9=RJw@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
List	pgsql-hackers

Tree view

On Wed, Nov 8, 2017 at 6:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Because (1) up to now there's been no need to consider the qual ordering
> till later, and (2) re-doing that sort for each path seemed unduly
> expensive.  If we're to try to estimate whether later quals will be
> reached, then sure the ordering becomes important.  I'm still concerned
> about (2) though.

This patch changes 'set_baserel_size_estimates', and it is called only once
for each range table entry, so sorting does not happen for each path and
its negative impact on optimizer performance is negligible.

While sorting quals does not cost so much, I noticed another performance
problem of this patch.
Patched 'set_baserel_size_estimates'  is O(N^2) (N is the number of quals)
and would take so long time when a query has a large qual list.
After sorting quals, it loops over quals q_1,q_2, ..., q_N to
estimated "weighted"
cost of each qual q_k. In each iteration, in order to calculate "the
weight" of q_k,
it calls clauselist_selectivity({q_1,q_2, ..., q_k}), which loops k
times in the function.

I've checked actual negative impact on optimization time with the
artificial query:
    SELECT count(*) FROM pgbench_accounts WHERE (randomly generated N quals);

N=50: 0.5966 msec (original), 0.938 msec (patched)
N=100: 1.1992 msec (original), 1.5396 msec (patched)
N=200: 1.9167 msec (original),  4.6676 msec (patched)
N=400: 2.7583 msec (original), 12.0786 msec (patched)
N=800: 5.2919 msec (original), 38.0228 msec (patched)
N=1600: 9.3428 msec (original), 167.737 msec (patched)

From this result, patched version is almost O(N^2).
100 quals in a query seems unusually large number, but I think there may exists
some real-world queries that have hundreds or thousands of quals,
so I feel this patch needs improvement for handling large N.

My idea of improving this patch is that give a threshold N_limit,
and for q_1 ... q_N_limit, do the same weighted cost estimation in the
current version of this patch.
For q_{N_limit+1} ...., stop calling clauselist_selectivity for
calculating the weight
and reuse the result of clauselist_selectivity({q_1,q_2, ..., q_N_limit}).
For example, if N_limit=100, additional overhead is only
sub-milliseconds per each range table entry,
and cost estimation is surely better than the current postgres implementation.

Any thoughts?

pgsql-hackers by date:

From: Simon Riggs
Date: 19 January 2018, 13:42:38
Subject: Re: MCV lists for highly skewed distributions

From: Michael Paquier
Date: 19 January 2018, 14:08:56
Subject: Re: PATCH: Configurable file mode mask

Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation - Mailing list pgsql-hackers

Previous

Next