Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: MCV lists for highly skewed distributions
Date
Msg-id 8a524773-3299-fe7b-de8b-bc977de97bec@2ndquadrant.com
Whole thread Raw
In response to Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Re: MCV lists for highly skewed distributions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 03/11/2018 10:23 AM, Dean Rasheed wrote:
> ...
>
> I'm moving this back to a status of "Needs review" partly because the
> code has changed significantly, but also because I want to do more
> testing, particularly with larger datasets.
> 

Thanks for working on this. The code seems fine to me, although it might
be a good idea to add comments explaining why analyze_mcv_list() starts
with full MCV lists and then removes items (essentially the explanation
why Jeff's original idea would not work so well).

Actually, one question - when deciding whether to keep the item in the
MCV list, analyze_mcv_list only compares it's frequency with an average
of the rest. But as we're removing items from the MCV list, the average
frequency of the non-MCV items is growing (we're removing items with
higher and higher frequencies). That means the estimates for the least
common items will get higher and higher - essentially overestimates. So,
couldn't/shouldn't analyze_mcv_list consider this too?

I've also done a bunch of testing regarding join cardinality estimates,
because eqjoinsel_inner() is one of the places using MCV lists too. So
I've generated tables with different sizes and data distributions, and
observed how the patch affects the join estimates.

The scripts and results are available here:

   https://github.com/tvondra/join-estimates-tests

The tables were not particularly huge at this point (1000 to 100k rows),
I've also varied number of distinct values (100 - 10k), statistics
target (10 - 10k) and distribution (for each table independently):

1) uniform

    insert into t1 select random() * $ndistinct1, i
      from generate_series(1,$nrows1) s(i)

2) skewed

    insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
      from generate_series(1,$nrows2) s(i)

3) skewed-inverse

    insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
      from generate_series(1,$nrows2) s(i)

4) skewed

    insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
      from generate_series(1,$nrows2) s(i)

5) skewed-strong-inverse

    insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
      from generate_series(1,$nrows2) s(i)

The results are a bit too large to attach them here - see for example
https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods.

Looking at Mean Relative Error, i.e. essentially

    MRE = AVERAGE(MAX(estimate/actual,actual/estimate))

over the 20 runs for each combination of parameters, The "average" sheet
shows

     (MRE for patched) / (MRE for master)

and the patch seems to clearly improve accuracy in this case. There are
a few cases where the estimate gets slightly worse (say, by 10%)
compared to current master. So I think that's nice.

I'm open to running additional tests with other distributions, table
sizes etc. if needed.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: ON CONFLICT DO UPDATE for partitioned tables
Next
From: Tom Lane
Date:
Subject: Re: segmentation fault in pg head with SQL function.