Re: [HACKERS] multivariate statistics (v25) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] multivariate statistics (v25)
Date
Msg-id 7dedc58b-f9cf-fe34-ef7f-382f776ce953@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] multivariate statistics (v24)  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] multivariate statistics (v25)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<a78ffb17-70e8-a55a-c10c-66ab575e88ed@2ndquadrant.com>
>> OK,
>>
>> attached is v24 of the patch series, addressing most of the reported
>> issues and comments (at least I believe so). The main changes are:
>
> Unfortunately, 0002 conflicts with the current master
> (4461a9b). Could you rebase them or tell us the commit where this
> patches stand on?
>

Attached is a rebased patch series, otherwise it's the same as v24.

FWIW it was based on 016c990834 from Feb 28, but apparently some recent 
patch caused a minor conflict.

> I only saw the patch files but have some comments.
>
>> 1) I've mostly abandoned the "multivariate" name in favor of
>> "extended", particularly in places referring to stats stored in the
>> pg_statistic_ext in general. "Multivariate" is now used only in places
>> talking about particular types (e.g. multivariate histograms).
>>
>> The "extended" name is more widely used for this type of statistics,
>> and the assumption is that we'll also add other (non-multivariate)
>> types of statistics - e.g. statistics on custom expressions, or some
>> for of join statistics.
>
> In 0005, and
>
> @@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root,
>       * If there are no such stats or not enough attributes, don't waste time
>       * simply skip to estimation using the plain per-column stats.
>       */
> +    if (has_stats(stats, STATS_TYPE_MCV) &&
> ...
> +            /* compute the multivariate stats */
> +            s1 *= clauselist_ext_selectivity(root, mvclauses, stat);
> ====
> @@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, Index relid,
>  }
>
>  /*
> + * estimate selectivity of clauses using multivariate statistic
>
> These comment is left unchanged?  or on purpose? 0007 adds very
> similar texts.
>

Hmm, those comments should be probably changed to "extended".

>> 2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and
>> pg_mv_stats view renamed to pg_stats_ext).
>
> FWIW, "extended statistic" would be abbreviated as
> "ext_statistic" or "extended_stats". Why have you exchanged the
> words?
>

Because this way it's clear it's a version of pg_statistic, and it will 
be sorted right next to it.

>> 3) The structure of pg_statistic_ext was changed as proposed by
>> Alvaro, i.e. the boolean flags were removed and instead we have just a
>> single "char[]" column with list of enabled statistics.
>>
>> 4) I also got rid of the "mv" part in most variable/function/constant
>> names, replacing it by "ext" or something similar. Also mvstats.h got
>> renamed to stats.h.
>>
>> 5) Moved the files from src/backend/utils/mvstats to
>> backend/statistics.
>>
>> 6) Fixed the n_choose_k() overflow issues by using the algorithm
>> proposed by Dean. Also, use the simple formula for num_combinations().
>>
>> 7) I've tweaked data types for a few struct members (in stats.h). I've
>> kept most of the uint32 fields at the top level though, because int16
>> might not be large enough for large statistics and the overhead is
>> minimal (compared to the space needed e.g. for histogram buckets).
>
> Some formulated proof or boundary value test cases might be
> needed (to prevent future trouble). Or any defined behavior on
> overflow of them might be enough. I belive all (or most) of
> overflow-able data has such behavior.
>

That is probably a good idea and I plan to do that.

>> The renames/changes were quite widespread, but I've done my best to
>> fix all the comments and various other places.
>>
>> regards
>
> regards,
>

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: [HACKERS] tuplesort_gettuple_common() and *should_free argument
Next
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] multivariate statistics (v25)