Re: [HACKERS] extended statistics: n-distinct - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] extended statistics: n-distinct
Date
Msg-id 20170321.180024.50058401.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to [HACKERS] extended statistics: n-distinct  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: [HACKERS] extended statistics: n-distinct  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
Thank you for finishing this.

At Mon, 20 Mar 2017 16:02:20 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20170320190220.ixlaueanxegqd5gr@alvherre.pgsql>
> Here is a closer to final version of the multivariate statistics series,
> last posted at
> https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql

I'm sorry but this seems conflicting the current master(17fa3e8)
and a3eac988c267, which is the base of v28.

> If you've always wanted to review multivariate stats, but never found a
> good reason to, now is a terrific time to do so!  (In other words: I
> plan to get this pushed in the not too distant future.)

Great! But sorry for having contributed not so much.

> This is a new thread to present a version of the n-distinct patch that
> IMO is close enough to commit.  There are some work items still.
> There's some discussion on the topic of cross-column statistics:
> https://wiki.postgresql.org/wiki/Cross_Columns_Stats
> 
> This problem is important enough that Kyotaro Horiguchi submitted
> another patch that does the same thing:
> https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
> This patch aims to provide the same functionality, keeping the design
> general enough that other kinds of statistics can be added later (such
> as functional dependencies, histograms and MCVs, all of which have been
> previously submitted as patches by Tomas).

I may be stupid but I don't get the picture here, specifically
about the relation to Tomas's patch. Does this work as
infrastructure for Tomas's mv patch? Or in some other
relationsip?

> To recap, what this patch provides is a new command of the form
>    CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table
> 
> Note that we put the table name in a separate FROM clause instead of
> together with the column name, so that this is more readily extensible
> to things that are not just columns, for example expressions that might
> involve more than one table (per review from Dean Rasheed).  Currently,
> only one table is supported.
> 
> In this patch, the "opts" can only be "ndistinct", which creates a
> pg_statistic_ext row with the number of distinct groups found in all
> possible combination across that set of columns.  This can be used when
> a GROUP BY or a DISTINCT clause need to estimate the number of distinct
> groups in an aggregation.
> 
> 
> 
> Some things left to change:
> 
> * Currently, we use the ndistinct value only if the grouping uses
> exactly the set of columns covered by a statistics.  For example, if we
> have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
> the old method, which may result in worse results than if we used the
> number we know about (a,b,c) then applied a fixup to consider the
> distinctness of (d).

Do you planning to realize correcting esitimation of joins
perplexed by strong correlations?

> * Also, estimate_num_groups() looks a bit patchy.  With slightly more
> invasive changes we can make it look more natural.
> 
> * I'm not terribly happy with the header organization.  I think
> VacAttrStats should be in its own (new) src/include/statistics/analyze.h
> for example (which cleans up a bunch of existing stuff a bit), and the
> new files could do with some slight makeover.
> 
> * The current code uses AttrNumber * and int2vector, in places where it
> would be more convenient to use Bitmapsets.
> 
> * We currently try to keep a stats object even if a column in it is
> dropped -- for example, if we have stats on (a,b,c) and drop (b), then
> we still have stats on (a,c).  While this is nice, it creates a bunch of
> weird corner cases, so I'm going to rip that out and just drop the
> statistics instead.  If the user wants stats on (a,c) to remain, they
> can create it after (or before) dropping the column.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




pgsql-hackers by date:

Previous
From: "Ideriha, Takeshi"
Date:
Subject: Re: [HACKERS] Other formats in pset like markdown, rst, mediawiki
Next
From: Amit Langote
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode