[HACKERS] extended statistics: n-distinct - Mailing list pgsql-hackers

From Alvaro Herrera
Subject [HACKERS] extended statistics: n-distinct
Date
Msg-id 20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
Whole thread Raw
Responses Re: [HACKERS] extended statistics: n-distinct  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: [HACKERS] extended statistics: n-distinct  (David Rowley <david.rowley@2ndquadrant.com>)
Re: [HACKERS] extended statistics: n-distinct  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
Here is a closer to final version of the multivariate statistics series,
last posted at
https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql
If you've always wanted to review multivariate stats, but never found a
good reason to, now is a terrific time to do so!  (In other words: I
plan to get this pushed in the not too distant future.)

This is a new thread to present a version of the n-distinct patch that
IMO is close enough to commit.  There are some work items still.
There's some discussion on the topic of cross-column statistics:
https://wiki.postgresql.org/wiki/Cross_Columns_Stats

This problem is important enough that Kyotaro Horiguchi submitted
another patch that does the same thing:
https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
This patch aims to provide the same functionality, keeping the design
general enough that other kinds of statistics can be added later (such
as functional dependencies, histograms and MCVs, all of which have been
previously submitted as patches by Tomas).

To recap, what this patch provides is a new command of the form
   CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table

Note that we put the table name in a separate FROM clause instead of
together with the column name, so that this is more readily extensible
to things that are not just columns, for example expressions that might
involve more than one table (per review from Dean Rasheed).  Currently,
only one table is supported.

In this patch, the "opts" can only be "ndistinct", which creates a
pg_statistic_ext row with the number of distinct groups found in all
possible combination across that set of columns.  This can be used when
a GROUP BY or a DISTINCT clause need to estimate the number of distinct
groups in an aggregation.



Some things left to change:

* Currently, we use the ndistinct value only if the grouping uses
exactly the set of columns covered by a statistics.  For example, if we
have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
the old method, which may result in worse results than if we used the
number we know about (a,b,c) then applied a fixup to consider the
distinctness of (d).

* Also, estimate_num_groups() looks a bit patchy.  With slightly more
invasive changes we can make it look more natural.

* I'm not terribly happy with the header organization.  I think
VacAttrStats should be in its own (new) src/include/statistics/analyze.h
for example (which cleans up a bunch of existing stuff a bit), and the
new files could do with some slight makeover.

* The current code uses AttrNumber * and int2vector, in places where it
would be more convenient to use Bitmapsets.

* We currently try to keep a stats object even if a column in it is
dropped -- for example, if we have stats on (a,b,c) and drop (b), then
we still have stats on (a,c).  While this is nice, it creates a bunch of
weird corner cases, so I'm going to rip that out and just drop the
statistics instead.  If the user wants stats on (a,c) to remain, they
can create it after (or before) dropping the column.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: Beena Emerson
Date:
Subject: Re: [HACKERS] increasing the default WAL segment size
Next
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] logical replication access control patches