Thread: [HACKERS] extended statistics: n-distinct

[HACKERS] extended statistics: n-distinct

From
Alvaro Herrera
Date:
Here is a closer to final version of the multivariate statistics series,
last posted at
https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql
If you've always wanted to review multivariate stats, but never found a
good reason to, now is a terrific time to do so!  (In other words: I
plan to get this pushed in the not too distant future.)

This is a new thread to present a version of the n-distinct patch that
IMO is close enough to commit.  There are some work items still.
There's some discussion on the topic of cross-column statistics:
https://wiki.postgresql.org/wiki/Cross_Columns_Stats

This problem is important enough that Kyotaro Horiguchi submitted
another patch that does the same thing:
https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
This patch aims to provide the same functionality, keeping the design
general enough that other kinds of statistics can be added later (such
as functional dependencies, histograms and MCVs, all of which have been
previously submitted as patches by Tomas).

To recap, what this patch provides is a new command of the form
   CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table

Note that we put the table name in a separate FROM clause instead of
together with the column name, so that this is more readily extensible
to things that are not just columns, for example expressions that might
involve more than one table (per review from Dean Rasheed).  Currently,
only one table is supported.

In this patch, the "opts" can only be "ndistinct", which creates a
pg_statistic_ext row with the number of distinct groups found in all
possible combination across that set of columns.  This can be used when
a GROUP BY or a DISTINCT clause need to estimate the number of distinct
groups in an aggregation.



Some things left to change:

* Currently, we use the ndistinct value only if the grouping uses
exactly the set of columns covered by a statistics.  For example, if we
have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
the old method, which may result in worse results than if we used the
number we know about (a,b,c) then applied a fixup to consider the
distinctness of (d).

* Also, estimate_num_groups() looks a bit patchy.  With slightly more
invasive changes we can make it look more natural.

* I'm not terribly happy with the header organization.  I think
VacAttrStats should be in its own (new) src/include/statistics/analyze.h
for example (which cleans up a bunch of existing stuff a bit), and the
new files could do with some slight makeover.

* The current code uses AttrNumber * and int2vector, in places where it
would be more convenient to use Bitmapsets.

* We currently try to keep a stats object even if a column in it is
dropped -- for example, if we have stats on (a,b,c) and drop (b), then
we still have stats on (a,c).  While this is nice, it creates a bunch of
weird corner cases, so I'm going to rip that out and just drop the
statistics instead.  If the user wants stats on (a,c) to remain, they
can create it after (or before) dropping the column.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] extended statistics: n-distinct

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> * I'm not terribly happy with the header organization.  I think
> VacAttrStats should be in its own (new) src/include/statistics/analyze.h
> for example (which cleans up a bunch of existing stuff a bit)

I tried this and it doesn't actually do any good.  Patch attached, which
I intend to throw away.  The main problem is that most files interested
in analyze stuff also wants to do vacuum_delay_point, or access
default_statistics_target, both of which are declared in vacuum.h, so
these files have to include vacuum.h anyway. 

In order to achieve anything worthwhile we'd have to do a three-way
split I think, and that's more trouble that this is worth.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] extended statistics: n-distinct

From
David Rowley
Date:
On 21 March 2017 at 08:02, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Here is a closer to final version of the multivariate statistics series,
> last posted at
> https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql

I've made another pass over the patch.

+      A notice is issued in this case.  Note that there is no guarantee that
+      the existing statistics is anything like the one that would have been
+      created.

I think the final sentence would be better read as "Note that only the
name of the statistics object is considered here. The definition of
the statistics is not considered"

+ * This is not handled automatically by DROP TABLE because statistics are
+ * on their own schema.

"in their own schema" ?

+ /* now that we know the number of attributes, allocate the attribute */
+ item->attrs = (AttrNumber *) palloc0(item->nattrs * sizeof(AttrNumber));

there's still a number of palloc0() calls in mvdist.c that could be
simple palloc() calls.

+ int nmultiple,
+ summultiple;

these seem not initialised to 0 before they're used.

+int
+compare_scalars_simple(const void *a, const void *b, void *arg)

I don't see where this is used.

+int
+compare_scalars_partition(const void *a, const void *b, void *arg)

or this

+int
+multi_sort_compare_dims(int start, int end,

should have a comment

+bool
+stats_are_enabled(HeapTuple htup, char type)

missing comment too

+ appendStringInfo(&buf, "CREATE STATISTICS %s ON (",
+ quote_identifier(NameStr(statextrec->staname)));

I know I wrote this bit, but I did it like that because I didn't know
what stat types would be included. Do we need a WITH(ndistinct) or
something in there?

+ attname = get_relid_attribute_name(statextrec->starelid, attnum);

This one is my fault. It needs a quote_identifier() around it:

postgres=# create table mytable ("select" int, "insert" int);
CREATE TABLE
postgres=# create statistics mytable_stats on ("select","insert") from mytable;
CREATE STATISTICS
postgres=# select pg_get_statisticsextdef(oid) from pg_statistic_ext
where staname = 'mytable_stats';                    pg_get_statisticsextdef
------------------------------------------------------------------CREATE STATISTICS mytable_stats ON (select, insert)
FROMmytable
 

+ while (HeapTupleIsValid(htup = systable_getnext(indscan)))
+ /* TODO maybe include only already built statistics? */
+ result = insert_ordered_oid(result, HeapTupleGetOid(htup));


+ /* XXX ensure this is true. It was broken in v25 0002 */

can now remove this comment. I see you added a check to only allow
stats on tables and MVs.

+ printfPQExpBuffer(&buf,
+  "SELECT oid, stanamespace::regnamespace AS nsp, staname, stakeys,\n"
+  "  (SELECT string_agg(attname::text,', ') \n"

Might need to do something with pg_catalog.quote_ident(attname) here:

postgres=# \d mytable             Table "public.mytable"Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------select | integer |           |          |insert | integer |
|          |
 
Indexes:   "mytable_select_insert_idx" btree ("select", insert)
Statistics:   "public.mytable_stats" WITH (ndistinct) ON (select, insert)

notice lack of quoting on 'select'.

List   *keys; /* String nodes naming referenced columns */

Are these really keys? Sounds more like something you might call it if
it were an index. Maybe it should just be "columns"? Although we need
to consider that one day they might be expressions too.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] extended statistics: n-distinct

From
Kyotaro HORIGUCHI
Date:
Thank you for finishing this.

At Mon, 20 Mar 2017 16:02:20 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20170320190220.ixlaueanxegqd5gr@alvherre.pgsql>
> Here is a closer to final version of the multivariate statistics series,
> last posted at
> https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql

I'm sorry but this seems conflicting the current master(17fa3e8)
and a3eac988c267, which is the base of v28.

> If you've always wanted to review multivariate stats, but never found a
> good reason to, now is a terrific time to do so!  (In other words: I
> plan to get this pushed in the not too distant future.)

Great! But sorry for having contributed not so much.

> This is a new thread to present a version of the n-distinct patch that
> IMO is close enough to commit.  There are some work items still.
> There's some discussion on the topic of cross-column statistics:
> https://wiki.postgresql.org/wiki/Cross_Columns_Stats
> 
> This problem is important enough that Kyotaro Horiguchi submitted
> another patch that does the same thing:
> https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
> This patch aims to provide the same functionality, keeping the design
> general enough that other kinds of statistics can be added later (such
> as functional dependencies, histograms and MCVs, all of which have been
> previously submitted as patches by Tomas).

I may be stupid but I don't get the picture here, specifically
about the relation to Tomas's patch. Does this work as
infrastructure for Tomas's mv patch? Or in some other
relationsip?

> To recap, what this patch provides is a new command of the form
>    CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table
> 
> Note that we put the table name in a separate FROM clause instead of
> together with the column name, so that this is more readily extensible
> to things that are not just columns, for example expressions that might
> involve more than one table (per review from Dean Rasheed).  Currently,
> only one table is supported.
> 
> In this patch, the "opts" can only be "ndistinct", which creates a
> pg_statistic_ext row with the number of distinct groups found in all
> possible combination across that set of columns.  This can be used when
> a GROUP BY or a DISTINCT clause need to estimate the number of distinct
> groups in an aggregation.
> 
> 
> 
> Some things left to change:
> 
> * Currently, we use the ndistinct value only if the grouping uses
> exactly the set of columns covered by a statistics.  For example, if we
> have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
> the old method, which may result in worse results than if we used the
> number we know about (a,b,c) then applied a fixup to consider the
> distinctness of (d).

Do you planning to realize correcting esitimation of joins
perplexed by strong correlations?

> * Also, estimate_num_groups() looks a bit patchy.  With slightly more
> invasive changes we can make it look more natural.
> 
> * I'm not terribly happy with the header organization.  I think
> VacAttrStats should be in its own (new) src/include/statistics/analyze.h
> for example (which cleans up a bunch of existing stuff a bit), and the
> new files could do with some slight makeover.
> 
> * The current code uses AttrNumber * and int2vector, in places where it
> would be more convenient to use Bitmapsets.
> 
> * We currently try to keep a stats object even if a column in it is
> dropped -- for example, if we have stats on (a,b,c) and drop (b), then
> we still have stats on (a,c).  While this is nice, it creates a bunch of
> weird corner cases, so I'm going to rip that out and just drop the
> statistics instead.  If the user wants stats on (a,c) to remain, they
> can create it after (or before) dropping the column.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] extended statistics: n-distinct

From
Alvaro Herrera
Date:
Kyotaro HORIGUCHI wrote:

> At Mon, 20 Mar 2017 16:02:20 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20170320190220.ixlaueanxegqd5gr@alvherre.pgsql>

> > This is a new thread to present a version of the n-distinct patch that
> > IMO is close enough to commit.  There are some work items still.
> > There's some discussion on the topic of cross-column statistics:
> > https://wiki.postgresql.org/wiki/Cross_Columns_Stats
> > 
> > This problem is important enough that Kyotaro Horiguchi submitted
> > another patch that does the same thing:
> > https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
> > This patch aims to provide the same functionality, keeping the design
> > general enough that other kinds of statistics can be added later (such
> > as functional dependencies, histograms and MCVs, all of which have been
> > previously submitted as patches by Tomas).
> 
> I may be stupid but I don't get the picture here, specifically
> about the relation to Tomas's patch. Does this work as
> infrastructure for Tomas's mv patch? Or in some other
> relationsip?

Well, this patch is Tomas' first patch, which I've reviewed and reworked
-- I changed some things that weren't properly finished, cleaned up the
code, made it all more robust, and made sure the sane cases work sanely
while the others rejected promptly (rather than throwing bogus error
messages at a later time, or crashing).

I didn't review your own n-distinct patch.  I don't think there's any
common code, but it would be very useful if you could try your test
scenarios and make sure they are handled sanely by this patch.

Regarding your question:

> Do you planning to realize correcting esitimation of joins
> perplexed by strong correlations?

There is a later patch in Tomas' series which I would like to get to
before PG10 closes, but it's not this patch.  It needs to be rebased on
top of this one.

Attached is v30, which includes some more cleanup.  Detailed commits can
be seen here:
https://github.com/2ndQuadrant/postgres/commits/dev/mvstats-ndistinct
In particular, this includes code from Tomas to consider mixing
ndistinct estimates from multiple multivariate statistic objects, which
is better than the old approach of only using the estimate when a
perfect match was found.  However, I lobotomized Tomas' selfuncs.c code
however and I need to revert that part before pushing -- essentially I
removed examine_variable() processing, which seemed a bit on the
expensive side for what we were doing, but that was a silly mistake.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: extended statistics: n-distinct

From
Alvaro Herrera
Date:
Pushed this after some more tweaking.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services