Re: Multi-column distinctness. - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Multi-column distinctness.
Date
Msg-id 55EBF9DD.9070605@2ndquadrant.com
Whole thread Raw
In response to Multi-column distinctness.  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Multi-column distinctness.  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hello Horiguchi-san,

On 08/28/2015 10:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this patch enables planner to be couscious of inter-column
> correlation.
>
> Sometimes two or more columns in a table has some correlation
> which brings underestimate, which leads to wrong join method and
> ends with slow execution.
>
> Tomas Vondra is now working on heavily-equipped multivariate
> statistics for OLAP usage. In contrast, this is a lightly
> implemented solution which calculates only the ratio between a
> rows estimated by current method and a actual row number. I think
> this doesn't conflict with his work except the grammar part.

I'd like to sign up as a reviewer of this patch, if you're OK with that. 
I don't see the patch in any of the commitfests, though :-(

Now, a few comments based on reading the patches (will try testing them 
in the next few days, hopefully).

1) catalog design
-----------------
I see you've invented a catalog with a slightly different structure than 
I use in the multivariate stats patch, storing the coefficient for each 
combination of columns in a separate row. The idea in the multivariate 
patch is to generate all possible combinations of columns and store the 
coefficients packed in a single row, which requires implementing a more 
complicated data structure. But your patch does not do that (only 
computes coefficient for the one combination).

This is another potential incompatibility with the multivariate patch, 
although it's rather in the background and should not be difficult to 
rework if needed.

2) find_mv_coeffeicient
---------------------
I don't quite see why this is the right thing
    /* Prefer smaller one */    if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)        mv_coef =
mvc->mvccoefficient;

i.e. why it's correct to choose the record with the lower coefficient 
and not the one with most columns (the largest subset). Maybe there's 
some rule that those are in fact the same thing (seems like adding a 
column can only lower the coefficient), but it's not quite obvious and 
would deserve a comment.

3) single statistics limitation
-------------------------------
What annoys me a bit is that this patch only applies a single statistics 
- it won't try "combining" multiple smaller ones (even if they don't 
overlap, which shouldn't be that difficult).

4) no sub-combinations
----------------------
Another slightly annoying thing is that the patch computes statistics 
only for the specified combination of columns, and no subsets. So if you 
have enabled stats on [a,b,c] you won't be able to use that for 
conditions on [a,b], for example. Sure, the user can create the stats 
manually, but IMO the less burden we place on the user the better. It's 
enough that we force users to create the stats at all, we shouldn't 
really force them co create all possible combinations.

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to 
commit the multivariate stats patch. I have no intention in blocking 
this patch for that reasons, but if we might design the syntax to make 
it compatible with the multivariate patch, that'd be nice. But it seems 
to me the syntax is pretty much the same, no?

I.e. it uses
    ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the 
stattypes in ATExecAddDropMvStatistics except for checking there's a 
single entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit 
confusing. Maybe two separate methods would be better?

6) GROUP BY
-----------
One thing I'd really love to see is improvement of GROUP BY clauses. 
That's one of the main sources of pain in analytical queries, IMHO, 
resulting in either OOM issues in Hash Aggregate, or choice of sort 
paths where hash would be more efficient.

regards
Tomas

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Multi-column distinctness.
Next
From: Julien Rouhaud
Date:
Subject: Re: Outdated documentation for PageHeaderData