Home > mailing lists

Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: multivariate statistics (v19)
Date	August 15, 2016 20:50:26
Msg-id	2e275433-53c6-cabe-3344-b2ed625c4a95@2ndquadrant.com Whole thread Raw
In response to	Re: multivariate statistics (v19) (Michael Paquier <michael.paquier@gmail.com>)
List	pgsql-hackers

Tree view

On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) enriching the query tree with multivariate statistics info
>>
>> Right now all the stuff related to multivariate statistics estimation
>> happens in clausesel.c - matching condition to statistics, selection of
>> statistics to use (if there are multiple usable stats), etc. So pretty much
>> all this info is internal to clausesel.c and does not get outside.
>
> This does not seem bad to me as first sight but...
>
>> I'm starting to think that some of the steps (matching quals to stats,
>> selection of stats) should happen in a "preprocess" step before the actual
>> estimation, storing the information (which stats to use, etc.) in a new type
>> of node in the query tree - something like RestrictInfo.
>>
>> I believe this needs to happen sometime after deconstruct_jointree() as that
>> builds RestrictInfos nodes, and looking at planmain.c, right after
>> extract_restriction_or_clauses seems about right. Haven't tried, though.
>>
>> This would move all the "statistics selection" logic from clausesel.c,
>> separating it from the "actual estimation" and simplifying the code.
>>
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicated 
in practice. It essentially means doing something like
    rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the 
name suggests the list should only include RestrictInfo nodes, which 
seems to contradict the transformation.

For example with conditions
    WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on 
(a,b,c), we need to note that somehow - my plan was to inject a node 
storing this information, something like (a bit simplified):
    StatisticsInfo {         Oid statisticsoid; /* OID of the statistics */         List *mvconditions; /* estimate
usingthe statistics */         List *otherconditions; /* estimate the old way */    }

But that'd clearly violate the assumption that baserestrictinfo only 
contains RestrictInfo. I don't think it's feasible (or desirable) to 
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it 
wants. But it also means we need to maintain two lists instead of one, 
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e. 
each node would track whether it's estimated the "old way" or using 
multivariate statistics (and which one). But it would require changing 
many of the existing nodes (at least those compatible with multivariate 
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during 
the estimation, as it'd be necessary to look for other nodes to be 
estimated by the same statistics. Which seems to defeat the idea of 
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e. 
keeping a separate list, but I'd welcome other ideas.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Peter Eisentraut
Date: 15 August 2016, 20:36:11
Subject: Re: Let's get rid of the separate minor version numbers for shlibs

From: Jim Nasby
Date: 15 August 2016, 20:59:02
Subject: Re: PATCH: Exclude additional directories in pg_basebackup

Re: multivariate statistics (v19) - Mailing list pgsql-hackers

Previous

Next