Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics (v19)
Date
Msg-id 2e275433-53c6-cabe-3344-b2ed625c4a95@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 08/10/2016 06:41 AM, Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> 1) enriching the query tree with multivariate statistics info
>>
>> Right now all the stuff related to multivariate statistics estimation
>> happens in clausesel.c - matching condition to statistics, selection of
>> statistics to use (if there are multiple usable stats), etc. So pretty much
>> all this info is internal to clausesel.c and does not get outside.
>
> This does not seem bad to me as first sight but...
>
>> I'm starting to think that some of the steps (matching quals to stats,
>> selection of stats) should happen in a "preprocess" step before the actual
>> estimation, storing the information (which stats to use, etc.) in a new type
>> of node in the query tree - something like RestrictInfo.
>>
>> I believe this needs to happen sometime after deconstruct_jointree() as that
>> builds RestrictInfos nodes, and looking at planmain.c, right after
>> extract_restriction_or_clauses seems about right. Haven't tried, though.
>>
>> This would move all the "statistics selection" logic from clausesel.c,
>> separating it from the "actual estimation" and simplifying the code.
>>
>> But more importantly, I think we'll need to show some of the data in EXPLAIN
>> output. With per-column statistics it's fairly straightforward to determine
>> which statistics are used and how. But with multivariate stats things are
>> often more complicated - there may be multiple candidate statistics (e.g.
>> histograms covering different subsets of the conditions), it's possible to
>> apply them in different orders, etc.
>>
>> But EXPLAIN can't show the info if it's ephemeral and available only within
>> clausesel.c (and thrown away after the estimation).
>
> This gives a good reason to not do that in clauserel.c, it would be
> really cool to be able to get some information regarding the stats
> used with a simple EXPLAIN.

I've been thinking about this, and I'm afraid it's way more complicated 
in practice. It essentially means doing something like
    rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo);

for each table (and in the future maybe also for joins etc.) But as the 
name suggests the list should only include RestrictInfo nodes, which 
seems to contradict the transformation.

For example with conditions
    WHERE (a=1) AND (b=2) AND (c=3)

the list will contain 3 RestrictInfos. But if there's a statistics on 
(a,b,c), we need to note that somehow - my plan was to inject a node 
storing this information, something like (a bit simplified):
    StatisticsInfo {         Oid statisticsoid; /* OID of the statistics */         List *mvconditions; /* estimate
usingthe statistics */         List *otherconditions; /* estimate the old way */    }
 

But that'd clearly violate the assumption that baserestrictinfo only 
contains RestrictInfo. I don't think it's feasible (or desirable) to 
rework all the places to expect both RestrictInfo and the new node.

I can think of two alternatives:

1) keep the transformed list as separate list, next to baserestrictinfo

This obviously fixes the issue, as each caller can decide which node it 
wants. But it also means we need to maintain two lists instead of one, 
and keep them synchronized.

2) embed the information into the existing tree

It might be possible to store the information in existing nodes, i.e. 
each node would track whether it's estimated the "old way" or using 
multivariate statistics (and which one). But it would require changing 
many of the existing nodes (at least those compatible with multivariate 
statistics: currently OpExpr, NullTest, ...).

And it also seems fairly difficult to reconstruct the information during 
the estimation, as it'd be necessary to look for other nodes to be 
estimated by the same statistics. Which seems to defeat the idea of 
preprocessing to some degree.

So I'm not sure what's the best solution. I'm leaning to (1), i.e. 
keeping a separate list, but I'd welcome other ideas.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Let's get rid of the separate minor version numbers for shlibs
Next
From: Jim Nasby
Date:
Subject: Re: PATCH: Exclude additional directories in pg_basebackup