Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: multivariate statistics (v19)
Date
Msg-id CAB7nPqTCmL5ojXi4u-FX-u+J5mvhUaqsyz7zyK+6eojz2Ynuog@mail.gmail.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Petr Jelinek <petr@2ndquadrant.com>)
Responses Re: multivariate statistics (v19)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>
>>>> 2) combining multiple statistics
>>>>
>>>>
>>>> I think the ability to combine multivariate statistics (covering
>>>> different
>>>> subsets of conditions) is important and useful, but I'm starting to
>>>> think
>>>> that the current implementation may not be the correct one (which is
>>>> why I
>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>
>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>> estimating query:
>>>>
>>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>
>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>> patch does
>>>> about this:
>>>>
>>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>
>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>> estimates
>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>> efficient,
>>>> but it only works as long as the query contains conditions
>>>> "connecting" the
>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>> stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: multivariate statistics (v19)
Next
From: Ants Aasma
Date:
Subject: Re: multivariate statistics (v19)