Re: multivariate statistics (v19) - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: multivariate statistics (v19)
Date
Msg-id 3eb2a4b6-911b-29d9-7280-bbf4837acb05@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics (v19)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: multivariate statistics (v19)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 10/08/16 13:33, Tomas Vondra wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>> 2) combining multiple statistics
>>>
>>> I think the ability to combine multivariate statistics (covering
>>> different
>>> subsets of conditions) is important and useful, but I'm starting to
>>> think
>>> that the current implementation may not be the correct one (which is
>>> why I
>>> haven't written the SGML docs about this part of the patch series yet).
>>>
>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>> estimating query:
>>>
>>>    SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>
>>> but that we only have two statistics (a,b) and (b,c). The current
>>> patch does
>>> about this:
>>>
>>>    P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>
>>> i.e. it estimates the first two conditions using (a,b), and then
>>> estimates
>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>> efficient,
>>> but it only works as long as the query contains conditions
>>> "connecting" the
>>> two statistics. So if we remove the "b=2" condition from the query, this
>>> stops working.
>>
>> This is trying to make the algorithm smarter than the user, which is
>> something I'd think we could live without. In this case statistics on
>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>
>
> I don't think so. Obviously, if you have statistics covering all the
> conditions - great, we can't really do better than that.
>
> But there's a crucial relation between the number of dimensions of the
> statistics and accuracy of the statistics. Let's say you have statistics
> on 8 columns, and you split each dimension twice to build a histogram -
> that's 256 buckets right there, and we only get ~50% selectivity in each
> dimension (the actual histogram building algorithm is more complex, but
> you get the idea).
>

I think it makes sense to pursue this, but I also think we can easily 
live with not having it in the first version that gets committed and 
doing it as follow-up patch.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: multivariate statistics (v19)
Next
From: Michael Paquier
Date:
Subject: Re: multivariate statistics (v19)