Thread: Using multiple extended statistics for estimates

Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
Hi,

PostgreSQL 10 introduced extended statistics, allowing us to consider
correlation between columns to improve estimates, and PostgreSQL 12
added support for MCV statistics. But we still had the limitation that
we only allowed using a single extended statistics per relation, i.e.
given a table with two extended stats

    CREATE TABLE t (a int, b int, c int, d int);
    CREATE STATISTICS s1 (mcv) ON a, b FROM t;
    CREATE STATISTICS s2 (mcv) ON c, d FROM t;

and a query

    SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;

we only ever used one of the statistics (and we considered them in a not
particularly well determined order).

This patch addresses this by using as many extended stats as possible,
by adding a loop to statext_mcv_clauselist_selectivity(). In each step
we pick the "best" applicable statistics (in the sense of covering the
most attributes) and factor it into the oveall estimate.

All this happens where we'd originally consider applying a single MCV
list, i.e. before even considering the functional dependencies, so
roughly like this:

    while ()
    {
        ... apply another MCV list ...
    }

    ... apply functional dependencies ...


I've both in the loop, but I think that'd be wrong - the MCV list is
expected to contain more information about individual values (compared
to functional deps, which are column-level).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>Hi,
>
>PostgreSQL 10 introduced extended statistics, allowing us to consider
>correlation between columns to improve estimates, and PostgreSQL 12
>added support for MCV statistics. But we still had the limitation that
>we only allowed using a single extended statistics per relation, i.e.
>given a table with two extended stats
>
>   CREATE TABLE t (a int, b int, c int, d int);
>   CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>   CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>
>and a query
>
>   SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>
>we only ever used one of the statistics (and we considered them in a not
>particularly well determined order).
>
>This patch addresses this by using as many extended stats as possible,
>by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>we pick the "best" applicable statistics (in the sense of covering the
>most attributes) and factor it into the oveall estimate.
>
>All this happens where we'd originally consider applying a single MCV
>list, i.e. before even considering the functional dependencies, so
>roughly like this:
>
>   while ()
>   {
>       ... apply another MCV list ...
>   }
>
>   ... apply functional dependencies ...
>
>
>I've both in the loop, but I think that'd be wrong - the MCV list is
>expected to contain more information about individual values (compared
>to functional deps, which are column-level).
>

Here is a slightly polished v2 of the patch, the main difference being
that computing clause_attnums was moved to a separate function.

This is a fairly simple patch, and it's not entirely new functionality
(applying multiple statistics was part of the very first patch seris,
although of course in a very different form). So unless there are
objections, I'd like to get this committed sometime next week.

There's room for improvement, of course, for example when handling
overlapping statistics. Consider a table with columns (a,b,c) and two
extended statistics on (a,b) and (b,c), and query with one clause per
column

   SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1

In this case the patch does not help, because we apply (a,b) and then we
have just a single clause remaining. What we could do is still apply the
(b,c) statistic, using the already-estimated clause on b as a condition.
So essentially we'd compute

    P(a=1 && b=1) * P(c=1 | b=1)

But that'll require larger changes, and I see it as an evolution of the
current patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:
>On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>>Hi,
>>
>>PostgreSQL 10 introduced extended statistics, allowing us to consider
>>correlation between columns to improve estimates, and PostgreSQL 12
>>added support for MCV statistics. But we still had the limitation that
>>we only allowed using a single extended statistics per relation, i.e.
>>given a table with two extended stats
>>
>>  CREATE TABLE t (a int, b int, c int, d int);
>>  CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>>  CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>>
>>and a query
>>
>>  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>>
>>we only ever used one of the statistics (and we considered them in a not
>>particularly well determined order).
>>
>>This patch addresses this by using as many extended stats as possible,
>>by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>>we pick the "best" applicable statistics (in the sense of covering the
>>most attributes) and factor it into the oveall estimate.
>>
>>All this happens where we'd originally consider applying a single MCV
>>list, i.e. before even considering the functional dependencies, so
>>roughly like this:
>>
>>  while ()
>>  {
>>      ... apply another MCV list ...
>>  }
>>
>>  ... apply functional dependencies ...
>>
>>
>>I've both in the loop, but I think that'd be wrong - the MCV list is
>>expected to contain more information about individual values (compared
>>to functional deps, which are column-level).
>>
>
>Here is a slightly polished v2 of the patch, the main difference being
>that computing clause_attnums was moved to a separate function.
>

This time with the attachment ;-)


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using multiple extended statistics for estimates

From
Kyotaro Horiguchi
Date:
Hello.

At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in 
> >Here is a slightly polished v2 of the patch, the main difference being
> >that computing clause_attnums was moved to a separate function.
> >
> 
> This time with the attachment ;-)

This patch is a kind of straight-forward, which repeats what the
previous statext_mcv_clauselist_selectivity did as long as remaining
clauses matches any of MV-MCVs. Almost no regression in the cases
where zero or just one MV-MCV applies to the given clause list.

It applies cleanly on the current master and seems working as
expected.


I have some comments.

Could we have description in the documentation on what multiple
MV-MCVs are used in a query? And don't we need some regression tests?


+/*
+ * statext_mcv_clause_attnums
+ *        Recalculate attnums from compatible but not-yet-estimated clauses.

It returns attnums collected from multiple clause*s*. Is the name OK
with "clause_attnums"?

The comment says as if it checks the compatibility of each clause but
the work is done in the caller side. I'm not sure such strictness is
required, but it might be better that the comment represents what
exactly the function does.


+ */
+static Bitmapset *
+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
+                           Bitmapset **list_attnums)

The last two parameters are in the same type in notation but in
different actual types.. that is, one is a pointer to Bitmapset*, and
another is an array of Bitmaptset*. The code in the function itself
suggests that, but it would be helpful if a brief explanation of the
parameters is seen in the function comment.

+        /*
+         * Recompute attnums in the remaining clauses (we simply use the bitmaps
+         * computed earlier, so that we don't have to inspect the clauses again).
+         */
+        clauses_attnums = statext_mcv_clause_attnums(list_length(clauses),

Couldn't we avoid calling this function twice with the same parameters
at the first round in the loop?

+        foreach(l, clauses)
         {
-            stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
-            *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+            /*
+             * If the clause is compatible with the selected statistics, mark it
+             * as estimated and add it to the list to estimate.
+             */
+            if (list_attnums[listidx] != NULL &&
+                bms_is_subset(list_attnums[listidx], stat->keys))
+            {
+                stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
+                *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+            }

The loop runs through all clauses every time. I agree that that is
better than using a copy of the clauses to avoid to step on already
estimated clauses, but maybe we need an Assertion that the listidx is
not a part of estimatedclauses to make sure no clauses are not
estimated twice.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Thu, Nov 07, 2019 at 01:38:20PM +0900, Kyotaro Horiguchi wrote:
>Hello.
>
>At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
>> >Here is a slightly polished v2 of the patch, the main difference being
>> >that computing clause_attnums was moved to a separate function.
>> >
>>
>> This time with the attachment ;-)
>
>This patch is a kind of straight-forward, which repeats what the
>previous statext_mcv_clauselist_selectivity did as long as remaining
>clauses matches any of MV-MCVs. Almost no regression in the cases
>where zero or just one MV-MCV applies to the given clause list.
>
>It applies cleanly on the current master and seems working as
>expected.
>
>
>I have some comments.
>
>Could we have description in the documentation on what multiple
>MV-MCVs are used in a query? And don't we need some regression tests?
>

Yes, regression tests are certainly needed - I though I've added them,
but it seems I failed to include them in the patch. Will fix.

I agree it's probably worth mentioning we can consider multiple stats,
but I'm a bit hesitant to put the exact rules how we pick the "best"
statistic to the docs. It's not 100% deterministic and it's likely
we'll need to tweak it a bit in the future.

I'd prefer showing the stats in EXPLAIN, but that's a separate patch.

>
>+/*
>+ * statext_mcv_clause_attnums
>+ *        Recalculate attnums from compatible but not-yet-estimated clauses.
>
>It returns attnums collected from multiple clause*s*. Is the name OK
>with "clause_attnums"?
>
>The comment says as if it checks the compatibility of each clause but
>the work is done in the caller side. I'm not sure such strictness is
>required, but it might be better that the comment represents what
>exactly the function does.
>

But the incompatible clauses have the pre-computed attnums set to NULL,
so technically the comment is correct. But I'll clarify.

>
>+ */
>+static Bitmapset *
>+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses,
>+                           Bitmapset **list_attnums)
>
>The last two parameters are in the same type in notation but in
>different actual types.. that is, one is a pointer to Bitmapset*, and
>another is an array of Bitmaptset*. The code in the function itself
>suggests that, but it would be helpful if a brief explanation of the
>parameters is seen in the function comment.
>

OK, will explain in a comment.

>+        /*
>+         * Recompute attnums in the remaining clauses (we simply use the bitmaps
>+         * computed earlier, so that we don't have to inspect the clauses again).
>+         */
>+        clauses_attnums = statext_mcv_clause_attnums(list_length(clauses),
>
>Couldn't we avoid calling this function twice with the same parameters
>at the first round in the loop?
>

Hmmm, yeah. That's a good point.

>+        foreach(l, clauses)
>         {
>-            stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
>-            *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>+            /*
>+             * If the clause is compatible with the selected statistics, mark it
>+             * as estimated and add it to the list to estimate.
>+             */
>+            if (list_attnums[listidx] != NULL &&
>+                bms_is_subset(list_attnums[listidx], stat->keys))
>+            {
>+                stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
>+                *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>+            }
>
>The loop runs through all clauses every time. I agree that that is
>better than using a copy of the clauses to avoid to step on already
>estimated clauses, but maybe we need an Assertion that the listidx is
>not a part of estimatedclauses to make sure no clauses are not
>estimated twice.
>

Well, we can't really operate on a smaller "copy" of the list anyway,
because that would break the precalculation logic (the listidx value
would be incorrect for the new list), and tweaking it would be more
expensive than just iterating over all clauses. The assumption is that
we won't see extremely large number of clauses here.

Adding an assert seems reasonable. And maybe a comment why we should not
see any already-estimated clauses here.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/6/19 11:58 AM, Tomas Vondra wrote:
> On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:
>> On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>>> Hi,
>>>
>>> PostgreSQL 10 introduced extended statistics, allowing us to consider
>>> correlation between columns to improve estimates, and PostgreSQL 12
>>> added support for MCV statistics. But we still had the limitation that
>>> we only allowed using a single extended statistics per relation, i.e.
>>> given a table with two extended stats
>>>
>>>  CREATE TABLE t (a int, b int, c int, d int);
>>>  CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>>>  CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>>>
>>> and a query
>>>
>>>  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>>>
>>> we only ever used one of the statistics (and we considered them in a not
>>> particularly well determined order).
>>>
>>> This patch addresses this by using as many extended stats as possible,
>>> by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>>> we pick the "best" applicable statistics (in the sense of covering the
>>> most attributes) and factor it into the oveall estimate.

Tomas,

Your patch compiles and passes the regression tests for me on debian 
linux under master.

Since your patch does not include modified regression tests, I wrote a 
test that I expected to improve under this new code, but running it both 
before and after applying your patch, there is no change.  Please find 
the modified test attached.  Am I wrong to expect some change in this 
test's output?  If so, can you provide a test example that works 
differently under your patch?

Thanks!


-- 
Mark Dilger

Attachment

Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/9/19 12:33 PM, Mark Dilger wrote:
> 
> 
> On 11/6/19 11:58 AM, Tomas Vondra wrote:
>> On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:
>>> On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>>>> Hi,
>>>>
>>>> PostgreSQL 10 introduced extended statistics, allowing us to consider
>>>> correlation between columns to improve estimates, and PostgreSQL 12
>>>> added support for MCV statistics. But we still had the limitation that
>>>> we only allowed using a single extended statistics per relation, i.e.
>>>> given a table with two extended stats
>>>>
>>>>  CREATE TABLE t (a int, b int, c int, d int);
>>>>  CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>>>>  CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>>>>
>>>> and a query
>>>>
>>>>  SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>>>>
>>>> we only ever used one of the statistics (and we considered them in a 
>>>> not
>>>> particularly well determined order).
>>>>
>>>> This patch addresses this by using as many extended stats as possible,
>>>> by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>>>> we pick the "best" applicable statistics (in the sense of covering the
>>>> most attributes) and factor it into the oveall estimate.
> 
> Tomas,
> 
> Your patch compiles and passes the regression tests for me on debian 
> linux under master.
> 
> Since your patch does not include modified regression tests, I wrote a 
> test that I expected to improve under this new code, but running it both 
> before and after applying your patch, there is no change.

Ok, the attached test passes before applying your patch and fails 
afterward owing to the estimates improving and no longer matching the 
expected output.  To be clear, this confirms your patch working as expected.

I haven't seen any crashes in several hours of running different tests, 
so I think it looks good.


-- 
Mark Dilger

Attachment

Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Sat, Nov 09, 2019 at 12:33:05PM -0800, Mark Dilger wrote:
>
>
>On 11/6/19 11:58 AM, Tomas Vondra wrote:
>>On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:
>>>On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>>>>Hi,
>>>>
>>>>PostgreSQL 10 introduced extended statistics, allowing us to consider
>>>>correlation between columns to improve estimates, and PostgreSQL 12
>>>>added support for MCV statistics. But we still had the limitation that
>>>>we only allowed using a single extended statistics per relation, i.e.
>>>>given a table with two extended stats
>>>>
>>>> CREATE TABLE t (a int, b int, c int, d int);
>>>> CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>>>> CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>>>>
>>>>and a query
>>>>
>>>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>>>>
>>>>we only ever used one of the statistics (and we considered them in a not
>>>>particularly well determined order).
>>>>
>>>>This patch addresses this by using as many extended stats as possible,
>>>>by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>>>>we pick the "best" applicable statistics (in the sense of covering the
>>>>most attributes) and factor it into the oveall estimate.
>
>Tomas,
>
>Your patch compiles and passes the regression tests for me on debian 
>linux under master.
>

Thanks.

>Since your patch does not include modified regression tests, I wrote a 
>test that I expected to improve under this new code, but running it 
>both before and after applying your patch, there is no change.  Please 
>find the modified test attached.  Am I wrong to expect some change in 
>this test's output?  If so, can you provide a test example that works 
>differently under your patch?
>

Those queries are not improved by the patch, because we only support
clauses "Var op Const" for now - your tests are using "Var op Var" so
that doesn't work.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Sat, Nov 09, 2019 at 02:32:27PM -0800, Mark Dilger wrote:
>
>
>On 11/9/19 12:33 PM, Mark Dilger wrote:
>>
>>
>>On 11/6/19 11:58 AM, Tomas Vondra wrote:
>>>On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote:
>>>>On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote:
>>>>>Hi,
>>>>>
>>>>>PostgreSQL 10 introduced extended statistics, allowing us to consider
>>>>>correlation between columns to improve estimates, and PostgreSQL 12
>>>>>added support for MCV statistics. But we still had the limitation that
>>>>>we only allowed using a single extended statistics per relation, i.e.
>>>>>given a table with two extended stats
>>>>>
>>>>> CREATE TABLE t (a int, b int, c int, d int);
>>>>> CREATE STATISTICS s1 (mcv) ON a, b FROM t;
>>>>> CREATE STATISTICS s2 (mcv) ON c, d FROM t;
>>>>>
>>>>>and a query
>>>>>
>>>>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1;
>>>>>
>>>>>we only ever used one of the statistics (and we considered 
>>>>>them in a not
>>>>>particularly well determined order).
>>>>>
>>>>>This patch addresses this by using as many extended stats as possible,
>>>>>by adding a loop to statext_mcv_clauselist_selectivity(). In each step
>>>>>we pick the "best" applicable statistics (in the sense of covering the
>>>>>most attributes) and factor it into the oveall estimate.
>>
>>Tomas,
>>
>>Your patch compiles and passes the regression tests for me on debian 
>>linux under master.
>>
>>Since your patch does not include modified regression tests, I wrote 
>>a test that I expected to improve under this new code, but running 
>>it both before and after applying your patch, there is no change.
>
>Ok, the attached test passes before applying your patch and fails 
>afterward owing to the estimates improving and no longer matching the 
>expected output.  To be clear, this confirms your patch working as 
>expected.
>
>I haven't seen any crashes in several hours of running different 
>tests, so I think it looks good.
>

Yep, thanks for adding the tests. I'll include them into the patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
Hi,

here's an updated patch, with some minor tweaks based on the review and
added tests (I ended up reworking those a bit, to make them more like
the existing ones).

There's also a new piece, dealing with functional dependencies. Until
now we did the same thing as for MCV lists - we picketd the "best"
extended statistics (with functional dependencies built) and just used
that. At first I thought we might simply do the same loop as for MCV
lists, but that does not really make sense because we might end up
applying "weaker" dependency first.

Say for example we have table with columns (a,b,c,d,e) and functional
dependencies on (a,b,c,d) and (c,d,e) where all the dependencies on
(a,b,c,d) are weaker than (c,d => e). In a query with clauses on all
attributes this is guaranteed to apply all dependencies from the first
statistic first, which si clearly wrong.

So what this does instead is simply merging all the dependencies from
all the relevant stats, and treating them as a single collection.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/13/19 7:28 AM, Tomas Vondra wrote:
> Hi,
> 
> here's an updated patch, with some minor tweaks based on the review and
> added tests (I ended up reworking those a bit, to make them more like
> the existing ones).

Thanks, Tomas, for the new patch set!

Attached are my review comments so far, in the form of a patch applied 
on top of yours.

-- 
Mark Dilger

Attachment

Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:
>
>
>On 11/13/19 7:28 AM, Tomas Vondra wrote:
>>Hi,
>>
>>here's an updated patch, with some minor tweaks based on the review and
>>added tests (I ended up reworking those a bit, to make them more like
>>the existing ones).
>
>Thanks, Tomas, for the new patch set!
>
>Attached are my review comments so far, in the form of a patch applied 
>on top of yours.
>

Thanks.

1) It's not clear to me why adding 'const' to the List parameters would
   be useful? Can you explain?

2) I think you're right we can change find_strongest_dependency to do

    /* also skip weaker dependencies when attribute count matches */
    if (strongest->nattributes == dependency->nattributes &&
        strongest->degree >= dependency->degree)
        continue;

   That'll skip some additional dependencies, which seems OK.

3) It's not clear to me what you mean by

     * TODO: Improve this code comment.  Specifically, why would we
     * ignore that no rows will match?  It seems that such a discovery
     * would allow us to return an estimate of 0 rows, and that would
     * be useful.

   added to dependencies_clauselist_selectivity. Are you saying we
   should also compute selectivity estimates for individual clauses and
   use Min() as a limit? Maybe, but that seems unrelated to the patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/14/19 7:55 AM, Tomas Vondra wrote:
> On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:
>>
>>
>> On 11/13/19 7:28 AM, Tomas Vondra wrote:
>>> Hi,
>>>
>>> here's an updated patch, with some minor tweaks based on the review and
>>> added tests (I ended up reworking those a bit, to make them more like
>>> the existing ones).
>>
>> Thanks, Tomas, for the new patch set!
>>
>> Attached are my review comments so far, in the form of a patch applied 
>> on top of yours.
>>
> 
> Thanks.
> 
> 1) It's not clear to me why adding 'const' to the List parameters would
>    be useful? Can you explain?

When I first started reviewing the functions, I didn't know if those 
lists were intended to be modified by the function.  Adding 'const' 
helps document that the function does not intend to change them.

> 2) I think you're right we can change find_strongest_dependency to do
> 
>     /* also skip weaker dependencies when attribute count matches */
>     if (strongest->nattributes == dependency->nattributes &&
>         strongest->degree >= dependency->degree)
>         continue;
> 
>    That'll skip some additional dependencies, which seems OK.
> 
> 3) It's not clear to me what you mean by
> 
>      * TODO: Improve this code comment.  Specifically, why would we
>      * ignore that no rows will match?  It seems that such a discovery
>      * would allow us to return an estimate of 0 rows, and that would
>      * be useful.
> 
>    added to dependencies_clauselist_selectivity. Are you saying we
>    should also compute selectivity estimates for individual clauses and
>    use Min() as a limit? Maybe, but that seems unrelated to the patch.

I mean that the comment right above that TODO is hard to understand. You 
seem to be saying that it is good and proper to only take the 
selectivity estimate from the final clause in the list, but then go on 
to say that other clauses might prove that no rows will match.  So that 
implies that by ignoring all but the last clause, we're ignoring such 
other clauses that prove no rows can match.  But why would we be 
ignoring those?

I am not arguing that your code is wrong.  I'm just critiquing the 
hard-to-understand phrasing of that code comment.

-- 
Mark Dilger



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote:
>
>
>On 11/14/19 7:55 AM, Tomas Vondra wrote:
>>On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:
>>>
>>>
>>>On 11/13/19 7:28 AM, Tomas Vondra wrote:
>>>>Hi,
>>>>
>>>>here's an updated patch, with some minor tweaks based on the review and
>>>>added tests (I ended up reworking those a bit, to make them more like
>>>>the existing ones).
>>>
>>>Thanks, Tomas, for the new patch set!
>>>
>>>Attached are my review comments so far, in the form of a patch 
>>>applied on top of yours.
>>>
>>
>>Thanks.
>>
>>1) It's not clear to me why adding 'const' to the List parameters would
>>   be useful? Can you explain?
>
>When I first started reviewing the functions, I didn't know if those 
>lists were intended to be modified by the function.  Adding 'const' 
>helps document that the function does not intend to change them.
>

Hmmm, ok. I'll think about it, but we're not really using const* in this
way very much I think - at least not in the surrounding code.

>>2) I think you're right we can change find_strongest_dependency to do
>>
>>    /* also skip weaker dependencies when attribute count matches */
>>    if (strongest->nattributes == dependency->nattributes &&
>>        strongest->degree >= dependency->degree)
>>        continue;
>>
>>   That'll skip some additional dependencies, which seems OK.
>>
>>3) It's not clear to me what you mean by
>>
>>     * TODO: Improve this code comment.  Specifically, why would we
>>     * ignore that no rows will match?  It seems that such a discovery
>>     * would allow us to return an estimate of 0 rows, and that would
>>     * be useful.
>>
>>   added to dependencies_clauselist_selectivity. Are you saying we
>>   should also compute selectivity estimates for individual clauses and
>>   use Min() as a limit? Maybe, but that seems unrelated to the patch.
>
>I mean that the comment right above that TODO is hard to understand. 
>You seem to be saying that it is good and proper to only take the 
>selectivity estimate from the final clause in the list, but then go on 
>to say that other clauses might prove that no rows will match.  So 
>that implies that by ignoring all but the last clause, we're ignoring 
>such other clauses that prove no rows can match.  But why would we be 
>ignoring those?
>
>I am not arguing that your code is wrong.  I'm just critiquing the 
>hard-to-understand phrasing of that code comment.
>

Aha, I think I understand now - thanks for the explanation. You're right
the comment is trying to explain why just taking the last clause for a
given attnum is fine. I'll try to make the comment clearer.

For the case with equal Const values that should be mostly obvious, i.e.
"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

The case with different Const values is harder, unfortunately. It might
seem obvious that "a=1 AND a=2" means there are no matching rows, but
that heavily relies on the semantics of the equality operator. And we
can't simply compare the Const values either, I'm afraid, because there
are cases with cross-type operators like

  a = 1::int AND a = 1.0::numeric

where the Consts are of different type, yet both conditions can be true.

So it would be pretty tricky to do this, and the current code does not
even try to do that.

Instead, it just assumes that it's mostly fine to overestimate, because
then at runtime we'll simply end up with 0 rows here.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> For the case with equal Const values that should be mostly obvious, i.e.
> "a=1 AND a=1 AND a=1" has the same selectivity as "a=1".

> The case with different Const values is harder, unfortunately. It might
> seem obvious that "a=1 AND a=2" means there are no matching rows, but
> that heavily relies on the semantics of the equality operator. And we
> can't simply compare the Const values either, I'm afraid, because there
> are cases with cross-type operators like
>   a = 1::int AND a = 1.0::numeric
> where the Consts are of different type, yet both conditions can be true.

FWIW, there's code in predtest.c to handle exactly that, at least for
types sharing a btree opfamily.  Whether it's worth applying that logic
here is unclear, but note that we've had the ability to recognize
redundant and contradictory clauses for a long time:

regression=# explain select * from tenk1 where two = 1;          
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
   Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; 
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
   Filter: (two = 1)
(2 rows)

regression=# explain select * from tenk1 where two = 1 and two = 2::bigint;
                          QUERY PLAN                           
---------------------------------------------------------------
 Result  (cost=0.00..470.00 rows=1 width=244)
   One-Time Filter: false
   ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=1 width=244)
         Filter: (two = 1)
(4 rows)

It falls down on

regression=# explain select * from tenk1 where two = 1 and two = 2::numeric;
                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..520.00 rows=25 width=244)
   Filter: ((two = 1) AND ((two)::numeric = '2'::numeric))
(2 rows)

because numeric isn't in the same opfamily, so these clauses can't be
compared easily.

            regards, tom lane



Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/14/19 12:04 PM, Tomas Vondra wrote:
> Aha, I think I understand now - thanks for the explanation. You're right
> the comment is trying to explain why just taking the last clause for a
> given attnum is fine. I'll try to make the comment clearer.
> 
> For the case with equal Const values that should be mostly obvious, i.e.
> "a=1 AND a=1 AND a=1" has the same selectivity as "a=1".
> 
> The case with different Const values is harder, unfortunately. It might
> seem obvious that "a=1 AND a=2" means there are no matching rows, but
> that heavily relies on the semantics of the equality operator. And we
> can't simply compare the Const values either, I'm afraid, because there
> are cases with cross-type operators like
> 
>   a = 1::int AND a = 1.0::numeric
> 
> where the Consts are of different type, yet both conditions can be true.
> 
> So it would be pretty tricky to do this, and the current code does not
> even try to do that.
> 
> Instead, it just assumes that it's mostly fine to overestimate, because
> then at runtime we'll simply end up with 0 rows here.

I'm unsure whether that could be a performance problem at runtime.

I could imagine the planner short-circuiting additional planning when
it finds a plan with zero rows, and so we'd save planner time if we
avoid overestimating.  I don't recall if the planner does anything like
that, or if there are plans to implement such logic, but it might be
good not to rule it out.  Tom's suggestion elsewhere in this thread to
use code in predtest.c sounds good to me.

I don't know if you want to expand the scope of this particular patch to
include that, though.

-- 
Mark Dilger



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Thu, Nov 14, 2019 at 03:16:04PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> For the case with equal Const values that should be mostly obvious, i.e.
>> "a=1 AND a=1 AND a=1" has the same selectivity as "a=1".
>
>> The case with different Const values is harder, unfortunately. It might
>> seem obvious that "a=1 AND a=2" means there are no matching rows, but
>> that heavily relies on the semantics of the equality operator. And we
>> can't simply compare the Const values either, I'm afraid, because there
>> are cases with cross-type operators like
>>   a = 1::int AND a = 1.0::numeric
>> where the Consts are of different type, yet both conditions can be true.
>
>FWIW, there's code in predtest.c to handle exactly that, at least for
>types sharing a btree opfamily.  Whether it's worth applying that logic
>here is unclear, but note that we've had the ability to recognize
>redundant and contradictory clauses for a long time:
>
>regression=# explain select * from tenk1 where two = 1;
>                         QUERY PLAN
>------------------------------------------------------------
> Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
>   Filter: (two = 1)
>(2 rows)
>
>regression=# explain select * from tenk1 where two = 1 and two = 1::bigint;
>                         QUERY PLAN
>------------------------------------------------------------
> Seq Scan on tenk1  (cost=0.00..470.00 rows=5000 width=244)
>   Filter: (two = 1)
>(2 rows)
>
>regression=# explain select * from tenk1 where two = 1 and two = 2::bigint;
>                          QUERY PLAN
>---------------------------------------------------------------
> Result  (cost=0.00..470.00 rows=1 width=244)
>   One-Time Filter: false
>   ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=1 width=244)
>         Filter: (two = 1)
>(4 rows)
>
>It falls down on
>
>regression=# explain select * from tenk1 where two = 1 and two = 2::numeric;
>                        QUERY PLAN
>-----------------------------------------------------------
> Seq Scan on tenk1  (cost=0.00..520.00 rows=25 width=244)
>   Filter: ((two = 1) AND ((two)::numeric = '2'::numeric))
>(2 rows)
>
>because numeric isn't in the same opfamily, so these clauses can't be
>compared easily.
>
>            regards, tom lane

Yeah, and this logic still works - the redundant clauses won't even get
to the selectivity estimation, I think. So maybe the comment is not
quite necessary, because the problem does not even exist ...

Maybe we could do something about the cases that predtest.c can't solve,
but it's not clear if we can be much smarter for types with different
opfamilies.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Thu, Nov 14, 2019 at 01:17:02PM -0800, Mark Dilger wrote:
>
>
>On 11/14/19 12:04 PM, Tomas Vondra wrote:
>>Aha, I think I understand now - thanks for the explanation. You're right
>>the comment is trying to explain why just taking the last clause for a
>>given attnum is fine. I'll try to make the comment clearer.
>>
>>For the case with equal Const values that should be mostly obvious, i.e.
>>"a=1 AND a=1 AND a=1" has the same selectivity as "a=1".
>>
>>The case with different Const values is harder, unfortunately. It might
>>seem obvious that "a=1 AND a=2" means there are no matching rows, but
>>that heavily relies on the semantics of the equality operator. And we
>>can't simply compare the Const values either, I'm afraid, because there
>>are cases with cross-type operators like
>>
>>  a = 1::int AND a = 1.0::numeric
>>
>>where the Consts are of different type, yet both conditions can be true.
>>
>>So it would be pretty tricky to do this, and the current code does not
>>even try to do that.
>>
>>Instead, it just assumes that it's mostly fine to overestimate, because
>>then at runtime we'll simply end up with 0 rows here.
>
>I'm unsure whether that could be a performance problem at runtime.
>
>I could imagine the planner short-circuiting additional planning when
>it finds a plan with zero rows, and so we'd save planner time if we
>avoid overestimating.  I don't recall if the planner does anything like
>that, or if there are plans to implement such logic, but it might be
>good not to rule it out.  Tom's suggestion elsewhere in this thread to
>use code in predtest.c sounds good to me.
>

No, AFAIK the planner does not do anything like that - it might probaly
do that if it could prove there are no such rows, but that's hardly the
case for estimates based on approximate information (i.e. statistics).

If could do that based on the predicate analysis in predtest.c mentioned
by Tom, although I don't think it does anything beyond tweaking the row
estimate to ~1 row.

>I don't know if you want to expand the scope of this particular patch to
>include that, though.
>

Certainly not. It's an interesting but surprisingly complicated problem,
and this patch simply aims to add different improvement.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 11/14/19 12:04 PM, Tomas Vondra wrote:
> On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote:
>>
>>
>> On 11/14/19 7:55 AM, Tomas Vondra wrote:
>>> On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote:
>>>>
>>>>
>>>> On 11/13/19 7:28 AM, Tomas Vondra wrote:
>>>>> Hi,
>>>>>
>>>>> here's an updated patch, with some minor tweaks based on the review 
>>>>> and
>>>>> added tests (I ended up reworking those a bit, to make them more like
>>>>> the existing ones).
>>>>
>>>> Thanks, Tomas, for the new patch set!
>>>>
>>>> Attached are my review comments so far, in the form of a patch 
>>>> applied on top of yours.
>>>>
>>>
>>> Thanks.
>>>
>>> 1) It's not clear to me why adding 'const' to the List parameters would
>>>   be useful? Can you explain?
>>
>> When I first started reviewing the functions, I didn't know if those 
>> lists were intended to be modified by the function.  Adding 'const' 
>> helps document that the function does not intend to change them.
>>
> 
> Hmmm, ok. I'll think about it, but we're not really using const* in this
> way very much I think - at least not in the surrounding code.
> 
>>> 2) I think you're right we can change find_strongest_dependency to do
>>>
>>>    /* also skip weaker dependencies when attribute count matches */
>>>    if (strongest->nattributes == dependency->nattributes &&
>>>        strongest->degree >= dependency->degree)
>>>        continue;
>>>
>>>   That'll skip some additional dependencies, which seems OK.
>>>
>>> 3) It's not clear to me what you mean by
>>>
>>>     * TODO: Improve this code comment.  Specifically, why would we
>>>     * ignore that no rows will match?  It seems that such a discovery
>>>     * would allow us to return an estimate of 0 rows, and that would
>>>     * be useful.
>>>
>>>   added to dependencies_clauselist_selectivity. Are you saying we
>>>   should also compute selectivity estimates for individual clauses and
>>>   use Min() as a limit? Maybe, but that seems unrelated to the patch.
>>
>> I mean that the comment right above that TODO is hard to understand. 
>> You seem to be saying that it is good and proper to only take the 
>> selectivity estimate from the final clause in the list, but then go on 
>> to say that other clauses might prove that no rows will match.  So 
>> that implies that by ignoring all but the last clause, we're ignoring 
>> such other clauses that prove no rows can match.  But why would we be 
>> ignoring those?
>>
>> I am not arguing that your code is wrong.  I'm just critiquing the 
>> hard-to-understand phrasing of that code comment.
>>
> 
> Aha, I think I understand now - thanks for the explanation. You're right
> the comment is trying to explain why just taking the last clause for a
> given attnum is fine. I'll try to make the comment clearer.

Are you planning to submit a revised patch for this?



-- 
Mark Dilger



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:
>
>Are you planning to submit a revised patch for this?
>

Yes, I'll submit a rebased version of this patch shortly. I got broken
because of the recent fix in choose_best_statistics, shouldn't take long
to update the patch. I do have a couple more related patches in the
queue, so I want to submit them all at once.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:
>On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:
>>
>>Are you planning to submit a revised patch for this?
>>
>
>Yes, I'll submit a rebased version of this patch shortly. I got broken
>because of the recent fix in choose_best_statistics, shouldn't take long
>to update the patch. I do have a couple more related patches in the
>queue, so I want to submit them all at once.
>

OK, here we go - these two patched allow applying multiple extended
statistics, both for MCV and functional dependencies. Functional
dependencies are simply merged and then applied at once (so withouth
choose_best_statistics), statistics are considered in greedy manner by
calling choose_best_statistics in a loop.

I do have some additional enhancements in the queue, but those are not
fully baked yet, so I'll post them later in separate patches.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:
>On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:
>>On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:
>>>
>>>Are you planning to submit a revised patch for this?
>>>
>>
>>Yes, I'll submit a rebased version of this patch shortly. I got broken
>>because of the recent fix in choose_best_statistics, shouldn't take long
>>to update the patch. I do have a couple more related patches in the
>>queue, so I want to submit them all at once.
>>
>
>OK, here we go - these two patched allow applying multiple extended
>statistics, both for MCV and functional dependencies. Functional
>dependencies are simply merged and then applied at once (so withouth
>choose_best_statistics), statistics are considered in greedy manner by
>calling choose_best_statistics in a loop.
>

OK, this time with the patches actually attached ;-)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 12/5/19 9:51 AM, Tomas Vondra wrote:
> On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:
>> On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:
>>> On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:
>>>>
>>>> Are you planning to submit a revised patch for this?
>>>>
>>>
>>> Yes, I'll submit a rebased version of this patch shortly. I got broken
>>> because of the recent fix in choose_best_statistics, shouldn't take long
>>> to update the patch. I do have a couple more related patches in the
>>> queue, so I want to submit them all at once.
>>>
>>
>> OK, here we go - these two patched allow applying multiple extended
>> statistics, both for MCV and functional dependencies. Functional
>> dependencies are simply merged and then applied at once (so withouth
>> choose_best_statistics), statistics are considered in greedy manner by
>> calling choose_best_statistics in a loop.
>>
> 
> OK, this time with the patches actually attached ;-)

These look good to me.  I added extra tests (not included in this email)
to verify the code on more interesting test cases, such as partitioned
tables and within joins.  Your test cases are pretty trivial, just being
selects from a single table.

I'll go mark this "ready for committer".

-- 
Mark Dilger



Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
On Mon, Dec 09, 2019 at 11:56:39AM -0800, Mark Dilger wrote:
>
>
>On 12/5/19 9:51 AM, Tomas Vondra wrote:
>>On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote:
>>>On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote:
>>>>On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote:
>>>>>
>>>>>Are you planning to submit a revised patch for this?
>>>>>
>>>>
>>>>Yes, I'll submit a rebased version of this patch shortly. I got broken
>>>>because of the recent fix in choose_best_statistics, shouldn't take long
>>>>to update the patch. I do have a couple more related patches in the
>>>>queue, so I want to submit them all at once.
>>>>
>>>
>>>OK, here we go - these two patched allow applying multiple extended
>>>statistics, both for MCV and functional dependencies. Functional
>>>dependencies are simply merged and then applied at once (so withouth
>>>choose_best_statistics), statistics are considered in greedy manner by
>>>calling choose_best_statistics in a loop.
>>>
>>
>>OK, this time with the patches actually attached ;-)
>
>These look good to me.  I added extra tests (not included in this email)
>to verify the code on more interesting test cases, such as partitioned
>tables and within joins.  Your test cases are pretty trivial, just being
>selects from a single table.
>

Adding such more complex tests seem like a good idea, maybe you'd like
to share them?

>I'll go mark this "ready for committer".
>

Thanks for the review. I'll hold-off with the commit until the next CF,
though, just to give others a proper opportunity to look at it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Using multiple extended statistics for estimates

From
Mark Dilger
Date:

On 12/9/19 2:00 PM, Tomas Vondra wrote:
>>
>> These look good to me.  I added extra tests (not included in this email)
>> to verify the code on more interesting test cases, such as partitioned
>> tables and within joins.  Your test cases are pretty trivial, just being
>> selects from a single table.
>>
> 
> Adding such more complex tests seem like a good idea, maybe you'd like
> to share them?

You can find them attached.  I did not include them in my earlier email
because they seem a bit unrefined, taking too many lines of code for the
amount of coverage they provide.  But you can prune them down and add
them to the patch if you like.

These only test the functional dependencies.  If you want to include
something like them in your commit, you might create similar tests for
the mcv statistics, too.

-- 
Mark Dilger

Attachment

Re: Using multiple extended statistics for estimates

From
Tomas Vondra
Date:
Hi,

I've pushed these two improvements after some minor improvements, mostly
to comments. I ended up not using the extra tests, as it wasn't clear to
me it's worth the extra duration.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services