Thread: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

Hi hackers,

eqjoinsel() can be optimized by not reading MCV stats if at least one of 
the two join attributes is unique. As primary keys are implicitly unique 
this situation can occur frequently. For unique columns no MCV stats are 
stored and eqjoinsel_inner() and eqjoinsel_semi(), called from 
eqjoinsel(), only consider MCV stats in the join selectivity computation 
if they're present on both columns. Attached is a small patch that 
implements the skipping.

With this change we saw some queries improve planning time by more than 
2x, especially with larger values for default_statistics_target. That's 
because get_attstatsslot() deconstructs the array holding the MCV. The 
size of that array depends on default_statistics_target.

Thanks for your consideration!

--
David Geier
(ServiceNow)

Attachment
David Geier <geidav.pg@gmail.com> writes:
> eqjoinsel() can be optimized by not reading MCV stats if at least one of 
> the two join attributes is unique.

There won't *be* any MCV stats for a column that ANALYZE perceives to
be unique, so I'm not quite sure where the claimed savings comes from.

> With this change we saw some queries improve planning time by more than 
> 2x, especially with larger values for default_statistics_target.

Please provide a concrete example.

            regards, tom lane



Hi Tom,
> There won't *be* any MCV stats for a column that ANALYZE perceives to
> be unique, so I'm not quite sure where the claimed savings comes from.
We save if one join attribute is unique while the other isn't. In that 
case stored MCV stats are read for the non-unique attribute but then 
never used. This is because MCV stats in join selectivity estimation are 
only used if they're present on both columns
> Please provide a concrete example.

A super simple case already showing a significant speedup is the 
following. The more ways to join two tables and the more joins overall, 
the higher the expected gain.

CREATE TABLE bar(col INT UNIQUE);
CREATE TABLE foo (col INT);
INSERT INTO foo SELECT generate_series(1, 1000000, 0.5);
SET default_statistics_target = 10000;
ANALYZE foo, bar;
\timing on
EXPLAIN SELECT * FROM foo, bar WHERE foo.col = bar.col;

Running the above query five times gave me average runtimes of:

- 0.62 ms without the patch and
- 0.48 ms with the patch.

--
David Geier
(ServiceNow)





On 11/14/22 10:19, David Geier wrote:
> Hi Tom,
>> There won't *be* any MCV stats for a column that ANALYZE perceives to
>> be unique, so I'm not quite sure where the claimed savings comes from.
>
> We save if one join attribute is unique while the other isn't. In that
> case stored MCV stats are read for the non-unique attribute but then
> never used. This is because MCV stats in join selectivity estimation are
> only used if they're present on both columns
>

Right - if we only have MCV on one side of the join, we currently end up
loading the MCV we have only to not use it anyway. The uniqueness is a
simple way to detect some of those cases. I'd bet the savings can be
quite significant for small joins and/or cases with large MCV.

I wonder if we might be yet a bit smarter, though.

For example, assume the first attribute is not defined as "unique" but
we still don't have a MCV (it may be unique - or close to unique - in
practice, or maybe it's just uniform distribution). We end up with

  have_mcvs1 = false

Can't we just skip trying to load the second MCV? So we could do

    if (have_mcvs1 && HeapTupleIsValid(vardata2.statsTuple))
    { ... try loading mcv2 ... }

Or perhaps what if we have a function that quickly determines if the
attribute has MCV, without loading it? I'd bet the expensive part of
get_attstatslot() is the deconstruct_array().

We could have a function that only does the first small loop over slots,
and returns true/false if we have a slot of the requested stakind. It
might even check the isunique flag first, to make it more convenient.

And only if both sides return "true" we'd load the MCV, deconstruct the
array and all that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> Or perhaps what if we have a function that quickly determines if the
> attribute has MCV, without loading it? I'd bet the expensive part of
> get_attstatslot() is the deconstruct_array().
> We could have a function that only does the first small loop over slots,
> and returns true/false if we have a slot of the requested stakind.

Yeah, I like this idea.

> It might even check the isunique flag first, to make it more convenient.

That would tie it to this one use-case, which doesn't seem advisable.
I think we should forget the known-unique angle and just do quick
checks to see if both sides have MCVs.

            regards, tom lane



I wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> Or perhaps what if we have a function that quickly determines if the
>> attribute has MCV, without loading it? I'd bet the expensive part of
>> get_attstatslot() is the deconstruct_array().
>> We could have a function that only does the first small loop over slots,
>> and returns true/false if we have a slot of the requested stakind.

> Yeah, I like this idea.

Actually, looking at get_attstatslot, I realize it was already designed
to do that -- just pass zero for flags.  So we could do it as attached.

We could make some consequent simplifications by only retaining one
"have_mcvs" flag, but I'm inclined to leave the rest of the code as-is.
We would not get much gain from that, and it would make this harder
to undo if there ever is a reason to consider just one set of MCVs.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..e0aeaa6909 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2261,6 +2261,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
     Form_pg_statistic stats2 = NULL;
     bool        have_mcvs1 = false;
     bool        have_mcvs2 = false;
+    bool        get_mcv_stats;
     bool        join_is_reversed;
     RelOptInfo *inner_rel;

@@ -2275,11 +2276,25 @@ eqjoinsel(PG_FUNCTION_ARGS)
     memset(&sslot1, 0, sizeof(sslot1));
     memset(&sslot2, 0, sizeof(sslot2));

+    /*
+     * There is no use in fetching one side's MCVs if we lack MCVs for the
+     * other side, so do a quick check to verify that both stats exist.
+     */
+    get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+                     HeapTupleIsValid(vardata2.statsTuple) &&
+                     get_attstatsslot(&sslot1, vardata1.statsTuple,
+                                      STATISTIC_KIND_MCV, InvalidOid,
+                                      0) &&
+                     get_attstatsslot(&sslot2, vardata2.statsTuple,
+                                      STATISTIC_KIND_MCV, InvalidOid,
+                                      0));
+
     if (HeapTupleIsValid(vardata1.statsTuple))
     {
         /* note we allow use of nullfrac regardless of security check */
         stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-        if (statistic_proc_security_check(&vardata1, opfuncoid))
+        if (get_mcv_stats &&
+            statistic_proc_security_check(&vardata1, opfuncoid))
             have_mcvs1 = get_attstatsslot(&sslot1, vardata1.statsTuple,
                                           STATISTIC_KIND_MCV, InvalidOid,
                                           ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2304,8 @@ eqjoinsel(PG_FUNCTION_ARGS)
     {
         /* note we allow use of nullfrac regardless of security check */
         stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-        if (statistic_proc_security_check(&vardata2, opfuncoid))
+        if (get_mcv_stats &&
+            statistic_proc_security_check(&vardata2, opfuncoid))
             have_mcvs2 = get_attstatsslot(&sslot2, vardata2.statsTuple,
                                           STATISTIC_KIND_MCV, InvalidOid,
                                           ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);


On Fri, Nov 18, 2022 at 9:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Actually, looking at get_attstatslot, I realize it was already designed
to do that -- just pass zero for flags.  So we could do it as attached.
 
Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
return whether the slot type exists without loading it.  Do you think we
need to emphasize this use case in the comments for 'flags'?  It seems
currently there is no such use case in the codes on HEAD.

I wonder whether we need to also check statistic_proc_security_check()
when determining if MCVs exists in both sides.

Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes:
> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
> return whether the slot type exists without loading it.  Do you think we
> need to emphasize this use case in the comments for 'flags'?

Perhaps, it's not really obvious now.

> I wonder whether we need to also check statistic_proc_security_check()
> when determining if MCVs exists in both sides.

Yeah, I thought about hoisting the statistic_proc_security_check
tests up into get_mcv_stats.  I don't think it's a great idea
though.  Again, it'd complicate untangling this if we ever
generalize the use of MCVs in this function.  Also, I don't
think we should be micro-optimizing the case where the security
check doesn't pass --- if it doesn't, you're going to be hurting
from bad plans a lot more than you are from some wasted cycles
here.

            regards, tom lane



Thanks everyone for the great feedback and suggestions.

>
>> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
>> return whether the slot type exists without loading it.  Do you think we
>> need to emphasize this use case in the comments for 'flags'?
> Perhaps, it's not really obvious now.

Comment added.


> I wonder whether we need to also check statistic_proc_security_check()
>> when determining if MCVs exists in both sides.
> Yeah, I thought about hoisting the statistic_proc_security_check
> tests up into get_mcv_stats.  I don't think it's a great idea
> though.  Again, it'd complicate untangling this if we ever
> generalize the use of MCVs in this function.  Also, I don't
> think we should be micro-optimizing the case where the security
> check doesn't pass --- if it doesn't, you're going to be hurting
> from bad plans a lot more than you are from some wasted cycles
> here.

Sounds reasonable.

Attached is v2 of the patch.
This is basically Tom's version plus a comment for the flags of 
get_attstatslot() as suggested by Richard.

I couldn't come up with any reasonable way of writing an automated test 
for that.
Any ideas?

--
David Geier
(ServiceNow)

Attachment
On 11/18/22 09:54, David Geier wrote:
> Thanks everyone for the great feedback and suggestions.
> 
>>
>>> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
>>> return whether the slot type exists without loading it.  Do you think we
>>> need to emphasize this use case in the comments for 'flags'?
>> Perhaps, it's not really obvious now.
> 
> Comment added.
> 
> 
>> I wonder whether we need to also check statistic_proc_security_check()
>>> when determining if MCVs exists in both sides.
>> Yeah, I thought about hoisting the statistic_proc_security_check
>> tests up into get_mcv_stats.  I don't think it's a great idea
>> though.  Again, it'd complicate untangling this if we ever
>> generalize the use of MCVs in this function.  Also, I don't
>> think we should be micro-optimizing the case where the security
>> check doesn't pass --- if it doesn't, you're going to be hurting
>> from bad plans a lot more than you are from some wasted cycles
>> here.
> 
> Sounds reasonable.
> 
> Attached is v2 of the patch.
> This is basically Tom's version plus a comment for the flags of
> get_attstatslot() as suggested by Richard.
> 

Seems fine. I wonder if we could/could introduce a new constant for 0,
similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

> I couldn't come up with any reasonable way of writing an automated test
> for that.
> Any ideas?
> 

I don't think you can write a test for this, because there is no change
to behavior that can be observed by the user. If one side has no MCV,
the only difference is whether we try to load the other MCV or not.
There's no impact on estimates, because we won't use it.

IMO the best thing we can do is check coverage, that the new code is
exercised in regression tests. And I think that's fine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 11/18/22 09:54, David Geier wrote:
>> I couldn't come up with any reasonable way of writing an automated test
>> for that.

> I don't think you can write a test for this, because there is no change
> to behavior that can be observed by the user.

Yeah, and the delta in performance is surely too small to be
measured reliably in the buildfarm.  I think coverage will have
to be sufficient.

            regards, tom lane



On 11/18/22 14:00, Tomas Vondra wrote:
> Seems fine. I wonder if we could/could introduce a new constant for 0,
> similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
> magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.
Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached.
> I don't think you can write a test for this, because there is no change
> to behavior that can be observed by the user. If one side has no MCV,
> the only difference is whether we try to load the other MCV or not.

Yeah. I thought along the lines of checking the number of pages read 
when the pg_stats entry is not in syscache yet. But that seems awfully 
implementation specific. So no test provided.

-- 
David Geier
(ServiceNow)

Attachment
David Geier <geidav.pg@gmail.com> writes:
> On 11/18/22 14:00, Tomas Vondra wrote:
>> Seems fine. I wonder if we could/could introduce a new constant for 0,
>> similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
>> magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

> Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached.

No, I don't think it's a good idea.  The flags argument is documented as,
and used as, a bitmask of multiple options.  Passing zero fits fine with
that and is consistent with what we do elsewhere.  Turning it into
sort-of-an-enum-but-not-really isn't an improvement.

I didn't like your draft comment too much, because it didn't cover
what I think is the most important point: after a call with flags=0
we do not need a matching free_attstatsslot call to avoid leaking
anything.  (If we did, this patch would be a lot hairier.)

I rewrote the comment the way I wanted it and pushed.

            regards, tom lane