Thread: eqjoinsel_semi still sucks ...

eqjoinsel_semi still sucks ...

From
Tom Lane
Date:
I looked into Maxim Boguk's complaint of bad estimation of antijoin size:
http://archives.postgresql.org/pgsql-general/2012-05/msg00033.php

I can reproduce what I think the problem is in the regression database.
We do okay with this:

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand =
b.unique2);                                                                  QUERY PLAN
                                   
 

---------------------------------------------------------------------------------------------------------------------------------------------------Hash
AntiJoin  (cost=395.26..1003.26 rows=1 width=244) (actual time=264.324..264.324 rows=0 loops=1)  Hash Cond: (a.thousand
=b.unique2)  ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.050..47.798 rows=10000
loops=1) ->  Hash  (cost=270.26..270.26 rows=10000 width=4) (actual time=129.420..129.420 rows=10000 loops=1)
Buckets:1024  Batches: 1  Memory Usage: 274kB        ->  Index Only Scan using tenk1_unique2 on tenk1 b
(cost=0.00..270.26rows=10000 width=4) (actual time=0.422..65.480 rows=10000 loops=1)              Heap Fetches: 0Total
runtime:267.732 ms
 
(8 rows)

but not so okay when a filter condition is added inside the sub-select:

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2
andb.two = 0);                                                     QUERY PLAN
          
 

----------------------------------------------------------------------------------------------------------------------Hash
AntiJoin  (cost=545.50..1091.00 rows=1 width=244) (actual time=123.713..265.185 rows=5090 loops=1)  Hash Cond:
(a.thousand= b.unique2)  ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.048..46.685
rows=10000loops=1)  ->  Hash  (cost=483.00..483.00 rows=5000 width=4) (actual time=123.483..123.483 rows=5000 loops=1)
     Buckets: 1024  Batches: 1  Memory Usage: 137kB        ->  Seq Scan on tenk1 b  (cost=0.00..483.00 rows=5000
width=4)(actual time=0.059..91.405 rows=5000 loops=1)              Filter: (two = 0)              Rows Removed by
Filter:5000Total runtime: 284.889 ms
 
(9 rows)

Now, eqjoinsel_semi is correctly estimating that the condition
a.thousand = b.unique2 is unselective in itself: all values of
a.thousand will have join partners in the first case.  The problem comes
in trying to account for the additional filter condition.  The heuristic
we're currently using is to reduce the number of distinct values assumed
for the inner variable according to the selectivity of the additional
conditions.  In this case, though, that results in reducing ndistinct
for b.unique2 from 10000 to 5000, which is still more than ndistinct for
a.thousand (i.e., 1000), so the final selectivity estimate doesn't
change at all.  Oops.

On reflection I think that the idea of clamping ndistinct beforehand is
just wrong, and what we ought to do instead is apply a multiplier to the
selectivity estimate afterwards.  In the case of a base rel we could
just multiply by the selectivity of its baserestrictinfo list.  For join
rels it's a bit harder to guess how much a given input relation might
have been decimated, but if the join's estimated size is smaller than
the output size of the base rel the correlation var came from, we could
multiply by that ratio (on top of whatever correction came from the base
rel's restriction clauses).

Thoughts?
        regards, tom lane


Re: eqjoinsel_semi still sucks ...

From
Andrey Lepikhov
Date:
On 2/5/2012 20:34, Tom Lane wrote:
> On reflection I think that the idea of clamping ndistinct beforehand is
> just wrong, and what we ought to do instead is apply a multiplier to the
> selectivity estimate afterwards.  In the case of a base rel we could
> just multiply by the selectivity of its baserestrictinfo list.  For join
> rels it's a bit harder to guess how much a given input relation might
> have been decimated, but if the join's estimated size is smaller than
> the output size of the base rel the correlation var came from, we could
> multiply by that ratio (on top of whatever correction came from the base
> rel's restriction clauses).
I got stuck in some cases where (due to a tree of filters) the planner 
underestimates the JOIN just because the ndistinct conveys a huge number 
to the selectivity estimation formula. However, the estimation of both 
input relations is made correctly and is limited.
I've tried to understand the logic through commits 0d3b231eebf, 
97930cf578e and 7f3eba30c9d. But it is still not clear.
So, why the idea of clamping ndistinct is terrible in general? Could you 
explain your reasons a bit more?

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: eqjoinsel_semi still sucks ...

From
Alena Rybakina
Date:
On 23.06.2023 14:28, Andrey Lepikhov wrote:
On 2/5/2012 20:34, Tom Lane wrote:
On reflection I think that the idea of clamping ndistinct beforehand is
just wrong, and what we ought to do instead is apply a multiplier to the
selectivity estimate afterwards.  In the case of a base rel we could
just multiply by the selectivity of its baserestrictinfo list.  For join
rels it's a bit harder to guess how much a given input relation might
have been decimated, but if the join's estimated size is smaller than
the output size of the base rel the correlation var came from, we could
multiply by that ratio (on top of whatever correction came from the base
rel's restriction clauses).
I got stuck in some cases where (due to a tree of filters) the planner underestimates the JOIN just because the ndistinct conveys a huge number to the selectivity estimation formula. However, the estimation of both input relations is made correctly and is limited.
I've tried to understand the logic through commits 0d3b231eebf, 97930cf578e and 7f3eba30c9d. But it is still not clear.
So, why the idea of clamping ndistinct is terrible in general? Could you explain your reasons a bit more?

Hi, I also have an interest in understanding the same issue, and I dug into the above commits and the topic . I hope this letter will help to sort out this issue. I found some information on this topic in the discussion [1], more specifically in the email [2].

First of all, according to the post [2] and 0d3b231ee commit, clampling ndistinct is necessary because it allows us to eliminate non-MCV values, or potentially even the least-common MCVs for the inner relation that might not be analyze operation (part of the code is given below when not all different values are taken into account):static void
compute_distinct_stats(VacAttrStatsP stats,
                       AnalyzeAttrFetchFunc fetchfunc,
                       int samplerows,
                       double totalrows)

{
...

/*
     * Decide how many values are worth storing as most-common values. If
     * we are able to generate a complete MCV list (all the values in the
     * sample will fit, and we think these are all the ones in the table),
     * then do so.  Otherwise, store only those values that are
     * significantly more common than the values not in the list.
     *
     * Note: the first of these cases is meant to address columns with
     * small, fixed sets of possible values, such as boolean or enum
     * columns.  If we can *completely* represent the column population by
     * an MCV list that will fit into the stats target, then we should do
     * so and thus provide the planner with complete information.  But if
     * the MCV list is not complete, it's generally worth being more
     * selective, and not just filling it all the way up to the stats
     * target.
     */
    if (track_cnt < track_max && toowide_cnt == 0 &&
      stats->stadistinct > 0 &&
      track_cnt <= num_mcv)
    {
      /* Track list includes all values seen, and all will fit */
      num_mcv = track_cnt;
    }
    else
    {
      int       *mcv_counts;

      /* Incomplete list; decide how many values are worth keeping */
      if (num_mcv > track_cnt)
        num_mcv = track_cnt;

      if (num_mcv > 0)
      {
        mcv_counts = (int *) palloc(num_mcv * sizeof(int));
        for (i = 0; i < num_mcv; i++)
          mcv_counts[i] = track[i].count;

        num_mcv = analyze_mcv_list(mcv_counts, num_mcv,
                       stats->stadistinct,
                       stats->stanullfrac,
                       samplerows, totalrows);
      }
    }

...
}

The reason why the idea of clamping is used may be related to the fact that during the processing of eqjoinsel_inner, the ndistinct estimate for the join key column decreases (in proportion to the selectivity of the baserel restrictions), which leads to a proportional increase in the number of selectivities for the join condition; we have to eliminate values other than MCV or potentially the least-common MCV for the inner relation, since the most common values are the ones that are most likely to survive the decimation resulting from a lower restriction clause.

This is discussed in [2], I copied a fragment where it is clearly seen below:

If you think about what is happening in eqjoinsel_inner with the patch, we are reducing the ndistinct estimate for the join key column proportionally to the selectivity of whatever baserel restrictions apply.  This then results in proportionally increasing the selectivity number for the join condition --- in other words, we're more or less cancelling out the effects of one or the other relation's base restrictions.  So that's pretty broken in general.  The reason it is important for semi/antijoin inner relations is that this is actually the only way that restrictions applied to the inner rel get to impact the join size estimate at all, since set_joinrel_size_estimates is not going to factor the inner rel size into what it multiplies the join selectivity against.    In short, I was mistakenly extrapolating from the observation that it helped to hack the ndistinct estimate for a semijoin's inner rel, to the conclusion that we should do that for all join input rels.  
...
Yeah, those are estimating that all the outer rows have join partners, because there are more distinct values in the sub-select than there are in the outer relation.

In addition, the answer to the question why clamping is necessary is also in the comment to commit 97930cf5, and this not only reduced the number of rows coming out of a table and  the number of possibly-distinct values of a join variable, but join restriction clauses that might have been applied at a lower level of join.

That patch accounted for baserel restriction clauses that reduced the number of rows coming out of a table (and hence the number of possibly-distinct values of a join variable), but not for join restriction clauses that might have been applied at a lower level of join.  To account for the latter, look up the sizes of the min_lefthand and min_righthand inputs of the current join, and clamp with those in the same way as for the base relations.


1. https://www.postgresql.org/message-id/flat/5156.1314829311%40sss.pgh.pa.us#86609b6ac6af405dec67316bfbe9868f

2. https://www.postgresql.org/message-id/raw/5156.1314829311%40sss.pgh.pa.us

----

Regards,

Alena Rybakina

Re: eqjoinsel_semi still sucks ...

From
Tom Lane
Date:
Alena Rybakina <lena.ribackina@yandex.ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.

>> I got stuck in some cases where (due to a tree of filters) the planner
>> underestimates the JOIN just because the ndistinct conveys a huge
>> number to the selectivity estimation formula. However, the estimation
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf,
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug
> into the above commits and the topic . I hope this letter will help to
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread.  The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2
andb.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..445.00 rows=10000 width=244) (actual time=0.003..0.502 rows=10000 loops=1)
   ->  Hash  (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 240kB
         ->  Seq Scan on tenk1 b  (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1)
               Filter: (two = 0)
               Rows Removed by Filter: 5000
 Planning Time: 0.760 ms
 Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=532.50..1127.50 rows=10000 width=488) (actual time=2.000..5.742 rows=10000 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous
andb.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=532.50..1127.50 rows=10000 width=244) (actual time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand =
b.fivethousand b.two = 0); 
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix.  I've not had time to pursue
the idea I expressed at the start of the thread.  Also, it's a
bit scary to change this logic with only a few examples to look
at.  If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

            regards, tom lane