Thread: Unclamped row estimates whith OR-ed subplans

Unclamped row estimates whith OR-ed subplans

From
"Benjamin Coutu"
Date:
Hello,

please consider the following SQL query:

SELECT * FROM "transactions" WHERE
    "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%') OR
    "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

This yields the following plan on Postgres 11:

Seq Scan on transactions  (cost=67.21..171458.03 rows=1301316 width=1206)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  SubPlan 1
    ->  Bitmap Heap Scan on accounts  (cost=33.36..61.16 rows=46 width=4)
          Recheck Cond: ((name)::text ~~* '%test%'::text)
          ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.35 rows=46 width=0)
                Index Cond: ((name)::text ~~* '%test%'::text)
  SubPlan 2
    ->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4)
          Filter: ((name)::text ~~* '%test%'::text)

So the where clause of this query has just two subplans OR-ed together, one is estimated to yield 46 rows and one is
estimatedto yield 5 rows. 
I'd expect the total rows for the seqscan to be estimated at 46 then, following the logic that rows_seqscan =
max(rows_subplan1,rows_subplan2). As you can see, the optimizer estimates a whopping 1301316 rows instead. 

I am absolutely aware that those are hashed sub plans below a seqscan and that Postgres therefore has to scan all
tuplesof the table. But the problem is that upper nodes (which are excluded from this example for simplicity) think
theywill receive 1301316 rows from the seqscan, when in fact they will probably only see a hand full, which the planner
couldhave (easily?) deduced by taking the greater of the two subplan row estimates. 

What am I missing, or is this perhaps a shortfall of the planner?

Thanks,

Ben

--

Bejamin Coutu
ben.coutu@zeyos.com

ZeyOS GmbH & Co. KG
http://www.zeyos.com



Re: Unclamped row estimates whith OR-ed subplans

From
Laurenz Albe
Date:
On Fri, 2020-06-19 at 17:12 +0200, Benjamin Coutu wrote:
> please consider the following SQL query:
> 
> SELECT * FROM "transactions" WHERE 
>         "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%') OR
>         "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')
> 
> This yields the following plan on Postgres 11:
> 
> Seq Scan on transactions  (cost=67.21..171458.03 rows=1301316 width=1206)
>   Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
>   SubPlan 1
>     ->  Bitmap Heap Scan on accounts  (cost=33.36..61.16 rows=46 width=4)
>           Recheck Cond: ((name)::text ~~* '%test%'::text)
>           ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.35 rows=46 width=0)
>                 Index Cond: ((name)::text ~~* '%test%'::text)
>   SubPlan 2
>     ->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4)
>           Filter: ((name)::text ~~* '%test%'::text)
> 
> So the where clause of this query has just two subplans OR-ed together, one is estimated to yield 46 rows and one is
estimatedto yield 5 rows.
 
> I'd expect the total rows for the seqscan to be estimated at 46 then, following the logic that rows_seqscan =
max(rows_subplan1,rows_subplan2). As you can see, the optimizer estimates a whopping
 
> 1301316 rows instead.
> 
> I am absolutely aware that those are hashed sub plans below a seqscan and that Postgres therefore has to scan all
tuplesof the table. But the problem is that upper nodes (which are excluded from
 
> this example for simplicity) think they will receive 1301316 rows from the seqscan, when in fact they will probably
onlysee a hand full, which the planner could have (easily?) deduced by taking the
 
> greater of the two subplan row estimates.
> 
> What am I missing, or is this perhaps a shortfall of the planner?

The subplans are executed *fpr each row* found in "transactions",
and the estimate on the subplans is *per execution".

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Unclamped row estimates whith OR-ed subplans

From
Tom Lane
Date:
"Benjamin Coutu" <ben.coutu@zeyos.com> writes:
> please consider the following SQL query:

> SELECT * FROM "transactions" WHERE
>     "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%') OR
>     "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

> This yields the following plan on Postgres 11:

> Seq Scan on transactions  (cost=67.21..171458.03 rows=1301316 width=1206)
>   Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
>   SubPlan 1
>     ->  Bitmap Heap Scan on accounts  (cost=33.36..61.16 rows=46 width=4)
>           Recheck Cond: ((name)::text ~~* '%test%'::text)
>           ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.35 rows=46 width=0)
>                 Index Cond: ((name)::text ~~* '%test%'::text)
>   SubPlan 2
>     ->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4)
>           Filter: ((name)::text ~~* '%test%'::text)

> So the where clause of this query has just two subplans OR-ed together, one is estimated to yield 46 rows and one is
estimatedto yield 5 rows. 
> I'd expect the total rows for the seqscan to be estimated at 46 then, following the logic that rows_seqscan =
max(rows_subplan1,rows_subplan2). As you can see, the optimizer estimates a whopping 1301316 rows instead. 

No.  The subplan estimates are for the number of rows produced by one
execution of the subplan, ie the numbers of "accounts" or "contracts"
rows that match those inner WHERE conditions.  This has very little
a-priori relationship to the number of "transactions" rows that will
satisfy the outer WHERE condition.  If we knew that transactions.account
and transactions.contract were unique columns, then we might be able
to say that there shouldn't be more than one outer match per subplan
result row ... but you didn't say that, and it seems unlikely.

(Having said that, I think that the estimates for these cases very
possibly are quite stupid.  But that doesn't mean that 46+5 would
be the right answer.)

            regards, tom lane



Re: Unclamped row estimates whith OR-ed subplans

From
"David G. Johnston"
Date:
On Friday, June 19, 2020, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

> I am absolutely aware that those are hashed sub plans below a seqscan and that Postgres therefore has to scan all tuples of the table. But the problem is that upper nodes (which are excluded from
> this example for simplicity) think they will receive 1301316 rows from the seqscan, when in fact they will probably only see a hand full, which the planner could have (easily?) deduced by taking the
> greater of the two subplan row estimates.
>
> What am I missing, or is this perhaps a shortfall of the planner?

The subplans are executed *fpr each row* found in "transactions",
and the estimate on the subplans is *per execution".

I understood Tom’s nearby answer but not this one.  This seems to be referring to correlated subplans which the examples are not.

David J.
 

Re: Unclamped row estimates whith OR-ed subplans

From
"Benjamin Coutu"
Date:
> No.  The subplan estimates are for the number of rows produced by one
> execution of the subplan, ie the numbers of "accounts" or "contracts"
> rows that match those inner WHERE conditions.  This has very little
> a-priori relationship to the number of "transactions" rows that will
> satisfy the outer WHERE condition.  If we knew that transactions.account
> and transactions.contract were unique columns, then we might be able
> to say that there shouldn't be more than one outer match per subplan
> result row ... but you didn't say that, and it seems unlikely.

Thanks Tom, I understand your point.

I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly
distinctand the individual subplan estimates are quite accurate: 

Seq Scan on transactions  (cost=67.81..171458.63 rows=1301316 width=1206) (actual time=69.418..917.594 rows=112
loops=1)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  Rows Removed by Filter: 1792937
  SubPlan 1
    ->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) (actual time=3.053..3.292 rows=111 loops=1)
          Recheck Cond: ((name)::text ~~* '%test%'::text)
          Rows Removed by Index Recheck: 4
          Heap Blocks: exact=104
          ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 width=0) (actual time=0.505..0.505 rows=118
loops=1)
                Index Cond: ((name)::text ~~* '%test%'::text)
  SubPlan 2
    ->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4) (actual time=2.531..2.836 rows=4 loops=1)
          Filter: ((name)::text ~~* '%test%'::text)
          Rows Removed by Filter: 272

For comparison here are the plans for the queries with the individual where clauses:

SELECT * FROM "transactions" WHERE "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%')

Nested Loop  (cost=34.38..488.93 rows=155 width=1206) (actual time=0.599..1.393 rows=112 loops=1)
  ->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) (actual time=0.541..0.796 rows=111 loops=1)
        Recheck Cond: ((name)::text ~~* '%test%'::text)
        Rows Removed by Index Recheck: 4
        Heap Blocks: exact=104
        ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 width=0) (actual time=0.521..0.521 rows=118
loops=1)
              Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_account on transactions  (cost=0.43..9.08 rows=21 width=1206) (actual
time=0.004..0.005rows=1 loops=111) 
        Index Cond: (account = accounts."ID")

SELECT * FROM "transactions" WHERE "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

Nested Loop  (cost=3.76..10.10 rows=31662 width=1206) (actual time=0.082..0.082 rows=0 loops=1)
  ->  Bitmap Heap Scan on contracts  (cost=3.64..5.74 rows=5 width=4) (actual time=0.069..0.075 rows=4 loops=1)
        Recheck Cond: ((name)::text ~~* '%test%'::text)
        Heap Blocks: exact=2
        ->  Bitmap Index Scan on s_contracts  (cost=0.00..3.64 rows=5 width=0) (actual time=0.060..0.060 rows=4
loops=1)
              Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_contract on transactions  (cost=0.12..0.86 rows=1 width=1206) (actual
time=0.001..0.001rows=0 loops=4) 
        Index Cond: (contract = contracts."ID")

The statistics for the columns are:

SELECT attname, null_frac, n_distinct from pg_stats WHERE tablename = 'transactions' AND attname IN ('account',
'contract')

transactions.account:   null_frac=0.025 n_distinct=80277
transactions.contract:  null_frac=1     n_distinct=0     (there are basically no non-null values for field "contract"
intransactions) 

According to pg_class.reltuples the table "transactions" has 1735088 rows.

I'd naively expect the selectivity for an OR of those two hashed subplans given uniform distribution to be:

rows_total =
    ((rows_transactions * (1 - null_frac_account)) / n_distinct_account) * expected_rows_from_subplan1 +
    ((rows_transactions * (1 - null_frac_contract)) / n_distinct_contract) * expected_rows_from_subplan2

=> rows_total =
    ((1735088 * (1 - 0.025)) / 80277) * 46 +
    ((1735088 * (1 - 1)) / 0) * 5

=> rows_total = 969 + 0 /* no non-null values for contract field */

Please forgive the sloppy math but something along this line could be promising.

Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the null_frac
offield transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum estimate for that
query?

Thanks again!

Benjamin Coutu



Re: Unclamped row estimates whith OR-ed subplans

From
Tom Lane
Date:
"Benjamin Coutu" <ben.coutu@zeyos.com> writes:
> I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly
distinctand the individual subplan estimates are quite accurate: 

Yeah, as I said, the estimates you're getting for the OR'd subplans are
pretty stupid.  Once you throw the OR in there, it's not possible to
convert the IN clauses to semi-joins, so they just stay as generic
subplans.  It looks like we have exactly zero intelligence about the
generic case --- unless I'm missing something in clause_selectivity,
you just end up with a default 0.5 selectivity estimate.  So yeah,
there's a lot of room for improvement, whenever anyone finds some
round tuits to work on that.

While you're waiting, you might think about recasting the query to
avoid the OR.  Perhaps you could do a UNION of two scans of the
transactions table?

> Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the
null_fracof field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum
estimatefor that query? 

That I don't understand.  I get a minimal rowcount estimate for an
all-nulls outer table, as long as I'm using just one IN rather than
an OR:

regression=# create table contracts (id int);
CREATE TABLE
regression=# insert into contracts values(1),(2),(3),(4);
INSERT 0 4
regression=# analyze contracts ;
ANALYZE
regression=# create table transactions (contract int);
CREATE TABLE
regression=# insert into transactions select null from generate_series(1,100000);
INSERT 0 100000
regression=# analyze transactions;
ANALYZE
regression=# explain select * from transactions where contract in (select id from contracts);
                                QUERY PLAN
--------------------------------------------------------------------------
 Hash Semi Join  (cost=1.09..1607.59 rows=1 width=4)
   Hash Cond: (transactions.contract = contracts.id)
   ->  Seq Scan on transactions  (cost=0.00..1344.00 rows=100000 width=4)
   ->  Hash  (cost=1.04..1.04 rows=4 width=4)
         ->  Seq Scan on contracts  (cost=0.00..1.04 rows=4 width=4)
(5 rows)

            regards, tom lane



Re: Unclamped row estimates whith OR-ed subplans

From
Michael Lewis
Date:
While you're waiting, you might think about recasting the query to
avoid the OR.  Perhaps you could do a UNION of two scans of the
transactions table?

Minor note- use UNION ALL to avoid the dedupe work if you already know those will be distinct sets, or having duplicates is fine.

Re: Unclamped row estimates whith OR-ed subplans

From
"Benjamin Coutu"
Date:
> While you're waiting, you might think about recasting the query to
> avoid the OR.  Perhaps you could do a UNION of two scans of the
> transactions table?

Thanks for the hint, I am well aware of the workaround for OR via UNION. I am not trying to improve this query per se
asit is the small root of a very complex generated query and it's unfeasible to rewrite it to a UNION in this case. 

The point of my message to the list was to highlight the misestimation, which I couldn't wrap my head around. Maybe
thisdiscussion can give some food for thought to someone who might tackle this in the future. It would surely be great
tohave selectivity estimate smarts for the generic case of OR-ed SubPlans someday. 

>
> > Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the
null_fracof field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum
estimatefor that query? 
>
> That I don't understand.  I get a minimal rowcount estimate for an
> all-nulls outer table, as long as I'm using just one IN rather than

Yeah, that might be worth digging into. Is there any other info apart from those stats that I could provide?