Thread: Suboptimal query plan fixed by replacing OR with UNION

Suboptimal query plan fixed by replacing OR with UNION

From

Steven Schlansker

Date:

05 July 2012, 18:58:09

Hi all,

I have a query which is being optimized very differently depending on whether it is written using an OR clause or a
UNIONclause. 

I believe that the query results should be the same, and even if I've missed something with regards to something small
(e.g.NULL handling) I do not believe that it's a good excuse for the plan the optimizer ends up with. 

First, the table:

public | account                | table    | staging | 7719 MB    |
                   Table "public.account"
      Column       |            Type             | Modifiers
-------------------+-----------------------------+-----------
 id                | uuid                        | not null
 source            | character varying           | not null
 source_id         | character varying           | not null
 name              | character varying           |
 email             | character varying           |
 photo_url         | character varying           |
 user_id           | uuid                        |
 creation_date     | timestamp without time zone | not null
 modification_date | timestamp without time zone | not null
 first_linked_date | timestamp without time zone |
Indexes:
    "account_pkey" PRIMARY KEY, btree (id)
    "account_source_id_idx" UNIQUE, btree (source, source_id)
    "account_id_user_id_idx" btree (id, user_id)
    "account_user_id_idx" btree (user_id)
    "ness_user_email_idx" btree (email)

Some abbreviated statistics (all queries below were planned after running this ANALYZE statement):

INFO:  analyzing "public.account"
INFO:  "account": scanned 30000 of 987795 pages, containing 742040 live rows and 1932 dead rows; 30000 rows in sample,
24102216estimated total rows 

      attname      | null_frac | avg_width | n_distinct | correlation
-------------------+-----------+-----------+------------+-------------
 email             |    0.9987 |        22 |         -1 |   -0.100607
 creation_date     |         0 |         8 |         -1 |    0.679791
 first_linked_date |         1 |         8 |          0 |
 id                |         0 |        16 |         -1 |    0.680173
 source_id         |         0 |        11 |  -0.949212 |  -0.0792623
 user_id           |    0.9956 |        16 |        129 |  -0.0797483
 source            |         0 |         8 |          6 |  -0.0118729
 modification_date |         0 |         8 |         -1 |     0.93162
 name              |  0.170433 |        14 |     135438 |   -0.005636
 photo_url         |         0 |        49 |     180319 |    0.172699

FWIW, the "estimated total rows" is within 0.01% of the true value.


Now, the problematic query:

SELECT * FROM account
  WHERE user_id in
    (SELECT user_id FROM account
      WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'))
  OR
    id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');

This query gives the plan:

 Seq Scan on account  (cost=29.59..1379485.60 rows=12051109 width=160)
   Filter: ((hashed SubPlan 1) OR (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])))
   SubPlan 1
     ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
           Recheck Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
           ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                 Index Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(7 rows)

I can't imagine why it picks a sequential scan.  Besides the ridiculous estimate, it takes most of a minute to finish.

Running either query independently comes to a very reasonable plan:

ness_user=# explain SELECT * FROM account WHERE
ness_user-# user_id in (SELECT user_id FROM account WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'));
                                                                               QUERY PLAN
                                                 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=66.79..3228.18 rows=2410 width=160)
   ->  HashAggregate  (cost=29.59..29.60 rows=1 width=16)
         ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
               Recheck Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                     Index Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
   ->  Bitmap Heap Scan on account  (cost=37.20..3188.55 rows=803 width=160)
         Recheck Cond: (user_id = public.account.user_id)
         ->  Bitmap Index Scan on account_user_id_idx  (cost=0.00..37.00 rows=803 width=0)
               Index Cond: (user_id = public.account.user_id)
(10 rows)

ness_user=# explain SELECT * FROM account WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');
                                                                         QUERY PLAN
                                     

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=160)
   Recheck Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
   ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
         Index Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(4 rows)

(where "reasonable" is defined as "not a sequential scan")

Upon seeing this -- I had a crazy idea.  What if I just paste them together with a UNION DISTINCT?

ness_user=# explain SELECT * FROM account WHERE
ness_user-# user_id in (SELECT user_id FROM account WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'))
UNIONDISTINCT 
ness_user-# SELECT * FROM account WHERE
ness_user-# id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');
                                                                                     QUERY PLAN
                                                             

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=3342.22..3366.35 rows=2413 width=160)
   ->  Append  (cost=66.79..3281.90 rows=2413 width=160)
         ->  Nested Loop  (cost=66.79..3228.18 rows=2410 width=160)
               ->  HashAggregate  (cost=29.59..29.60 rows=1 width=16)
                     ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
                           Recheck Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
                           ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                                 Index Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Heap Scan on account  (cost=37.20..3188.55 rows=803 width=160)
                     Recheck Cond: (user_id = public.account.user_id)
                     ->  Bitmap Index Scan on account_user_id_idx  (cost=0.00..37.00 rows=803 width=0)
                           Index Cond: (user_id = public.account.user_id)
         ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=160)
               Recheck Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                     Index Cond: (id = ANY
('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(16 rows)


Wow!  Changing the query from using an OR clause to a UNION DISTINCT with two SELECTs reduced the cost from 1379485.60
to3366.35!  And the gains are realized when you actually execute the query. 

Why is using an OR so awful here?  Why does it pick a sequential scan?  Is this an optimizer bug or have I missed
somethingin my queries? 

Thanks much for any advice,
Steven Schlansker

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Tom Lane

Date:

05 July 2012, 19:51:39

Steven Schlansker <steven@likeness.com> writes:
> Why is using an OR so awful here?

Because the OR stops it from being a join (it possibly needs to return
some rows that are not in the semijoin of the two tables).

> Why does it pick a sequential scan?  Is this an optimizer bug

No.  It can't transform OR into a UNION because the results might not
be the same.  I assume you don't care about removal of duplicates, or
have some reason to know that there won't be any ... but the planner
doesn't know that.

            regards, tom lane

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Steven Schlansker

Date:

05 July 2012, 20:17:03

On Jul 5, 2012, at 3:51 PM, Tom Lane wrote:

> Steven Schlansker <steven@likeness.com> writes:
>> Why is using an OR so awful here?
>
> Because the OR stops it from being a join (it possibly needs to return
> some rows that are not in the semijoin of the two tables).
>
>> Why does it pick a sequential scan?  Is this an optimizer bug
>
> No.  It can't transform OR into a UNION because the results might not
> be the same.  I assume you don't care about removal of duplicates, or
> have some reason to know that there won't be any ... but the planner
> doesn't know that.
>

Thanks for the insight here.  It still seems unfortunate that it picks a
sequential scan -- but if there really is no more efficient way to do this,
I will just rewrite the query.

Steven

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Jasen Betts

Date:

05 July 2012, 23:02:16

I note you've decided to rewrite this query as a union

> SELECT * FROM account
>   WHERE user_id in
>     (SELECT user_id FROM account
>       WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'))
>   OR
>     id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');

I notice both arrays (used with = ANY) have the exact same content,

if this is always true you can use a CTE here for the ID=ANY(...)
query and reference the CTE on both sides of the union.

WITH i as (
 SELECT * FROM account WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')
)
SELECT
 * from i
UNION DISTINCT
SELECT
 account.* from account join i on i.user_id = account.userid ;

--
⚂⚃ 100% natural

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Steven Schlansker

Date:

06 July 2012, 20:25:23

On Jul 5, 2012, at 6:35 PM, Jasen Betts wrote:

> I note you've decided to rewrite this query as a union
>
>> SELECT * FROM account
>>  WHERE user_id in
>>    (SELECT user_id FROM account
>>      WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'))
>>  OR
>>    id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');
>
> I notice both arrays (used with = ANY) have the exact same content,
>
> if this is always true you can use a CTE here for the ID=ANY(...)
> query and reference the CTE on both sides of the union.
>

Thanks for the idea!  I'll be sure to incorporate that.  Doesn't fix the unfortunate behavior with OR, though.

> WITH i as (
> SELECT * FROM account WHERE id =
ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')
> )
> SELECT
> * from i
> UNION DISTINCT
> SELECT
> account.* from account join i on i.user_id = account.userid ;
>
> --
> ⚂⚃ 100% natural
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Gurjeet Singh

Date:

07 July 2012, 01:25:16

On Thu, Jul 5, 2012 at 7:16 PM, Steven Schlansker <steven@likeness.com> wrote:

On Jul 5, 2012, at 3:51 PM, Tom Lane wrote:

> Steven Schlansker <steven@likeness.com> writes:
>> Why is using an OR so awful here?
>
> Because the OR stops it from being a join (it possibly needs to return
> some rows that are not in the semijoin of the two tables).
>
>> Why does it pick a sequential scan? Is this an optimizer bug
>
> No. It can't transform OR into a UNION because the results might not
> be the same. I assume you don't care about removal of duplicates, or
> have some reason to know that there won't be any ... but the planner
> doesn't know that.
>

Thanks for the insight here. It still seems unfortunate that it picks a
sequential scan -- but if there really is no more efficient way to do this,
I will just rewrite the query.

It might not be applicable to this case (because of the use of ANY in second branch of OR clause), but some databases provide a feature called OR-Optimization, where the optimizer breaks up the query at OR clause boundaries and uses UNION ALL operator to join the resulting queries, just like you did. Optimizer does need to add additional AND clauses to some of the branches to make sure the result set is not affected.

Just a thought.
--

Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Suboptimal query plan fixed by replacing OR with UNION

From

Steven Schlansker

Date:

09 July 2012, 14:35:42

On Jul 6, 2012, at 9:24 PM, Gurjeet Singh wrote:

> On Thu, Jul 5, 2012 at 7:16 PM, Steven Schlansker <steven@likeness.com> wrote:
>
> On Jul 5, 2012, at 3:51 PM, Tom Lane wrote:
>
> > Steven Schlansker <steven@likeness.com> writes:
> >> Why is using an OR so awful here?
> >
> > Because the OR stops it from being a join (it possibly needs to return
> > some rows that are not in the semijoin of the two tables).
> >
> >> Why does it pick a sequential scan?  Is this an optimizer bug
> >
> > No.  It can't transform OR into a UNION because the results might not
> > be the same.  I assume you don't care about removal of duplicates, or
> > have some reason to know that there won't be any ... but the planner
> > doesn't know that.
> >
>
> Thanks for the insight here.  It still seems unfortunate that it picks a
> sequential scan -- but if there really is no more efficient way to do this,
> I will just rewrite the query.
>
> It might not be applicable to this case (because of the use of ANY in second branch of OR clause), but some databases
providea feature called OR-Optimization, where the optimizer breaks up the query at OR clause boundaries and uses UNION
ALLoperator to join the resulting queries, just like you did. Optimizer does need to add additional AND clauses to some
ofthe branches to make sure the result set is not affected. 
>

That sounds like a great optimization for Postgres, but unfortunately it's far outside of my skill set / time to
contribute,so I'd have to wait for a "real" PG dev to get to it :) 

> Just a thought.
> --
> Gurjeet Singh
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>