Thread: weird execution plan

weird execution plan

From
"Huang, Suya"
Date:

Hi,

 

Can someone figure out why the first query runs so slow  comparing to the second one? They generate the same result…

 

dev=# explain analyze select count(distinct wid) from terms_weekly_20140503 a join port_terms b on a.term=b.terms;

                                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate  (cost=2226181.12..2226181.13 rows=1 width=516) (actual time=18757.318..18757.319 rows=1 loops=1)

   ->  Hash Join  (cost=37.67..2095240.22 rows=52376358 width=516) (actual time=0.758..2496.190 rows=1067696 loops=1)

         Hash Cond: (a.term = b.terms)

         ->  Seq Scan on terms_weekly_20140503 a  (cost=0.00..240738.81 rows=8516481 width=548) (actual time=0.009..951.875 rows=8516481 loops=1)

         ->  Hash  (cost=22.30..22.30 rows=1230 width=32) (actual time=0.690..0.690 rows=1000 loops=1)

               Buckets: 1024  Batches: 1  Memory Usage: 51kB

               ->  Seq Scan on port_terms b  (cost=0.00..22.30 rows=1230 width=32) (actual time=0.009..0.283 rows=1000 loops=1)

Total runtime: 18757.367 ms

(8 rows)

 

Time: 18758.068 ms

 

dev=# explain analyze with x as (select distinct wid from terms_weekly_20140503 a join port_terms b on a.term=b.terms) select count(*) from x;

                                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate  (cost=2226187.62..2226187.63 rows=1 width=0) (actual time=2976.011..2976.011 rows=1 loops=1)

   CTE x

     ->  HashAggregate  (cost=2226181.12..2226183.12 rows=200 width=516) (actual time=2827.958..2896.747 rows=212249 loops=1)

           ->  Hash Join  (cost=37.67..2095240.22 rows=52376358 width=516) (actual time=0.734..2470.533 rows=1067696 loops=1)

                 Hash Cond: (a.term = b.terms)

                 ->  Seq Scan on terms_weekly_20140503 a  (cost=0.00..240738.81 rows=8516481 width=548) (actual time=0.009..916.028 rows=8516481 loops=1)

                 ->  Hash  (cost=22.30..22.30 rows=1230 width=32) (actual time=0.669..0.669 rows=1000 loops=1)

                       Buckets: 1024  Batches: 1  Memory Usage: 51kB

                       ->  Seq Scan on port_terms b  (cost=0.00..22.30 rows=1230 width=32) (actual time=0.009..0.269 rows=1000 loops=1)

   ->  CTE Scan on x  (cost=0.00..4.00 rows=200 width=0) (actual time=2827.961..2963.878 rows=212249 loops=1)

Total runtime: 2980.681 ms

(11 rows)

 

Thanks,

Suya

Re: weird execution plan

From
David G Johnston
Date:
Huang, Suya wrote
> Can someone figure out why the first query runs so slow  comparing to the
> second one? They generate the same result...

Try: EXPLAIN (ANALYZE, BUFFERS)

I believe you are only seeing caching effects.

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/weird-execution-plan-tp5818730p5818733.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: weird execution plan

From
"Huang, Suya"
Date:
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of David G
Johnston
Sent: Friday, September 12, 2014 12:45 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] weird execution plan

Huang, Suya wrote
> Can someone figure out why the first query runs so slow  comparing to
> the second one? They generate the same result...

Try: EXPLAIN (ANALYZE, BUFFERS)

I believe you are only seeing caching effects.

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/weird-execution-plan-tp5818730p5818733.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

=================================================================================================================
Both queries have been run several times so cache would have same effect on both of them?  Below is the plan with
bufferinformation. 

dev=# explain (ANALYZE, BUFFERS) with x as (select distinct wid from terms_weekly_20140503 a join port_terms b on
a.term=b.terms)select count(*) from x; 
                                                                            QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2226187.62..2226187.63 rows=1 width=0) (actual time=3533.998..3533.999 rows=1 loops=1)
   Buffers: shared hit=86752 read=68837
   CTE x
     ->  HashAggregate  (cost=2226181.12..2226183.12 rows=200 width=516) (actual time=3383.700..3448.554 rows=212249
loops=1)
           Buffers: shared hit=86752 read=68837
           ->  Hash Join  (cost=37.67..2095240.22 rows=52376358 width=516) (actual time=0.799..3010.906 rows=1067696
loops=1)
                 Hash Cond: (a.term = b.terms)
                 Buffers: shared hit=86752 read=68837
                 ->  Seq Scan on terms_weekly_20140503 a  (cost=0.00..240738.81 rows=8516481 width=548) (actual
time=0.023..1277.352rows=8516481 loops=1) 
                       Buffers: shared hit=86739 read=68835
                 ->  Hash  (cost=22.30..22.30 rows=1230 width=32) (actual time=0.699..0.699 rows=1000 loops=1)
                       Buckets: 1024  Batches: 1  Memory Usage: 51kB
                       Buffers: shared hit=7
                       ->  Seq Scan on port_terms b  (cost=0.00..22.30 rows=1230 width=32) (actual time=0.009..0.270
rows=1000loops=1) 
                             Buffers: shared hit=7
   ->  CTE Scan on x  (cost=0.00..4.00 rows=200 width=0) (actual time=3383.707..3518.884 rows=212249 loops=1)
         Buffers: shared hit=86752 read=68837
 Total runtime: 3541.277 ms
(18 rows)

Time: 3552.505 ms
dev=# explain (analyze,buffers) select count(distinct w_id) from terms_weekly_20140503 a join port_terms b on
a.term=b.terms;
                                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2226181.12..2226181.13 rows=1 width=516) (actual time=18914.881..18914.882 rows=1 loops=1)
   Buffers: shared hit=155589
   ->  Hash Join  (cost=37.67..2095240.22 rows=52376358 width=516) (actual time=0.802..2616.410 rows=1067696 loops=1)
         Hash Cond: (a.term = b.terms)
         Buffers: shared hit=155589
         ->  Seq Scan on terms_weekly_20140503 a  (cost=0.00..240738.81 rows=8516481 width=548) (actual
time=0.010..966.380rows=8516481 loops=1) 
               Buffers: shared hit=155574
         ->  Hash  (cost=22.30..22.30 rows=1230 width=32) (actual time=0.729..0.729 rows=1000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 51kB
               Buffers: shared hit=7
               ->  Seq Scan on port_terms b  (cost=0.00..22.30 rows=1230 width=32) (actual time=0.009..0.300 rows=1000
loops=1)
                     Buffers: shared hit=7
 Total runtime: 18914.933 ms
(13 rows)

Time: 18915.712 ms


Re: weird execution plan

From
David G Johnston
Date:
Huang, Suya wrote
> Both queries have been run several times so cache would have same effect
> on both of them?  Below is the plan with buffer information.

Not everyone does so its nice to make certain - especially since I'm not all
that familiar with the code involved.  But since no one else has answered I
will theorize.

SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl )

vs

SELECT count(DISTINCT col) FROM tbl

The code for "SELECT DISTINCT col" is likely highly efficient because it
works on complete sets of records.

The code for "SELECT count(DISTINCT col)" is at a relative disadvantage
since it must evaluate one row at a time and remember whether it had seen
the same value previously before deciding whether to increment a counter.

With a large number of duplicate rows the process of making the row set
smaller before counting the end result will perform better since fewer rows
must be evaluated in the less efficient count(DISTINCT) expression - the
time saved there more than offset by the fact that you are effectively
passing over that subset of the data a second time.

HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows)

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/weird-execution-plan-tp5818730p5818905.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: weird execution plan

From
"Huang, Suya"
Date:
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of David G
Johnston
Sent: Saturday, September 13, 2014 7:34 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] weird execution plan

>Not everyone does so its nice to make certain - especially since I'm not all that familiar with the code involved.
Butsince no one else has answered I will theorize. 
>
>SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl )
>
>vs
>
>SELECT count(DISTINCT col) FROM tbl
>
>The code for "SELECT DISTINCT col" is likely highly efficient because it works on complete sets of records.
>
>The code for "SELECT count(DISTINCT col)" is at a relative disadvantage since it must evaluate one row at a time and
rememberwhether it had seen the same value previously before deciding whether to >increment a counter. 
>
>With a large number of duplicate rows the process of making the row set smaller before counting the end result will
performbetter since fewer rows must be evaluated in the less efficient count(DISTINCT) >expression - the time saved
theremore than offset by the fact that you are effectively passing over that subset of the data a second time. 
>
>HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows)
>
>David J.

Thanks David!

I am so surprised to the findings you put here. Just did an explain plan on the example you gave and pasted the result
below,you're correct.  

"Select count(distinct col1)" is really  a very common SQL statement we write daily, in Postgres, we need to rewrite it
sothat the aggregate doesn't happen on a very large data sets... I am wondering if this is something to be improved
fromthe optimizer ifself, instead of developers to rewrite SQL. Like having the optimizer just do the counting in the
endinstead of doing it each time.  I used Oracle before, never saw this issue... 

But really thank you for pointing this out, very valuable lesson-learnt in PG SQL writing for me and our developers.

dev=# explain analyze select count(*) from (select distinct wid from terms_weekly) foo;
                                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1278656.00..1278656.01 rows=1 width=0) (actual time=24316.335..24316.336 rows=1 loops=1)
   ->  HashAggregate  (cost=1278651.50..1278653.50 rows=200 width=42) (actual time=23899.916..24242.010 rows=1298124
loops=1)
         ->  Append  (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.028..13631.898 rows=42765320
loops=1)
               ->  Seq Scan on search_terms_weekly  (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0
loops=1)
               ->  Seq Scan on search_terms_weekly_20140503  (cost=0.00..293352.90 rows=10702190 width=42) (actual
time=0.026..2195.460rows=10702190 loops=1) 
               ->  Seq Scan on search_terms_weekly_20140510  (cost=0.00..298773.53 rows=10878953 width=42) (actual
time=8.244..3163.087rows=10878953 loops=1) 
               ->  Seq Scan on search_terms_weekly_20140517  (cost=0.00..288321.17 rows=10537717 width=41) (actual
time=7.345..2520.531rows=10537717 loops=1) 
               ->  Seq Scan on search_terms_weekly_20140524  (cost=0.00..291290.60 rows=10646460 width=41) (actual
time=8.543..2693.833rows=10646460 loops=1) 
 Total runtime: 24333.830 ms
(9 rows)

dev=# explain analyze select count(distinct wid) from terms_weekly;



                                                                       QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1278651.50..1278651.51 rows=1 width=42) (actual time=585774.511..585774.511 rows=1 loops=1)
   ->  Append  (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.019..10656.782 rows=42765320 loops=1)
         ->  Seq Scan on search_terms_weekly  (cost=0.00..0.00 rows=1 width=516) (actual time=0.002..0.002 rows=0
loops=1)
         ->  Seq Scan on search_terms_weekly_20140503  (cost=0.00..293352.90 rows=10702190 width=42) (actual
time=0.017..2225.397rows=10702190 loops=1) 
         ->  Seq Scan on search_terms_weekly_20140510  (cost=0.00..298773.53 rows=10878953 width=42) (actual
time=0.009..2244.918rows=10878953 loops=1) 
         ->  Seq Scan on search_terms_weekly_20140517  (cost=0.00..288321.17 rows=10537717 width=41) (actual
time=0.008..1822.088rows=10537717 loops=1) 
         ->  Seq Scan on search_terms_weekly_20140524  (cost=0.00..291290.60 rows=10646460 width=41) (actual
time=0.006..1561.229rows=10646460 loops=1) 
 Total runtime: 585774.568 ms
(8 rows)