Thread: Seq scan on join, not on subselect? analyze this

Seq scan on join, not on subselect? analyze this

From
Bryce Nesbitt
Date:
I'm a bit confused why the query planner is not restricting my join, and
not using the index.  Two explain analyze statements follow.
Why is the second so much better?

lyell5=> select version();
PostgreSQL 8.3.4 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)

lyell5=> explain analyze select * from article_words join words using
(word_key) where context_key=535462;

+------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                  
QUERY
PLAN                                                                   |

+------------------------------------------------------------------------------------------------------------------------------------------------+
| Hash Join  (cost=192092.90..276920.93 rows=45327 width=17) (actual
time=6020.932..60084.817 rows=777
loops=1)                                  |
|   Hash Cond: (article_words.word_key =
words.word_key)                                                                                        
|
|   ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=8) (actual time=0.031..0.547
rows=777 loops=1) |
|         Index Cond: (context_key =
535462)                                                                                                    
|
|   ->  Hash  (cost=93819.62..93819.62 rows=5653462 width=13) (actual
time=6020.605..6020.605 rows=5651551 loops=1)                              |
|         ->  Seq Scan on words  (cost=0.00..93819.62 rows=5653462
width=13) (actual time=0.006..2010.962 rows=5651551
loops=1)                  |
| Total runtime: 60085.616
ms                                                                                                                   
|

+------------------------------------------------------------------------------------------------------------------------------------------------+


lyell5=> explain analyze select * from words where word_key in (select
word_key from article_words where context_key=535462);

+------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                     
QUERY
PLAN                                                                      |

+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Nested Loop  (cost=56073.81..56091.41 rows=2 width=13) (actual
time=0.808..4.723 rows=777
loops=1)                                                   |
|   ->  HashAggregate  (cost=56073.81..56073.83 rows=2 width=4) (actual
time=0.795..1.072 rows=777
loops=1)                                            |
|         ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=4) (actual time=0.030..0.344
rows=777 loops=1) |
|               Index Cond: (context_key =
535462)                                                                                                    
|
|   ->  Index Scan using words_pkey on words  (cost=0.00..8.78 rows=1
width=13) (actual time=0.003..0.004 rows=1
loops=777)                            |
|         Index Cond: (words.word_key =
article_words.word_key)                                                                                       
|
| Total runtime: 4.936
ms
      
 
|

+------------------------------------------------------------------------------------------------------------------------------------------------------+



Re: Seq scan on join, not on subselect? analyze this

From
Tom Lane
Date:
Bryce Nesbitt <bryce2@obviously.com> writes:
> I'm a bit confused why the query planner is not restricting my join, and
> not using the index.  Two explain analyze statements follow.

Why are the rowcount estimates so far off?  Maybe you need to increase
the statistics target for this table.
        regards, tom lane


Re: Seq scan on join, not on subselect? analyze this

From
Bryce Nesbitt
Date:
Tom Lane wrote:<br /><blockquote cite="mid:4529.1225651101@sss.pgh.pa.us" type="cite"><pre wrap="">Why are the
rowcountestimates so far off?  Maybe you need to increase
 
the statistics target for this table.        regards, tom lane </pre></blockquote> Tom,<br /> How does one tell the
rowcountis off in a query plan?  I've never found a great reference on interpreting the query analyze output!<br /><br
/>Upping the stats target made <i>little</i> difference in the estimated row count, but a <i>huge</i> difference in the
query. It also mattered<br /> only on the context_key column, not the other columns.  Why would that be?<br /><br
/><fontsize="-1"><tt><br /> lyell5=> \pset format wrapped<br /> lyell5=> select * from pg_stats where
tablename='article_words';<br/>
+------------+-------------+----------+-----------+-----------+------------+--------------------+-------------------+---------------------+-------------+<br
/>| schemaname |  tablename  | attname  | null_frac | avg_width | n_distinct |  most_common_vals  | most_common_freqs
| histogram_bounds   | correlation |<br />
+------------+-------------+----------+-----------+-----------+------------+--------------------+-------------------+---------------------+-------------+<br
/>| public     | article_wor | word_key |         0 |         4 |       6361 | {382,226,1155,2130 | {0.003,0.00233333 |
{3,623,1376,2074,31|   0.0292482 |<br /> |            ; ds                                                          ;
,2246,3328,8999,73; ,0.00233333,0.002 ; 54,4601,7269,12770,               |<br />
|                                                                         ; 3,1760,1796}       ; 33333,0.00233333, ;
30017,212496,582068              |<br />
|                                                                                              ; 0.00233333,0.0023 ;
0}                               |<br />
|                                                                                              ;
3333,0.002,0.002,                                    |<br />
|                                                                                              ;
0.002}                                               |<br /> | public     | article_wor | context_ |         0
|        4 |       4671 | {639302,113013,133 | {0.00233333,0.001 | {28,42838,92697,140 |    0.867505 |<br />
|           ; ds          ; key                                           ; 052,211978,508496, ; 66667,0.00166667, ;
684,202950,248442,3              |<br /> |                                                                          ;
545123,590829,5985; 0.00166667,0.0016 ; 38954,403025,498952               |<br />
|                                                                         ; 95,649645,37658}   ; 6667,0.00166667,0 ;
,584048,654070}                  |<br />
|                                                                                              ;
.00166667,0.00166                                    |<br />
|                                                                                              ;
667,0.00166667,0.                                    |<br />
|                                                                                              ;
00133333}                                            |<br />
+------------+-------------+----------+-----------+-----------+------------+--------------------+-------------------+---------------------+-------------+<br
/>lyell5=> analyze verbose article_words;<br /> INFO:  "article_words": scanned 3000 of 1125489 pages, containing
555000live rows<br /> and 0 dead rows; 3000 rows in sample, 208,215,465 estimated total rows<br /><br /> lyell5=>
setdefault_statistics_target to 500;<br /> lyell5=> analyze verbose article_words;<br /> INFO:  "article_words":
scanned150000 of 1125489 pages, containing 27749820 live rows<br /> and 180 dead rows; 150000 rows in sample,
208,214,114estimated total rows<br /><br /><br /><br /> But it makes a huge impact on this query:<br /><br
/></tt></font><tableborder="1" cellpadding="2" cellspacing="2" width="30%"><tbody><tr><td bgcolor="#cccccc"
valign="top">statstarget<font color="#333333"><br /></font></td><td bgcolor="#cccccc" valign="top">sampled<br
/></td><tdbgcolor="#cccccc" valign="top">query duration after analyze<br /></td></tr><tr><td valign="top">10<br
/></td><tdvalign="top">3000<br /></td><td valign="top">80600ms<br /></td></tr><tr><td valign="top">15<br /></td><td
valign="top">4500<br/></td><td valign="top">64000ms</td></tr><tr><td valign="top">20<br /></td><td valign="top">6000<br
/></td><tdvalign="top">4.2ms<br /></td></tr><tr><td valign="top">30<br /></td><td valign="top">9000<br /></td><td
valign="top">4.2ms<br/></td></tr><tr><td valign="top">250<br /></td><td valign="top">75000<br /></td><td
valign="top">4.2ms<br/></td></tr><tr><td valign="top">500<br /></td><td valign="top">?<br /></td><td
valign="top">4.2ms<br/></td></tr><tr><td valign="top">1000<br /></td><td valign="top">300000<br /></td><td
valign="top">4.2ms<br/></td></tr></tbody></table><font size="-1"><tt><br /> Is there any good reason not to set stats
targetto 1000 always?<br /><br /><br /><br /><br /> lyell5=> alter table article_words alter word_key set statistics
30;analyzeverbose article_words;<br /> explain analyze select words.* from article_words join words using (word_key)
wherecontext_key=535462;<br /> INFO:  "article_words": scanned 9000 of 1125489 pages, containing 1665000 live rows and
0dead rows; 9000 rows in sample, 208215465 estimated total rows<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>|                                                                   QUERY
PLAN                                                                  |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>| Nested Loop  (cost=0.00..160395.31 rows=16170 width=13) (actual time=0.034..4.025 rows=777
loops=1)                                           |<br /> |   ->  Index Scan using article_word_idx on
article_words (cost=0.00..18258.96 rows=16170 width=4) (actual time=0.023..0.344 rows=777 loops=1) |<br /> |        
IndexCond: (context_key =
535462)                                                                                                    |<br /> |  
-> Index Scan using words_pkey on words  (cost=0.00..8.78 rows=1 width=13) (actual time=0.003..0.004 rows=1
loops=777)                     |<br /> |         Index Cond: (words.word_key =
article_words.word_key)                                                                                 |<br /> | Total
runtime:4.231
ms                                                                                                                       
|<br/>
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/><br/><br /> lyell5=> alter table article_words alter word_key set statistics 10;analyze verbose article_words;<br
/>explain analyze select words.* from article_words join words using (word_key) where context_key=535462;<br /> INFO: 
"article_words":scanned 3000 of 1125489 pages, containing 555000 live rows and 0 dead rows; 3000 rows in sample,
208215465estimated total rows<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>|                                                                   QUERY
PLAN                                                                  |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>| Hash Join  (cost=192089.54..265653.38 rows=44589 width=13) (actual time=18809.184..80685.239 rows=777
loops=1)                                |<br /> |   Hash Cond: (article_words.word_key =
words.word_key)                                                                                        |<br /> |  
-> Index Scan using article_word_idx on article_words  (cost=0.00..44717.05 rows=44589 width=4) (actual
time=0.025..0.597rows=777 loops=1) |<br /> |         Index Cond: (context_key =
535462)                                                                                                    |<br /> |  
-> Hash  (cost=93818.13..93818.13 rows=5653313 width=13) (actual time=18808.872..18808.872 rows=5651551
loops=1)                           |<br /> |         ->  Seq Scan on words  (cost=0.00..93818.13 rows=5653313
width=13)(actual time=0.007..7845.824 rows=5651551 loops=1)                  |<br /> | Total runtime: 80686.217
ms                                                                                                                   
|<br/>
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/><br/></tt></font> 

Re: Seq scan on join, not on subselect? analyze this

From
"Helio Campos Mello de Andrade"
Date:
Bryce,<br /><br />  - I think that the difference between the two queries has to do with the way postgresql execute
them.<br/><br />    In the first the SGDB does:<br />         1º Creates a temporary table with "m" X "n" rows where
the"m" and "n" are the number of the rows in the tables been joined.<br />         2º Take only the rows that has the
same"work_key"<br />         3º It restricts using the where clause.<br /><br />         OBS: Maybe It use the where
clausefirst on the tables just to minimize the "m" and "n". I not sure about that. Still it creates and "m" X "n"
temporarytable with lots of bad rows.<br /><br />   In the second query the SGDB:<br />      1º Select in
"article_words"only the rows that correspond with the restriction to that "context_key". It results in a much smaller
numberof rows. "k" <<< "n".<br />      2º It uses "k-results" and look for the for the rows where "word_key"
isin the group created by the INNER Query.<br /><br />That's why you have the difference between the query's "Total
runtime".<br/><br />Regards<br /><br />--<br />Helio Campos Mello de Andrade<br /><br /><br /><br /><div
class="gmail_quote">OnSun, Nov 2, 2008 at 3:51 PM, Bryce Nesbitt <span dir="ltr"><<a
href="mailto:bryce2@obviously.com">bryce2@obviously.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I'm a bit confused why
thequery planner is not restricting my join, and<br /> not using the index.  Two explain analyze statements follow.<br
/>Why is the second so much better?<br /><br /> lyell5=> select version();<br /> PostgreSQL 8.3.4 on
x86_64-pc-linux-gnu,compiled by GCC cc (GCC) 4.1.2<br /> 20061115 (prerelease) (Debian 4.1.1-21)<br /><br />
lyell5=>explain analyze select * from article_words join words using<br /> (word_key) where context_key=535462;<br
/>
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>|<br /> QUERY<br /> PLAN                                                                   |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>| Hash Join  (cost=192092.90..276920.93 rows=45327 width=17) (actual<br /> time=6020.932..60084.817 rows=777<br />
loops=1)                                 |<br /> |   Hash Cond: (article_words.word_key =<br /> words.word_key)<br />
|<br/> |   ->  Index Scan using article_word_idx on article_words<br /> (cost=0.00..55960.50 rows=45327 width=8)
(actualtime=0.031..0.547<br /> rows=777 loops=1) |<br /> |         Index Cond: (context_key =<br /> 535462)<br /> |<br
/>|   ->  Hash  (cost=93819.62..93819.62 rows=5653462 width=13) (actual<br /> time=6020.605..6020.605 rows=5651551
loops=1)                             |<br /> |         ->  Seq Scan on words  (cost=0.00..93819.62 rows=5653462<br
/>width=13) (actual time=0.006..2010.962 rows=5651551<br /> loops=1)                  |<br /> | Total runtime:
60085.616<br/> ms<br /> |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------+<br
/><br/><br /> lyell5=> explain analyze select * from words where word_key in (select<br /> word_key from
article_wordswhere context_key=535462);<br />
+------------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>|<br /> QUERY<br /> PLAN                                                                      |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------------+<br
/>| Nested Loop  (cost=56073.81..56091.41 rows=2 width=13) (actual<br /> time=0.808..4.723 rows=777<br /> loops=1)    
                                             |<br /> |   ->  HashAggregate  (cost=56073.81..56073.83 rows=2 width=4)
(actual<br/> time=0.795..1.072 rows=777<br /> loops=1)                                            |<br /> |        
-> Index Scan using article_word_idx on article_words<br /> (cost=0.00..55960.50 rows=45327 width=4) (actual
time=0.030..0.344<br/> rows=777 loops=1) |<br /> |               Index Cond: (context_key =<br /> 535462)<br /> |<br />
|  ->  Index Scan using words_pkey on words  (cost=0.00..8.78 rows=1<br /> width=13) (actual time=0.003..0.004
rows=1<br/> loops=777)                            |<br /> |         Index Cond: (words.word_key =<br />
article_words.word_key)<br/> |<br /> | Total runtime: 4.936<br /> ms<br /> |<br />
+------------------------------------------------------------------------------------------------------------------------------------------------------+<br
/><fontcolor="#888888"><br /><br /> --<br /> Sent via pgsql-sql mailing list (<a
href="mailto:pgsql-sql@postgresql.org">pgsql-sql@postgresql.org</a>)<br/> To make changes to your subscription:<br /><a
href="http://www.postgresql.org/mailpref/pgsql-sql"target="_blank">http://www.postgresql.org/mailpref/pgsql-sql</a><br
/></font></blockquote></div>