Re: query produces 1 GB temp file - Mailing list pgsql-performance

From Dirk.Lutzebaeck@t-online.de (Dirk Lutzebaeck)
Subject Re: query produces 1 GB temp file
Date
Msg-id 42061B35.6000306@aeccom.com
Whole thread Raw
In response to Re: query produces 1 GB temp file  (Greg Stark <gsstark@mit.edu>)
List pgsql-performance
Greg,

Thanks for your analysis. But I dont get any better after bumping
STATISTICS target from 10 to 200.
explain analyze shows that the optimizer is still way off estimating the
rows. Is this normal? It still produces a 1 GB temp file.
I simplified the query a bit, now only two tables are involved (bi, df).
I also vacuumed.


alter table bi alter rc set statistics 200;
alter table bi alter hide set statistics 200;
alter table bi alter co set statistics 200;
alter table bi alter en set statistics 200;
analyze bi;

alter table df alter en set statistics 200;
alter table df alter val_2 set statistics 200;
analyze df;

EXPLAIN ANALYZE
SELECT DISTINCT ON (df.val_9, df.created, df.flatid) df.docindex,
df.flatobj, bi.oid, bi.en
FROM bi,df
WHERE bi.rc=130170467
AND bi.en=df.en
AND bi.co=117305223
AND bi.hide=FALSE
AND (df.val_2='DG' OR df.val_2='SK')
AND df.docstart=1
ORDER BY df.val_9 ASC, df.created DESC
LIMIT 1000 OFFSET 0
;

Limit (cost=82470.09..82480.09 rows=1000 width=646) (actual
time=71768.685..72084.622 rows=1000 loops=1)
-> Unique (cost=82470.09..82643.71 rows=17362 width=646) (actual
time=71768.679..72079.987 rows=1000 loops=1)
-> Sort (cost=82470.09..82513.50 rows=17362 width=646) (actual
time=71768.668..71905.138 rows=22439 loops=1)
Sort Key: df.val_9, df.created, df.flatid
-> Merge Join (cost=80422.51..81247.49 rows=17362 width=646) (actual
time=7657.872..18486.551 rows=703677 loops=1)
Merge Cond: ("outer".en = "inner".en)
-> Sort (cost=55086.74..55340.18 rows=101378 width=8) (actual
time=5606.137..6672.630 rows=471871 loops=1)
Sort Key: bi.en
-> Seq Scan on bi (cost=0.00..46657.47 rows=101378 width=8) (actual
time=0.178..3715.109 rows=472320 loops=1)
Filter: ((rc = 130170467::oid) AND (co = 117305223::oid) AND (hide = false))
-> Sort (cost=25335.77..25408.23 rows=28982 width=642) (actual
time=2048.039..3677.140 rows=706482 loops=1)
Sort Key: df.en
-> Seq Scan on df (cost=0.00..23187.79 rows=28982 width=642) (actual
time=0.112..1546.580 rows=71978 loops=1)
Filter: (((val_2 = 'DG'::text) OR (val_2 = 'SK'::text)) AND (docstart = 1))


explain analyze select * from bi where rc=130170467;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual
time=0.157..3066.028 rows=513724 loops=1)
Filter: (rc = 130170467::oid)
Total runtime: 4208.663 ms
(3 rows)


explain analyze select * from bi where co=117305223;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual
time=0.021..3692.238 rows=945487 loops=1)
Filter: (co = 117305223::oid)
Total runtime: 5786.268 ms
(3 rows)





Greg Stark wrote:

>Dirk Lutzebaeck <lutzeb@aeccom.com> writes:
>
>
>
>>Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
>>tables have been analyzed before.
>>
>>
>
>Really? A lot of the estimates are very far off. If you really just analyzed
>these tables immediately prior to the query then perhaps you should try
>raising the statistics target on spec and co. Or is the problem that there's a
>correlation between those two columns?
>
>
>
>>               ->  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) (actual time=34.104..18016.005 rows=703677
loops=1)
>>                     ->  Nested Loop  (cost=0.00..5757.17 rows=17 width=51) (actual time=0.467..3216.342 rows=48563
loops=1)
>>                           ->  Nested Loop  (cost=0.00..5606.39 rows=30 width=42) (actual time=0.381..1677.014
rows=48563loops=1) 
>>                                 ->  Index Scan using es_sc_index on es  (cost=0.00..847.71 rows=301 width=8) (actual
time=0.184..46.519rows=5863 loops=1) 
>>                                       Index Cond: ((spec = 122293729) AND (co = 117305223::oid))
>>
>>
>
>The root of your problem,. The optimizer is off by a factor of 20. It thinks
>these two columns are much more selective than they are.
>
>
>
>>                                 ->  Index Scan using bi_env_index on bi  (cost=0.00..15.80 rows=1 width=42) (actual
time=0.052..0.218rows=8 loops=5863) 
>>                                       Index Cond: ("outer".en = bi.en)
>>                                       Filter: ((rc = 130170467::oid) AND (co = 117305223::oid) AND (hide = false))
>>
>>
>
>It also thinks these three columns are much more selective than they are.
>
>How accurate are its estimates if you just do these?
>
>explain analyze select * from es where spec = 122293729
>explain analyze select * from es where co = 117305223::oid
>explain analyze select * from bi where rc = 130170467::oid
>explain analyze select * from bi where co = 117305223
>explain analyze select * from bi where hide = false
>
>If they're individually accurate then you've run into the familiar problem of
>needing cross-column statistics. If they're individually inaccurate then you
>should try raising the targets on those columns with:
>
>ALTER TABLE [ ONLY ] name [ * ]
>    ALTER [ COLUMN ] column SET STATISTICS integer
>
>and reanalyzing.
>
>
>Dirk Lutzebaeck <lutzeb@aeccom.com> writes:
>
>
>
>>Can some please explain why the temp file is so huge? I understand
>>there are a lot of rows.
>>
>>
>
>Well that I can't explain. 22k rows of width 1361 doesn't sound so big to me
>either. The temporary table does need to store three copies of the records at
>a given time, but still it sounds like an awful lot.
>
>
>
>


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Slow Restoration of a template1 Database (ALTER GROUP)
Next
From: Dirk.Lutzebaeck@t-online.de (Dirk Lutzebaeck)
Date:
Subject: Re: query produces 1 GB temp file