Re: multi-layered view join performance oddities - Mailing list pgsql-performance

From Svenne Krap
Subject Re: multi-layered view join performance oddities
Date
Msg-id 436511DF.5020800@krap.dk
Whole thread Raw
In response to Re: multi-layered view join performance oddities  ("Steinar H. Gunderson" <sgunderson@bigfoot.com>)
List pgsql-performance
Hi.

Your suggestion with disableing the nested loop really worked well:

rkr=# set enable_nestloop=false;
SET
rkr=# explain analyze select * from ord_result_pct_pretty ;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=230.06..337.49 rows=1 width=174) (actual time=21.893..42.356 rows=2250 loops=1)
   Hash Cond: (("outer".dataset_id = "inner".dataset_id) AND ("outer".nb_property_type_id = "inner".nb_property_type_id))
   ->  Hash Join  (cost=56.94..164.10 rows=26 width=93) (actual time=5.073..17.906 rows=2532 loops=1)
         Hash Cond: ("outer".dataset_id = "inner".id)
         ->  Hash Join  (cost=55.54..161.63 rows=161 width=57) (actual time=4.996..14.775 rows=2532 loops=1)
               Hash Cond: ("outer".institut = "inner".id)
               ->  Append  (cost=54.38..121.72 rows=2476 width=44) (actual time=4.964..11.827 rows=2532 loops=1)
                     ->  HashAggregate  (cost=54.38..57.20 rows=226 width=16) (actual time=4.964..5.174 rows=282 loops=1)
                           ->  Seq Scan on ord_entrydata_current  (cost=0.00..37.50 rows=2250 width=16) (actual time=0.002..1.305 rows=2250 loops=1)
                     ->  Subquery Scan "*SELECT* 2"  (cost=0.00..60.00 rows=2250 width=20) (actual time=0.009..4.948 rows=2250 loops=1)
                           ->  Seq Scan on ord_entrydata_current  (cost=0.00..37.50 rows=2250 width=20) (actual time=0.003..2.098 rows=2250 loops=1)
               ->  Hash  (cost=1.13..1.13 rows=13 width=17) (actual time=0.022..0.022 rows=13 loops=1)
                     ->  Seq Scan on groups g  (cost=0.00..1.13 rows=13 width=17) (actual time=0.003..0.013 rows=13 loops=1)
         ->  Hash  (cost=1.32..1.32 rows=32 width=36) (actual time=0.070..0.070 rows=32 loops=1)
               ->  Seq Scan on ord_dataset od  (cost=0.00..1.32 rows=32 width=36) (actual time=0.009..0.043 rows=32 loops=1)
                     Filter: is_visible
   ->  Hash  (cost=173.07..173.07 rows=10 width=97) (actual time=15.472..15.472 rows=256 loops=1)
         ->  Hash Join  (cost=166.15..173.07 rows=10 width=97) (actual time=14.666..15.203 rows=256 loops=1)
               Hash Cond: ("outer".nb_property_type_id = "inner".id)
               ->  HashAggregate  (cost=165.05..168.15 rows=248 width=40) (actual time=14.619..14.849 rows=288 loops=1)
                     ->  Append  (cost=54.38..121.72 rows=2476 width=44) (actual time=5.012..11.130 rows=2532 loops=1)
                           ->  HashAggregate  (cost=54.38..57.20 rows=226 width=16) (actual time=5.011..5.222 rows=282 loops=1)
                                 ->  Seq Scan on ord_entrydata_current  (cost=0.00..37.50 rows=2250 width=16) (actual time=0.001..1.261 rows=2250 loops=1)
                           ->  Subquery Scan "*SELECT* 2"  (cost=0.00..60.00 rows=2250 width=20) (actual time=0.010..4.308 rows=2250 loops=1)
                                 ->  Seq Scan on ord_entrydata_current  (cost=0.00..37.50 rows=2250 width=20) (actual time=0.002..1.694 rows=2250 loops=1)
               ->  Hash  (cost=1.08..1.08 rows=8 width=57) (actual time=0.026..0.026 rows=8 loops=1)
                     ->  Seq Scan on nb_property_type npt  (cost=0.00..1.08 rows=8 width=57) (actual time=0.004..0.019 rows=8 loops=1)
 Total runtime: 43.297 ms
(28 rows)

Now, the whole question becomes, how do I get the planner to make a better estimation of the returned rows.

I am not sure, I can follow your moving-the-union-all-further-out advice, as I see no different place for the unioning of the two datasets.

Maybe one of the core devs know, where to fiddle :)

Svenne

Steinar H. Gunderson wrote:
On Sun, Oct 30, 2005 at 06:16:04PM +0100, Svenne Krap wrote: 
 Nested Loop  (cost=223.09..338.61 rows=1 width=174) (actual time=20.213..721.361 rows=2250 loops=1)  Join Filter: (("outer".dataset_id = "inner".dataset_id) AND ("outer".nb_property_type_id = "inner".nb_property_type_id))  ->  Hash Join  (cost=58.04..164.26 rows=1 width=150) (actual time=5.510..22.088 rows=2250 loops=1)   
There's horrible misestimation here. It expects one row and thus starts a
nested loop, but gets 2250. No wonder it's slow :-)

The misestimation can be traced all the way down here:
 
         Hash Cond: ("outer".institut = "inner".id)        ->  Hash Join  (cost=56.88..163.00 rows=16 width=137) (actual time=5.473..19.165 rows=2250 loops=1)              Hash Cond: ("outer".dataset_id = "inner".id)              ->  Hash Join  (cost=55.48..160.95 rows=99 width=101) (actual time=5.412..16.264 rows=2250 loops=1)   
where the planner misestimates the selectivity of your join (it estimates 99
rows, and there are 2250).

I've had problems joining with Append nodes in the past, and solved the
problem by moving the UNION ALL a bit out, but I'm not sure if it's a very
good general solution, or a solution to your problems here.

If all else fails, you could "set enable_nestloop=false", but that is not a
good idea in the long run, I'd guess -- it's much better to make sure the
planner has good estimates and let it do the correct decisions from there.

/* Steinar */ 

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: multi-layered view join performance oddities
Next
From: Svenne Krap
Date:
Subject: Re: multi-layered view join performance oddities