self join, parameterized base/join rel path row estimation and generally... - Mailing list pgsql-performance

From Bongseo Jang
Subject self join, parameterized base/join rel path row estimation and generally...
Date
Msg-id CAKPsnf1+ZU0W=UYPB_eAi3u8qATx43tny5Hba4x_B_T71qUa6A@mail.gmail.com
Whole thread Raw
List pgsql-performance
Hi,

I have two plans of a query.
nestloop plan is much faster, but planner chose the slower one, hashjoin.
 
  http://explain.depesz.com/s/Aqs
  http://explain.depesz.com/s/97C

it seems that rows=39698995  are quite overestimated.

-> Nested Loop (cost=0.000..5403.600 rows=39698995 width=45) (actual time=0.392..14.817 rows=943 loops=1)
   -> Nested Loop (cost=0.000..17.600 rows=1 width=8) (actual time=0.241..0.246 rows=1 loops=1)
         -> Index Scan using seven on hotel three (cost=0.000..6.880 rows=1 width=6) (actual time=0.113..0.115 rows=1 loops=1)
                Index Cond: (two = 31750::numeric)
         -> Index Scan using echo on oscar_foxtrot november (cost=0.000..10.710 rows=1 width=14) (actual time=0.117..0.118 rows=1 loops=1)
                 Index Cond: (charlie = three.golf)
   -> Index Scan using zulu on oscar_foxtrot juliet (cost=0.000..3849.200 rows=153679 width=45) (actual time=0.147..14.241 rows=943 loops=1)
         Index Cond: ((uniform_yankee = november.uniform_yankee) AND (uniform_victor = november.uniform_victor))

pg_stats is like this;
> select attname, null_frac, n_distinct, most_common_vals, most_common_freqs from pg_stats where tablename like 'oscar_foxtrot%' and (attname = 'uniform_yankee' or attname = 'uniform_victor')
"uniform_yankee";0;12;"{83886082,83886085}";"{0.9742,0.02}"
"uniform_victor";0;23;"{1342767106,1342308357}";"{0.973467,0.02}"

I assumed that nestloop rows would be more or less inner_path_rows * outer_path_rows with good pg_stats, and good plan could come based on it.

the plan above is not that case. Suspcious of 40 million rows and small number of values(actually two values) making up 98% of distribution. 
so.. I looked up some code and found that rows=153679 is rows of parameterized base rel estimated by eqsel(), and row=39698995 is rows of parameterized join rel by eqjoinsel().
I think wrong plan above comes from the fact that the two estimation cannot be close in general, great difference in my case.

where am i wrong and right? 
Is there recommended approach, related issue, commit and so on i can follow ?

thanks


[[nstallation Info]]
PostgreSQL-9.2.5 (via postgresql yum repository)
OS: Centos 6.3 (custom linux-3.10.12 kernel)
postgresql.conf:
    effective_cache_size = 10000MB
    shared_buffers = 1000MB
    work_mem = 100MB
    maintenance_work_mem = 100MB
HW: CPU 4-core Xeon x 2 sockets, RAM 256GB

--
Regards,
Jang.

 a sound mind in a sound body

pgsql-performance by date:

Previous
From: Josh Kupershmidt
Date:
Subject: Re: pg_repack solves alter table set tablespace lock
Next
From: Ying He
Date:
Subject: Re: pg_repack solves alter table set tablespace lock