Thread: Query performance issue

Query performance issue

From

Nagaraj Raj

Date:

22 January 2021, 01:53:26

Hi,

I have a query performance issue, it takes a long time, and not even getting explain analyze the output. this query joining on 3 tables which have around

a - 176223509

b - 286887780

c - 214219514

explain

select Count(a."individual_entity_proxy_id")

from "prospect" a

inner join "individual_demographic" b

on a."individual_entity_proxy_id" = b."individual_entity_proxy_id"

inner join "household_demographic" c

on a."household_entity_proxy_id" = c."household_entity_proxy_id"

where (((a."last_contacted_anychannel_dttm" is null)

or (a."last_contacted_anychannel_dttm" < TIMESTAMP '2020-11-23 0:00:00.000000'))

and (a."shared_paddr_with_customer_ind" = 'N')

and (a."profane_wrd_ind" = 'N')

and (a."tmo_ofnsv_name_ind" = 'N')

and (a."has_individual_address" = 'Y')

and (a."has_last_name" = 'Y')

and (a."has_first_name" = 'Y'))

and ((b."tax_bnkrpt_dcsd_ind" = 'N')

and (b."govt_prison_ind" = 'N')

and (b."cstmr_prspct_ind" = 'Prospect'))

and (( c."hspnc_lang_prfrnc_cval" in ('B', 'E', 'X') )

or (c."hspnc_lang_prfrnc_cval" is null));

-- Explain output

"Finalize Aggregate (cost=32813309.28..32813309.29 rows=1 width=8)"

" -> Gather (cost=32813308.45..32813309.26 rows=8 width=8)"

" Workers Planned: 8"

" -> Partial Aggregate (cost=32812308.45..32812308.46 rows=1 width=8)"

" -> Merge Join (cost=23870130.00..32759932.46 rows=20950395 width=8)"

" Merge Cond: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)"

" -> Sort (cost=23870127.96..23922503.94 rows=20950395 width=8)"

" Sort Key: a.individual_entity_proxy_id"

" -> Hash Join (cost=13533600.42..21322510.26 rows=20950395 width=8)"

" Hash Cond: (a.household_entity_proxy_id = c.household_entity_proxy_id)"

" -> Parallel Seq Scan on prospect a (cost=0.00..6863735.60 rows=22171902 width=16)"

" Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm < '2020-11-23 00:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND (profane_wrd_ind = 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND (has_last_name = 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))"

" -> Hash (cost=10801715.18..10801715.18 rows=166514899 width=8)"

" -> Seq Scan on household_demographic c (cost=0.00..10801715.18 rows=166514899 width=8)"

" Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS NULL))"

" -> Index Only Scan using indx_individual_demographic_prxyid_taxind_prspctind_prsnind on individual_demographic b (cost=0.57..8019347.13 rows=286887776 width=8)"

" Index Cond: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (cstmr_prspct_ind = 'Prospect'::text) AND (govt_prison_ind = 'N'::bpchar))"

Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle

Postgres 11 | db<>fiddle

Free online SQL environment for experimenting and sharing.

Server configuration is:

Version: 10.11

RAM - 320GB

vCPU - 32

"maintenance_work_mem" 256MB

"work_mem" 1GB

"shared_buffers" 64GB

Any suggestions?

Thanks,

Re: Query performance issue

From

Justin Pryzby

Date:

22 January 2021, 02:35:14

On Fri, Jan 22, 2021 at 01:53:26AM +0000, Nagaraj Raj wrote:
> Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle
> Postgres 11 | db<>fiddle
> Server configuration is: Version: 10.11RAM - 320GBvCPU - 32 "maintenance_work_mem" 256MB"work_mem"            
1GB"shared_buffers"64GB
 

> Aggregate  (cost=31.54..31.55 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
>   ->  Nested Loop  (cost=0.00..31.54 rows=1 width=8) (actual time=0.007..0.008 rows=0 loops=1)
>         Join Filter: (a.household_entity_proxy_id = c.household_entity_proxy_id)
>         ->  Nested Loop  (cost=0.00..21.36 rows=1 width=16) (actual time=0.006..0.007 rows=0 loops=1)
>               Join Filter: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)
>               ->  Seq Scan on prospect a  (cost=0.00..10.82 rows=1 width=16) (actual time=0.006..0.006 rows=0
loops=1)
>                     Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm <
'2020-11-2300:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND
(profane_wrd_ind= 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND
(has_last_name= 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))
 
>               ->  Seq Scan on individual_demographic b  (cost=0.00..10.53 rows=1 width=8) (never executed)
>                     Filter: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (govt_prison_ind = 'N'::bpchar) AND
((cstmr_prspct_ind)::text= 'Prospect'::text))
 
>         ->  Seq Scan on household_demographic c  (cost=0.00..10.14 rows=3 width=8) (never executed)
>               Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS
NULL))
> Planning Time: 1.384 ms
> Execution Time: 0.206 ms
> 13 rows

It's doing nested loops with estimated rowcount=1, which indicates a bad
underestimate, and suggests that the conditions are redundant or correlated.

Maybe you can handle this with MV stats on the correlated columns:

CREATE STATISTICS prospect_stats (dependencies) ON
    shared_paddr_with_customer_ind, profane_wrd_ind, tmo_ofnsv_name_ind, has_individual_address, has_last_name,
has_first_name
    FROM prospect;
CREATE STATISTICS individual_demographic_stats (dependencies) ON
    tax_bnkrpt_dcsd_ind, govt_prison_ind, cstmr_prspct_ind
    FROM individual_demographic_stats 
ANALYZE prospect, individual_demographic_stats ;

Since it's expensive to compute stats on large number of columns, I'd then
check *which* are correlated and then only compute MV stats on those.  This
will show col1=>col2: X where X approaches 1, the conditions are highly
correlated:
SELECT * FROM pg_statistic_ext; -- pg_statistic_ext_data since v12

Also, as a diagnostic tool to get "explain analyze" to finish, you can
SET enable_nestloop=off;

-- 
Justin

Re: Query performance issue

From

Tomas Vondra

Date:

14 February 2021, 22:03:50


On 1/22/21 3:35 AM, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 01:53:26AM +0000, Nagaraj Raj wrote:
>> Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle
>> Postgres 11 | db<>fiddle
>> Server configuration is: Version: 10.11RAM - 320GBvCPU - 32 "maintenance_work_mem" 256MB"work_mem"            
1GB"shared_buffers"64GB
 
> 
>> Aggregate  (cost=31.54..31.55 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
>>    ->  Nested Loop  (cost=0.00..31.54 rows=1 width=8) (actual time=0.007..0.008 rows=0 loops=1)
>>          Join Filter: (a.household_entity_proxy_id = c.household_entity_proxy_id)
>>          ->  Nested Loop  (cost=0.00..21.36 rows=1 width=16) (actual time=0.006..0.007 rows=0 loops=1)
>>                Join Filter: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)
>>                ->  Seq Scan on prospect a  (cost=0.00..10.82 rows=1 width=16) (actual time=0.006..0.006 rows=0
loops=1)
>>                      Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm <
'2020-11-2300:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND
(profane_wrd_ind= 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND
(has_last_name= 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))
 
>>                ->  Seq Scan on individual_demographic b  (cost=0.00..10.53 rows=1 width=8) (never executed)
>>                      Filter: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (govt_prison_ind = 'N'::bpchar) AND
((cstmr_prspct_ind)::text= 'Prospect'::text))
 
>>          ->  Seq Scan on household_demographic c  (cost=0.00..10.14 rows=3 width=8) (never executed)
>>                Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS
NULL))
>> Planning Time: 1.384 ms
>> Execution Time: 0.206 ms
>> 13 rows
> 
> It's doing nested loops with estimated rowcount=1, which indicates a bad
> underestimate, and suggests that the conditions are redundant or correlated.
> 

No, it's not. The dbfiddle does that because it's using empty tables, 
but the plan shared by Nagaraj does not contain any nested loops.

Nagaraj, if the EXPLAIN ANALYZE does not complete, there are two things 
you can do to determine which part of the plan is causing trouble.

Firstly, you can profile the backend using perf or some other profiles, 
and if we're lucky the function will give us some hints about which node 
type is using the CPU.

Secondly, you can "cut" the query into smaller parts, to run only parts 
of the plan - essentially start from inner-most join, and incrementally 
add more and more tables until it gets too long.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query performance issue

From

Michael Lewis

Date:

16 February 2021, 16:40:08

What indexes exist on those tables? How many rows do you expect to get back in total? Is the last_contacted_anychannel_dttm clause restrictive, or does that include most of the prospect table (check pg_stats for the histogram if you don't know).

and (a."shared_paddr_with_customer_ind" = 'N')

and (a."profane_wrd_ind" = 'N')

and (a."tmo_ofnsv_name_ind" = 'N')

and (a."has_individual_address" = 'Y')

and (a."has_last_name" = 'Y')

and (a."has_first_name" = 'Y'))

Are these conditions expected to throw out very few rows, or most of the table?

If you change both joins to EXISTS clauses, do you get the same plan when you run explain?