RE: [EXTERNAL] Re: Multi-column join + aggregate subquery resulting in infinite run time - Mailing list pgsql-novice

From Vianello, Dan A
Subject RE: [EXTERNAL] Re: Multi-column join + aggregate subquery resulting in infinite run time
Date
Msg-id 605e961868924b5abe41a8987f0b5766@NCEMEXGP001.CORP.CHARTERCOM.com
Whole thread Raw
In response to Re: Multi-column join + aggregate subquery resulting in infinite run time  (Steve Estes <denzera@gmail.com>)
List pgsql-novice

What do results look like with this? 

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, count(dl.acct_id) AS dl_count

FROM prod_account_details ad

INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 

AND ad.address = acx.match_input_address

LEFT JOIN prod_dunning_letters dl
ON ad.acct_id = dl.acct_id

GROUP BY ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.acct_id;

 

 

From: Steve Estes [mailto:denzera@gmail.com]
Sent: Thursday, July 09, 2020 12:29 PM
To: pgsql-novice@lists.postgresql.org
Subject: [EXTERNAL] Re: Multi-column join + aggregate subquery resulting in infinite run time

 

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

Just wanted to bump this in case anyone had any insights.  If you live in a place served by Grubhub or Doordash, I will literally buy you lunch if you can help me figure out why these joins wouldn't take linear time to bolt on, rather than geometrically increasing.  As I envision it, if you've got a common base table serving as the parent in a LEFT JOIN, each additional table you want to LEFT JOIN onto it should take roughly the same amount of time, because there's no need to calculate any interplay between that table and any others, you're just matching on your keys with one scan through the indexes and off you go to the next one.  Clearly I'm either thinking about it wrong, or doing something wrong as my joins increase beyond 1.

 

-Steve

 

 

On Mon, Jul 6, 2020 at 10:59 AM Steve Estes <denzera@gmail.com> wrote:

Hey gang,

 

I thought I knew what I was doing with queries, but this has more-or-less ruined my 4th of July holiday so I think it's high time to ask for help.  I imagine it's something trivial and obvious to you.

 

I've got 3 tables in question here:

 

prod_account_details => 538k records

line_number PK,

acct_id integer, -- nearly unique, but not quite; acct_id PLUS address is unique

address varchar(100),

entity_name varchar(100),

(... a few other fields ...)

btree index on acct_id, hash index on address, also btree on (acct_id, address)

 

prod_customer_profiles => 532k records, nearly 1:1 with account details

acct_id integer, -- corresponds to account_details.acct_id

match_input_address varchar(100), -- corresponds to account_details.address

(... lots of other fields ...)

btree index on acct_id, hash index on match_input_address, also btree on (acct_id, match_input_address)

 

prod_dunning_letters => 518k records, 1:M with account_details (covering ~181k acct_ids)

line_number PK,

acct_id integer, -- FK to account_details, not formally via constraint but semantically

(... a few other fields ...)

btree index on acct_id

 

 

--- Queries ---

 

So firstly, what works:

 

(1)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match

FROM prod_account_details ad

INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 

AND ad.address = acx.match_input_address;

 

...returns the 532k matched records in 3.0 seconds.

 

Now if we INNER join to the dunning_letters table via a grouped subquery:

 

(2)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count

FROM prod_account_details ad

INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 

AND ad.address = acx.match_input_address

INNER JOIN (SELECT count(*) AS num_records, acct_id FROM prod_dunning_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

 

... it returns fine too, 161k records in 1.3 seconds.  

 

But now if I do something as simple as changing that second join (to the subquery) to a LEFT JOIN, it blows up. i.e. I want all valid accounts, appended with the count of records in the dunning_letters table if any exist:

 

(3)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count

FROM prod_account_details ad

INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 

AND ad.address = acx.match_input_address

LEFT JOIN (SELECT count(*) AS num_records, acct_id FROM prod_dunning_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

 

What happens is, this runs for 20+ seconds, and then starts returning ~100 records every 10 seconds, which at the volume I'm talking about (>500k) would take forever to return.  And next up I need to bolt on the counts for some other, bigger tables too, so we're quickly approaching heat-death-of-the-universe amounts of runtime here.

 

Now what's interesting is, if I relax the join on the first two tables, i.e. drop the address part of the join and only do the acct_id one,

 

(4)

SELECT ad.acct_id, ad.entity_name, ad.address, acx.flag_infobase_match, dl.num_records AS dl_count

FROM prod_account_details ad

INNER JOIN prod_customer_profiles acx ON ad.acct_id = acx.acct_id 

-- AND ad.address = acx.match_input_address

LEFT JOIN (SELECT count(*) AS num_records, acct_id FROM prod_dunning_letters GROUP BY acct_id ) dl ON ad.acct_id = dl.acct_id;

 

...it starts returning immediately and spools 695k records in 4.7 seconds.  The problem is that we need the address part of the join in there, because there are some duplicate acct_ids on both sides so we get a bunch of cartesian-join junk.

 

 

--- EXPLAIN query plans ---

 

So here's the Explain for query (1), with just the two tables:

 

Gather  (cost=102995.20..123944.32 rows=1 width=41)
  Workers Planned: 2
  ->  Parallel Hash Join  (cost=101995.20..122944.22 rows=1 width=41)
        Hash Cond: ((ad.acct_id = acx.acct_id) AND (ad.address = (acx.match_input_address)::text))
        ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
        ->  Parallel Hash  (cost=97096.88..97096.88 rows=224288 width=26)
              ->  Parallel Seq Scan on prod_customer_profiles acx  (cost=0.00..97096.88 rows=224288 width=26)

 

 

Now, when we add on the third table, its strategy for that join changes.  Here's the Explain for (2), with the inner join that works:

 

Gather  (cost=22500.40..83636.43 rows=1 width=49)
  Workers Planned: 2
  ->  Nested Loop  (cost=21500.40..82636.33 rows=1 width=49)
        ->  Hash Join  (cost=21500.40..40427.74 rows=56285 width=54)
              Hash Cond: (ad.acct_id = prod_dunning_letters.acct_id)
              ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
              ->  Hash  (cost=19345.40..19345.40 rows=123920 width=15)
                    ->  GroupAggregate  (cost=0.42..18106.20 rows=123920 width=15)
                          Group Key: prod_dunning_letters.acct_id
                          ->  Index Only Scan using prod_dunning_letters_acct_id_idx on prod_dunning_letters  (cost=0.42..14269.36 rows=519529 width=7)
        ->  Index Scan using prod_customer_profiles_match_input_address_idx on prod_customer_profiles acx  (cost=0.00..0.74 rows=1 width=26)
              Index Cond: ((match_input_address)::text = ad.address)
              Filter: (ad.acct_id = acct_id)

 

 

But then change it to a LEFT JOIN from query (3), and it changes to:

 

Nested Loop Left Join  (cost=102995.63..144838.72 rows=1 width=49)
  Join Filter: (ad.acct_id = prod_dunning_letters.acct_id)
  ->  Gather  (cost=102995.20..123944.32 rows=1 width=41)
        Workers Planned: 2
        ->  Parallel Hash Join  (cost=101995.20..122944.22 rows=1 width=41)
              Hash Cond: ((ad.acct_id = acx.acct_id) AND (ad.address = (acx.match_input_address)::text))
              ->  Parallel Seq Scan on prod_account_details ad  (cost=0.00..14226.43 rows=224343 width=39)
              ->  Parallel Hash  (cost=97096.88..97096.88 rows=224288 width=26)
                    ->  Parallel Seq Scan on prod_customer_profiles acx  (cost=0.00..97096.88 rows=224288 width=26)
  ->  GroupAggregate  (cost=0.42..18106.20 rows=123920 width=15)
        Group Key: prod_dunning_letters.acct_id
        ->  Index Only Scan using prod_dunning_letters_acct_id_idx on prod_dunning_letters  (cost=0.42..14269.36 rows=519529 width=7)

 

 

Just reading that, I can't readily tell why #2 finishes fast and #3 runs without end.  If the difference was JUST the use of a left join, then why does #4 finish easily too, just with a single-column join between the first two tables?

 

This has befuddled the rest of my experimentation with it, so I'd really appreciate some insight.

 

-Steve

 

The contents of this e-mail message and
any attachments are intended solely for the
addressee(s) and may contain confidential
and/or legally privileged information. If you
are not the intended recipient of this message
or if this message has been addressed to you
in error, please immediately alert the sender
by reply e-mail and then delete this message
and any attachments. If you are not the
intended recipient, you are notified that
any use, dissemination, distribution, copying,
or storage of this message or any attachment
is strictly prohibited.

pgsql-novice by date:

Previous
From: Steve Estes
Date:
Subject: Re: Multi-column join + aggregate subquery resulting in infinite run time
Next
From: Anto Aravinth
Date:
Subject: Ranking the query with right order in full text search