Thread: JOIN on partitions is very slow

JOIN on partitions is very slow

From
daya airody
Date:
Hi folks,

We are using postgreSQL database and I am hitting some limits. I have partitions on company_sale_account table
based on company name

We generate a report on accounts matched between the two. Below is the query:

SELECT DISTINCT cpsa1.*
FROM company_sale_account cpsa1  
 JOIN  company_sale_account cpsa2  ON cpsa1.sale_account_id = cpsa2.sale_account_id
 WHERE  cpsa1.company_name = 'company_a'  
 AND cpsa2.company_name = 'company_b'


We have setup BTREE indexes on sale_account_id column on both the tables.
This worked fine till recently. Now, we have 10 million rows in
company_a partition and 7 million rows in company_b partition. This query is taking
more than 10 minutes.

Below is the explain plan output for it:

  Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427
  I/O Timings: read=19619.322
  ->  Sort  (cost=167950986.43..168904299.23 rows=381325118 width=132) (actual time=517017.334..603691.048 rows=16854094 loops=1)
        Sort Key: cpsa1.crm_account_id, ((cpsa1.account_name)::text), ((cpsa1.account_owner)::text), ((cpsa1.account_type)::text), cpsa1.is_customer, ((date_part('epoch'::text, cpsa1.created_date))::integer), ((hstore_to_json(cpsa1.custom_crm_fields))::tex (...)
        Sort Method: external merge  Disk: 2862656kB
        Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427
        I/O Timings: read=19619.322
        ->  Nested Loop  (cost=0.00..9331268.39 rows=381325118 width=132) (actual time=1.680..118698.570 rows=16854094 loops=1)
              Buffers: shared hit=20125977 read=47811 dirtied=75
              I/O Timings: read=19619.322
              ->  Append  (cost=0.00..100718.94 rows=2033676 width=33) (actual time=0.014..1783.243 rows=2033675 loops=1)
                    Buffers: shared hit=75298 dirtied=75
                    ->  Seq Scan on company_sale_account cpsa2  (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0 loops=1)
                          Filter: ((company_name)::text = 'company_b'::text)
                    ->  Seq Scan on company_sale_account_concur cpsa2_1  (cost=0.00..100718.94 rows=2033675 width=33) (actual time=0.013..938.145 rows=2033675 loops=1)
                          Filter: ((company_name)::text = 'company_b'::text)
                          Buffers: shared hit=75298 dirtied=75
              ->  Append  (cost=0.00..1.97 rows=23 width=355) (actual time=0.034..0.047 rows=8 loops=2033675)
                    Buffers: shared hit=20050679 read=47811
                    I/O Timings: read=19619.322
                    ->  Seq Scan on company_sale_account cpsa1  (cost=0.00..0.00 rows=1 width=4525) (actual time=0.000..0.000 rows=0 loops=2033675)
                          Filter: (((company_name)::text = 'company_a'::text) AND ((cpsa2.sale_account_id)::text = (sale_account_id)::text))
                    ->  Index Scan using ix_csa_adp_sale_account on company_sale_account_adp cpsa1_1  (cost=0.56..1.97 rows=22 width=165) (actual time=0.033..0.042 rows=8 loops=2033675)
                          Index Cond: ((sale_account_id)::text = (cpsa2.sale_account_id)::text)
                          Filter: ((company_name)::text = 'company_a'::text)
                          Buffers: shared hit=20050679 read=47811
                          I/O Timings: read=19619.322
Planning time: 30.853 ms
Execution time: 618218.321 ms


Do you have any suggestion on how to tune postgres.
Please share your thoughts. It would be a great help to me.

Re: JOIN on partitions is very slow

From
Michael Lewis
Date:
Are you able to tweak the query or is that generated by an ORM? What version of Postgres? Which configs have you changed from default? How many partitions do you have? Is there an index on company name?

Anytime I see distinct keyword, I expect it to be a performance bottleneck and wonder about rewriting the query. Even just using group by can be much faster because of how it gets executed.

Re: JOIN on partitions is very slow

From
daya airody
Date:
Yes. I can tweak the query. Version of postgres is 9.5.15. I have about 20 partitions for company_sale_account table. 
I do have an index on company name.

I need to use DISTINCT as i need to remove the duplicates.

Thanks for your time.



On Sun, Mar 22, 2020 at 11:38 PM Michael Lewis <mlewis@entrata.com> wrote:
Are you able to tweak the query or is that generated by an ORM? What version of Postgres? Which configs have you changed from default? How many partitions do you have? Is there an index on company name?

Anytime I see distinct keyword, I expect it to be a performance bottleneck and wonder about rewriting the query. Even just using group by can be much faster because of how it gets executed.

Re: JOIN on partitions is very slow

From
Michael Lewis
Date:
On Mon, Mar 23, 2020 at 1:40 AM daya airody <daya.airody@gmail.com> wrote:
Yes. I can tweak the query. Version of postgres is 9.5.15. I have about 20 partitions for company_sale_account table. 
I do have an index on company name.

I need to use DISTINCT as i need to remove the duplicates.

DISTINCT is a sign of improper joins most of the time in my experience. Often, just changing to group by is faster

SELECT cpsa1.*
FROM company_sale_account cpsa1  
 JOIN  company_sale_account cpsa2  ON cpsa1.sale_account_id = cpsa2.sale_account_id
 WHERE  cpsa1.company_name = 'company_a'  
 AND cpsa2.company_name = 'company_b'
GROUP BY cpsa1.id; --assuming primary key exists, and I forget if the feature that allows only naming primary key in group by might have been introduced with 9.6

It should be noted that 9.5 is about 1 year from being EOL'd so it would be prudent to update to v11 or 12 when possible.

How does the below query perform? By the way, "top posting" (replying with all previous email thread below your reply) is discouraged on these forums. It makes the reviewing archived posts more cumbersome. Instead, please reply with only your message and copying the relevant parts of prior conversation that you are responding to. 

SELECT cpsa1.*
FROM company_sale_account cpsa1  
WHERE cpsa1.company_name = 'company_a' AND EXISTS(SELECT FROM  company_sale_account cpsa2 WHER cpsa1.sale_account_id = cpsa2.sale_account_id AND cpsa2.company_name = 'company_b' );
 

Re: JOIN on partitions is very slow

From
Thomas Kellerer
Date:
Michael Lewis schrieb am 23.03.2020 um 17:16:
> Yes. I can tweak the query. Version of postgres is 9.5.15. I have
> about 20 partitions for company_sale_account table. I do have an
> index on company name.
>
> I need to use DISTINCT as i need to remove the duplicates.
>
>
> DISTINCT is a sign of improper joins most of the time in my
> experience. Often, just changing to group by is faster

As none of the columns of the joined table are used, most probably
this should be re-written as an EXISTS condition.
Then neither GROUP BY nor DISTINCT is needed.






Re: JOIN on partitions is very slow

From
daya airody
Date:


As none of the columns of the joined table are used, most probably
this should be re-written as an EXISTS condition.
Then neither GROUP BY nor DISTINCT is needed.

 
I need the columns from joined tables. To keep it simple, I didn't include them in the query. EXISTS solution won't work for me.