Re: Merge join vs merge semi join against primary key - Mailing list pgsql-general

From Sean Rhea
Subject Re: Merge join vs merge semi join against primary key
Date
Msg-id CACZYdDjgvnyNZnwZNEsgP6D=KtB5fu-iOWMoNHXN9kSWfiQYKQ@mail.gmail.com
Whole thread Raw
In response to Merge join vs merge semi join against primary key  (Sean Rhea <sean.c.rhea@gmail.com>)
Responses Re: Merge join vs merge semi join against primary key
List pgsql-general
It does the merge (not-semi) join:

production=> explain analyze SELECT ac.* FROM balances ac JOIN customers o ON (o.id = ac.customer_id AND o.group_id = 45);
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=2172.47..19959.82 rows=6 width=80) (actual time=114.578..243898.199 rows=7318 loops=1)
   Merge Cond: (ac.customer_id = o.id)
   ->  Index Scan using balances_customer_id_index on balances ac  (cost=0.00..727.42 rows=16876 width=80) (actual time=0.025..20.972 rows=16876 loops=1)
   ->  Index Scan using customers_pkey on customers o  (cost=0.00..64811.57 rows=179 width=8) (actual time=92.174..243813.231 rows=7672 loops=1)
         Filter: (group_id = 45)
         Rows Removed by Filter: 212699113
 Total runtime: 243901.595 ms
(7 rows)

Sean



On Fri, Oct 9, 2015 at 1:09 PM, Igor Neyman <ineyman@perceptron.com> wrote:

Hello,

 

I'm seeing some inexplicable (to me) behavior in PostgreSQL 9.2. I checked

the archives, but I still can't explain it. Apologies if I missed something.

 

1. When I join two tables with "WHERE id IN (...)" versus with an explicit

join, and the join column for the inner table is a primary key, I would expect

the same behavior in both cases, but the optimizer is choosing a merge join in

one case and a merge semi join in the other. There's at most one customer

with a given id. Why not do a semi join?

 

2. Even though the join methods are different, I would expect about the same

performance in either case, but one query takes only a few hundred

milliseconds while the other takes hundreds of seconds. Ouch!

 

Can anyone help me explain this behavior?

 

Some details are below. Let me know if it would be helpful to gather others.

 

Sean

 

 

production=> select version();

                                        version

----------------------------------------------------------------------------------------

 PostgreSQL 9.2.13 on i686-pc-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 32-bit

(1 row)

 

production=> \d customers

                        Table "public.customers"

     Column    |   Type   |                         Modifiers

---------------+----------+-------------------------------------------------------

 id            | bigint   | not null default nextval('customers_id_seq'::regclass)

 group_id      | bigint   |

...

Indexes:

    "customers_pkey" PRIMARY KEY, btree (id)

...

 

production=> select count(*) from customers;

 count

--------

 473733

(1 row)

 

production=> \d balances

                       Table "public.balances"

        Column     |   Type   |                         Modifiers

-------------------+----------+------------------------------------------------------

 id                | bigint   | not null default nextval('balances_id_seq'::regclass)

 balance           | integer  | not null default 0

 tracking_number   | integer  | not null

 customer_id       | bigint   | not null

...

Indexes:

    "balances_pkey" PRIMARY KEY, btree (id)

    "balances_customer_tracking_number_index" UNIQUE, btree (customer_id, tracking_number)

...

 

production=> select count(*) from balances;

 count

-------

 16876

(1 row)

 

production=> analyze verbose customers;

INFO:  analyzing "public.customers"

INFO:  "customers": scanned 14280 of 14280 pages, containing 475288 live rows and 1949 dead rows; 300000 rows in sample, 475288 estimated total rows

ANALYZE

 

production=> analyze verbose balances;

INFO:  analyzing "public.balances"

INFO:  "balances": scanned 202 of 202 pages, containing 16876 live rows and 0 dead rows; 16876 rows in sample, 16876 estimated total rows

ANALYZE

 

production=> explain analyze SELECT * FROM balances where customer_id IN (SELECT id from customers WHERE group_id = 45);

                                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------

 Merge Semi Join  (cost=2442.14..19958.30 rows=16876 width=80) (actual time=119.905..145.126 rows=7318 loops=1)

   Merge Cond: (balances.customer_id = customers.id)

   ->  Index Scan using balances_customer_id_index on balances  (cost=0.00..727.79 rows=16876 width=80) (actual time=0.302..9.477 rows=16876 loops=1)

   ->  Index Scan using customers_pkey on customers  (cost=0.00..64192.97 rows=184 width=8) (actual time=103.354..126.459 rows=359 loops=1)

         Filter: (group_id = 45)

         Rows Removed by Filter: 141684

 Total runtime: 146.659 ms

(7 rows)

 

production=> explain analyze SELECT ac.* FROM balances ac join customers o ON o.id = ac.customer_id WHERE o.group_id = 45;

                                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Merge Join  (cost=2214.50..20216.86 rows=30 width=80) (actual time=185.615..201991.752 rows=7318 loops=1)

   Merge Cond: (ac.customer_id = o.id)

   ->  Index Scan using balances_customer_tracking_number_index on balances ac  (cost=0.00..1007.49 rows=16876 width=80) (actual time=0.068..25.036 rows=16876 loops=1)

   ->  Index Scan using customers_pkey on customers o  (cost=0.00..63836.61 rows=836 width=8) (actual time=159.840..201915.765 rows=7672 loops=1)

         Filter: (group_id = 45)

         Rows Removed by Filter: 212699113

 Total runtime: 201995.044 ms

(7 rows)

 

 

What if you rewrite your second query like this:

 

SELECT ac.*

FROM balances ac JOIN customers o ON (o.id = ac.customer_id AND o.group_id = 45);

 

Regards,

Igor Neyman

 

 


pgsql-general by date:

Previous
From: Sean Rhea
Date:
Subject: Merge join vs merge semi join against primary key
Next
From: droberts
Date:
Subject: Re: Best practices for aggregate table design