Merge joins on index scans - Mailing list pgsql-performance

From James Parks
Subject Merge joins on index scans
Date
Msg-id CAJ3Xv+hGjhFk_GSZzh2YqJFpSGStwmsNeM_9yLd=T_UHWdguSA@mail.gmail.com
Whole thread Raw
Responses Re: Merge joins on index scans
List pgsql-performance
Dear psql-performance,

I'm having issues with a certain query, and I was hoping you could help me out.

The schema:

(start with new database cluster, with either SQL_ASCII or en.us-UTF8 encoding, using the default server configuration available in the pgdg Jessie packages).

CREATE TABLE a (id bigint primary key, nonce bigint);
CREATE TABLE b (id bigint primary key, a_id bigint not null);
CREATE INDEX a_idx ON b (a_id);

The query:

SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id ASC;

(skip down to [1] and [2] to see the query performance)

What I know:

If you force the query planner to use a merge join on the above query, it takes 10+ minutes to complete using the data as per below. If you force the query planner to use a hash join on the same data, it takes ~200 milliseconds. This behavior is the same both on Postgresql 9.2.15 and Postgresql 9.4.6 (at least as provided by the Debian Jessie repo hosted by postgresql.org), and happens both on i386 and x86_64 platforms. Note: the data for these queries is the same as described below. Let me know if you want me to provide the raw csv's or similar.

Creating a new Postgresql 9.4 cluster (64-bit), creating the tables (a) and (b), importing the table data into the tables (a) and (b), and then running the above query using EXPLAIN results in a merge join query plan, as in [1].

Creating a new Postgresql 9.2 cluster (32-bit or 64-bit), creating the tables (a) and (b), importing the table data into (a) and (b), and then running the above query results in a hash join query plan, as in [2].

When running query [1], the postgres process on the machine consumes 100% CPU for a long time (it seems CPU-bound).

What I expected:

I expected both of the hash join and merge join implementations of this query to have comparable query times; perhaps within an order of magnitude. This was expected on my part mostly because the cost metrics for each query were very similar. Instead, the "optimal" query plan for the query takes more than 1000x longer.

I also expected that the "Rows Removed by Filter: " for the index scan on (a) would not have such a high count, as the number of rows in table (a) (~500,000) is significantly less than the count (2,201,063,696).

What I want to know:

- Is this expected behavior? Can you describe how the merge join algorithm achieves these results?
- Can I avoid this issue by disabling merge joins in the server configuration?

Configuration:

The configuration of the database is the sample configuration as per the Debian Jessie packages of Postgresql available at http://apt.postgresql.org/pub/repos/apt/ with the exception that the data directory was explicitly specified.

Information about the data:

Here are some queries that help describe the data I'm working with:

postgres=# select distinct a_id, count(*) from b group by a_id;
  a_id  | count
--------+-------
  49872 |   320
  47994 |     5
  19084 | 82977
  53251 |   100
 109804 |    10
  51738 |     5
  49077 |    10

postgres=# select count(*) from b;
 count
-------
 83427

postgres=# select count(distinct nonce) from a;
 count
-------
   198

postgres=# select count(*) from a;
 count
--------
 490166

postgres=# select count(*) from a where nonce = 64;
 count
-------
   395

Hardware:

2015-era Intel Xeon processors
> 300 GB of ram (greater than the size of the database with a large margin)
database on hardware raid 1 array on 2 SSDs

Commentary:

This is my first bug report to a major open source project, so I apologize in advance if I messed up this report. Let me know if I have left out key details -- I'm happy to provide them.

Given that there are roughly 500k rows in table a, and given that the EXPLAIN output claims that the filter (nonce = 64) caused 2 billion rows to be skipped (refer to [1]) suggests that each row in table b is being compared to a non-negligible number of rows in (a).

I can probably make this data available as a pg_dump file. Let me know if you think that's necessary, and where I should upload it.

Regards,
James

[1]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id = a.id where a.nonce = 64 order by b.id asc;
                                                             QUERY PLAN                                                            
-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7855.25..7855.61 rows=143 width=16) (actual time=752058.415..752080.731 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2096kB
   Buffers: shared hit=2151025721 read=479, temp read=267 written=267
   I/O Timings: read=2.384
   ->  Merge Join  (cost=869.07..7850.13 rows=143 width=16) (actual time=5.718..751760.637 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=2151025721 read=479
         I/O Timings: read=2.384
         ->  Index Scan using a_idx on b  (cost=0.00..2953.35 rows=83427 width=16) (actual time=0.007..68.165 rows=83427 loops=1)
               Buffers: shared hit=1303 read=139
               I/O Timings: read=1.369
         ->  Index Scan using a_pkey on a  (cost=0.00..26163.20 rows=843 width=8) (actual time=5.706..751385.306 rows=83658 loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 2201063696
               Buffers: shared hit=2151024418 read=340
               I/O Timings: read=1.015
 Total runtime: 752092.206 ms

[2]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id = a.id where a.nonce = 64 order by b.id asc;
                                                     QUERY PLAN                                                    
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10392.28..10392.64 rows=143 width=16) (actual time=164.415..186.297 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2112kB
   Buffers: shared hit=2514 read=587, temp read=267 written=267
   I/O Timings: read=1.199
   ->  Hash Join  (cost=8787.61..10387.16 rows=143 width=16) (actual time=61.836..113.434 rows=83427 loops=1)
         Hash Cond: (b.a_id = a.id)
         Buffers: shared hit=2514 read=587
         I/O Timings: read=1.199
         ->  Seq Scan on b  (cost=0.00..1285.27 rows=83427 width=16) (actual time=0.011..15.826 rows=83427 loops=1)
               Buffers: shared hit=449 read=2
               I/O Timings: read=0.009
         ->  Hash  (cost=8777.08..8777.08 rows=843 width=8) (actual time=61.812..61.812 rows=395 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               Buffers: shared hit=2065 read=585
               I/O Timings: read=1.190
               ->  Seq Scan on a  (cost=0.00..8777.08 rows=843 width=8) (actual time=0.143..61.609 rows=395 loops=1)
                     Filter: (nonce = 64)
                     Rows Removed by Filter: 489771
                     Buffers: shared hit=2065 read=585
                     I/O Timings: read=1.190
 Total runtime: 198.014 ms

The above numbers were acquired using this version of Postgresql:
PostgreSQL 9.2.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

pgsql-performance by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Odd behavior with indices
Next
From: David Rowley
Date:
Subject: Re: Merge joins on index scans