Extremely inefficient merge-join - Mailing list pgsql-performance

From Marcin Gozdalik
Subject Extremely inefficient merge-join
Date
Msg-id CADu1mRO28n4FifmrJYQ6TfFFdWS7tvwksE=7FVmuqc7BsL2OeQ@mail.gmail.com
Whole thread Raw
Responses Re: Extremely inefficient merge-join  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Hi

I am having a rare issue with extremely inefficient merge join. The query plan indicates that PG is doing some kind of nested loop, although an index is present.

PostgreSQL 9.6.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

Schema of dir_current (some columns left out for brevity):

starfish=# \d sf.dir_current
                                         Table "sf.dir_current"
       Column       |    Type     | Collation | Nullable |                    Default
--------------------+-------------+-----------+----------+-----------------------------------------------
 id                 | bigint      |           | not null | nextval('sf.object_id_seq'::regclass)
 volume_id          | bigint      |           | not null |
 parent_id          | bigint      |           |          |
 blocks             | sf.blkcnt_t |           |          |
 rec_aggrs          | jsonb       |           | not null |
Indexes:
    "dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current"
    "dir_current_parentid_idx" btree (parent_id), tablespace "sf_current"
    "dir_current_volume_id_out_of_sync_time_idx" btree (volume_id, out_of_sync_time) WHERE out_of_sync_time IS NOT NULL, tablespace "sf_current"
    "dir_current_volume_id_path_unq_idx" UNIQUE, btree (volume_id, path text_pattern_ops), tablespace "sf_current"
    "dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id), tablespace "sf_current"
Foreign-key constraints:
    "dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED

dir_process is created as a temporary table:

CREATE TEMP TABLE dir_process AS (
  SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks, atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current
                WHERE ....
        );

CREATE INDEX dir_process_indx ON dir_process(volume_id, id);
ANALYZE dir_process;

and usually contains a few thousands rows.

Slow query:
  SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks FROM sf.dir_current AS dir
  INNER JOIN dir_process ON dir.parent_id = dir_process.id AND dir.volume_id = dir_process.volume_id
  WHERE dir.volume_id = ANY(volume_ids)

dir_current contains around 750M rows altogether, and there is ca. 1.75M rows with volume_id=5.
Sometimes Postgres will choose very inefficient plan, which involves looping many times over same rows, producing hundreds of millions or billions of rows:

LOG:  duration: 1125530.496 ms  plan:
        Merge Join  (cost=909.42..40484.01 rows=1 width=456) (actual rows=1 loops=1)
          Merge Cond: (dir.volume_id = dir_process.volume_id)
          Join Filter: (dir.parent_id = dir_process.id)
          Rows Removed by Join Filter: 13583132483
          ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..884052.46 rows=601329 width=456) (actual rows=2000756 loops=1)
                Index Cond: (volume_id = ANY ('{5}'::bigint[]))
          ->  Sort  (cost=909.31..912.70 rows=6789 width=16) (actual rows=13581131729 loops=1)
                Sort Key: dir_process.volume_id
                Sort Method: quicksort  Memory: 511kB
                ->  Seq Scan on dir_process  (cost=0.00..822.89 rows=6789 width=16) (actual rows=6789 loops=1)

LOG:  duration: 3923310.224 ms  plan:
        Merge Join  (cost=0.17..4324.64 rows=1 width=456) (actual rows=529 loops=1)
          Merge Cond: (dir_process.volume_id = dir.volume_id)
          Join Filter: (dir.parent_id = dir_process.id)
          Rows Removed by Join Filter: 831113021
          ->  Index Only Scan using dir_process_indx on dir_process  (cost=0.05..245.00 rows=450 width=16) (actual rows=450 loops=1)
                Heap Fetches: 450
          ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..884052.46 rows=601329 width=456) (actual rows=831113101 loops=1)
                Index Cond: (volume_id = ANY ('{5}'::bigint[]))

LOG:  duration: 10140968.829 ms  plan:
        Merge Join  (cost=0.17..8389.13 rows=1 width=456) (actual rows=819 loops=1)
          Merge Cond: (dir_process.volume_id = dir.volume_id)
          Join Filter: (dir.parent_id = dir_process.id)
          Rows Removed by Join Filter: 2153506735
          ->  Index Only Scan using dir_process_indx on dir_process  (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1)
                Heap Fetches: 1166
          ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389 loops=1)
                Index Cond: (volume_id = ANY ('{5}'::bigint[]))

Note that 2153506389 / 1166 = 1 846 918. Similarly 831113101 / 450 = 1 846 918.

I wonder how I can help Postgres query planner to choose a faster plan?

--
Marcin Gozdalik

pgsql-performance by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Autovacuum not functioning for large tables but it is working for few other small tables.
Next
From: Tom Lane
Date:
Subject: Re: Extremely inefficient merge-join