Partitionwise JOIN scanning all partitions, even unneeded ones - Mailing list pgsql-general

From Dimitrios Apostolou
Subject Partitionwise JOIN scanning all partitions, even unneeded ones
Date
Msg-id 704abd8e-5b3e-f946-fb7c-58d39eb1268e@gmx.net
Whole thread Raw
Responses Re: Partitionwise JOIN scanning all partitions, even unneeded ones
List pgsql-general
Hello list,

I believe I'm facing a performance bug in PostgreSQL with partitionwise
JOINs. I have reduced the issue to the minimum queries I could, please
read-on for details and see attached files for EXPLAIN output with timings
and database settings.

I'd appreciate help on whether it's a real issue, and if it's unknown I
can forward this to the psql-bugs mailing list. I'd also appreciate any
critique on the clarity of my description and on my schema and queries,
since I'm new to postgres.



==== Short description

+ Each textual "task_id" is described by a unique number "task_n".
+ Each task can have many workitems, and this relation is stored in
   "tasks_mm_workitems". A workitem has a unique number "workitem_n".
+ I have a massive (10G rows) table called test_runs_raw. It contains
   a list of test runs for each workitem_n.
   + TABLE test_runs_raw is PARTITIONED into 1000
     partitions (many of them empty, reserved for future)
     according to RANGE(workitem_n).

SUMMARY: I want to do stuff with all test_runs of a task.
          Basically I want JOIN tables in order to go from one "task_id" to
          one "task_n" to many "workitem_n" to manymany test runs.

PROBLEM: If I JOIN the 3 tables to do that, postgres scans ALL partitions
          of the massive table test_runs_raw, and this takes hours.

          On the other hand, if I omit the 1st part of the JOIN and provide
          the "task_n" manually, and JOIN only the 2 latter tables,
          postgres only scans the relevant partitions and my queries
          return fast.

VERSION:
  PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04)
11.4.0,64-bit 


==== Queries

Here are some seemingly meaningless queries, that are just minimal
reproducers of the issue (you can ignore "DISTINCT" part, it's just there
to limit the output I was getting).

===== Slow query:

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)   SELECT DISTINCT workitem_n
     FROM task_ids
     JOIN tasks_mm_workitems USING(task_n)
     JOIN test_runs_raw USING(workitem_n)
     WHERE task_id = '1698813977';

The EXPLAIN output here shows a parallel hash join doing seq scans on each
and every partition. Basically the whole 10G rows table is being
seq-scanned.

===== Fast query:

Here I resolve the "task_n" from the "task_id" manually, and omit the
first part of the JOIN.

SELECT task_n FROM task_ids WHERE task_id = '1698813977';
  task_n
--------
   59854
(1 row)

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)   SELECT DISTINCT workitem_n
     FROM  tasks_mm_workitems
     JOIN test_runs_raw USING(workitem_n)
     WHERE task_n = 59854;

The EXPLAIN output here shows "(never executed)" on most partition seq
scans. The relevant partitions are index-scanned. Basically only a few
partitions are accessed, and only partially through their index on
"workitem_n". This is excellent and what I would expect in all cases.



==== Schema description

> \d+ task_ids
  Column  |  Type   | Collation | Nullable |                 Default
---------+---------+-----------+----------+------------------------------------------
  task_n  | integer |           | not null |
nextval('task_ids_task_n_seq'::regclass)
  task_id | text    |           | not null |
Indexes:
     "task_ids_pkey" PRIMARY KEY, btree (task_n)
     "task_ids_task_id_key" UNIQUE CONSTRAINT, btree (task_id)

> \d tasks_mm_workitems
            Table "public.tasks_mm_workitems"
    Column   |  Type   | Collation | Nullable | Default
------------+---------+-----------+----------+---------
  task_n     | integer |           | not null |
  workitem_n | integer |           | not null |
Indexes:
     "tasks_mm_workitems_pkey" PRIMARY KEY, btree (task_n, workitem_n)
Foreign-key constraints:
     "fk1_tasks_mm_workitems" FOREIGN KEY (task_n) REFERENCES
task_ids(task_n)
     "fk2_tasks_mm_workitems" FOREIGN KEY (workitem_n) REFERENCES
workitem_ids(workitem_n)

> \d test_runs_raw
                                  Partitioned table "public.test_runs_raw"
       Column       |            Type             | Collation | Nullable |
Default
-------------------+-----------------------------+-----------+----------+----------------------------------
  run_n             | bigint                      |           | not null |
generated by default as identity
[...]
  workitem_n        | integer                     |           | not null |
[...]
Partition key: RANGE (workitem_n)
Indexes:
     "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n,
run_n), tablespace "archival_tablespace_1"
Foreign-key constraints:
     [...]
     "test_runs_raw_partitioned_workitem_n_fkey" FOREIGN KEY (workitem_n)
REFERENCES workitem_ids(workitem_n)
Number of partitions: 1000 (Use \d+ to list them.)
Tablespace: "tablespace1"



Thank you in advance,
Dimitris

Attachment

pgsql-general by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: event trigger clarification
Next
From: David Rowley
Date:
Subject: Re: Partitionwise JOIN scanning all partitions, even unneeded ones