Thread: Partitionwise JOIN scanning all partitions, even unneeded ones

Partitionwise JOIN scanning all partitions, even unneeded ones

From
Dimitrios Apostolou
Date:
Hello list,

I believe I'm facing a performance bug in PostgreSQL with partitionwise
JOINs. I have reduced the issue to the minimum queries I could, please
read-on for details and see attached files for EXPLAIN output with timings
and database settings.

I'd appreciate help on whether it's a real issue, and if it's unknown I
can forward this to the psql-bugs mailing list. I'd also appreciate any
critique on the clarity of my description and on my schema and queries,
since I'm new to postgres.



==== Short description

+ Each textual "task_id" is described by a unique number "task_n".
+ Each task can have many workitems, and this relation is stored in
   "tasks_mm_workitems". A workitem has a unique number "workitem_n".
+ I have a massive (10G rows) table called test_runs_raw. It contains
   a list of test runs for each workitem_n.
   + TABLE test_runs_raw is PARTITIONED into 1000
     partitions (many of them empty, reserved for future)
     according to RANGE(workitem_n).

SUMMARY: I want to do stuff with all test_runs of a task.
          Basically I want JOIN tables in order to go from one "task_id" to
          one "task_n" to many "workitem_n" to manymany test runs.

PROBLEM: If I JOIN the 3 tables to do that, postgres scans ALL partitions
          of the massive table test_runs_raw, and this takes hours.

          On the other hand, if I omit the 1st part of the JOIN and provide
          the "task_n" manually, and JOIN only the 2 latter tables,
          postgres only scans the relevant partitions and my queries
          return fast.

VERSION:
  PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04)
11.4.0,64-bit 


==== Queries

Here are some seemingly meaningless queries, that are just minimal
reproducers of the issue (you can ignore "DISTINCT" part, it's just there
to limit the output I was getting).

===== Slow query:

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)   SELECT DISTINCT workitem_n
     FROM task_ids
     JOIN tasks_mm_workitems USING(task_n)
     JOIN test_runs_raw USING(workitem_n)
     WHERE task_id = '1698813977';

The EXPLAIN output here shows a parallel hash join doing seq scans on each
and every partition. Basically the whole 10G rows table is being
seq-scanned.

===== Fast query:

Here I resolve the "task_n" from the "task_id" manually, and omit the
first part of the JOIN.

SELECT task_n FROM task_ids WHERE task_id = '1698813977';
  task_n
--------
   59854
(1 row)

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)   SELECT DISTINCT workitem_n
     FROM  tasks_mm_workitems
     JOIN test_runs_raw USING(workitem_n)
     WHERE task_n = 59854;

The EXPLAIN output here shows "(never executed)" on most partition seq
scans. The relevant partitions are index-scanned. Basically only a few
partitions are accessed, and only partially through their index on
"workitem_n". This is excellent and what I would expect in all cases.



==== Schema description

> \d+ task_ids
  Column  |  Type   | Collation | Nullable |                 Default
---------+---------+-----------+----------+------------------------------------------
  task_n  | integer |           | not null |
nextval('task_ids_task_n_seq'::regclass)
  task_id | text    |           | not null |
Indexes:
     "task_ids_pkey" PRIMARY KEY, btree (task_n)
     "task_ids_task_id_key" UNIQUE CONSTRAINT, btree (task_id)

> \d tasks_mm_workitems
            Table "public.tasks_mm_workitems"
    Column   |  Type   | Collation | Nullable | Default
------------+---------+-----------+----------+---------
  task_n     | integer |           | not null |
  workitem_n | integer |           | not null |
Indexes:
     "tasks_mm_workitems_pkey" PRIMARY KEY, btree (task_n, workitem_n)
Foreign-key constraints:
     "fk1_tasks_mm_workitems" FOREIGN KEY (task_n) REFERENCES
task_ids(task_n)
     "fk2_tasks_mm_workitems" FOREIGN KEY (workitem_n) REFERENCES
workitem_ids(workitem_n)

> \d test_runs_raw
                                  Partitioned table "public.test_runs_raw"
       Column       |            Type             | Collation | Nullable |
Default
-------------------+-----------------------------+-----------+----------+----------------------------------
  run_n             | bigint                      |           | not null |
generated by default as identity
[...]
  workitem_n        | integer                     |           | not null |
[...]
Partition key: RANGE (workitem_n)
Indexes:
     "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n,
run_n), tablespace "archival_tablespace_1"
Foreign-key constraints:
     [...]
     "test_runs_raw_partitioned_workitem_n_fkey" FOREIGN KEY (workitem_n)
REFERENCES workitem_ids(workitem_n)
Number of partitions: 1000 (Use \d+ to list them.)
Tablespace: "tablespace1"



Thank you in advance,
Dimitris

Attachment

Re: Partitionwise JOIN scanning all partitions, even unneeded ones

From
David Rowley
Date:
On Thu, 31 Aug 2023 at 07:55, Dimitrios Apostolou <jimis@gmx.net> wrote:
> I'd appreciate help on whether it's a real issue, and if it's unknown I
> can forward this to the psql-bugs mailing list. I'd also appreciate any
> critique on the clarity of my description and on my schema and queries,
> since I'm new to postgres.

Thank you for posting this here first. We often get reports on bugs
that are not bugs, so this saves from traffic there.

> ===== Slow query:
>
> EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)   SELECT DISTINCT workitem_n
>      FROM task_ids
>      JOIN tasks_mm_workitems USING(task_n)
>      JOIN test_runs_raw USING(workitem_n)
>      WHERE task_id = '1698813977';
>
> The EXPLAIN output here shows a parallel hash join doing seq scans on each
> and every partition. Basically the whole 10G rows table is being
> seq-scanned.

I'm sorry to say that this is not a bug, it's simply an optimisation
that we've not yet implemented.  The run-time partition pruning that
runs and causes the "(never executed)" Append subnodes in the fast
plan appears because run-time pruning during execution only works when
the Append (or MergeAppend) is parameterised by some column from above
or from the outer side of the join.  You can see that in this fragment
of your EXPLAIN output:

->  Index Only Scan using test_runs_raw__part_max6180k_pkey on
public.test_runs_raw__part_max6180k test_runs_raw_309
    Output: test_runs_raw_309.workitem_n
    Index Cond: (test_runs_raw_309.workitem_n = tasks_mm_workitems.workitem_n)

Note that tasks_mm_workitems is from the outer side of the join.

The same isn't done for Hash Joins as there is no parameterisation
with that join type.  It is technically possible to do, but it means
running the partition pruning algorithm once for each row that goes
into the hash table and taking the union of all the matching
partitions.  That's quite a bit of work, especially if you don't
manage to prune anything in the end.

Having said that, there is some work going on by Richard Guo [1] where
he aims to implement this. It is quite a tricky thing to do without
causing needless pruning work in cases where no partitions can be
pruned. If you have an interest, you can follow the thread there to
see the discussion about the difficulties with implementing this in a
way that does not cause performance regressions for queries where no
pruning was possible.

David

[1] https://commitfest.postgresql.org/44/4512/



Re: Partitionwise JOIN scanning all partitions, even unneeded ones

From
Dimitrios Apostolou
Date:
Thank you for the clear explanation, and I hope the missing optimisation
gets implemented sooner rather than later. Maybe the query planner should
consider the missing optimisation and ban *hash* partitionwise joins.

Indeed I verified that disabling hash join fixed the situation, with both
queries taking almost the same time. Great!

In the meantime, I'm considering disabling hash joins globally, as I've
had issues with them before (excessive I/O slowing causing much slower
execution than merge join, see [1] for your answer to my question then :).
Do you think that would save me from other problems I'm not aware of,
given the huge size of the table I'm querying?

[1] https://www.postgresql.org/message-id/CAApHDvpPVydoNkEqLyBSbjWq8kq8M7YWdkA44rTeA2MNaO3jsw@mail.gmail.com


Regards,
Dimitris