Thread: BUG #17544: Join order for INNER JOIN ... USING with GROUP BY on join column affects selected query plan

The following bug has been logged on the website:

Bug reference:      17544
Logged by:          Eric Sheng
Email address:      _@ericsheng.com
PostgreSQL version: 14.4
Operating system:   Ubuntu 20.04
Description:

It seems like the join order of a simple INNER JOIN ... USING(...) affects
query plan selection even when only joining two tables, when used with GROUP
BY on the join column.

Schema: groups with priority, containing zero or more tasks with a boolean

    CREATE TABLE groups(group_id INT PRIMARY KEY NOT NULL, priority INT);
    CREATE TABLE tasks(task_id INT PRIMARY KEY NOT NULL, group_id INT NOT
NULL REFERENCES groups(group_id), finished BOOLEAN NOT NULL);
    CREATE INDEX groups_priority_index ON groups(priority, group_id);
    CREATE INDEX tasks_group_id_index ON tasks(group_id);

Data: 1000 groups (random priority) with 1000 tasks per group

    DO $$
      BEGIN
        FOR i IN 0 .. 999 LOOP
          INSERT INTO groups(group_id, priority) VALUES(i, FLOOR(RANDOM() *
2000000000)::INT);
          FOR j in 0 .. 999 LOOP
            INSERT INTO tasks(task_id, group_id, finished) VALUES (1000 * i
+ j, i, RANDOM() * 100 < 99);
          END LOOP;
        END LOOP;
      END;
    $$;

Queries: fetching the lowest priority groups and whether or not all tasks in
them have finished (same query twice, just different join order)

    EXPLAIN (ANALYZE, TIMING) SELECT group_id, BOOL_AND(finished) FROM
groups INNER JOIN tasks USING(group_id) GROUP BY group_id, priority ORDER BY
priority ASC LIMIT 10;


QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=15771.99..15772.02 rows=10 width=9) (actual
time=293.054..302.919 rows=10 loops=1)
       ->  Sort  (cost=15771.99..15777.64 rows=2260 width=9) (actual
time=293.051..302.914 rows=10 loops=1)
             Sort Key: groups.priority
             Sort Method: top-N heapsort  Memory: 25kB
             ->  Finalize HashAggregate  (cost=15700.56..15723.16 rows=2260
width=9) (actual time=292.744..302.777 rows=1000 loops=1)
                   Group Key: groups.group_id
                   Batches: 1  Memory Usage: 241kB
                   ->  Gather  (cost=15203.36..15677.96 rows=4520 width=9)
(actual time=291.308..301.725 rows=2836 loops=1)
                         Workers Planned: 2
                         Workers Launched: 2
                         ->  Partial HashAggregate  (cost=14203.36..14225.96
rows=2260 width=9) (actual time=263.344..263.560 rows=945 loops=3)
                               Group Key: groups.group_id
                               Batches: 1  Memory Usage: 241kB
                               Worker 0:  Batches: 1  Memory Usage: 241kB
                               Worker 1:  Batches: 1  Memory Usage: 241kB
                               ->  Hash Join  (cost=60.85..11725.61
rows=495550 width=9) (actual time=0.719..182.977 rows=333333 loops=3)
                                     Hash Cond: (tasks.group_id =
groups.group_id)
                                     ->  Parallel Seq Scan on tasks
(cost=0.00..10361.50 rows=495550 width=5) (actual time=0.044..61.366
rows=333333 loops=3)
                                     ->  Hash  (cost=32.60..32.60 rows=2260
width=8) (actual time=0.627..0.627 rows=1000 loops=3)
                                           Buckets: 4096  Batches: 1  Memory
Usage: 72kB
                                           ->  Seq Scan on groups
(cost=0.00..32.60 rows=2260 width=8) (actual time=0.042..0.298 rows=1000
loops=3)
     Planning Time: 0.310 ms
     Execution Time: 303.294 ms

    EXPLAIN (ANALYZE, TIMING) SELECT group_id, BOOL_AND(finished) FROM tasks
INNER JOIN groups USING(group_id) GROUP BY group_id, priority ORDER BY
priority ASC LIMIT 10;


QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=0.70..1.35 rows=10 width=9) (actual time=1.146..10.606
rows=10 loops=1)
       ->  GroupAggregate  (cost=0.70..64932.27 rows=1000000 width=9)
(actual time=1.144..10.599 rows=10 loops=1)
             Group Key: groups.priority, tasks.group_id
             ->  Nested Loop  (cost=0.70..47432.27 rows=1000000 width=9)
(actual time=0.055..8.028 rows=10001 loops=1)
                   ->  Index Only Scan using groups_priority_index on groups
 (cost=0.28..43.27 rows=1000 width=8) (actual time=0.028..0.036 rows=11
loops=1)
                         Heap Fetches: 0
                   ->  Index Scan using tasks_group_id_index on tasks
(cost=0.42..37.39 rows=1000 width=5) (actual time=0.017..0.433 rows=909
loops=11)
                         Index Cond: (group_id = groups.group_id)
     Planning Time: 0.878 ms
     Execution Time: 10.672 ms

All join orders for inner joins give semantically equivalent results, and
documentation at https://www.postgresql.org/docs/current/explicit-joins.html
indicates that the planner should explore all join orders unless there are
too many tables, so I would have expected the join order here to be
immaterial to the query plan chosen.

Digging a bit deeper, this appears to be due to the USING clause causing the
GROUP BY group_id to be rewritten to `groups.group_id` in the first query
and `tasks.group_id` in the second query, and the former resulting in the
simpler plan not being used. The same difference can be observed running:

    EXPLAIN (ANALYZE, TIMING) SELECT groups.group_id, BOOL_AND(finished)
FROM tasks INNER JOIN groups ON tasks.group_id = groups.group_id GROUP BY
groups.group_id, priority ORDER BY priority ASC LIMIT 10;
    (the slow plan)

    EXPLAIN (ANALYZE, TIMING) SELECT tasks.group_id, BOOL_AND(finished) FROM
tasks INNER JOIN groups ON tasks.group_id = groups.group_id GROUP BY
tasks.group_id, priority ORDER BY priority ASC LIMIT 10;
    (the fast plan)


On 7/10/22 09:52, PG Bug reporting form wrote:
> All join orders for inner joins give semantically equivalent results, and
> documentation at https://www.postgresql.org/docs/current/explicit-joins.html
> indicates that the planner should explore all join orders unless there are
> too many tables, so I would have expected the join order here to be
> immaterial to the query plan chosen.
> 
> Digging a bit deeper, this appears to be due to the USING clause causing the
> GROUP BY group_id to be rewritten to `groups.group_id` in the first query
> and `tasks.group_id` in the second query, and the former resulting in the
> simpler plan not being used. The same difference can be observed running:
> 
>      EXPLAIN (ANALYZE, TIMING) SELECT groups.group_id, BOOL_AND(finished)
> FROM tasks INNER JOIN groups ON tasks.group_id = groups.group_id GROUP BY
> groups.group_id, priority ORDER BY priority ASC LIMIT 10;
>      (the slow plan)
> 
>      EXPLAIN (ANALYZE, TIMING) SELECT tasks.group_id, BOOL_AND(finished) FROM
> tasks INNER JOIN groups ON tasks.group_id = groups.group_id GROUP BY
> tasks.group_id, priority ORDER BY priority ASC LIMIT 10;
>      (the fast plan) 
Documentation speaks the truth - optimizer checks all join permutations. 
But USING clause doesn't pull both group_id vars and implicitly chooses 
only one according to an algorithm, described in 
parse_clause.c::buildMergedJoinVar(). So, because you use group_id in 
upper GROUP BY, optimizer is limited to specific set of strategies, 
because it must use the same grouping variable, as implicitly chosen in 
the JOIN USING clause.

-- 
Regards
Andrey Lepikhov
Postgres Professional



PG Bug reporting form <noreply@postgresql.org> writes:
> It seems like the join order of a simple INNER JOIN ... USING(...) affects
> query plan selection even when only joining two tables, when used with GROUP
> BY on the join column.
> ...
> Digging a bit deeper, this appears to be due to the USING clause causing the
> GROUP BY group_id to be rewritten to `groups.group_id` in the first query
> and `tasks.group_id` in the second query, and the former resulting in the
> simpler plan not being used.

Yeah.  In principle that shouldn't matter, but a confluence of
peculiarities of this specific scenario and a perhaps-overly-aggressive
optimization on the GROUP BY clause prevent the planner from finding
the good plan when the JOIN USING variable is resolved as coming from
the same table as the other GROUP BY column.  Technical details at [1].
I'm not sure whether we'll end up using the quick-hack patch shown there,
but if you're desperate for a fix you could apply that locally.

            regards, tom lane

[1] https://www.postgresql.org/message-id/1657885.1657647073%40sss.pgh.pa.us