Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE - Mailing list pgsql-bugs

From Sjors Gielen
Subject Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
Date
Msg-id 5E875954-484B-4471-8571-7ED4B33974DC@sjorsgielen.nl
Whole thread Raw
In response to Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE  (Sjors Gielen <sjors@sjorsgielen.nl>)
Responses Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
List pgsql-bugs
Dear all,

I am not sure I am using the correct list for this - please let me know if I
should move this to pgsql-hackers or pgsql-performance.

While investigating the below issue further, I found another potential issue
regarding CTE materialization, which is that uniqueness information does not
seem to be transported from the result of the CTE to the CTE Scan node. This is
not only causing unnecessary steps in the query plan, but also the cost for
such steps is dramatically under-estimated, so the incorrect query plan may be
used. I will illustrate with an example schema + query.

I am working with the following table, storing courses taken by a student and
their grade.

CREATE TABLE student_courses (
    id BIGINT NOT NULL,
    course_id BIGINT NOT NULL,
    student_id BIGINT NOT NULL,
    semester_id BIGINT NOT NULL,
    grade BIGINT
);

ALTER TABLE student_courses
    ADD CONSTRAINT pkey PRIMARY KEY (id),
    ADD CONSTRAINT uq UNIQUE (semester_id, student_id, course_id);
CREATE INDEX idx on student_courses using btree (semester_id, student_id);
CREATE INDEX idx2 on student_courses using btree (student_id);

Now, the issue shows when we use a materialized query to make a selection of
values in the table, ensuring that this selection contains only unique rows;
then, using those rows to match back with the same table. In this example, we
want to have a list of all grades for all courses they took in their *last*
semester (assuming that semester IDs are in chronologic order). To speed up the
query a bit, we include a minimum semester ID.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
WITH last_semester_per_student AS MATERIALIZED (
    SELECT DISTINCT student_id, MAX(semester_id)
    FROM student_courses
    WHERE semester_id > 150
    GROUP BY student_id
)
SELECT count(*)
FROM student_courses
WHERE (student_id, semester_id) IN (
    SELECT student_id, semester_id
    FROM last_semester_per_student
);

This is a shortened version of the query plan:

 Aggregate  (cost=3560.15..3560.16 rows=1 width=8) (actual time=42.937..42.943 rows=1 loops=1)
   CTE last_semester_per_student
     ->  HashAggregate  (cost=1921.54..2142.00 rows=22046 width=16) (actual time=9.945..11.067 rows=21826 loops=1)
           Output: student_courses_1.student_id, (max(student_courses_1.semester_id))
           Group Key: student_courses_1.student_id, max(student_courses_1.semester_id)
           ->  HashAggregate  (cost=1590.85..1811.31 rows=22046 width=16) (actual time=6.165..7.504 rows=21826 loops=1)
                 Output: student_courses_1.student_id, max(student_courses_1.semester_id)
                 Group Key: student_courses_1.student_id
                 ->  Index Only Scan using idx on public.student_courses student_courses_1  (cost=0.42..1393.87
rows=39397width=16) (actual time=0.035..1.849 rows=39729 loops=1) 
   ->  Nested Loop  (cost=496.33..1218.96 rows=79678 width=0) (actual time=16.022..39.995 rows=121968 loops=1)
         ->  HashAggregate  (cost=496.04..498.04 rows=200 width=8) (actual time=15.971..17.263 rows=21826 loops=1)
               Output: last_semester_per_student.student_id
               ->  CTE Scan on last_semester_per_student  (cost=0.00..440.92 rows=22046 width=8) (actual
time=9.947..13.157rows=21826 loops=1) 
         ->  Index Only Scan using idx2 on public.student_courses  (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826) 

As can be seen, the CTE is computed using an Index Only scan, then grouped with
a HashAggregate. After this, there could be duplicate rows (each having a
different course_id) so the outer HashAggregate ensures that the CTE output
contains only unique rows, to satisfy the DISTINCT.

Now the surprise is in the second half of the plan, within the Nested Loop. The
CTE is already known to return unique rows, and expected to return 22046 of
them. I believe that the HashAggregate around it is intended to make the rows
unique again. But, since the CTE rows were already known to be unique, I
believe this HashAggregate is unnecessary. In fact, you can see that the CTE
scan is expected to produce 22046 rows (quite close to the actual number of
21826), but the HashAggregate is actually expected to reduce this to 200 rows
(but in fact, it does not reduce it at all, and the query planner could know
this). This underestimation causes the cost of the Nested Loop to be much lower
than it should be, since most of the Nested Loop costs come from the number of
iterations in the outer loop. So this could cause the wrong plans to be chosen.

Should the correct plan not be simply this?

   ->  Nested Loop
         ->  CTE Scan on last_semester_per_student  (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157
rows=21826loops=1) 
         ->  Index Only Scan using idx2 on public.student_courses  (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826) 

I am not certain, but it appears that the code generating this path might be in
pathnode.c create_unique_path, which already checks whether its child node
yields unique rows, but only supports relations or subqueries for this check,
not CTEs. Therefore, it falls back to a naive estimation which is off by a
factor of 100x.

Am I correct in thinking this code would be improved by supporting CTE
subqueries, as well, transporting the uniqueness information from the
CTE-producing node (in this case, the outer HashAggregate)?

Would a change of this nature be accepted upstream, if I were to put some time
and effort into it?

The exact schema, data and query are available at
https://sjorsgielen.nl/cte-uniqueness.tgz.

Thank you,
Sjors Gielen

> Op 31 mei 2024, om 22:29 heeft Sjors Gielen <sjors@sjorsgielen.nl> het volgende geschreven:
>
> Dear all,
>
> FYI - I have just tried the below on PostgreSQL 17-beta1 and the issue still occurs.
>
> To be clear - this is not a regression. This appears to be a query planner bug (?) which reproduces in Postgres 15,
16and now 17-beta1. 
>
> Best,
> Sjors
>
>> Op 26 feb 2024, om 14:11 heeft PG Bug reporting form <noreply@postgresql.org> het volgende geschreven:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference:      18365
>> Logged by:          Sjors Gielen
>> Email address:
>> PostgreSQL version: 16.2
>> Operating system:   Linux (native amd64 and Docker for Mac arm64)
>> Description:
>>
>> Dear all,
>>
>> I have run into an issue where a query with a CTE performs a sequential scan
>> on a large table (42M rows, ~1min on our prod), while explicitly
>> materializing the CTE simply performs an index only scan (~2s). When I set
>> `enable_seqscan=off` and compare the costs, it turns out the query planner
>> grossly overestimates the cost of a Nested Loop compared to the (as far as I
>> can tell) exact same Nested Loop when the CTE is materialized. I know that
>> the query planner acts on heuristics, so this might not be considered a bug,
>> but the cost values are so wildly different for what should be essentially
>> the same operation, that it might warrant further investigation.
>>
>> I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of
>> 20240223.1636.gd360e3c.
>>
>> I have reported the issue, the schema, the query and the query plan outputs
>> at
>>
<https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>.
>> Also, in order to facilitate reproduction, I have uploaded two reproduction
>> scripts and a 1 GB sample of my dataset which still reproduces the issue at
>> <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given
>> version of the PostgreSQL Docker image listening on local port 15432.
>> `run-test.sh` creates the `testing` schema, creates the DDL, and imports the
>> data from data.sql.bz2. It should be easy to run `run-test.sh` against any
>> Postgres server by changing the DSN at the top of the script. The `testing`
>> schema is expected not to exist.
>>
>> By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you
>> should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I
>> would expect the cost of the Nested Loop in the first case (materialized
>> view), which is `2246..76657` (around 75k), to be more or less the same as
>> that of the Nested Loop in the third case (sequential scan disabled), which
>> is 230310..797581 (~570k). Because this cost is much higher, even surpassing
>> the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in
>> the second case (non-materialized view) which ends up causing a slowdown of
>> ~6.5 times.
>>
>> Is this a bug?
>>
>> Thank you,
>> Sjors Gielen
>>
>




pgsql-bugs by date:

Previous
From: Floris Van Nee
Date:
Subject: RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
Next
From: Floris Van Nee
Date:
Subject: RE: error "can only drop stats once" brings down database