Thread: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE

BUG #18365: Inconsistent cost function between materialized and non-materialized CTE

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18365
Logged by:          Sjors Gielen
Email address:      postgresql@sjorsgielen.nl
PostgreSQL version: 16.2
Operating system:   Linux (native amd64 and Docker for Mac arm64)
Description:

Dear all,

I have run into an issue where a query with a CTE performs a sequential scan
on a large table (42M rows, ~1min on our prod), while explicitly
materializing the CTE simply performs an index only scan (~2s). When I set
`enable_seqscan=off` and compare the costs, it turns out the query planner
grossly overestimates the cost of a Nested Loop compared to the (as far as I
can tell) exact same Nested Loop when the CTE is materialized. I know that
the query planner acts on heuristics, so this might not be considered a bug,
but the cost values are so wildly different for what should be essentially
the same operation, that it might warrant further investigation.

I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of
20240223.1636.gd360e3c.

I have reported the issue, the schema, the query and the query plan outputs
at

<https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>.
Also, in order to facilitate reproduction, I have uploaded two reproduction
scripts and a 1 GB sample of my dataset which still reproduces the issue at
<https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given
version of the PostgreSQL Docker image listening on local port 15432.
`run-test.sh` creates the `testing` schema, creates the DDL, and imports the
data from data.sql.bz2. It should be easy to run `run-test.sh` against any
Postgres server by changing the DSN at the top of the script. The `testing`
schema is expected not to exist.

By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you
should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I
would expect the cost of the Nested Loop in the first case (materialized
view), which is `2246..76657` (around 75k), to be more or less the same as
that of the Nested Loop in the third case (sequential scan disabled), which
is 230310..797581 (~570k). Because this cost is much higher, even surpassing
the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in
the second case (non-materialized view) which ends up causing a slowdown of
~6.5 times.

Is this a bug?

Thank you,
Sjors Gielen


Dear all,

FYI - I have just tried the below on PostgreSQL 17-beta1 and the issue still occurs.

To be clear - this is not a regression. This appears to be a query planner bug (?) which reproduces in Postgres 15, 16
andnow 17-beta1. 

Best,
Sjors

> Op 26 feb 2024, om 14:11 heeft PG Bug reporting form <noreply@postgresql.org> het volgende geschreven:
>
> The following bug has been logged on the website:
>
> Bug reference:      18365
> Logged by:          Sjors Gielen
> Email address:
> PostgreSQL version: 16.2
> Operating system:   Linux (native amd64 and Docker for Mac arm64)
> Description:
>
> Dear all,
>
> I have run into an issue where a query with a CTE performs a sequential scan
> on a large table (42M rows, ~1min on our prod), while explicitly
> materializing the CTE simply performs an index only scan (~2s). When I set
> `enable_seqscan=off` and compare the costs, it turns out the query planner
> grossly overestimates the cost of a Nested Loop compared to the (as far as I
> can tell) exact same Nested Loop when the CTE is materialized. I know that
> the query planner acts on heuristics, so this might not be considered a bug,
> but the cost values are so wildly different for what should be essentially
> the same operation, that it might warrant further investigation.
>
> I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of
> 20240223.1636.gd360e3c.
>
> I have reported the issue, the schema, the query and the query plan outputs
> at
>
<https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>.
> Also, in order to facilitate reproduction, I have uploaded two reproduction
> scripts and a 1 GB sample of my dataset which still reproduces the issue at
> <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given
> version of the PostgreSQL Docker image listening on local port 15432.
> `run-test.sh` creates the `testing` schema, creates the DDL, and imports the
> data from data.sql.bz2. It should be easy to run `run-test.sh` against any
> Postgres server by changing the DSN at the top of the script. The `testing`
> schema is expected not to exist.
>
> By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you
> should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I
> would expect the cost of the Nested Loop in the first case (materialized
> view), which is `2246..76657` (around 75k), to be more or less the same as
> that of the Nested Loop in the third case (sequential scan disabled), which
> is 230310..797581 (~570k). Because this cost is much higher, even surpassing
> the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in
> the second case (non-materialized view) which ends up causing a slowdown of
> ~6.5 times.
>
> Is this a bug?
>
> Thank you,
> Sjors Gielen
>




Dear all,

I am not sure I am using the correct list for this - please let me know if I
should move this to pgsql-hackers or pgsql-performance.

While investigating the below issue further, I found another potential issue
regarding CTE materialization, which is that uniqueness information does not
seem to be transported from the result of the CTE to the CTE Scan node. This is
not only causing unnecessary steps in the query plan, but also the cost for
such steps is dramatically under-estimated, so the incorrect query plan may be
used. I will illustrate with an example schema + query.

I am working with the following table, storing courses taken by a student and
their grade.

CREATE TABLE student_courses (
    id BIGINT NOT NULL,
    course_id BIGINT NOT NULL,
    student_id BIGINT NOT NULL,
    semester_id BIGINT NOT NULL,
    grade BIGINT
);

ALTER TABLE student_courses
    ADD CONSTRAINT pkey PRIMARY KEY (id),
    ADD CONSTRAINT uq UNIQUE (semester_id, student_id, course_id);
CREATE INDEX idx on student_courses using btree (semester_id, student_id);
CREATE INDEX idx2 on student_courses using btree (student_id);

Now, the issue shows when we use a materialized query to make a selection of
values in the table, ensuring that this selection contains only unique rows;
then, using those rows to match back with the same table. In this example, we
want to have a list of all grades for all courses they took in their *last*
semester (assuming that semester IDs are in chronologic order). To speed up the
query a bit, we include a minimum semester ID.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
WITH last_semester_per_student AS MATERIALIZED (
    SELECT DISTINCT student_id, MAX(semester_id)
    FROM student_courses
    WHERE semester_id > 150
    GROUP BY student_id
)
SELECT count(*)
FROM student_courses
WHERE (student_id, semester_id) IN (
    SELECT student_id, semester_id
    FROM last_semester_per_student
);

This is a shortened version of the query plan:

 Aggregate  (cost=3560.15..3560.16 rows=1 width=8) (actual time=42.937..42.943 rows=1 loops=1)
   CTE last_semester_per_student
     ->  HashAggregate  (cost=1921.54..2142.00 rows=22046 width=16) (actual time=9.945..11.067 rows=21826 loops=1)
           Output: student_courses_1.student_id, (max(student_courses_1.semester_id))
           Group Key: student_courses_1.student_id, max(student_courses_1.semester_id)
           ->  HashAggregate  (cost=1590.85..1811.31 rows=22046 width=16) (actual time=6.165..7.504 rows=21826 loops=1)
                 Output: student_courses_1.student_id, max(student_courses_1.semester_id)
                 Group Key: student_courses_1.student_id
                 ->  Index Only Scan using idx on public.student_courses student_courses_1  (cost=0.42..1393.87
rows=39397width=16) (actual time=0.035..1.849 rows=39729 loops=1) 
   ->  Nested Loop  (cost=496.33..1218.96 rows=79678 width=0) (actual time=16.022..39.995 rows=121968 loops=1)
         ->  HashAggregate  (cost=496.04..498.04 rows=200 width=8) (actual time=15.971..17.263 rows=21826 loops=1)
               Output: last_semester_per_student.student_id
               ->  CTE Scan on last_semester_per_student  (cost=0.00..440.92 rows=22046 width=8) (actual
time=9.947..13.157rows=21826 loops=1) 
         ->  Index Only Scan using idx2 on public.student_courses  (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826) 

As can be seen, the CTE is computed using an Index Only scan, then grouped with
a HashAggregate. After this, there could be duplicate rows (each having a
different course_id) so the outer HashAggregate ensures that the CTE output
contains only unique rows, to satisfy the DISTINCT.

Now the surprise is in the second half of the plan, within the Nested Loop. The
CTE is already known to return unique rows, and expected to return 22046 of
them. I believe that the HashAggregate around it is intended to make the rows
unique again. But, since the CTE rows were already known to be unique, I
believe this HashAggregate is unnecessary. In fact, you can see that the CTE
scan is expected to produce 22046 rows (quite close to the actual number of
21826), but the HashAggregate is actually expected to reduce this to 200 rows
(but in fact, it does not reduce it at all, and the query planner could know
this). This underestimation causes the cost of the Nested Loop to be much lower
than it should be, since most of the Nested Loop costs come from the number of
iterations in the outer loop. So this could cause the wrong plans to be chosen.

Should the correct plan not be simply this?

   ->  Nested Loop
         ->  CTE Scan on last_semester_per_student  (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157
rows=21826loops=1) 
         ->  Index Only Scan using idx2 on public.student_courses  (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826) 

I am not certain, but it appears that the code generating this path might be in
pathnode.c create_unique_path, which already checks whether its child node
yields unique rows, but only supports relations or subqueries for this check,
not CTEs. Therefore, it falls back to a naive estimation which is off by a
factor of 100x.

Am I correct in thinking this code would be improved by supporting CTE
subqueries, as well, transporting the uniqueness information from the
CTE-producing node (in this case, the outer HashAggregate)?

Would a change of this nature be accepted upstream, if I were to put some time
and effort into it?

The exact schema, data and query are available at
https://sjorsgielen.nl/cte-uniqueness.tgz.

Thank you,
Sjors Gielen

> Op 31 mei 2024, om 22:29 heeft Sjors Gielen <sjors@sjorsgielen.nl> het volgende geschreven:
>
> Dear all,
>
> FYI - I have just tried the below on PostgreSQL 17-beta1 and the issue still occurs.
>
> To be clear - this is not a regression. This appears to be a query planner bug (?) which reproduces in Postgres 15,
16and now 17-beta1. 
>
> Best,
> Sjors
>
>> Op 26 feb 2024, om 14:11 heeft PG Bug reporting form <noreply@postgresql.org> het volgende geschreven:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference:      18365
>> Logged by:          Sjors Gielen
>> Email address:
>> PostgreSQL version: 16.2
>> Operating system:   Linux (native amd64 and Docker for Mac arm64)
>> Description:
>>
>> Dear all,
>>
>> I have run into an issue where a query with a CTE performs a sequential scan
>> on a large table (42M rows, ~1min on our prod), while explicitly
>> materializing the CTE simply performs an index only scan (~2s). When I set
>> `enable_seqscan=off` and compare the costs, it turns out the query planner
>> grossly overestimates the cost of a Nested Loop compared to the (as far as I
>> can tell) exact same Nested Loop when the CTE is materialized. I know that
>> the query planner acts on heuristics, so this might not be considered a bug,
>> but the cost values are so wildly different for what should be essentially
>> the same operation, that it might warrant further investigation.
>>
>> I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of
>> 20240223.1636.gd360e3c.
>>
>> I have reported the issue, the schema, the query and the query plan outputs
>> at
>>
<https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>.
>> Also, in order to facilitate reproduction, I have uploaded two reproduction
>> scripts and a 1 GB sample of my dataset which still reproduces the issue at
>> <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given
>> version of the PostgreSQL Docker image listening on local port 15432.
>> `run-test.sh` creates the `testing` schema, creates the DDL, and imports the
>> data from data.sql.bz2. It should be easy to run `run-test.sh` against any
>> Postgres server by changing the DSN at the top of the script. The `testing`
>> schema is expected not to exist.
>>
>> By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you
>> should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I
>> would expect the cost of the Nested Loop in the first case (materialized
>> view), which is `2246..76657` (around 75k), to be more or less the same as
>> that of the Nested Loop in the third case (sequential scan disabled), which
>> is 230310..797581 (~570k). Because this cost is much higher, even surpassing
>> the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in
>> the second case (non-materialized view) which ends up causing a slowdown of
>> ~6.5 times.
>>
>> Is this a bug?
>>
>> Thank you,
>> Sjors Gielen
>>
>




On Tue, 4 Jun 2024 at 02:50, Sjors Gielen <sjors@sjorsgielen.nl> wrote:
> I am not sure I am using the correct list for this - please let me know if I
> should move this to pgsql-hackers or pgsql-performance.

This is the correct place to report bugs.

> Should the correct plan not be simply this?
>
>    ->  Nested Loop
>          ->  CTE Scan on last_semester_per_student  (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157
rows=21826loops=1)
 
>          ->  Index Only Scan using idx2 on public.student_courses  (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826)
 

You're right that the query planner injected a uniquification step
into the plan unnecessarily.  It did this to allow the semi-join to be
transformed into a join so it could swap the join order.

We won't class this issue as a bug as there is no code in PostgreSQL
that is malfunctioning. We simply don't have anything in place in the
current versions of PostgreSQL to detect which keys a Path is unique
on which the planner could use to skip performing the uniquification
step.

> I am not certain, but it appears that the code generating this path might be in
> pathnode.c create_unique_path, which already checks whether its child node
> yields unique rows, but only supports relations or subqueries for this check,
> not CTEs. Therefore, it falls back to a naive estimation which is off by a
> factor of 100x.

The row estimation part has been improved in the yet-to-be-released
PostgreSQL 17.  The improvement won't be backpatched into earlier
versions.

> Am I correct in thinking this code would be improved by supporting CTE
> subqueries, as well, transporting the uniqueness information from the
> CTE-producing node (in this case, the outer HashAggregate)?

I'm unsure what you mean by the first part since CTEs already support
that. For the "transporting the uniqueness information" part, yes, if
the planner tracked that it could produce a plan that could be
executed more efficiently.

> Would a change of this nature be accepted upstream, if I were to put some time
> and effort into it?

There has been some effort in [1] to improve this. I've not had enough
spare time to look into how that's progressing.  If you're keen to see
PostgreSQL produce better plans for this type of thing, you're very
welcome to catch up with what's been done so far and assess how you
can best assist with progressing this further. Showing interest on the
thread and testing the latest patch is likely a very good start.

David

[1] https://postgr.es/m/7mlamswjp81p.fsf@e18c07352.et15sqa



On Tue, 27 Feb 2024 at 02:44, PG Bug reporting form
<noreply@postgresql.org> wrote:
> I have run into an issue where a query with a CTE performs a sequential scan
> on a large table (42M rows, ~1min on our prod), while explicitly
> materializing the CTE simply performs an index only scan (~2s). When I set
> `enable_seqscan=off` and compare the costs, it turns out the query planner
> grossly overestimates the cost of a Nested Loop compared to the (as far as I
> can tell) exact same Nested Loop when the CTE is materialized. I know that
> the query planner acts on heuristics, so this might not be considered a bug,
> but the cost values are so wildly different for what should be essentially
> the same operation, that it might warrant further investigation.

> Is this a bug?

I don't see any bugs. What seems to be going on is that the
materialized CTE underestimates the number of rows after making the
CTE distinct so the planner can use a join rather than a semi-join.

That's seen in:

->  HashAggregate  (cost=2576.28..2679.33 rows=10305 width=16) (actual
time=765.793..893.761 rows=166060 loops=1)

Due to that row underestimation, the planner thinks a parameterized
nested loop is the best way to join the two relations as it thinks the
stock_history_date_product_id_idx index only needs to be looked up
10305 times rather than 166060 times.  With the non-materialized
version, the planner estimates 103051.  That causes it to think the
Nested Loop -> index scan on stock_history_date_product_id_idx is too
expensive and hash join with a seq scan is cheaper   As it turns out,
looking up the index *is* faster, even when that's done 166060 times.

The two parameters that drive the planner's decision on this are
random_page_cost, (you might want to consider lowering that) and
effective_cache_size.  A rough guideline for effective_cache_size is
75% of the machine's RAM. However, it much depends on your
shared_buffer setting and what type of other things run concurrently
on the machine.  Some people have found lowering random_page_cost as
low as 1.1 helps. The default is 4.0, which has remained since the HDD
days.  For SSDs, it's often too large.

I've attached the EXPLAINs I trimmed down and compared to reach this conclusion.

David

Attachment