Thread: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18365 Logged by: Sjors Gielen Email address: postgresql@sjorsgielen.nl PostgreSQL version: 16.2 Operating system: Linux (native amd64 and Docker for Mac arm64) Description: Dear all, I have run into an issue where a query with a CTE performs a sequential scan on a large table (42M rows, ~1min on our prod), while explicitly materializing the CTE simply performs an index only scan (~2s). When I set `enable_seqscan=off` and compare the costs, it turns out the query planner grossly overestimates the cost of a Nested Loop compared to the (as far as I can tell) exact same Nested Loop when the CTE is materialized. I know that the query planner acts on heuristics, so this might not be considered a bug, but the cost values are so wildly different for what should be essentially the same operation, that it might warrant further investigation. I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of 20240223.1636.gd360e3c. I have reported the issue, the schema, the query and the query plan outputs at <https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>. Also, in order to facilitate reproduction, I have uploaded two reproduction scripts and a 1 GB sample of my dataset which still reproduces the issue at <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given version of the PostgreSQL Docker image listening on local port 15432. `run-test.sh` creates the `testing` schema, creates the DDL, and imports the data from data.sql.bz2. It should be easy to run `run-test.sh` against any Postgres server by changing the DSN at the top of the script. The `testing` schema is expected not to exist. By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I would expect the cost of the Nested Loop in the first case (materialized view), which is `2246..76657` (around 75k), to be more or less the same as that of the Nested Loop in the third case (sequential scan disabled), which is 230310..797581 (~570k). Because this cost is much higher, even surpassing the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in the second case (non-materialized view) which ends up causing a slowdown of ~6.5 times. Is this a bug? Thank you, Sjors Gielen
Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
From
Sjors Gielen
Date:
Dear all, FYI - I have just tried the below on PostgreSQL 17-beta1 and the issue still occurs. To be clear - this is not a regression. This appears to be a query planner bug (?) which reproduces in Postgres 15, 16 andnow 17-beta1. Best, Sjors > Op 26 feb 2024, om 14:11 heeft PG Bug reporting form <noreply@postgresql.org> het volgende geschreven: > > The following bug has been logged on the website: > > Bug reference: 18365 > Logged by: Sjors Gielen > Email address: > PostgreSQL version: 16.2 > Operating system: Linux (native amd64 and Docker for Mac arm64) > Description: > > Dear all, > > I have run into an issue where a query with a CTE performs a sequential scan > on a large table (42M rows, ~1min on our prod), while explicitly > materializing the CTE simply performs an index only scan (~2s). When I set > `enable_seqscan=off` and compare the costs, it turns out the query planner > grossly overestimates the cost of a Nested Loop compared to the (as far as I > can tell) exact same Nested Loop when the CTE is materialized. I know that > the query planner acts on heuristics, so this might not be considered a bug, > but the cost values are so wildly different for what should be essentially > the same operation, that it might warrant further investigation. > > I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of > 20240223.1636.gd360e3c. > > I have reported the issue, the schema, the query and the query plan outputs > at > <https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>. > Also, in order to facilitate reproduction, I have uploaded two reproduction > scripts and a 1 GB sample of my dataset which still reproduces the issue at > <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given > version of the PostgreSQL Docker image listening on local port 15432. > `run-test.sh` creates the `testing` schema, creates the DDL, and imports the > data from data.sql.bz2. It should be easy to run `run-test.sh` against any > Postgres server by changing the DSN at the top of the script. The `testing` > schema is expected not to exist. > > By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you > should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I > would expect the cost of the Nested Loop in the first case (materialized > view), which is `2246..76657` (around 75k), to be more or less the same as > that of the Nested Loop in the third case (sequential scan disabled), which > is 230310..797581 (~570k). Because this cost is much higher, even surpassing > the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in > the second case (non-materialized view) which ends up causing a slowdown of > ~6.5 times. > > Is this a bug? > > Thank you, > Sjors Gielen >
Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
From
Sjors Gielen
Date:
Dear all, I am not sure I am using the correct list for this - please let me know if I should move this to pgsql-hackers or pgsql-performance. While investigating the below issue further, I found another potential issue regarding CTE materialization, which is that uniqueness information does not seem to be transported from the result of the CTE to the CTE Scan node. This is not only causing unnecessary steps in the query plan, but also the cost for such steps is dramatically under-estimated, so the incorrect query plan may be used. I will illustrate with an example schema + query. I am working with the following table, storing courses taken by a student and their grade. CREATE TABLE student_courses ( id BIGINT NOT NULL, course_id BIGINT NOT NULL, student_id BIGINT NOT NULL, semester_id BIGINT NOT NULL, grade BIGINT ); ALTER TABLE student_courses ADD CONSTRAINT pkey PRIMARY KEY (id), ADD CONSTRAINT uq UNIQUE (semester_id, student_id, course_id); CREATE INDEX idx on student_courses using btree (semester_id, student_id); CREATE INDEX idx2 on student_courses using btree (student_id); Now, the issue shows when we use a materialized query to make a selection of values in the table, ensuring that this selection contains only unique rows; then, using those rows to match back with the same table. In this example, we want to have a list of all grades for all courses they took in their *last* semester (assuming that semester IDs are in chronologic order). To speed up the query a bit, we include a minimum semester ID. EXPLAIN (ANALYZE, BUFFERS, VERBOSE) WITH last_semester_per_student AS MATERIALIZED ( SELECT DISTINCT student_id, MAX(semester_id) FROM student_courses WHERE semester_id > 150 GROUP BY student_id ) SELECT count(*) FROM student_courses WHERE (student_id, semester_id) IN ( SELECT student_id, semester_id FROM last_semester_per_student ); This is a shortened version of the query plan: Aggregate (cost=3560.15..3560.16 rows=1 width=8) (actual time=42.937..42.943 rows=1 loops=1) CTE last_semester_per_student -> HashAggregate (cost=1921.54..2142.00 rows=22046 width=16) (actual time=9.945..11.067 rows=21826 loops=1) Output: student_courses_1.student_id, (max(student_courses_1.semester_id)) Group Key: student_courses_1.student_id, max(student_courses_1.semester_id) -> HashAggregate (cost=1590.85..1811.31 rows=22046 width=16) (actual time=6.165..7.504 rows=21826 loops=1) Output: student_courses_1.student_id, max(student_courses_1.semester_id) Group Key: student_courses_1.student_id -> Index Only Scan using idx on public.student_courses student_courses_1 (cost=0.42..1393.87 rows=39397width=16) (actual time=0.035..1.849 rows=39729 loops=1) -> Nested Loop (cost=496.33..1218.96 rows=79678 width=0) (actual time=16.022..39.995 rows=121968 loops=1) -> HashAggregate (cost=496.04..498.04 rows=200 width=8) (actual time=15.971..17.263 rows=21826 loops=1) Output: last_semester_per_student.student_id -> CTE Scan on last_semester_per_student (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157rows=21826 loops=1) -> Index Only Scan using idx2 on public.student_courses (cost=0.29..3.54 rows=6 width=8) (actual time=0.001..0.001rows=6 loops=21826) As can be seen, the CTE is computed using an Index Only scan, then grouped with a HashAggregate. After this, there could be duplicate rows (each having a different course_id) so the outer HashAggregate ensures that the CTE output contains only unique rows, to satisfy the DISTINCT. Now the surprise is in the second half of the plan, within the Nested Loop. The CTE is already known to return unique rows, and expected to return 22046 of them. I believe that the HashAggregate around it is intended to make the rows unique again. But, since the CTE rows were already known to be unique, I believe this HashAggregate is unnecessary. In fact, you can see that the CTE scan is expected to produce 22046 rows (quite close to the actual number of 21826), but the HashAggregate is actually expected to reduce this to 200 rows (but in fact, it does not reduce it at all, and the query planner could know this). This underestimation causes the cost of the Nested Loop to be much lower than it should be, since most of the Nested Loop costs come from the number of iterations in the outer loop. So this could cause the wrong plans to be chosen. Should the correct plan not be simply this? -> Nested Loop -> CTE Scan on last_semester_per_student (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157 rows=21826loops=1) -> Index Only Scan using idx2 on public.student_courses (cost=0.29..3.54 rows=6 width=8) (actual time=0.001..0.001rows=6 loops=21826) I am not certain, but it appears that the code generating this path might be in pathnode.c create_unique_path, which already checks whether its child node yields unique rows, but only supports relations or subqueries for this check, not CTEs. Therefore, it falls back to a naive estimation which is off by a factor of 100x. Am I correct in thinking this code would be improved by supporting CTE subqueries, as well, transporting the uniqueness information from the CTE-producing node (in this case, the outer HashAggregate)? Would a change of this nature be accepted upstream, if I were to put some time and effort into it? The exact schema, data and query are available at https://sjorsgielen.nl/cte-uniqueness.tgz. Thank you, Sjors Gielen > Op 31 mei 2024, om 22:29 heeft Sjors Gielen <sjors@sjorsgielen.nl> het volgende geschreven: > > Dear all, > > FYI - I have just tried the below on PostgreSQL 17-beta1 and the issue still occurs. > > To be clear - this is not a regression. This appears to be a query planner bug (?) which reproduces in Postgres 15, 16and now 17-beta1. > > Best, > Sjors > >> Op 26 feb 2024, om 14:11 heeft PG Bug reporting form <noreply@postgresql.org> het volgende geschreven: >> >> The following bug has been logged on the website: >> >> Bug reference: 18365 >> Logged by: Sjors Gielen >> Email address: >> PostgreSQL version: 16.2 >> Operating system: Linux (native amd64 and Docker for Mac arm64) >> Description: >> >> Dear all, >> >> I have run into an issue where a query with a CTE performs a sequential scan >> on a large table (42M rows, ~1min on our prod), while explicitly >> materializing the CTE simply performs an index only scan (~2s). When I set >> `enable_seqscan=off` and compare the costs, it turns out the query planner >> grossly overestimates the cost of a Nested Loop compared to the (as far as I >> can tell) exact same Nested Loop when the CTE is materialized. I know that >> the query planner acts on heuristics, so this might not be considered a bug, >> but the cost values are so wildly different for what should be essentially >> the same operation, that it might warrant further investigation. >> >> I can reproduce the issue on PostgreSQL 15.2, 15.6, 16.2 and 17devel as of >> 20240223.1636.gd360e3c. >> >> I have reported the issue, the schema, the query and the query plan outputs >> at >> <https://dba.stackexchange.com/questions/335570/why-is-postgresql-performing-a-sequential-scan-except-when-my-cte-is-materializ>. >> Also, in order to facilitate reproduction, I have uploaded two reproduction >> scripts and a 1 GB sample of my dataset which still reproduces the issue at >> <https://sjorsgielen.nl/psql-bug-report.tar.gz>. `bring-up.sh` runs a given >> version of the PostgreSQL Docker image listening on local port 15432. >> `run-test.sh` creates the `testing` schema, creates the DDL, and imports the >> data from data.sql.bz2. It should be easy to run `run-test.sh` against any >> Postgres server by changing the DSN at the top of the script. The `testing` >> schema is expected not to exist. >> >> By running `./bring-up.sh 16.2 && ./run-test.sh`, some minutes later, you >> should see the output of three `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)`. I >> would expect the cost of the Nested Loop in the first case (materialized >> view), which is `2246..76657` (around 75k), to be more or less the same as >> that of the Nested Loop in the third case (sequential scan disabled), which >> is 230310..797581 (~570k). Because this cost is much higher, even surpassing >> the Hash Join cost of ~290k, PostgreSQL decides to use a sequential scan in >> the second case (non-materialized view) which ends up causing a slowdown of >> ~6.5 times. >> >> Is this a bug? >> >> Thank you, >> Sjors Gielen >> >
Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
From
David Rowley
Date:
On Tue, 4 Jun 2024 at 02:50, Sjors Gielen <sjors@sjorsgielen.nl> wrote: > I am not sure I am using the correct list for this - please let me know if I > should move this to pgsql-hackers or pgsql-performance. This is the correct place to report bugs. > Should the correct plan not be simply this? > > -> Nested Loop > -> CTE Scan on last_semester_per_student (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157 rows=21826loops=1) > -> Index Only Scan using idx2 on public.student_courses (cost=0.29..3.54 rows=6 width=8) (actual time=0.001..0.001rows=6 loops=21826) You're right that the query planner injected a uniquification step into the plan unnecessarily. It did this to allow the semi-join to be transformed into a join so it could swap the join order. We won't class this issue as a bug as there is no code in PostgreSQL that is malfunctioning. We simply don't have anything in place in the current versions of PostgreSQL to detect which keys a Path is unique on which the planner could use to skip performing the uniquification step. > I am not certain, but it appears that the code generating this path might be in > pathnode.c create_unique_path, which already checks whether its child node > yields unique rows, but only supports relations or subqueries for this check, > not CTEs. Therefore, it falls back to a naive estimation which is off by a > factor of 100x. The row estimation part has been improved in the yet-to-be-released PostgreSQL 17. The improvement won't be backpatched into earlier versions. > Am I correct in thinking this code would be improved by supporting CTE > subqueries, as well, transporting the uniqueness information from the > CTE-producing node (in this case, the outer HashAggregate)? I'm unsure what you mean by the first part since CTEs already support that. For the "transporting the uniqueness information" part, yes, if the planner tracked that it could produce a plan that could be executed more efficiently. > Would a change of this nature be accepted upstream, if I were to put some time > and effort into it? There has been some effort in [1] to improve this. I've not had enough spare time to look into how that's progressing. If you're keen to see PostgreSQL produce better plans for this type of thing, you're very welcome to catch up with what's been done so far and assess how you can best assist with progressing this further. Showing interest on the thread and testing the latest patch is likely a very good start. David [1] https://postgr.es/m/7mlamswjp81p.fsf@e18c07352.et15sqa
Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
From
David Rowley
Date:
On Tue, 27 Feb 2024 at 02:44, PG Bug reporting form <noreply@postgresql.org> wrote: > I have run into an issue where a query with a CTE performs a sequential scan > on a large table (42M rows, ~1min on our prod), while explicitly > materializing the CTE simply performs an index only scan (~2s). When I set > `enable_seqscan=off` and compare the costs, it turns out the query planner > grossly overestimates the cost of a Nested Loop compared to the (as far as I > can tell) exact same Nested Loop when the CTE is materialized. I know that > the query planner acts on heuristics, so this might not be considered a bug, > but the cost values are so wildly different for what should be essentially > the same operation, that it might warrant further investigation. > Is this a bug? I don't see any bugs. What seems to be going on is that the materialized CTE underestimates the number of rows after making the CTE distinct so the planner can use a join rather than a semi-join. That's seen in: -> HashAggregate (cost=2576.28..2679.33 rows=10305 width=16) (actual time=765.793..893.761 rows=166060 loops=1) Due to that row underestimation, the planner thinks a parameterized nested loop is the best way to join the two relations as it thinks the stock_history_date_product_id_idx index only needs to be looked up 10305 times rather than 166060 times. With the non-materialized version, the planner estimates 103051. That causes it to think the Nested Loop -> index scan on stock_history_date_product_id_idx is too expensive and hash join with a seq scan is cheaper As it turns out, looking up the index *is* faster, even when that's done 166060 times. The two parameters that drive the planner's decision on this are random_page_cost, (you might want to consider lowering that) and effective_cache_size. A rough guideline for effective_cache_size is 75% of the machine's RAM. However, it much depends on your shared_buffer setting and what type of other things run concurrently on the machine. Some people have found lowering random_page_cost as low as 1.1 helps. The default is 4.0, which has remained since the HDD days. For SSDs, it's often too large. I've attached the EXPLAINs I trimmed down and compared to reach this conclusion. David