On Tue, 4 Jun 2024 at 02:50, Sjors Gielen <sjors@sjorsgielen.nl> wrote:
> I am not sure I am using the correct list for this - please let me know if I
> should move this to pgsql-hackers or pgsql-performance.
This is the correct place to report bugs.
> Should the correct plan not be simply this?
>
> -> Nested Loop
> -> CTE Scan on last_semester_per_student (cost=0.00..440.92 rows=22046 width=8) (actual time=9.947..13.157
rows=21826loops=1)
> -> Index Only Scan using idx2 on public.student_courses (cost=0.29..3.54 rows=6 width=8) (actual
time=0.001..0.001rows=6 loops=21826)
You're right that the query planner injected a uniquification step
into the plan unnecessarily. It did this to allow the semi-join to be
transformed into a join so it could swap the join order.
We won't class this issue as a bug as there is no code in PostgreSQL
that is malfunctioning. We simply don't have anything in place in the
current versions of PostgreSQL to detect which keys a Path is unique
on which the planner could use to skip performing the uniquification
step.
> I am not certain, but it appears that the code generating this path might be in
> pathnode.c create_unique_path, which already checks whether its child node
> yields unique rows, but only supports relations or subqueries for this check,
> not CTEs. Therefore, it falls back to a naive estimation which is off by a
> factor of 100x.
The row estimation part has been improved in the yet-to-be-released
PostgreSQL 17. The improvement won't be backpatched into earlier
versions.
> Am I correct in thinking this code would be improved by supporting CTE
> subqueries, as well, transporting the uniqueness information from the
> CTE-producing node (in this case, the outer HashAggregate)?
I'm unsure what you mean by the first part since CTEs already support
that. For the "transporting the uniqueness information" part, yes, if
the planner tracked that it could produce a plan that could be
executed more efficiently.
> Would a change of this nature be accepted upstream, if I were to put some time
> and effort into it?
There has been some effort in [1] to improve this. I've not had enough
spare time to look into how that's progressing. If you're keen to see
PostgreSQL produce better plans for this type of thing, you're very
welcome to catch up with what's been done so far and assess how you
can best assist with progressing this further. Showing interest on the
thread and testing the latest patch is likely a very good start.
David
[1] https://postgr.es/m/7mlamswjp81p.fsf@e18c07352.et15sqa