Thread: Re: Simple query with Planner underestimating rows.

Re: Simple query with Planner underestimating rows.

From
Andrei Lepikhov
Date:
On 1/29/25 02:29, Felipe López Montes wrote:
> Hi all,
> 
> I am using PostgreSQL 17.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 
> 12.4.0, 64-bit.
> 
> I have been facing this issue past weeks and I don't know what else to 
> try, I have a very simple query where the planner is underestimating the 
> number of rows. So instead of estimating the ~ actual number of rows 
> (5120), it estimates only 1.
Your query represents a typical PostgreSQL issue: Here, I see two 
'almost' unique columns: participant_identifier and 
programme_identifier. The result is that the join cardinality on a 
unique column is predicted to be close to the size of the smaller relation.
But in the second join, you have a join by two columns with low selectivity:

ON t3.participant_identifier = t1.participant_identifier
   AND t3.programme_identifier = t2.programme_identifier;

Postgres doesn't gather dependency statistics on two or more columns and 
just multiplies the low selectivities of these clauses, reducing the 
number of rows to a possible minimum - 1.

What you can do? Right now, maybe only pg_hint_plan may help in such a 
situation. However, if you provide some test cases, we may check the 
forgotten feature [1], which enables extended statistics in join clause 
estimations and may push development efforts in that direction.

[1] using extended statistics to improve join estimates
https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com

-- 
regards, Andrei Lepikhov



Re: Simple query with Planner underestimating rows.

From
Andrei Lepikhov
Date:
On 1/29/25 15:32, Felipe López Montes wrote:
> Hi,
> 
> Thank you for your reply,
> 
> I have already tried creating such statistics on t3.programme_identifier 
> and t3.participant_identifier and the plan is the same, however I cannot 
> create them for the right part of the join as they are from different 
> tables (t1 and t2)
Of course, because for now, a join clause can't be estimated by extended 
statistics. It applies only to a scan filter (clause referencing only 
one relation).
I have meant that with the development patch [1] applied, you may create 
two statistics on t3(participant_identifier,programme_identifier) and 
t2(participant_identifier,programme_identifier). These statistics would 
then be used to estimate the join clause and may resolve the problem.

[1] using extended statistics to improve join estimates
https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com

-- 
regards, Andrei Lepikhov