Re: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Showing applied extended statistics in explain Part 2
Date
Msg-id c6668d68-dc2b-4093-a912-d5412672d678@gmail.com
Whole thread Raw
In response to Re: Showing applied extended statistics in explain Part 2  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Showing applied extended statistics in explain Part 2
List pgsql-hackers
On 8/2/2025 20:50, Tomas Vondra wrote:
> 
> 
> On 1/24/25 11:17, Andrei Lepikhov wrote:
>> On 11/1/24 12:22, Tatsuro Yamada wrote:
>> I often use SQL Server to compare execution plans generated by
>> PostgreSQL, and I appreciate how they display the usage of extended
>> statistics. They clearly identify which statistics were utilized during
>> query planning and include them in the summary section of query plan.
>>
> I admit not knowing what exactly SQL Server shows in the explain, but it
> seems helpful to know which part of the plan used which statistics, no?
I wonder if you meant clauses, not a part of the plan. Clauses may be 
used on different levels of the join tree, but they are estimated once.
I meant we may just refer to the statistic used by its name in the 
node's explanation if this node contains estimated clause. I use 
multiple extended statistics. Sometimes, they intersect, and it is hard 
to say which one was used for a specific clause.
> Imagine we only knew an index was used, but not which node used it and
> for what keys. That would be a bit ... useless.
Sure, but I am suspicious about the strong necessity to show a specific 
clause estimated by the statistic. It is almost obvious because of the 
simple choosing algorithm.
> 
> Or what info does the SQL server include in the plan, exactly? Can you
> share an example?
I wouldn't say SQL Server designed it ideally. In an XML file, you can 
find something like this:

<StatisticsInfo LastUpdate="2024-09-09T21:55:04.43" 
ModificationCount="0" SamplingPercent="17.9892" 
Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]" 
Schema="[dbo]" Database="[DB]"></StatisticsInfo>

The main profit here - you see all the stats involved in estimations 
(and their state), even if final plan doesn't contain estimated stuff at 
all.
> 
>> I find this method much easier to implement, as it allows us to see any
>> usage points - remember that `estimate_num_groups` may be applied in
>> multiple places and may not always correspond to a node clause.
>>
> 
> I may be wrong, but I don't quite see why would that be significantly
> easier to implement. You still need to track the information somewhere,
> and it probably needs for individual Path nodes. Because while building
> the plan you don't know which paths will get selected.
In my mind, gathering extended statistics usage data may be implemented 
like it is already done for many parameters in the stat collector. It 
may help identify whether statistics are used frequently or not.
> 
> Although, maybe the selectivity estimation is sufficiently independent
> from the exact paths? In which case we might track it at the plan level.
Not only that argument. Usually, when I request a DBMS about an EXPLAIN, 
my typical desire is to realise which plan we have and why the optimiser 
has skipped another, maybe even more optimal, plan. I wonder if people 
frequently have the same intention.
That means you will never see clauses (and their estimations) that were 
kicked off the plan - ppi_clauses of a parameterised path, for example. 
And it may hide the real mechanic that caused skipping the better plan.
Mentioning all statistics involved in the summary may reveal such info.

> 
> The point about estimate_num_groups is good - I think there will be more
> cases where linking the extended statistics to a clause will be hard.
> But maybe let's not block the whole patch because of that?
Sure, if the current code doesn't block further improvements of showing 
all statistics involved in the query planning process.

I just wanted to emphasize the key idea: quite frequently we don't see 
in explain stuff that triggered suboptimal plan, because it was 
overestimated and excluded: an index, join, clause ... .

Anyway, thanks for answers, I will discover the code more.

-- 
regards, Andrei Lepikhov



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Fix outdated code comments in nodeAgg.c
Next
From: Amit Kapila
Date:
Subject: Re: Conflict detection for update_deleted in logical replication