Re: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers
From | Andrei Lepikhov |
---|---|
Subject | Re: Showing applied extended statistics in explain Part 2 |
Date | |
Msg-id | c6668d68-dc2b-4093-a912-d5412672d678@gmail.com Whole thread Raw |
In response to | Re: Showing applied extended statistics in explain Part 2 (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Showing applied extended statistics in explain Part 2
|
List | pgsql-hackers |
On 8/2/2025 20:50, Tomas Vondra wrote: > > > On 1/24/25 11:17, Andrei Lepikhov wrote: >> On 11/1/24 12:22, Tatsuro Yamada wrote: >> I often use SQL Server to compare execution plans generated by >> PostgreSQL, and I appreciate how they display the usage of extended >> statistics. They clearly identify which statistics were utilized during >> query planning and include them in the summary section of query plan. >> > I admit not knowing what exactly SQL Server shows in the explain, but it > seems helpful to know which part of the plan used which statistics, no? I wonder if you meant clauses, not a part of the plan. Clauses may be used on different levels of the join tree, but they are estimated once. I meant we may just refer to the statistic used by its name in the node's explanation if this node contains estimated clause. I use multiple extended statistics. Sometimes, they intersect, and it is hard to say which one was used for a specific clause. > Imagine we only knew an index was used, but not which node used it and > for what keys. That would be a bit ... useless. Sure, but I am suspicious about the strong necessity to show a specific clause estimated by the statistic. It is almost obvious because of the simple choosing algorithm. > > Or what info does the SQL server include in the plan, exactly? Can you > share an example? I wouldn't say SQL Server designed it ideally. In an XML file, you can find something like this: <StatisticsInfo LastUpdate="2024-09-09T21:55:04.43" ModificationCount="0" SamplingPercent="17.9892" Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]" Schema="[dbo]" Database="[DB]"></StatisticsInfo> The main profit here - you see all the stats involved in estimations (and their state), even if final plan doesn't contain estimated stuff at all. > >> I find this method much easier to implement, as it allows us to see any >> usage points - remember that `estimate_num_groups` may be applied in >> multiple places and may not always correspond to a node clause. >> > > I may be wrong, but I don't quite see why would that be significantly > easier to implement. You still need to track the information somewhere, > and it probably needs for individual Path nodes. Because while building > the plan you don't know which paths will get selected. In my mind, gathering extended statistics usage data may be implemented like it is already done for many parameters in the stat collector. It may help identify whether statistics are used frequently or not. > > Although, maybe the selectivity estimation is sufficiently independent > from the exact paths? In which case we might track it at the plan level. Not only that argument. Usually, when I request a DBMS about an EXPLAIN, my typical desire is to realise which plan we have and why the optimiser has skipped another, maybe even more optimal, plan. I wonder if people frequently have the same intention. That means you will never see clauses (and their estimations) that were kicked off the plan - ppi_clauses of a parameterised path, for example. And it may hide the real mechanic that caused skipping the better plan. Mentioning all statistics involved in the summary may reveal such info. > > The point about estimate_num_groups is good - I think there will be more > cases where linking the extended statistics to a clause will be hard. > But maybe let's not block the whole patch because of that? Sure, if the current code doesn't block further improvements of showing all statistics involved in the query planning process. I just wanted to emphasize the key idea: quite frequently we don't see in explain stuff that triggered suboptimal plan, because it was overestimated and excluded: an index, join, clause ... . Anyway, thanks for answers, I will discover the code more. -- regards, Andrei Lepikhov
pgsql-hackers by date: