Thread: Showing applied extended statistics in explain
Hi, With extended statistics it may not be immediately obvious if they were applied and to which clauses. If you have multiple extended statistics, we may also apply them in different order, etc. And with expressions, there's also the question of matching expressions to the statistics. So it seems useful to include this into in the explain plan - show which statistics were applied, in which order. Attached is an early PoC patch doing that in VERBOSE mode. I'll add it to the next CF. A simple example demonstrating the idea: ====================================================================== create table t (a int, b int); insert into t select mod(i,10), mod(i,10) from generate_series(1,100000) s(i); create statistics s on a, b from t; analyze t; test=# explain (verbose) select * from t where a = 1 and b = 1; QUERY PLAN --------------------------------------------------------------- Seq Scan on public.t (cost=0.00..1943.00 rows=10040 width=8) Output: a, b Filter: ((t.a = 1) AND (t.b = 1)) Statistics: public.s Clauses: ((a = 1) AND (b = 1)) (4 rows) test=# explain (verbose) select 1 from t group by a, b; QUERY PLAN ---------------------------------------------------------------------- HashAggregate (cost=1943.00..1943.10 rows=10 width=12) Output: 1, a, b Group Key: t.a, t.b -> Seq Scan on public.t (cost=0.00..1443.00 rows=100000 width=8) Output: a, b Statistics: public.s Clauses: (a AND b) (6 rows) ====================================================================== The current implementation is a bit ugly PoC, with a couple annoying issues that need to be solved: 1) The information is stashed in multiple lists added to a Plan. Maybe there's a better place, and maybe we need to invent a better way to track the info (a new node stashed in a single List). 2) The deparsing is modeled (i.e. copied) from how we deal with index quals, but it's having issues with nested OR clauses, because there are nested RestrictInfo nodes and the deparsing does not expect that. 3) It does not work for functional dependencies, because we effectively "merge" all functional dependencies and apply the entries. Not sure how to display this, but I think it should show the individual dependencies actually applied. 4) The info is collected always, but I guess we should do that only when in explain mode. Not sure how expensive it is. 5) It includes just statistics name + clauses, but maybe we should include additional info (e.g estimate for that combination of clauses). 6) The clauses in the grouping query are transformed to AND list, which is wrong. This is easy to fix, I was lazy to do that in a PoC patch. 7) It does not show statistics for individual expressions. I suppose examine_variable could add it to the rel somehow, and maybe we could do that with index expressions too? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Le samedi 27 mars 2021, 01:50:54 CEST Tomas Vondra a écrit : > The current implementation is a bit ugly PoC, with a couple annoying > issues that need to be solved: > Hello Thomas, I haven't looked at the implementation at all but I think it's an interesting idea. > 1) The information is stashed in multiple lists added to a Plan. Maybe > there's a better place, and maybe we need to invent a better way to > track the info (a new node stashed in a single List). Yes this would probably be cleaner. > > 2) The deparsing is modeled (i.e. copied) from how we deal with index > quals, but it's having issues with nested OR clauses, because there are > nested RestrictInfo nodes and the deparsing does not expect that. > > 3) It does not work for functional dependencies, because we effectively > "merge" all functional dependencies and apply the entries. Not sure how > to display this, but I think it should show the individual dependencies > actually applied. Yes that would be useful when trying to understand where an estimation comes from. > > 4) The info is collected always, but I guess we should do that only when > in explain mode. Not sure how expensive it is. That would probably be better yes. > > 5) It includes just statistics name + clauses, but maybe we should > include additional info (e.g estimate for that combination of clauses). I'm not sure the estimate for the combination is that useful, as you have an associated estimated number of rows attached to the node. I think to be able to really make sense of it, a GUC disabling the extended statistics could be useful for the curious DBA to compare the selectivity estimation for a plan with the statistics and a plan without. > > 6) The clauses in the grouping query are transformed to AND list, which > is wrong. This is easy to fix, I was lazy to do that in a PoC patch. > > 7) It does not show statistics for individual expressions. I suppose > examine_variable could add it to the rel somehow, and maybe we could do > that with index expressions too? Yes this would be useful for single-expression extended statistics as well as statistics collected from a functional index. I don't know if it's doable, but if we want to provide insights into how statistics are used, it could be nice to also display the statistics target used. Since the values at the time of the last analyze and the current value might be different, it could be nice to store it along with the stats. I remember having to troubleshoot queries where the problem was an ALTER <TABLE/ INDEX> ... SET STATISTICS had not been run as expected, and having that information available in the plan for a complex query might help in diagnosing the problem quicker. Regards, -- Ronan Dunklau
> On Sat, Mar 27, 2021 at 01:50:54AM +0100, Tomas Vondra wrote: > Hi, > > With extended statistics it may not be immediately obvious if they were > applied and to which clauses. If you have multiple extended statistics, > we may also apply them in different order, etc. And with expressions, > there's also the question of matching expressions to the statistics. > > So it seems useful to include this into in the explain plan - show which > statistics were applied, in which order. Attached is an early PoC patch > doing that in VERBOSE mode. I'll add it to the next CF. Hi, sounds like a useful improvement indeed, thanks for the patch. Do you plan to invest more time in it? > 1) The information is stashed in multiple lists added to a Plan. Maybe > there's a better place, and maybe we need to invent a better way to > track the info (a new node stashed in a single List). > > ... > > 3) It does not work for functional dependencies, because we effectively > "merge" all functional dependencies and apply the entries. Not sure how > to display this, but I think it should show the individual dependencies > actually applied. > > ... > > 5) It includes just statistics name + clauses, but maybe we should > include additional info (e.g estimate for that combination of clauses). Yes, a new node would be nice to have. The other questions above are somewhat related to what it should contain, and I guess it depends on the use case this patch targets. E.g. for the case "help to figure out if an extended statistics was applied" even "merged" functional dependencies would be fine I believe. More advanced plan troubleshooting may benefit from estimates. What exactly use case do you have in mind? > 4) The info is collected always, but I guess we should do that only when > in explain mode. Not sure how expensive it is. Maybe it's in fact not that expensive to always collect the info? Adding it as it is now do not increase number of cache lines Plan structure occupies (although it comes close to the boundary), and a couple of simple tests with CPU bounded load of various types doesn't show significant slowdown. I haven't tried the worst case scenario with a lot of extended stats involved, but in such situations I can imagine the overhead also could be rather small in comparison with other expenses. I've got few more questions after reading the patch: * Is there any particular reason behind choosing only some scan nodes to display extended stats? E.g. BitmapHeapScan is missing. * StatisticExtInfo should have a _copy etc. node functionalty now, right? I've hit "unrecognized node type" with a prepared statement because it's missing.
Hi, On 7/27/21 12:21 PM, Dmitry Dolgov wrote: >> On Sat, Mar 27, 2021 at 01:50:54AM +0100, Tomas Vondra wrote: >> Hi, >> >> With extended statistics it may not be immediately obvious if they were >> applied and to which clauses. If you have multiple extended statistics, >> we may also apply them in different order, etc. And with expressions, >> there's also the question of matching expressions to the statistics. >> >> So it seems useful to include this into in the explain plan - show which >> statistics were applied, in which order. Attached is an early PoC patch >> doing that in VERBOSE mode. I'll add it to the next CF. > > Hi, > > sounds like a useful improvement indeed, thanks for the patch. Do you > plan to invest more time in it? > Yes. I think providing more insight into which statistics were applied, in which order and to which clauses, is quite desirable. >> 1) The information is stashed in multiple lists added to a Plan. Maybe >> there's a better place, and maybe we need to invent a better way to >> track the info (a new node stashed in a single List). >> >> ... >> >> 3) It does not work for functional dependencies, because we effectively >> "merge" all functional dependencies and apply the entries. Not sure how >> to display this, but I think it should show the individual dependencies >> actually applied. >> >> ... >> >> 5) It includes just statistics name + clauses, but maybe we should >> include additional info (e.g estimate for that combination of clauses). > > Yes, a new node would be nice to have. The other questions above are > somewhat related to what it should contain, and I guess it depends on > the use case this patch targets. E.g. for the case "help to figure out > if an extended statistics was applied" even "merged" functional > dependencies would be fine I believe. What do you mean by "merged" functional dependencies? I guess we could say "these clauses were estimated using dependencies" without listing which exact dependencies were applied. > More advanced plan troubleshooting may benefit from estimates. I'm sorry, I don't know what you mean by this. Can you elaborate? > What exactly use case do you have in mind? Well, my goal was to help users investigating the plan/estimates, because once you create multiple "candidate" statistics objects it may get quite tricky to determine which of them were applied, in what order, etc. It's not all that straightforward, given the various heuristics we use to pick the "best" statistics, apply dependencies last, etc. And I don't quite want to teach the users those rules, because I consider them mostly implementation details that wee may want to tweak in the future. >> 4) The info is collected always, but I guess we should do that only when >> in explain mode. Not sure how expensive it is. > > Maybe it's in fact not that expensive to always collect the info? Adding > it as it is now do not increase number of cache lines Plan structure > occupies (although it comes close to the boundary), and a couple of > simple tests with CPU bounded load of various types doesn't show > significant slowdown. I haven't tried the worst case scenario with a lot > of extended stats involved, but in such situations I can imagine the > overhead also could be rather small in comparison with other expenses. > Yeah, once there are many statistics it's probably not an issue - the overhead from processing them is likely way higher than copying this extra info. Plus people tend to create statistics when there are issues with planning the queries. > I've got few more questions after reading the patch: > > * Is there any particular reason behind choosing only some scan nodes to > display extended stats? E.g. BitmapHeapScan is missing. > All nodes that may apply extended stats to estimate quals should include this info. I guess BitmapHeapScan may do that for the filter, right? > * StatisticExtInfo should have a _copy etc. node functionalty now, > right? I've hit "unrecognized node type" with a prepared statement > because it's missing. > Yeah, probably. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 7/27/21 12:21 PM, Dmitry Dolgov wrote: >>> So it seems useful to include this into in the explain plan - show which >>> statistics were applied, in which order. Attached is an early PoC patch >>> doing that in VERBOSE mode. I'll add it to the next CF. > Yes. I think providing more insight into which statistics were applied, > in which order and to which clauses, is quite desirable. TBH I do not agree that this is a great idea. I think it's inevitably exposing a lot of unstable internal planner details. I like even less the aspect that this means a lot of information has to be added to the finished plan in case it's needed for EXPLAIN. Aside from the sheer cost of copying that data around, what happens if for example somebody drops a statistic object between the time of planning and the EXPLAIN? Are we going to start keeping locks on those objects for the lifetime of the plans? regards, tom lane
On 7/27/21 10:50 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >> On 7/27/21 12:21 PM, Dmitry Dolgov wrote: >>>> So it seems useful to include this into in the explain plan - show which >>>> statistics were applied, in which order. Attached is an early PoC patch >>>> doing that in VERBOSE mode. I'll add it to the next CF. > >> Yes. I think providing more insight into which statistics were applied, >> in which order and to which clauses, is quite desirable. > > TBH I do not agree that this is a great idea. I think it's inevitably > exposing a lot of unstable internal planner details. Which unstable planner details? IMHO it's not all that different from info about which indexes were picked for the query. > I like even less the aspect that this means a lot of information has > to be added to the finished plan in case it's needed for EXPLAIN. Yes, that's true. I mentioned that in my initial post, and I suggested we might collect it only when in explain mode. > Aside from the sheer cost of copying that data around, what happens > if for example somebody drops a statistic object between the time of > planning and the EXPLAIN? Are we going to start keeping locks on > those objects for the lifetime of the plans? > Yes, that'd be an issue. I'm not sure what to do about it, short of either locking the (applied) statistics objects, or maybe just simply copying the bits we need for explain (pretty much just name). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 27, 2021 at 4:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > TBH I do not agree that this is a great idea. I think it's inevitably > exposing a lot of unstable internal planner details. Well, that is a risk, but I think I like the alternative even less. Imagine if we had a CREATE INDEX command but no way -- other than running queries and noticing how long they seem to take -- to tell whether it was being used. That would suck, a lot, and this seems to me to be exactly the same. A user who creates a statistics object has a legitimate interest in finding out whether that object is doing anything to a given query that they happen to care about. > I like even less > the aspect that this means a lot of information has to be added to the > finished plan in case it's needed for EXPLAIN. Aside from the sheer > cost of copying that data around, what happens if for example somebody > drops a statistic object between the time of planning and the EXPLAIN? > Are we going to start keeping locks on those objects for the lifetime > of the plans? I don't understand the premise of this question. We don't keep locks on tables or indexes involved in a plan for the lifetime of a plan, or on functions or any other kind of object either. We instead arrange to invalidate the plan if those objects are modified or dropped. Why would we not use the same approach here? -- Robert Haas EDB: http://www.enterprisedb.com
> On Tue, Jul 27, 2021 at 10:20:34PM +0200, Tomas Vondra wrote: > > >> 1) The information is stashed in multiple lists added to a Plan. Maybe > >> there's a better place, and maybe we need to invent a better way to > >> track the info (a new node stashed in a single List). > >> > >> ... > >> > >> 3) It does not work for functional dependencies, because we effectively > >> "merge" all functional dependencies and apply the entries. Not sure how > >> to display this, but I think it should show the individual dependencies > >> actually applied. > >> > >> ... > >> > >> 5) It includes just statistics name + clauses, but maybe we should > >> include additional info (e.g estimate for that combination of clauses). > > > > Yes, a new node would be nice to have. The other questions above are > > somewhat related to what it should contain, and I guess it depends on > > the use case this patch targets. E.g. for the case "help to figure out > > if an extended statistics was applied" even "merged" functional > > dependencies would be fine I believe. > > What do you mean by "merged" functional dependencies? I guess we could > say "these clauses were estimated using dependencies" without listing > which exact dependencies were applied. Yes, that's exactly what I was thinking. From the original email I've got an impression that in case of functional dependencies you plan to display the info only with the individual dependencies (when implemented) or nothing at all. By "merged" I wanted to refer to the statement about "merge" all functional dependencies and apply the entries. > > More advanced plan troubleshooting may benefit from estimates. > > I'm sorry, I don't know what you mean by this. Can you elaborate? Yeah, sorry for not being clear. The idea was that the question about including "additional info" strongly depends on which use cases the patch tries to address, and I follow up on that further. There is no more hidden detailed meaning here :) > > I've got few more questions after reading the patch: > > > > * Is there any particular reason behind choosing only some scan nodes to > > display extended stats? E.g. BitmapHeapScan is missing. > > > > All nodes that may apply extended stats to estimate quals should include > this info. I guess BitmapHeapScan may do that for the filter, right? > Yes, something like this (stats output added by me): Bitmap Heap Scan on public.tenk1 Output: unique1 Recheck Cond: (tenk1.unique1 < 1000) Filter: (tenk1.stringu1 = 'xxx'::name) Statistics: public.s Clauses: ((unique1 < 1000) AND (stringu1 = 'xxx'::name)) -> Bitmap Index Scan on tenk1_unique1 Index Cond: (tenk1.unique1 < 1000)
Hi Tomas!
>Well, my goal was to help users investigating the plan/estimates,
>because once you create multiple "candidate" statistics objects it may
>get quite tricky to determine which of them were applied, in what order,
>etc. It's not all that straightforward, given the various heuristics we
>use to pick the "best" statistics, apply dependencies last, etc. And I
>don't quite want to teach the users those rules, because I consider them
>mostly implementation details that wee may want to tweak in the future.
You are right. This feature will be very useful for plan tuning with
extended statistics. Therefore, I would like to help develop it. :-D
extended statistics. Therefore, I would like to help develop it. :-D
>5) It includes just statistics name + clauses, but maybe we should
>include additional info (e.g estimate for that combination of clauses).
I thought the above sentence was about what level to aim for in the initial
>include additional info (e.g estimate for that combination of clauses).
I thought the above sentence was about what level to aim for in the initial
version. Ideally, we would like to include additional information. However,
it is clear that the more things we have, the longer it will take to develop.
So, I think it is better to commit the initial version at a minimal level to
So, I think it is better to commit the initial version at a minimal level to
provide it to users quickly.
As long as an Extended stats name is displayed in EXPLAIN result, it is
possible to figure out which extended statistics are used. That information
alone is valuable to the user.
> 4) The info is collected always, but I guess we should do that only when
>in explain mode. Not sure how expensive it is.
>in explain mode. Not sure how expensive it is.
In the case of pg_plan_advsr that I created, I used ProcessUtility_hook to
detect EXPLAIN command. It searches all queries to find T_ExplainStmt, and
set a flag. I guess we can't use this method in Postgres core, right?
detect EXPLAIN command. It searches all queries to find T_ExplainStmt, and
set a flag. I guess we can't use this method in Postgres core, right?
If we have to collect the extra info for all query executions, we need to
reduce overhead. I mean collecting the minimum necessary.
To do that, I think it would be better to display less Extended stats info
in EXPLAIN results. For example, displaying only extended stats names is fine,
I guess. (I haven't understood your patch yet, so If I say wrong, sorry)
reduce overhead. I mean collecting the minimum necessary.
To do that, I think it would be better to display less Extended stats info
in EXPLAIN results. For example, displaying only extended stats names is fine,
I guess. (I haven't understood your patch yet, so If I say wrong, sorry)
>6) The clauses in the grouping query are transformed to AND list, which
>is wrong. This is easy to fix, I was lazy to do that in a PoC patch.
6) is related 5).
If we agree to remove showing quals of extended stats in EXPLAIN result,
We can solve this problem by removing the code. Is it okay?
>is wrong. This is easy to fix, I was lazy to do that in a PoC patch.
6) is related 5).
If we agree to remove showing quals of extended stats in EXPLAIN result,
We can solve this problem by removing the code. Is it okay?
>2) The deparsing is modeled (i.e. copied) from how we deal with index
>quals, but it's having issues with nested OR clauses, because there are
>nested RestrictInfo nodes and the deparsing does not expect that.
>
>3) It does not work for functional dependencies, because we effectively
>"merge" all functional dependencies and apply the entries. Not sure how
>to display this, but I think it should show the individual dependencies
>actually applied.
>
>7) It does not show statistics for individual expressions. I suppose
>examine_variable could add it to the rel somehow, and maybe we could do
>that with index expressions too?
>quals, but it's having issues with nested OR clauses, because there are
>nested RestrictInfo nodes and the deparsing does not expect that.
>
>3) It does not work for functional dependencies, because we effectively
>"merge" all functional dependencies and apply the entries. Not sure how
>to display this, but I think it should show the individual dependencies
>actually applied.
>
>7) It does not show statistics for individual expressions. I suppose
>examine_variable could add it to the rel somehow, and maybe we could do
>that with index expressions too?
I'm not sure about 2) 3) and 7) above, so I'd like to see some concrete examples
of queries. I would like to include it in the test pattern for regression testing.
To be fixed:
* StatisticExtInfo should have a _copy etc. node functionality now,
right? I've hit "unrecognized node type" with a prepared statement
because it's missing.
* Is there any particular reason behind choosing only some scan nodes to
display extended stats? E.g. BitmapHeapScan is missing.
* (New) In the case of Group by, we should put Extended stats info under the
Hash/Group Aggregate node (not under Scan node).
* (New) We have to create a regression test including the above test patterns.
right? I've hit "unrecognized node type" with a prepared statement
because it's missing.
* Is there any particular reason behind choosing only some scan nodes to
display extended stats? E.g. BitmapHeapScan is missing.
* (New) In the case of Group by, we should put Extended stats info under the
Hash/Group Aggregate node (not under Scan node).
* (New) We have to create a regression test including the above test patterns.
Attached patch:
It is a rebased version on the head of the master because there were many Hunks
when I applied the previous patch on master.
I'll create regression tests firstly.
Regards,
Tatsuro Yamada
Attachment
As discussed in [1], we're taking this opportunity to return some patchsets that don't appear to be getting enough reviewer interest. This is not a rejection, since we don't necessarily think there's anything unacceptable about the entry, but it differs from a standard "Returned with Feedback" in that there's probably not much actionable feedback at all. Rather than code changes, what this patch needs is more community interest. You might - ask people for help with your approach, - see if there are similar patches that your code could supplement, - get interested parties to agree to review your patch in a CF, or - possibly present the functionality in a way that's easier to review overall. (Doing these things is no guarantee that there will be interest, but it's hopefully better than endlessly rebasing a patchset that is not receiving any feedback from the community.) Once you think you've built up some community support and the patchset is ready for review, you (or any interested party) can resurrect the patch entry by visiting https://commitfest.postgresql.org/38/3050/ and changing the status to "Needs Review", and then changing the status again to "Move to next CF". (Don't forget the second step; hopefully we will have streamlined this in the near future!) Thanks, --Jacob [1] https://postgr.es/m/86140760-8ba5-6f3a-3e6e-5ca6c060bd24@timescale.com