Thread: Showing applied extended statistics in explain

Showing applied extended statistics in explain

From
Tomas Vondra
Date:
Hi,

With extended statistics it may not be immediately obvious if they were
applied and to which clauses. If you have multiple extended statistics,
we may also apply them in different order, etc. And with expressions,
there's also the question of matching expressions to the statistics.

So it seems useful to include this into in the explain plan - show which
statistics were applied, in which order. Attached is an early PoC patch
doing that in VERBOSE mode. I'll add it to the next CF.


A simple example demonstrating the idea:

======================================================================

  create table t (a int, b int);
  insert into t select mod(i,10), mod(i,10)
    from generate_series(1,100000) s(i);

  create statistics s on a, b from t;
  analyze t;

test=# explain (verbose) select * from t where a = 1 and b = 1;
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on public.t  (cost=0.00..1943.00 rows=10040 width=8)
   Output: a, b
   Filter: ((t.a = 1) AND (t.b = 1))
   Statistics: public.s  Clauses: ((a = 1) AND (b = 1))
(4 rows)

test=# explain (verbose) select 1 from t group by a, b;
                              QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=1943.00..1943.10 rows=10 width=12)
   Output: 1, a, b
   Group Key: t.a, t.b
   ->  Seq Scan on public.t  (cost=0.00..1443.00 rows=100000 width=8)
         Output: a, b
         Statistics: public.s  Clauses: (a AND b)
(6 rows)

======================================================================

The current implementation is a bit ugly PoC, with a couple annoying
issues that need to be solved:

1) The information is stashed in multiple lists added to a Plan. Maybe
there's a better place, and maybe we need to invent a better way to
track the info (a new node stashed in a single List).

2) The deparsing is modeled (i.e. copied) from how we deal with index
quals, but it's having issues with nested OR clauses, because there are
nested RestrictInfo nodes and the deparsing does not expect that.

3) It does not work for functional dependencies, because we effectively
"merge" all functional dependencies and apply the entries. Not sure how
to display this, but I think it should show the individual dependencies
actually applied.

4) The info is collected always, but I guess we should do that only when
in explain mode. Not sure how expensive it is.

5) It includes just statistics name + clauses, but maybe we should
include additional info (e.g estimate for that combination of clauses).

6) The clauses in the grouping query are transformed to AND list, which
is wrong. This is easy to fix, I was lazy to do that in a PoC patch.

7) It does not show statistics for individual expressions. I suppose
examine_variable could add it to the rel somehow, and maybe we could do
that with index expressions too?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Showing applied extended statistics in explain

From
Ronan Dunklau
Date:
Le samedi 27 mars 2021, 01:50:54 CEST Tomas Vondra a écrit :
> The current implementation is a bit ugly PoC, with a couple annoying
> issues that need to be solved:
>
Hello Thomas,

I haven't looked at the implementation at all but I think it's an interesting
idea.


> 1) The information is stashed in multiple lists added to a Plan. Maybe
> there's a better place, and maybe we need to invent a better way to
> track the info (a new node stashed in a single List).

Yes this would probably be cleaner.

>
> 2) The deparsing is modeled (i.e. copied) from how we deal with index
> quals, but it's having issues with nested OR clauses, because there are
> nested RestrictInfo nodes and the deparsing does not expect that.
>
> 3) It does not work for functional dependencies, because we effectively
> "merge" all functional dependencies and apply the entries. Not sure how
> to display this, but I think it should show the individual dependencies
> actually applied.

Yes that would be useful when trying to understand where an estimation comes
from.

>
> 4) The info is collected always, but I guess we should do that only when
> in explain mode. Not sure how expensive it is.

That would probably be better yes.

>
> 5) It includes just statistics name + clauses, but maybe we should
> include additional info (e.g estimate for that combination of clauses).

I'm not sure the estimate for the combination is that useful, as you have an
associated estimated number of rows attached to the node.  I think to be able
to really make sense of it, a GUC disabling the extended statistics could be
useful for the curious DBA to compare the selectivity estimation for a plan
with the statistics and a plan without.

>
> 6) The clauses in the grouping query are transformed to AND list, which
> is wrong. This is easy to fix, I was lazy to do that in a PoC patch.
>
> 7) It does not show statistics for individual expressions. I suppose
> examine_variable could add it to the rel somehow, and maybe we could do
> that with index expressions too?

Yes this would be useful for single-expression extended statistics as well as
statistics collected from a functional index.

I don't know if it's doable, but if we want to provide insights into how
statistics are used, it could be nice to also display the statistics target
used. Since the values at the time of the last analyze and the current value
might be different, it could be nice to store it along with the stats. I
remember having to troubleshoot queries where the problem was an ALTER <TABLE/
INDEX> ... SET STATISTICS had not been run as expected, and having that
information available in the plan for a complex query might help in diagnosing
the problem quicker.

Regards,

--
Ronan Dunklau





Re: Showing applied extended statistics in explain

From
Dmitry Dolgov
Date:
> On Sat, Mar 27, 2021 at 01:50:54AM +0100, Tomas Vondra wrote:
> Hi,
>
> With extended statistics it may not be immediately obvious if they were
> applied and to which clauses. If you have multiple extended statistics,
> we may also apply them in different order, etc. And with expressions,
> there's also the question of matching expressions to the statistics.
>
> So it seems useful to include this into in the explain plan - show which
> statistics were applied, in which order. Attached is an early PoC patch
> doing that in VERBOSE mode. I'll add it to the next CF.

Hi,

sounds like a useful improvement indeed, thanks for the patch. Do you
plan to invest more time in it?

> 1) The information is stashed in multiple lists added to a Plan. Maybe
> there's a better place, and maybe we need to invent a better way to
> track the info (a new node stashed in a single List).
>
> ...
>
> 3) It does not work for functional dependencies, because we effectively
> "merge" all functional dependencies and apply the entries. Not sure how
> to display this, but I think it should show the individual dependencies
> actually applied.
>
> ...
>
> 5) It includes just statistics name + clauses, but maybe we should
> include additional info (e.g estimate for that combination of clauses).

Yes, a new node would be nice to have. The other questions above are
somewhat related to what it should contain, and I guess it depends on
the use case this patch targets. E.g. for the case "help to figure out
if an extended statistics was applied" even "merged" functional
dependencies would be fine I believe. More advanced plan troubleshooting
may benefit from estimates. What exactly use case do you have in mind?

> 4) The info is collected always, but I guess we should do that only when
> in explain mode. Not sure how expensive it is.

Maybe it's in fact not that expensive to always collect the info? Adding
it as it is now do not increase number of cache lines Plan structure
occupies (although it comes close to the boundary), and a couple of
simple tests with CPU bounded load of various types doesn't show
significant slowdown. I haven't tried the worst case scenario with a lot
of extended stats involved, but in such situations I can imagine the
overhead also could be rather small in comparison with other expenses.

I've got few more questions after reading the patch:

* Is there any particular reason behind choosing only some scan nodes to
  display extended stats? E.g. BitmapHeapScan is missing.

* StatisticExtInfo should have a _copy etc. node functionalty now,
  right? I've hit "unrecognized node type" with a prepared statement
  because it's missing.



Re: Showing applied extended statistics in explain

From
Tomas Vondra
Date:
Hi,

On 7/27/21 12:21 PM, Dmitry Dolgov wrote:
>> On Sat, Mar 27, 2021 at 01:50:54AM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> With extended statistics it may not be immediately obvious if they were
>> applied and to which clauses. If you have multiple extended statistics,
>> we may also apply them in different order, etc. And with expressions,
>> there's also the question of matching expressions to the statistics.
>>
>> So it seems useful to include this into in the explain plan - show which
>> statistics were applied, in which order. Attached is an early PoC patch
>> doing that in VERBOSE mode. I'll add it to the next CF.
> 
> Hi,
> 
> sounds like a useful improvement indeed, thanks for the patch. Do you
> plan to invest more time in it?
> 

Yes. I think providing more insight into which statistics were applied,
in which order and to which clauses, is quite desirable.

>> 1) The information is stashed in multiple lists added to a Plan. Maybe
>> there's a better place, and maybe we need to invent a better way to
>> track the info (a new node stashed in a single List).
>>
>> ...
>>
>> 3) It does not work for functional dependencies, because we effectively
>> "merge" all functional dependencies and apply the entries. Not sure how
>> to display this, but I think it should show the individual dependencies
>> actually applied.
>>
>> ...
>>
>> 5) It includes just statistics name + clauses, but maybe we should
>> include additional info (e.g estimate for that combination of clauses).
> 
> Yes, a new node would be nice to have. The other questions above are
> somewhat related to what it should contain, and I guess it depends on
> the use case this patch targets. E.g. for the case "help to figure out
> if an extended statistics was applied" even "merged" functional
> dependencies would be fine I believe.

What do you mean by "merged" functional dependencies? I guess we could
say "these clauses were estimated using dependencies" without listing
which exact dependencies were applied.

> More advanced plan troubleshooting may benefit from estimates.

I'm sorry, I don't know what you mean by this. Can you elaborate?

> What exactly use case do you have in mind?
Well, my goal was to help users investigating the plan/estimates,
because once you create multiple "candidate" statistics objects it may
get quite tricky to determine which of them were applied, in what order,
etc. It's not all that straightforward, given the various heuristics we
use to pick the "best" statistics, apply dependencies last, etc. And I
don't quite want to teach the users those rules, because I consider them
mostly implementation details that wee may want to tweak in the future.

>> 4) The info is collected always, but I guess we should do that only when
>> in explain mode. Not sure how expensive it is.
> 
> Maybe it's in fact not that expensive to always collect the info? Adding
> it as it is now do not increase number of cache lines Plan structure
> occupies (although it comes close to the boundary), and a couple of
> simple tests with CPU bounded load of various types doesn't show
> significant slowdown. I haven't tried the worst case scenario with a lot
> of extended stats involved, but in such situations I can imagine the
> overhead also could be rather small in comparison with other expenses.
> 

Yeah, once there are many statistics it's probably not an issue - the
overhead from processing them is likely way higher than copying this
extra info. Plus people tend to create statistics when there are issues
with planning the queries.

> I've got few more questions after reading the patch:
> 
> * Is there any particular reason behind choosing only some scan nodes to
>   display extended stats? E.g. BitmapHeapScan is missing.
> 

All nodes that may apply extended stats to estimate quals should include
this info. I guess BitmapHeapScan may do that for the filter, right?

> * StatisticExtInfo should have a _copy etc. node functionalty now,
>   right? I've hit "unrecognized node type" with a prepared statement
>   because it's missing.
> 

Yeah, probably.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Showing applied extended statistics in explain

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 7/27/21 12:21 PM, Dmitry Dolgov wrote:
>>> So it seems useful to include this into in the explain plan - show which
>>> statistics were applied, in which order. Attached is an early PoC patch
>>> doing that in VERBOSE mode. I'll add it to the next CF.

> Yes. I think providing more insight into which statistics were applied,
> in which order and to which clauses, is quite desirable.

TBH I do not agree that this is a great idea.  I think it's inevitably
exposing a lot of unstable internal planner details.  I like even less
the aspect that this means a lot of information has to be added to the
finished plan in case it's needed for EXPLAIN.  Aside from the sheer
cost of copying that data around, what happens if for example somebody
drops a statistic object between the time of planning and the EXPLAIN?
Are we going to start keeping locks on those objects for the lifetime
of the plans?

            regards, tom lane



Re: Showing applied extended statistics in explain

From
Tomas Vondra
Date:
On 7/27/21 10:50 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> On 7/27/21 12:21 PM, Dmitry Dolgov wrote:
>>>> So it seems useful to include this into in the explain plan - show which
>>>> statistics were applied, in which order. Attached is an early PoC patch
>>>> doing that in VERBOSE mode. I'll add it to the next CF.
> 
>> Yes. I think providing more insight into which statistics were applied,
>> in which order and to which clauses, is quite desirable.
> 
> TBH I do not agree that this is a great idea.  I think it's inevitably
> exposing a lot of unstable internal planner details.

Which unstable planner details? IMHO it's not all that different from
info about which indexes were picked for the query.

> I like even less the aspect that this means a lot of information has
> to be added to the finished plan in case it's needed for EXPLAIN.
Yes, that's true. I mentioned that in my initial post, and I suggested
we might collect it only when in explain mode.

> Aside from the sheer cost of copying that data around, what happens
> if for example somebody drops a statistic object between the time of
> planning and the EXPLAIN? Are we going to start keeping locks on
> those objects for the lifetime of the plans?
> 

Yes, that'd be an issue. I'm not sure what to do about it, short of
either locking the (applied) statistics objects, or maybe just simply
copying the bits we need for explain (pretty much just name).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Showing applied extended statistics in explain

From
Robert Haas
Date:
On Tue, Jul 27, 2021 at 4:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> TBH I do not agree that this is a great idea.  I think it's inevitably
> exposing a lot of unstable internal planner details.

Well, that is a risk, but I think I like the alternative even less.
Imagine if we had a CREATE INDEX command but no way -- other than
running queries and noticing how long they seem to take -- to tell
whether it was being used. That would suck, a lot, and this seems to
me to be exactly the same. A user who creates a statistics object has
a legitimate interest in finding out whether that object is doing
anything to a given query that they happen to care about.

> I like even less
> the aspect that this means a lot of information has to be added to the
> finished plan in case it's needed for EXPLAIN.  Aside from the sheer
> cost of copying that data around, what happens if for example somebody
> drops a statistic object between the time of planning and the EXPLAIN?
> Are we going to start keeping locks on those objects for the lifetime
> of the plans?

I don't understand the premise of this question. We don't keep locks
on tables or indexes involved in a plan for the lifetime of a plan, or
on functions or any other kind of object either. We instead arrange to
invalidate the plan if those objects are modified or dropped. Why
would we not use the same approach here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Showing applied extended statistics in explain

From
Dmitry Dolgov
Date:
> On Tue, Jul 27, 2021 at 10:20:34PM +0200, Tomas Vondra wrote:
>
> >> 1) The information is stashed in multiple lists added to a Plan. Maybe
> >> there's a better place, and maybe we need to invent a better way to
> >> track the info (a new node stashed in a single List).
> >>
> >> ...
> >>
> >> 3) It does not work for functional dependencies, because we effectively
> >> "merge" all functional dependencies and apply the entries. Not sure how
> >> to display this, but I think it should show the individual dependencies
> >> actually applied.
> >>
> >> ...
> >>
> >> 5) It includes just statistics name + clauses, but maybe we should
> >> include additional info (e.g estimate for that combination of clauses).
> >
> > Yes, a new node would be nice to have. The other questions above are
> > somewhat related to what it should contain, and I guess it depends on
> > the use case this patch targets. E.g. for the case "help to figure out
> > if an extended statistics was applied" even "merged" functional
> > dependencies would be fine I believe.
>
> What do you mean by "merged" functional dependencies? I guess we could
> say "these clauses were estimated using dependencies" without listing
> which exact dependencies were applied.

Yes, that's exactly what I was thinking. From the original email I've
got an impression that in case of functional dependencies you plan to
display the info only with the individual dependencies (when
implemented) or nothing at all. By "merged" I wanted to refer to the
statement about "merge" all functional dependencies and apply the
entries.

> > More advanced plan troubleshooting may benefit from estimates.
>
> I'm sorry, I don't know what you mean by this. Can you elaborate?

Yeah, sorry for not being clear. The idea was that the question about including
"additional info" strongly depends on which use cases the patch tries to
address, and I follow up on that further. There is no more hidden detailed
meaning here :)

> > I've got few more questions after reading the patch:
> >
> > * Is there any particular reason behind choosing only some scan nodes to
> >   display extended stats? E.g. BitmapHeapScan is missing.
> >
>
> All nodes that may apply extended stats to estimate quals should include
> this info. I guess BitmapHeapScan may do that for the filter, right?
>

Yes, something like this (stats output added by me):

     Bitmap Heap Scan on public.tenk1
       Output: unique1
       Recheck Cond: (tenk1.unique1 < 1000)
       Filter: (tenk1.stringu1 = 'xxx'::name)
       Statistics: public.s Clauses: ((unique1 < 1000) AND (stringu1 = 'xxx'::name))
       ->  Bitmap Index Scan on tenk1_unique1
             Index Cond: (tenk1.unique1 < 1000)



Re: Showing applied extended statistics in explain

From
Tatsuro Yamada
Date:
Hi Tomas!

>> What exactly use case do you have in mind?
>Well, my goal was to help users investigating the plan/estimates,
>because once you create multiple "candidate" statistics objects it may
>get quite tricky to determine which of them were applied, in what order,
>etc. It's not all that straightforward, given the various heuristics we
>use to pick the "best" statistics, apply dependencies last, etc. And I
>don't quite want to teach the users those rules, because I consider them
>mostly implementation details that wee may want to tweak in the future.

You are right. This feature will be very useful for plan tuning with
extended statistics. Therefore, I would like to help develop it. :-D


>5) It includes just statistics name + clauses, but maybe we should
>include additional info (e.g estimate for that combination of clauses).

I thought the above sentence was about what level to aim for in the initial 
version. Ideally, we would like to include additional information. However, 
it is clear that the more things we have, the longer it will take to develop.
So, I think it is better to commit the initial version at a minimal level to 
provide it to users quickly. 

As long as an Extended stats name is displayed in EXPLAIN result, it is 
possible to figure out which extended statistics are used. That information 
alone is valuable to the user.


> 4) The info is collected always, but I guess we should do that only when
>in explain mode. Not sure how expensive it is.

In the case of pg_plan_advsr that I created, I used ProcessUtility_hook to
detect EXPLAIN command. It searches all queries to find T_ExplainStmt, and
set a flag. I guess we can't use this method in Postgres core, right?

If we have to collect the extra info for all query executions, we need to
reduce overhead. I mean collecting the minimum necessary.
To do that, I think it would be better to display less Extended stats info
in EXPLAIN results. For example, displaying only extended stats names is fine,
I guess. (I haven't understood your patch yet, so If I say wrong, sorry)


>6) The clauses in the grouping query are transformed to AND list, which
>is wrong. This is easy to fix, I was lazy to do that in a PoC patch.

6) is related 5).
If we agree to remove showing quals of extended stats in EXPLAIN result,
We can solve this problem by removing the code. Is it okay?


>2) The deparsing is modeled (i.e. copied) from how we deal with index
>quals, but it's having issues with nested OR clauses, because there are
>nested RestrictInfo nodes and the deparsing does not expect that.
>
>3) It does not work for functional dependencies, because we effectively
>"merge" all functional dependencies and apply the entries. Not sure how
>to display this, but I think it should show the individual dependencies
>actually applied.
>
>7) It does not show statistics for individual expressions. I suppose
>examine_variable could add it to the rel somehow, and maybe we could do
>that with index expressions too?

I'm not sure about 2) 3) and 7) above, so I'd like to see some concrete examples 
of queries. I would like to include it in the test pattern for regression testing.


To be fixed:

* StatisticExtInfo should have a _copy etc. node functionality now,
 right? I've hit "unrecognized node type" with a prepared statement
 because it's missing.

* Is there any particular reason behind choosing only some scan nodes to
 display extended stats? E.g. BitmapHeapScan is missing.

* (New) In the case of Group by, we should put Extended stats info under the
 Hash/Group Aggregate node (not under Scan node).

* (New) We have to create a regression test including the above test patterns.


Attached patch:

It is a rebased version on the head of the master because there were many Hunks 
when I applied the previous patch on master.

I'll create regression tests firstly.

Regards,
Tatsuro Yamada


Attachment

Re: Showing applied extended statistics in explain

From
Jacob Champion
Date:
As discussed in [1], we're taking this opportunity to return some
patchsets that don't appear to be getting enough reviewer interest.

This is not a rejection, since we don't necessarily think there's
anything unacceptable about the entry, but it differs from a standard
"Returned with Feedback" in that there's probably not much actionable
feedback at all. Rather than code changes, what this patch needs is more
community interest. You might

- ask people for help with your approach,
- see if there are similar patches that your code could supplement,
- get interested parties to agree to review your patch in a CF, or
- possibly present the functionality in a way that's easier to review
  overall.

(Doing these things is no guarantee that there will be interest, but
it's hopefully better than endlessly rebasing a patchset that is not
receiving any feedback from the community.)

Once you think you've built up some community support and the patchset
is ready for review, you (or any interested party) can resurrect the
patch entry by visiting

    https://commitfest.postgresql.org/38/3050/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

[1] https://postgr.es/m/86140760-8ba5-6f3a-3e6e-5ca6c060bd24@timescale.com