Thread: Showing applied extended statistics in explain Part 2

Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

01 March 2024, 00:19:42

Hi,

This original patch made by Tomas improves the usability of extended statistics,
so I rebased it on 362de947, and I'd like to re-start developing it.

The previous thread [1] suggested something to solve. I'll try to solve it as
best I can, but I think this feature is worth it with some limitations.
Please find the attached file.

[1] https://www.postgresql.org/message-id/flat/8081617b-d80f-ae2b-b79f-ea7e926f9fcf%40enterprisedb.com

Regards,
Tatsuro Yamada
NTT Open Source Software Center

Attachment

0001-show-stats-in-explain-rebased-on-362de947.patch

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

02 March 2024, 00:11:26

On 3/1/24 01:19, Tatsuro Yamada wrote:
> Hi,
> 
> This original patch made by Tomas improves the usability of extended statistics, 
> so I rebased it on 362de947, and I'd like to re-start developing it.
>  
> The previous thread [1] suggested something to solve. I'll try to solve it as 
> best I can, but I think this feature is worth it with some limitations.
> Please find the attached file.
> 

Thank you for the interest in moving this patch forward. And I agree
it's worth to cut some of the stuff if it's necessary to make it work.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

12 June 2024, 11:13:56

Hello Yamada-san,

I finally got to look at this patch again - apologies for taking so
long, I'm well aware it's rather frustrating to wait for feedback. I'll
try to pay more attention to this patch, and don't hesitate to ping me
off-list if immediate input is needed.

I looked at the patch from March 1 [1], which applies mostly with some
minor bitrot, but no major conflicts. A couple review comments:


1) The patch is not added to the CF app, which I think is a mistake. Can
you please add it to the 2024-07 commitfest? Otherwise people may not be
aware of it, won't do reviews etc. It'll require posting a rebased
patch, but should not be a big deal.


2) Not having the patch in a CF also means cfbot is not running tests on
it. Which is unfortunate, because the patch actually has an a bug cfbot
would find - I've noticed it after running the tests through the github
CI, see [2].

FWIW I very much recommend setting up this CI and using it during
development, it turned out to be very valuable for me as it tests on a
range of systems, and more convenient than the rpi5 machines I used for
that purposes before. See src/tools/ci/README for details.


3) The bug has this symptom:

  ERROR:  unrecognized node type: 268
  CONTEXT:  PL/pgSQL function check_estimated_rows(text) line 7 ...
  STATEMENT:  SELECT * FROM check_estimated_rows('SELECT a, b FROM ...

but it only shows on the FreeBSD machine (in CI). But that's simply
because that's running tests with "-DCOPY_PARSE_PLAN_TREES", which
always copies the query plan, to make sure all the nodes can be copied.
And we don't have a copy function for the StatisticExtInfo node (that's
the node 268), so it fails.

FWIW you can have this failure even on a regular build, you just need to
do explain on a prepared statement (with an extended statistics):

  CREATE TABLE t (a int, b int);

  INSERT INTO t SELECT (i/100), (i/100)
    FROM generate_series(1,1000) s(i);

  CREATE STATISTICS ON a,b FROM t;

  ANALYZE t;

  PREPARE ps (INT, INT) AS SELECT * FROM t WHERE a = $1 AND b = $2;

  EXPLAIN EXECUTE ps(5,5);

  ERROR:  unrecognized node type: 268


4) I can think of two basic ways to fix this issue - either allow
copying of the StatisticExtInto node, or represent the information in a
different way (e.g. add a new node for that purpose, or use existing
nodes to do that).

I don't think we should do the first thing - the StatisticExtInfo is in
pathnodes.h, is designed to be used during path construction, has a lot
of fields that we likely don't need to show stuff in explain - but we'd
need to copy those too, and that seems useless / expensive.

So I suggest we invent a new (much simpler) node, tracking only the bits
we actually need for in the explain part. Or alternatively, if you think
adding a separate node is an overkill, maybe we could keep just an OID
of the statistics we applied, and the explain would lookup the name?

But I think having a new node might might also make the patch simpler,
as the new struct could combine the information the patch keeps in three
separate lists. Instead, there's be just one new list in Plan, members
would be the new node type, and each element would be

  (statistics OID, list of clauses, flag is_or)

or something like that.


5) In [3] Tom raised two possible issues with doing this - cost of
copying the information, and locking. For the performance concerns, I
think the first thing we should do is measuring how expensive it is. I
suggest measuring the overhead for about three basic cases:

 - table with no extended stats
 - table with a couple (1-10?) extended stats
 - table with a lot of (100?) extended stats

And see the impact of the patch. That is, measure the planning time with
master and with the patch applied. The table size does not matter much,
I think - this should measure just the planning, not execute the query.
In practice the extra costs will get negligible as the execution time
grows. But we're measuring the worst case, so that's fine.

For the locking, I agree with Robert [4] that this probably should not
be an issue - I don't see why would this be different from indexes etc.
But I haven't thought about that too much, so maybe investigate and test
this a bit more (that plans get invalidated if the statistics changes,
and so on).


6) I'm not sure we want to have this under EXPLAIN (VERBOSE). It's what
I did in the initial PoC patch, but maybe we should invent a new flag
for this purpose, otherwise VERBOSE will cover too much stuff? I'm
thinking about "STATS" for example.

This would probably mean the patch should also add a new auto_explain
"log_" flag to enable/disable this.


7) The patch really needs some docs - I'd mention this in the EXPLAIN
docs, probably. There's also a chapter about estimates, maybe that
should mention this too? Try searching for places in the SGML docs
mentioning extended stats and/or explain, I guess.

For tests, I guess stats_ext is the right place to test this. I'm not
sure what's the best way to do this, though. If it's covered by VERBOSE,
that seems it might be unstable - and that would be an issue. But maybe
we might add a function similar to check_estimated_rows(), but verifying
 the query used the expected statistics to estimate expected clauses.

But maybe with the new explain "STATS" flag it would be easier, because
we could do EXPLAIN (COSTS OFF, STATS ON) and that would be stable.

As for what needs to be tested, I don't think we need to test how we
match queries/clauses to statistics - that's already tested. It's fine
to focus just on displaying the expected stuff. I'd take a couple of the
existing tests, and check those. And then also add a couple tests for
prepared statements, and invalidation of a plan after an extended
statistics gets dropped, etc.


So there's stuff to do to make this committable, but hopefully this
review gives you some guidance regarding what/how ;-)


regards


[1]
https://www.postgresql.org/message-id/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com

[2] https://cirrus-ci.com/build/6436352672137216

[3] https://www.postgresql.org/message-id/459863.1627419001%40sss.pgh.pa.us

[4]
https://www.postgresql.org/message-id/CA%2BTgmoZU34zo4%3DhyqgLH16iGpHQ6%2BQAesp7k5a1cfZB%3D%2B9xtsw%40mail.gmail.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

26 June 2024, 09:06:43

Hi Tomas!

Thanks for the comments!

1) The patch is not added to the CF app, which I think is a mistake. Can
you please add it to the 2024-07 commitfest? Otherwise people may not be
aware of it, won't do reviews etc. It'll require posting a rebased
patch, but should not be a big deal.

I added the patch to the 2024-07 commitfest today.

2) Not having the patch in a CF also means cfbot is not running tests on
it. Which is unfortunate, because the patch actually has an a bug cfbot
would find - I've noticed it after running the tests through the github
CI, see [2].
3) The bug has this symptom:
ERROR: unrecognized node type: 268
CONTEXT: PL/pgSQL function check_estimated_rows(text) line 7 ...
STATEMENT: SELECT * FROM check_estimated_rows('SELECT a, b FROM ...
4) I can think of two basic ways to fix this issue - either allow
copying of the StatisticExtInto node, or represent the information in a
different way (e.g. add a new node for that purpose, or use existing
nodes to do that).

Thanks for the info. I'll investigate using cfbot.

To fix the problem, I understand we need to create a new struct like

(statistics OID, list of clauses, flag is_or).

5) In [3] Tom raised two possible issues with doing this - cost of
copying the information, and locking. For the performance concerns, I
think the first thing we should do is measuring how expensive it is. I
suggest measuring the overhead for about three basic cases:

Okay, I'll measure it once the patch is completed and check the overhead.

I read [3][4] and in my opinion I agree with Robert.
As with indexes, there should be a mechanism for determining whether

extended statistics are used or not. If it were available, users would be able to

tune using extended statistics and get better execution plans.

6) I'm not sure we want to have this under EXPLAIN (VERBOSE). It's what
I did in the initial PoC patch, but maybe we should invent a new flag
for this purpose, otherwise VERBOSE will cover too much stuff? I'm
thinking about "STATS" for example.

This would probably mean the patch should also add a new auto_explain
"log_" flag to enable/disable this.

I thought it might be better to do this, so I'll fix it.

7) The patch really needs some docs - I'd mention this in the EXPLAIN
docs, probably. There's also a chapter about estimates, maybe that
should mention this too? Try searching for places in the SGML docs
mentioning extended stats and/or explain, I guess.

I plan to create documentation after the specifications are finalized.

For tests, I guess stats_ext is the right place to test this. I'm not
sure what's the best way to do this, though. If it's covered by VERBOSE,
that seems it might be unstable - and that would be an issue. But maybe
we might add a function similar to check_estimated_rows(), but verifying
the query used the expected statistics to estimate expected clauses.

As for testing, I think it's more convenient for reviewers to include it in the patch,

so I'm thinking of including it in the next patch.

So there's stuff to do to make this committable, but hopefully this
review gives you some guidance regarding what/how ;-)

Thank you! It helps me a lot!

The attached patch does not correspond to the above comment.

But it does solve some of the issues mentioned in previous threads.

The next patch is planned to include:
6) Add stats option to explain command
8) Add regression test (stats_ext.sql)
4) Add new node (resolve errors in cfbot and prepared statement)

Regards,

Tatsuro Yamada

[1]
https://www.postgresql.org/message-id/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com

[2] https://cirrus-ci.com/build/6436352672137216

[3] https://www.postgresql.org/message-id/459863.1627419001%40sss.pgh.pa.us

[4]
https://www.postgresql.org/message-id/CA%2BTgmoZU34zo4%3DhyqgLH16iGpHQ6%2BQAesp7k5a1cfZB%3D%2B9xtsw%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-show-stats-in-explain-rebased-on-15c9ac36.patch

Re: Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

28 June 2024, 06:45:35

Hi Tomas,

The attached patch does not correspond to the above comment.
But it does solve some of the issues mentioned in previous threads.

Oops, I made a mistake sending a patch on my previous email.

Attached patch is the right patch.

Regards,

Tatsuro Yamada

Attachment

0001-Add-Showing-applied-extended-statistics-in-explain-r3.patch

Re: Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

28 June 2024, 11:16:03

Hi Tomas and All,

Attached file is a new patch including:

6) Add stats option to explain command

7) The patch really needs some docs (partly)

>4) Add new node (resolve errors in cfbot and prepared statement)

I tried adding a new node in pathnode.h, but it doesn't work well.

So, it needs more time to implement it successfully because this is

the first time to add a new node in it.

8) Add regression test (stats_ext.sql)

Actually, I am not yet able to add new test cases to stats_ext.sql.

Instead, I created a simple test (test.sql) and have attached it.
Also, output.txt is the test result.

To add new test cases to stats_ext.sql,
I'd like to decide on a strategy for modifying it. In particular, there are

381 places where the check_estimated_rows function is used, so should I

include the same number of tests, or should we include the bare minimum

of tests that cover the code path? I think only the latter would be fine.

Any advice is appreciated. :-D

P.S.

I'm going to investigate how to use CI this weekend hopefully.

Regards,

Tatsuro Yamada

Attachment

RE: Showing applied extended statistics in explain Part 2

From

Date:

12 July 2024, 10:09:42

Hi,

Thanks for working the feature. As a user, I find it useful, and I'd like to use
it in v18! Although I've just started start looking into it, I have a few questions.


(1)

Is it better to make the order of output consistent? For example, even
though there are three clauses shown in the below case, the order does not
match.
* "Filter" shows that "id1" is first.
* "Ext Stats" shows that "id2" is first.

-- An example
DROP TABLE IF EXISTS test;
CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i));
create statistics test_s1 on id1, id2 from test; analyze;

=# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..23092.77 rows=84311 width=20)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
         Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))                                -- here
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))    -- here
(5 rows)



(2)

Do we really need the schema names without VERBOSE option? As in the above case,
"Ext Stats" shows schema name "public", even though the table name "test" isn't
shown with its schema name.

Additionally, if the VERBOSE option is specified, should the column names also be
printed with namespace?

=# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..22947.37 rows=82857 width=20)
   Output: id1, id2, id3, value
   Workers Planned: 2
   ->  Parallel Seq Scan on public.test  (cost=0.00..13661.67 rows=34524 width=20)
         Output: id1, id2, id3, value
         Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))   -- here
(7 rows)



(3)

I might be misunderstanding something, but do we need the clauses? Is there any
case where users would want to know the clauses? For example, wouldn't the
following be sufficient?

> Ext Stats: id1, id2 using test_s1



(4)

The extended statistics with "dependencies" or "ndistinct" option don't seem to
be shown in EXPLAIN output. Am I missing something? (Is this expected?)

I tested the examples in the documentation. Although it might work with
"mcv" option, I can't confirm that it works because "unrecognized node type"
error occurred in my environment.
https://www.postgresql.org/docs/current/sql-createstatistics.html

(It might be wrong since I'm beginner with extended stats codes.)
IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(),
and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct().


-- doesn't work with "dependencies" option?
=# \dX
                        List of extended statistics
 Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV
--------+---------+--------------------+-----------+--------------+---------
 public | s1      | a, b FROM t1       | (null)    | defined      | (null)
(2 rows)

=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3)
         Filter: ((a = 1) AND (b = 0))
         Rows Removed by Filter: 333300
 Planning Time: 0.246 ms
 Execution Time: 50.361 ms
(8 rows)

-- doesn't work with "ndistinct"?
=# \dX
                                             List of extended statistics
 Schema | Name |                            Definition                            | Ndistinct | Dependencies |  MCV
--------+------+------------------------------------------------------------------+-----------+--------------+--------
 public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined   | (null)       | (null)
(1 row)

postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1)
   Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone)
   Rows Removed by Filter: 480961
 Planning Time: 0.088 ms
 Execution Time: 144.590 ms
(5 rows)

-- doesn't work with "mvc". It might work, but the error happens in my environments
=# \dX
                    List of extended statistics
 Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV
--------+------+--------------+-----------+--------------+---------
 public | s2   | a, b FROM t2 | (null)    | (null)       | defined
(1 row)

-- I encountered the error with the query.
=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
ERROR:  unrecognized node type: 268



Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

15 July 2024, 14:03:34

On 6/26/24 11:06, Tatsuro Yamada wrote:
> Hi Tomas!
> 
> Thanks for the comments!
> 
> 1) The patch is not added to the CF app, which I think is a mistake. Can
>> you please add it to the 2024-07 commitfest? Otherwise people may not be
>> aware of it, won't do reviews etc. It'll require posting a rebased
>> patch, but should not be a big deal.
>>
> 
> I added the patch to the 2024-07 commitfest today.
> 
> 
> 2) Not having the patch in a CF also means cfbot is not running tests on
>> it. Which is unfortunate, because the patch actually has an a bug cfbot
>> would find - I've noticed it after running the tests through the github
>> CI, see [2].
>> 3) The bug has this symptom:
>>   ERROR:  unrecognized node type: 268
>>   CONTEXT:  PL/pgSQL function check_estimated_rows(text) line 7 ...
>>   STATEMENT:  SELECT * FROM check_estimated_rows('SELECT a, b FROM ...
>> 4) I can think of two basic ways to fix this issue - either allow
>> copying of the StatisticExtInto node, or represent the information in a
>> different way (e.g. add a new node for that purpose, or use existing
>> nodes to do that).
>>
> 
> Thanks for the info. I'll investigate using cfbot.
> To fix the problem, I understand we need to create a new struct like
> (statistics OID, list of clauses, flag is_or).
> 

Yes, something like that, in the plannodes.h.

> 
> 5) In [3] Tom raised two possible issues with doing this - cost of
>> copying the information, and locking. For the performance concerns, I
>> think the first thing we should do is measuring how expensive it is. I
>> suggest measuring the overhead for about three basic cases:
>>
> 
> Okay, I'll measure it once the patch is completed and check the overhead.
> I read [3][4] and in my opinion I agree with Robert.
> As with indexes, there should be a mechanism for determining whether
> extended statistics are used or not. If it were available, users would be
> able to
> tune using extended statistics and get better execution plans.
> 

I do agree with that, but I also understand Tom's concerns about the
costs. His concern is that to make this work, we have to keep/copy the
information for all queries, even if that user never does explain.

Yes, we do the same thing (copy of some pieces) for indexes, and from
this point of view it's equally reasonable. But there's the difference
that for indexes it's always been done this way, hence it's considered
"the baseline", while for extended stats we've not copied the data until
this patch, so it'd be seen as a regression.

I think there are two ways to deal with this - ideally, we'd show that
the overhead is negligible (~noise). And if it's measurable, we'd need
to argue that it's worth it - but that's much harder, IMHO.

So I'd suggest you try to measure the overhead on a couple cases (simple
query with 0 or more statistics applied).

> 
> 
>> 6) I'm not sure we want to have this under EXPLAIN (VERBOSE). It's what
>> I did in the initial PoC patch, but maybe we should invent a new flag
>> for this purpose, otherwise VERBOSE will cover too much stuff? I'm
>> thinking about "STATS" for example.
>>
>> This would probably mean the patch should also add a new auto_explain
>> "log_" flag to enable/disable this.
>>
> 
> I thought it might be better to do this, so I'll fix it.
> 

OK

> 
> 
>> 7) The patch really needs some docs - I'd mention this in the EXPLAIN
>> docs, probably. There's also a chapter about estimates, maybe that
>> should mention this too? Try searching for places in the SGML docs
>> mentioning extended stats and/or explain, I guess.
>>
> 
> I plan to create documentation after the specifications are finalized.
> 

I'm, not sure that's a good approach. Maybe it doesn't need to be
mentioned in the section explaining how estimates work, but it'd be good
to have it at least in the EXPLAIN command docs. The thing is - docs are
a nice way for reviewers to learn about how the feature is expected to
work / be used. Yes, it may need to be adjusted if the patch changes,
but it's likely much easier than changing the code.

> 
> 
>> For tests, I guess stats_ext is the right place to test this. I'm not
>> sure what's the best way to do this, though. If it's covered by VERBOSE,
>> that seems it might be unstable - and that would be an issue. But maybe
>> we might add a function similar to check_estimated_rows(), but verifying
>>  the query used the expected statistics to estimate expected clauses.
>>
> 
> As for testing, I think it's more convenient for reviewers to include it in
> the patch,
> so I'm thinking of including it in the next patch.
> 

I'm not sure I understand what you mean - what is more convenient to
include in the patch & you plan to include in the next patch version?

My opinion is that there clearly need to be some regression tests, be it
in stats_ext.sql or in some other script. But to make it easier, we
might have a function similar to check_estimated_rows() which would
extract just the interesting part of the plan.

> 
> So there's stuff to do to make this committable, but hopefully this
>> review gives you some guidance regarding what/how ;-)
>>
> 
> Thank you! It helps me a lot!
> 
> The attached patch does not correspond to the above comment.
> But it does solve some of the issues mentioned in previous threads.
> 
> The next patch is planned to include:
> 6) Add stats option to explain command
> 8) Add regression test (stats_ext.sql)
> 4) Add new node (resolve errors in cfbot and prepared statement)
> 

Sounds good.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

15 July 2024, 14:15:36

On 6/28/24 13:16, Tatsuro Yamada wrote:
> Hi Tomas and All,
> 
> Attached file is a new patch including:
>     6) Add stats option to explain command
>     7) The patch really needs some docs (partly)
> 
>  >4) Add new node (resolve errors in cfbot and prepared statement)
> 
> I tried adding a new node in pathnode.h, but it doesn't work well.
> So, it needs more time to implement it successfully because this is
> the first time to add a new node in it.
> 

I'm not sure why it didn't work well, and I haven't tried adding the
struct myself so I might be missing something important, but m
assumption was the new struct would go to plannodes.h. The planning
works in phases:

  parse -> build Path nodes -> pick cheapest Path -> create Plan

and it's the Plan that is printed by EXPLAIN. The pathnodes.h and
plannodes.h match this, so if it's expected to be in Plan it should go
to plannodes.h I think.

> 
>> 8) Add regression test (stats_ext.sql)
> 
> 
> Actually, I am not yet able to add new test cases to stats_ext.sql.

Why is that not possible? Can you explain?

> Instead, I created a simple test (test.sql) and have attached it.
> Also, output.txt is the test result.
> 
> To add new test cases to stats_ext.sql,
> I'd like to decide on a strategy for modifying it. In particular, there are
> 381 places where the check_estimated_rows function is used, so should I
> include the same number of tests, or should we include the bare minimum
> of tests that cover the code path? I think only the latter would be fine.
> Any advice is appreciated. :-D
> 

I don't understand. My suggestion was to create a new function, similar
to check_estimated_rows(), that's get a query, do EXPLAIN and extract
the list of applied statistics. Similar to what check_estimated_rows()
does for number of rows.

I did not mean to suggest you modify check_estimated_rows() to extract
both the number of rows and statistics, nor to modify the existing tests
(that's not very useful, because there's only one extended statistics in
each of those tests, and by testing the estimate we implicitly test that
it's applied).

My suggestion is to add a couple new queries, with multiple statistics
and multiple clauses etc. And then test the patch on those. You could do
simple EXPLAIN (COSTS OFF), or add the new function to make it a bit
more stable (but maybe it's not worth it).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

15 July 2024, 14:40:19

Hi,

Let me share my opinion on those questions ...


On 7/12/24 12:09, Masahiro.Ikeda@nttdata.com wrote:
> Hi,
> 
> Thanks for working the feature. As a user, I find it useful, and I'd like to use
> it in v18! Although I've just started start looking into it, I have a few questions.
> 
> 
> (1)
> 
> Is it better to make the order of output consistent? For example, even
> though there are three clauses shown in the below case, the order does not
> match.
> * "Filter" shows that "id1" is first.
> * "Ext Stats" shows that "id2" is first.
> 
> -- An example
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
> INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i));
> create statistics test_s1 on id1, id2 from test; analyze;
> 
> =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
>                                       QUERY PLAN                                       
> ---------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..23092.77 rows=84311 width=20)
>    Workers Planned: 2
>    ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
>          Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))                                -- here
>          Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))    -- here
> (5 rows)
> 

I don't think we need to make the order consistent. It probably wouldn't
hurt, but I'm not sure it's even possible for all scan types - for
example in an index scan, the clauses might be split between index
conditions and filters, etc.

> 
> 
> (2)
> 
> Do we really need the schema names without VERBOSE option? As in the above case,
> "Ext Stats" shows schema name "public", even though the table name "test" isn't
> shown with its schema name.
> 
> Additionally, if the VERBOSE option is specified, should the column names also be
> printed with namespace?
> 
> =# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
>                                       QUERY PLAN                                       
> ---------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..22947.37 rows=82857 width=20)
>    Output: id1, id2, id3, value
>    Workers Planned: 2
>    ->  Parallel Seq Scan on public.test  (cost=0.00..13661.67 rows=34524 width=20)
>          Output: id1, id2, id3, value
>          Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
>          Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))   -- here
> (7 rows)
> 

Yeah, I don't think there's a good reason to force printing schema for
the statistics, if it's not needed for the table. The rules should be
the same, I think.

> 
> 
> (3)
> 
> I might be misunderstanding something, but do we need the clauses? Is there any
> case where users would want to know the clauses? For example, wouldn't the
> following be sufficient?
> 
>> Ext Stats: id1, id2 using test_s1
> 

The stats may overlap, and some clauses may be matching multiple of
them. And some statistics do not support all clause types (e.g.
functional dependencies work only with equality conditions). Yes, you
might deduce which statistics are used for which clause, but it's not
trivial - interpreting explain is already not trivial, let's not make it
harder.

(If tracking the exact clauses turns out to be expensive, we might
revisit this - it might make it cheaper).

> 
> 
> (4)
> 
> The extended statistics with "dependencies" or "ndistinct" option don't seem to
> be shown in EXPLAIN output. Am I missing something? (Is this expected?)
> 
> I tested the examples in the documentation. Although it might work with
> "mcv" option, I can't confirm that it works because "unrecognized node type"
> error occurred in my environment.
> https://www.postgresql.org/docs/current/sql-createstatistics.html
> 
> (It might be wrong since I'm beginner with extended stats codes.)
> IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(),
> and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct().
> 
> 
> -- doesn't work with "dependencies" option?
> =# \dX
>                         List of extended statistics
>  Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV   
> --------+---------+--------------------+-----------+--------------+---------
>  public | s1      | a, b FROM t1       | (null)    | defined      | (null)
> (2 rows)
> 
> =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
>                                                     QUERY PLAN                                                     
> -------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3)
>          Filter: ((a = 1) AND (b = 0))
>          Rows Removed by Filter: 333300
>  Planning Time: 0.246 ms
>  Execution Time: 50.361 ms
> (8 rows)
> 
> -- doesn't work with "ndistinct"?
> =# \dX
>                                              List of extended statistics
>  Schema | Name |                            Definition                            | Ndistinct | Dependencies |  MCV

>
--------+------+------------------------------------------------------------------+-----------+--------------+--------
>  public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined   | (null)       |
(null)
> (1 row)
> 
> postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
>   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
>                                                 QUERY PLAN                                                
> ----------------------------------------------------------------------------------------------------------
>  Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1)
>    Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone)
>    Rows Removed by Filter: 480961
>  Planning Time: 0.088 ms
>  Execution Time: 144.590 ms
> (5 rows)
> 
> -- doesn't work with "mvc". It might work, but the error happens in my environments
> =# \dX
>                     List of extended statistics
>  Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV   
> --------+------+--------------+-----------+--------------+---------
>  public | s2   | a, b FROM t2 | (null)    | (null)       | defined
> (1 row)
> 
> -- I encountered the error with the query.
> =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
> ERROR:  unrecognized node type: 268
> 
> 

Yes, you're right we don't show some stats. For dependencies there's the
problem that we don't apply them individually, so it's not really
possible to map clauses to individual stats. I wonder if we might have a
special "entry" to show clauses estimated by the functional dependencies
combined from all stats (instead of a particular statistics).

For ndistinct, I think we don't show this because it doesn't go through
clauselist_selectivity, which is the only thing I modified in the PoC.
But I guess we might improve estimate_num_groups() to track the stats in
a similar way, I guess.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Showing applied extended statistics in explain Part 2

From

Date:

18 July 2024, 10:37:42

> Let me share my opinion on those questions ...

Thanks! I could understand the patch well thanks to your comments.


> On 7/12/24 12:09, Masahiro.Ikeda@nttdata.com wrote:
> > Is it better to make the order of output consistent? For example, even
> > though there are three clauses shown in the below case, the order does
> > not match.
> > * "Filter" shows that "id1" is first.
> > * "Ext Stats" shows that "id2" is first.
> >
> > -- An example
> > DROP TABLE IF EXISTS test;
> > CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
> > INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM
> > generate_series(1,1000000) s(i)); create statistics test_s1 on id1,
> > id2 from test; analyze;
> >
> > =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
> >                                       QUERY PLAN
> > ----------------------------------------------------------------------
> > -----------------  Gather  (cost=1000.00..23092.77 rows=84311
> > width=20)
> >    Workers Planned: 2
> >    ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
> >          Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))
> -- here
> >          Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))
> -- here
> > (5 rows)
> >
>
> I don't think we need to make the order consistent. It probably wouldn't hurt, but I'm
> not sure it's even possible for all scan types - for example in an index scan, the clauses
> might be split between index conditions and filters, etc.

OK, I understand it isn't unexpected behavior.


> > (3)
> >
> > I might be misunderstanding something, but do we need the clauses? Is
> > there any case where users would want to know the clauses? For
> > example, wouldn't the following be sufficient?
> >
> >> Ext Stats: id1, id2 using test_s1
> >
>
> The stats may overlap, and some clauses may be matching multiple of them. And some
> statistics do not support all clause types (e.g.
> functional dependencies work only with equality conditions). Yes, you might deduce
> which statistics are used for which clause, but it's not trivial - interpreting explain is
> already not trivial, let's not make it harder.
>
> (If tracking the exact clauses turns out to be expensive, we might revisit this - it might
> make it cheaper).

Thanks. I agree that we need to show the clauses.


> > (4)
> >
> > The extended statistics with "dependencies" or "ndistinct" option
> > don't seem to be shown in EXPLAIN output. Am I missing something? (Is
> > this expected?)
> >
> > I tested the examples in the documentation. Although it might work
> > with "mcv" option, I can't confirm that it works because "unrecognized node type"
> > error occurred in my environment.
> > https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sq
> > l-createstatistics.html__;!!GCTRfqYYOYGmgK_z!9H-FTXrhg7cr0U2r4PoKEeWM1
> > v9feP8I8zlNyhf-801n-KI8bIMAxOQgaetSTpek3ECk2_FKWEsuApVZ-ys-ka7rfjX8ANB
> > 9zQ$
> >
> > (It might be wrong since I'm beginner with extended stats codes.)
> > IIUC, the reason is that the patch only handles
> > statext_mcv_clauselist_selectivity(),
> > and doesn't handle dependencies_clauselist_selectivity() and
> estimate_multivariate_ndistinct().
> >
> >
> > -- doesn't work with "dependencies" option?
> > =# \dX
> >                         List of extended statistics
> >  Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> >  public | s1      | a, b FROM t1       | (null)    | defined      | (null)
> > (2 rows)
> >
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
> >                                                     QUERY PLAN
> > ----------------------------------------------------------------------
> > ---------------------------------------------
> >  Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual
> time=0.214..50.327 rows=100 loops=1)
> >    Workers Planned: 2
> >    Workers Launched: 2
> >    ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual
> time=30.300..46.610 rows=33 loops=3)
> >          Filter: ((a = 1) AND (b = 0))
> >          Rows Removed by Filter: 333300  Planning Time: 0.246 ms
> > Execution Time: 50.361 ms
> > (8 rows)
> >
> > -- doesn't work with "ndistinct"?
> > =# \dX
> >                                              List of extended statistics
> >  Schema | Name |                            Definition                            |
> Ndistinct | Dependencies |  MCV
> >
> --------+------+------------------------------------------------------------------+------
> -----+--------------+--------
> >  public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 |
> defined   | (null)       | (null)
> > (1 row)
> >
> > postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> >   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> >                                                 QUERY PLAN
> > ----------------------------------------------------------------------
> > ------------------------------------
> >  Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual
> time=0.027..143.199 rows=44640 loops=1)
> >    Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp
> without time zone)
> >    Rows Removed by Filter: 480961
> >  Planning Time: 0.088 ms
> >  Execution Time: 144.590 ms
> > (5 rows)
> >
> > -- doesn't work with "mvc". It might work, but the error happens in my
> > environments =# \dX
> >                     List of extended statistics
> >  Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV
> > --------+------+--------------+-----------+--------------+---------
> >  public | s2   | a, b FROM t2 | (null)    | (null)       | defined
> > (1 row)
> >
> > -- I encountered the error with the query.
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b =
> > 1);
> > ERROR:  unrecognized node type: 268
> >
> >
>
> Yes, you're right we don't show some stats. For dependencies there's the problem that
> we don't apply them individually, so it's not really possible to map clauses to individual
> stats. I wonder if we might have a special "entry" to show clauses estimated by the
> functional dependencies combined from all stats (instead of a particular statistics).

OK, I understand it's intended behavior for "dependencies" and we need to consider how to
show them in EXPLAIN output in future.


> For ndistinct, I think we don't show this because it doesn't go through
> clauselist_selectivity, which is the only thing I modified in the PoC.
> But I guess we might improve estimate_num_groups() to track the stats in a similar way,
> I guess.

Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because
the number of clauses is one though it goes through clauselist_selectivity().


> > ERROR:  unrecognized node type: 268

Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch?
change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to
show extra information for example "kind" of "StatisticExtInfo"?

The above is just one idea came up with while I read the following comments of header
of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy.
  * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
  * There are some subsidiary structs that are useful to copy, though.


By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice.
I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and "Group".
I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users
understand
when the stats are used, but I don't have any idea now.

-- I tested with the example of CREATE STATISTICS documentation.
psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
                                                                 QUERY PLAN
                     

---------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1)
   Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
   ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2)
               Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
               Sort Method: quicksort  Memory: 32kB
               Worker 0:  Sort Method: quicksort  Memory: 32kB
               ->  Partial HashAggregate  (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768
rows=249loops=2) 
                     Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
                     Batches: 1  Memory Usage: 45kB
                     Worker 0:  Batches: 1  Memory Usage: 45kB
                     ->  Parallel Seq Scan on t3  (cost=0.00..6963.66 rows=309177 width=16) (actual time=0.021..171.214
rows=262800loops=2) 
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)  --
here
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)  --
here
 Planning Time: 114327.206 ms
 Execution Time: 288.007 ms
(18 rows)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment

change_to_applied_stats_has_list_of_oids.diff

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

18 July 2024, 21:36:11

On 7/18/24 12:37, Masahiro.Ikeda@nttdata.com wrote:
>> Let me share my opinion on those questions ...
> ...>
>> For ndistinct, I think we don't show this because it doesn't go through
>> clauselist_selectivity, which is the only thing I modified in the PoC.
>> But I guess we might improve estimate_num_groups() to track the stats in a similar way,
>> I guess.
> 
> Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because
> the number of clauses is one though it goes through clauselist_selectivity().
> 

Ah, I see I misunderstood the original report. The query used was

  EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
    WHERE date_trunc('month', a) = '2020-01-01'::timestamp;

And it has nothing to do with the number of clauses being one neither.

The problem is this estimate is handled by examine_variable() matching
the expression to the "expression" stats, and injecting it into the
variable, so that the clauselist_selectivity() sees these stats.

This would happen even if you build just expression statistics on each
of the date_trunc() calls, and then tried a query with two clauses:

  CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
  CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;

  EXPLAIN SELECT * FROM t3
    WHERE date_trunc('month', a) = '2020-01-01'::timestamp
      AND date_trunc('day', 'a') = '2020-01-01'::timestamp;

Not sure how to handle this - we could remember when explain_variable()
injects statistics like this, I guess. But do we know that each call to
examine_variable() is for estimation? And do we know for which clause?

> 
>>> ERROR:  unrecognized node type: 268
> 
> Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
> because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch?
> change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to
> show extra information for example "kind" of "StatisticExtInfo"?
> 
> The above is just one idea came up with while I read the following comments of header
> of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy.
>   * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
>   * There are some subsidiary structs that are useful to copy, though.
> 

I do think tracking just the OID would work, because we already know how
to copy List objects. But if we want to also track the clauses, we'd
have to keep multiple lists, right? That seems a bit inconvenient.

> 
> By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice.
> I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and
"Group".
> I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users
understand
> when the stats are used, but I don't have any idea now.
> 
> -- I tested with the example of CREATE STATISTICS documentation.
> psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
>                                                                  QUERY PLAN
                      
 
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Group  (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1)
>    Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
>    ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1)
>          Workers Planned: 1
>          Workers Launched: 1
>          ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2)
>                Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
>                Sort Method: quicksort  Memory: 32kB
>                Worker 0:  Sort Method: quicksort  Memory: 32kB
>                ->  Partial HashAggregate  (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768
rows=249loops=2)
 
>                      Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
>                      Batches: 1  Memory Usage: 45kB
>                      Worker 0:  Batches: 1  Memory Usage: 45kB
>                      ->  Parallel Seq Scan on t3  (cost=0.00..6963.66 rows=309177 width=16) (actual
time=0.021..171.214rows=262800 loops=2)
 
>                            Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
--here
 
>                            Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
--here
 
>  Planning Time: 114327.206 ms
>  Execution Time: 288.007 ms
> (18 rows)
> 

I haven't looked into this, but my guess would be this is somehow
related to the parallelism - there's one parallel worker, which means we
have 2 processes to report stats for (leader + worker). And you get two
copies of the "Ext Stats" line. Could be a coincidence, ofc, but maybe
there's a loop to print some worker info, and you print the statistics
info in it?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: Showing applied extended statistics in explain Part 2

From

Date:

19 July 2024, 10:17:23

> On 7/18/24 12:37, Masahiro.Ikeda@nttdata.com wrote:
> >> Let me share my opinion on those questions ...
> > ...>
> >> For ndistinct, I think we don't show this because it doesn't go
> >> through clauselist_selectivity, which is the only thing I modified in the PoC.
> >> But I guess we might improve estimate_num_groups() to track the stats
> >> in a similar way, I guess.
> >
> > Thanks. IIUC, the reason is that it doesn't go through
> > statext_clauselist_selectivity() because the number of clauses is one though it goes
> through clauselist_selectivity().
> >
>
> Ah, I see I misunderstood the original report. The query used was
>
>   EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
>     WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
>
> And it has nothing to do with the number of clauses being one neither.
>
> The problem is this estimate is handled by examine_variable() matching the expression
> to the "expression" stats, and injecting it into the variable, so that the
> clauselist_selectivity() sees these stats.
>
> This would happen even if you build just expression statistics on each of the
> date_trunc() calls, and then tried a query with two clauses:
>
>   CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
>   CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;
>
>   EXPLAIN SELECT * FROM t3
>     WHERE date_trunc('month', a) = '2020-01-01'::timestamp
>       AND date_trunc('day', 'a') = '2020-01-01'::timestamp;
>
> Not sure how to handle this - we could remember when explain_variable() injects
> statistics like this, I guess. But do we know that each call to
> examine_variable() is for estimation? And do we know for which clause?

I see. The issue is related to extended statistics for single expression. As a
first step, it's ok for me that we don't support it.

The below is just an idea to know clauses...
Although I'm missing something, can callers of examine_variable()
for estimation to rebuild the clauses from partial information of "OpExpr"?

Only clause_selectivity_ext() knows the information of actual full clauses.
But we don't need full information. It's enough to know the information
to show "OpExpr" for EXPLAIN.

get_oper_expr() deparse "OpExpr" using only the operator oid and arguments
in get_oper_expr().

If so, the caller to estimate, for example eqsel_internal(), scalarineqsel_wrapper()
and so on, seems to be able to know the "OpExpr" information, which are operator
oid and arguments, and used extended statistics easily to show for EXPLAIN.

# Memo: the call path of the estimation function
 caller to estimate selectivity (eqsel_internal()/scalargtjoinsel_wrappter()/...)
  -> get_restriction_variable()/get_join_valiables()
   -> examine_variable()


> >>> ERROR:  unrecognized node type: 268
> >
> > Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
> > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the
> current patch?
> > change_to_applied_stats_has_list_of_oids.diff is the change I assumed.
> > Do you have any plan to show extra information for example "kind" of
> "StatisticExtInfo"?
> >
> > The above is just one idea came up with while I read the following
> > comments of header of pathnodes.h, and to support copy "StatisticExtInfo" will leads
> many other nodes to support copy.
> >   * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> >   * There are some subsidiary structs that are useful to copy, though.
> >
>
> I do think tracking just the OID would work, because we already know how to copy List
> objects. But if we want to also track the clauses, we'd have to keep multiple lists, right?
> That seems a bit inconvenient.

Understood. In future, we might show not only the applied_clauses but also the clauses of
its extended statistics (StatisticExtInfo->exprs).


> > By the way, I found curios result while I tested with the above patch. It shows same
> "Ext Stats" twice.
> > I think it's expected behavior because the stat is used when estimate the cost of
> "Partial HashAggregate" and "Group".
> > I've shared the result because I could not understand soon when I saw
> > it first time. I think it's better to let users understand when the stats are used, but I
> don't have any idea now.
> >
> > -- I tested with the example of CREATE STATISTICS documentation.
> > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day',
> a) FROM t3 GROUP BY 1, 2;
> >                                                                  QUERY PLAN
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -  Group  (cost=9530.56..9576.18 rows=365 width=16) (actual
> > time=286.908..287.909 rows=366 loops=1)
> >    Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> >    ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual
> time=286.904..287.822 rows=498 loops=1)
> >          Workers Planned: 1
> >          Workers Launched: 1
> >          ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual
> time=282.905..282.919 rows=249 loops=2)
> >                Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> >                Sort Method: quicksort  Memory: 32kB
> >                Worker 0:  Sort Method: quicksort  Memory: 32kB
> >                ->  Partial HashAggregate  (cost=8509.54..8515.02 rows=365
> width=16) (actual time=282.716..282.768 rows=249 loops=2)
> >                      Group Key: date_trunc('month'::text, a), date_trunc('day'::text,
> a)
> >                      Batches: 1  Memory Usage: 45kB
> >                      Worker 0:  Batches: 1  Memory Usage: 45kB
> >                      ->  Parallel Seq Scan on t3  (cost=0.00..6963.66 rows=309177
> width=16) (actual time=0.021..171.214 rows=262800 loops=2)
> >                            Ext Stats: public.s3  Clauses: date_trunc('month'::text,
> a), date_trunc('day'::text, a)  -- here
> >                            Ext Stats: public.s3  Clauses:
> > date_trunc('month'::text, a), date_trunc('day'::text, a)  -- here
> > Planning Time: 114327.206 ms  Execution Time: 288.007 ms
> > (18 rows)
> >
>
> I haven't looked into this, but my guess would be this is somehow related to the
> parallelism - there's one parallel worker, which means we have 2 processes to report
> stats for (leader + worker). And you get two copies of the "Ext Stats" line. Could be a
> coincidence, ofc, but maybe there's a loop to print some worker info, and you print the
> statistics info in it?

I think yes and no. In the above case, it relates to parallelism, but it doesn't print the
information per each worker.

-- Make the number of workers is 5 and EXPLAIN without ANALYZE option.
-- But "Ext Stats" is printed only twice.
=# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Group  (cost=4449.49..4489.50 rows=365 width=16)
   Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
   ->  Gather Merge  (cost=4449.49..4478.55 rows=1825 width=16)
         Workers Planned: 5
         ->  Sort  (cost=4449.41..4450.32 rows=365 width=16)
               Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
               ->  Partial HashAggregate  (cost=4428.40..4433.88 rows=365 width=16)
                     Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
                     ->  Parallel Seq Scan on t3  (cost=0.00..3902.80 rows=105120 width=16)
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
(11 rows)

When creating a group path, it creates partial grouping paths if possible, and then
creates the final grouping path. At this time, both the partial grouping path and
the final grouping path use the same RelOptInfo to repeatedly use the extended
statistics to know how many groups there will be. That's why it outputs only twice.
There may be other similar calculation for partial paths.

# The call path of the above query
 create_grouping_paths
  create_ordinary_grouping_paths
   create_partial_grouping_paths
    get_number_of_groups
     estimate_num_groups
      estimate_multivariate_ndistinct  -- first time to estimate the number of groups for partial grouping path
   get_number_of_groups
    estimate_num_groups
     estimate_multivariate_ndistinct  -- second time to estimate the number of groups for final grouping path


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Showing applied extended statistics in explain Part 2

From

Ilia Evdokimov

Date:

18 November 2024, 15:52:42

Hi everyone!

Thank you for your work.

1) While exploring extended statistics, I encountered a bug that occurs 
when using EXPLAIN (STATS) with queries containing OR conditions:

CREATE TABLE t (a int, b int, c int, d int);
INSERT INTO t SELECT x/10+1, x, x + 10, x * 2 FROM 
generate_series(1,10000) g(x);
CREATE STATISTICS ON a, b FROM t;
CREATE STATISTICS ON c, d FROM t;
ANALYZE;

The following query works as expected:

EXPLAIN (STATS) SELECT * FROM t WHERE a > 0 AND b > 0 AND c > 0 AND d > 0;
                            QUERY PLAN
----------------------------------------------------------------
  Seq Scan on t  (cost=0.00..255.00 rows=10000 width=16)
    Filter: ((a > 0) AND (b > 0) AND (c > 0) AND (d > 0))
    Ext Stats: public.t_a_b_stat  Clauses: ((a > 0) AND (b > 0))
    Ext Stats: public.t_c_d_stat  Clauses: ((c > 0) AND (d > 0))
(4 rows)

However, when using OR conditions, the following query results in an error:

EXPLAIN (ANALYZE, STATS) SELECT * FROM t WHERE a > 0 AND b > 0 OR c > 0 
AND d > 0;
ERROR:  unrecognized node type: 314

2) It would be great if the STATS flag appeared as an option when 
pressing Tab during query input in the psql command-line interface.

Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

19 November 2024, 00:15:55

On 11/18/24 13:52, Ilia Evdokimov wrote:
> Hi everyone!
> 
> Thank you for your work.
> 
> 1) While exploring extended statistics, I encountered a bug that occurs
> when using EXPLAIN (STATS) with queries containing OR conditions:
> 
> CREATE TABLE t (a int, b int, c int, d int);
> INSERT INTO t SELECT x/10+1, x, x + 10, x * 2 FROM
> generate_series(1,10000) g(x);
> CREATE STATISTICS ON a, b FROM t;
> CREATE STATISTICS ON c, d FROM t;
> ANALYZE;
> 
> The following query works as expected:
> 
> EXPLAIN (STATS) SELECT * FROM t WHERE a > 0 AND b > 0 AND c > 0 AND d > 0;
>                            QUERY PLAN
> ----------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..255.00 rows=10000 width=16)
>    Filter: ((a > 0) AND (b > 0) AND (c > 0) AND (d > 0))
>    Ext Stats: public.t_a_b_stat  Clauses: ((a > 0) AND (b > 0))
>    Ext Stats: public.t_c_d_stat  Clauses: ((c > 0) AND (d > 0))
> (4 rows)
> 
> However, when using OR conditions, the following query results in an error:
> 
> EXPLAIN (ANALYZE, STATS) SELECT * FROM t WHERE a > 0 AND b > 0 OR c > 0
> AND d > 0;
> ERROR:  unrecognized node type: 314
> 

I believe this is the issue I mentioned when I first posted the original
version of this patch:

> 2) The deparsing is modeled (i.e. copied) from how we deal with index
> quals, but it's having issues with nested OR clauses, because there
> are nested RestrictInfo nodes and the deparsing does not expect that.

In other words, the AND-clauses happens to be parsed like this:

  BoolExpr (boolop=and)
    RestrictInfo (clause=OpExpr, ...)
    RestrictInfo (clause=OpExpr, ...)

And the deparse_expression() machinery does not expect RI nodes, which
is where the error message comes from.

An obvious solution would be to extend get_rule_expr() like this:

  case T_RestrictInfo:
    get_rule_expr((Node *) ((RestrictInfo *) node)->clause,
                  context, showimplicit);
    break;

But I think this would be wrong - AFAIK ruleutils.c is meant to handle
these nodes. As the ruleutils.c header says:

    Functions to convert stored expressions/querytrees back to
    source text

And RestrictInfo surely is not meant to be stored anywhere - it's a
runtime only node, caching some data.

So I think the correct solution is to not pass any expressions with
RestrictInfo to deparse_expression(). Either by stripping the nodes, or
by not adding them at all.

The patch tries to do the stripping by maybe_extract_actual_clauses(),
but that only looks at the top node, and that is not sufficient here.
Maybe it would be possible to walk the whole tree, and remove all the
RestrictInfos nodes - including intermediate ones, not just the top. But
I'm not quite sure it wouldn't cause issues elsewhere (assuming it
modifies the existing nodes). It still feels a bit like fixing a problem
we shouldn't really have ...

The only alternative approach I can think of is to make sure we never
add any RestrictInfo nodes in these lists. But we build them for
selectivity estimation, and the RestrictInfo is meant for that.

So I'm a bit unsure what to do about this :-(

In any case, I think this shows the patch needs more tests.

> 2) It would be great if the STATS flag appeared as an option when
> pressing Tab during query input in the psql command-line interface.
> 

True. Tab autocomplete would be nice.

regards

-- 
Tomas Vondra

Re: Showing applied extended statistics in explain Part 2

From

Ilia Evdokimov

Date:

02 December 2024, 20:00:23

On 19.11.2024 00:38, Tomas Vondra wrote:
> On 11/18/24 22:15, Tomas Vondra wrote:
>> ...
>>
>> So I think the correct solution is to not pass any expressions with
>> RestrictInfo to deparse_expression(). Either by stripping the nodes, or
>> by not adding them at all.
>>
>> The patch tries to do the stripping by maybe_extract_actual_clauses(),
>> but that only looks at the top node, and that is not sufficient here.
>> Maybe it would be possible to walk the whole tree, and remove all the
>> RestrictInfos nodes - including intermediate ones, not just the top. But
>> I'm not quite sure it wouldn't cause issues elsewhere (assuming it
>> modifies the existing nodes). It still feels a bit like fixing a problem
>> we shouldn't really have ...
>>
> To make this idea a bit more concrete, here's a patch removing all
> RestrictInfo nodes in show_stat_qual(). But still, it feels wrong.
>
>
> regards
>

Yes, removing all 'RestrictInfos' during deparsing using 
'expression_tree_mutator()' is not optimal. However, I don't see an 
alternative. Perhaps it could be done this earlier in extended_stats.c 
to avoid the need for cleanup later ...

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Showing applied extended statistics in explain Part 2

From

Tom Lane

Date:

03 December 2024, 02:05:16

Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> writes:
> On 19.11.2024 00:38, Tomas Vondra wrote:
>> On 11/18/24 22:15, Tomas Vondra wrote:
>>> So I think the correct solution is to not pass any expressions with
>>> RestrictInfo to deparse_expression(). Either by stripping the nodes, or
>>> by not adding them at all.
>>> 
>>> The patch tries to do the stripping by maybe_extract_actual_clauses(),
>>> but that only looks at the top node, and that is not sufficient here.

Pardon me for being late to the party, but I don't understand why this
is difficult.  There should never be more than one layer of
RestrictInfos, at the top level of an implicitly-ANDed list of quals.
The only exception to this is that RestrictInfos representing OR
clauses have an additional field "orclause" that attaches
RestrictInfos to the elements of the OR list --- but the main "clause"
field doesn't look like that, and you can just ignore "orclause" if it
doesn't suit you.  So ISTM this doesn't need to be any harder than
what extract_actual_clauses() does (and has done for decades).

            regards, tom lane

Re: Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

05 February, 09:28:10

Hi All,

Thank you everyone for your cooperation with comments on the patch and solution ideas.
I am sorting through your review comments now. And after rebasing the patch, I plan to

send a patch that addresses the comments as much as possible to -hackers by Feb 21 at the latest.

Therefore, the status of the patch for this feature from last month's commit fest will be

"moving to the next commit fest".

P.S.

Thanks for letting me know, Ilia.

Thanks,

Tatsuro Yamada

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

08 February, 16:50:57

On 1/24/25 11:17, Andrei Lepikhov wrote:
> On 11/1/24 12:22, Tatsuro Yamada wrote:
>> Hi All,
>>
>> I apologize for not being able to continue development due to various
>> circumstances.
>> The attached file is the rebased patch.
>> I will now catch up on the discussion and try to revise the patch.
> I wonder why it’s so important to know exactly where and who has used
> extended statistics.
> 
> I often use SQL Server to compare execution plans generated by
> PostgreSQL, and I appreciate how they display the usage of extended
> statistics. They clearly identify which statistics were utilized during
> query planning and include them in the summary section of query plan.
> 
I admit not knowing what exactly SQL Server shows in the explain, but it
seems helpful to know which part of the plan used which statistics, no?
Imagine we only knew an index was used, but not which node used it and
for what keys. That would be a bit ... useless.

Or what info does the SQL server include in the plan, exactly? Can you
share an example?

> I find this method much easier to implement, as it allows us to see any
> usage points - remember that `estimate_num_groups` may be applied in
> multiple places and may not always correspond to a node clause.
> 

I may be wrong, but I don't quite see why would that be significantly
easier to implement. You still need to track the information somewhere,
and it probably needs for individual Path nodes. Because while building
the plan you don't know which paths will get selected.

Although, maybe the selectivity estimation is sufficiently independent
from the exact paths? In which case we might track it at the plan level.

Still, I don't think that really makes this much easier to implement.
The code would only move a little bit to a different place, but other
than that it would remain the same.

> It’s worth noting that some clauses may be transformed during the
> planning process, and the scan filter may not align with the estimated
> clause. It’s possible that certain clauses might not appear in the final
> estimated plan based on the extended statistics.
> 
> If necessary, we could add an `extstat_ID` to the summary to reference
> it in the plan nodes.
> 

Not sure, but I'd prefer not to add "indirection" the people would have
to follow in the explain plan. I'd much rather duplicate the same info
(which should be rare anyway, we usually don't have the same statistics
used in multiple places in one query).

The point about estimate_num_groups is good - I think there will be more
cases where linking the extended statistics to a clause will be hard.
But maybe let's not block the whole patch because of that?

regards

-- 
Tomas Vondra

Re: Showing applied extended statistics in explain Part 2

From

Andrei Lepikhov

Date:

10 February, 12:09:53

On 8/2/2025 20:50, Tomas Vondra wrote:
> 
> 
> On 1/24/25 11:17, Andrei Lepikhov wrote:
>> On 11/1/24 12:22, Tatsuro Yamada wrote:
>> I often use SQL Server to compare execution plans generated by
>> PostgreSQL, and I appreciate how they display the usage of extended
>> statistics. They clearly identify which statistics were utilized during
>> query planning and include them in the summary section of query plan.
>>
> I admit not knowing what exactly SQL Server shows in the explain, but it
> seems helpful to know which part of the plan used which statistics, no?
I wonder if you meant clauses, not a part of the plan. Clauses may be 
used on different levels of the join tree, but they are estimated once.
I meant we may just refer to the statistic used by its name in the 
node's explanation if this node contains estimated clause. I use 
multiple extended statistics. Sometimes, they intersect, and it is hard 
to say which one was used for a specific clause.
> Imagine we only knew an index was used, but not which node used it and
> for what keys. That would be a bit ... useless.
Sure, but I am suspicious about the strong necessity to show a specific 
clause estimated by the statistic. It is almost obvious because of the 
simple choosing algorithm.
> 
> Or what info does the SQL server include in the plan, exactly? Can you
> share an example?
I wouldn't say SQL Server designed it ideally. In an XML file, you can 
find something like this:

<StatisticsInfo LastUpdate="2024-09-09T21:55:04.43" 
ModificationCount="0" SamplingPercent="17.9892" 
Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]" 
Schema="[dbo]" Database="[DB]"></StatisticsInfo>

The main profit here - you see all the stats involved in estimations 
(and their state), even if final plan doesn't contain estimated stuff at 
all.
> 
>> I find this method much easier to implement, as it allows us to see any
>> usage points - remember that `estimate_num_groups` may be applied in
>> multiple places and may not always correspond to a node clause.
>>
> 
> I may be wrong, but I don't quite see why would that be significantly
> easier to implement. You still need to track the information somewhere,
> and it probably needs for individual Path nodes. Because while building
> the plan you don't know which paths will get selected.
In my mind, gathering extended statistics usage data may be implemented 
like it is already done for many parameters in the stat collector. It 
may help identify whether statistics are used frequently or not.
> 
> Although, maybe the selectivity estimation is sufficiently independent
> from the exact paths? In which case we might track it at the plan level.
Not only that argument. Usually, when I request a DBMS about an EXPLAIN, 
my typical desire is to realise which plan we have and why the optimiser 
has skipped another, maybe even more optimal, plan. I wonder if people 
frequently have the same intention.
That means you will never see clauses (and their estimations) that were 
kicked off the plan - ppi_clauses of a parameterised path, for example. 
And it may hide the real mechanic that caused skipping the better plan.
Mentioning all statistics involved in the summary may reveal such info.

> 
> The point about estimate_num_groups is good - I think there will be more
> cases where linking the extended statistics to a clause will be hard.
> But maybe let's not block the whole patch because of that?
Sure, if the current code doesn't block further improvements of showing 
all statistics involved in the query planning process.

I just wanted to emphasize the key idea: quite frequently we don't see 
in explain stuff that triggered suboptimal plan, because it was 
overestimated and excluded: an index, join, clause ... .

Anyway, thanks for answers, I will discover the code more.

-- 
regards, Andrei Lepikhov

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

10 February, 18:56:45

On 2/10/25 10:09, Andrei Lepikhov wrote:
> On 8/2/2025 20:50, Tomas Vondra wrote:
>>
>>
>> On 1/24/25 11:17, Andrei Lepikhov wrote:
>>> On 11/1/24 12:22, Tatsuro Yamada wrote:
>>> I often use SQL Server to compare execution plans generated by
>>> PostgreSQL, and I appreciate how they display the usage of extended
>>> statistics. They clearly identify which statistics were utilized during
>>> query planning and include them in the summary section of query plan.
>>>
>> I admit not knowing what exactly SQL Server shows in the explain, but it
>> seems helpful to know which part of the plan used which statistics, no?
>
> I wonder if you meant clauses, not a part of the plan. Clauses may be
> used on different levels of the join tree, but they are estimated once.
>
> I meant we may just refer to the statistic used by its name in the
> node's explanation if this node contains estimated clause. I use
> multiple extended statistics. Sometimes, they intersect, and it is hard
> to say which one was used for a specific clause.

I did mean "part of the plan", i.e. the actual operation. That being
said, maybe you're right the exact node doesn't matter all that much,
and it'd be better to have a separate list of stats for the whole plan.

That'd probably work better for stuff like estimate_num_groups(), which
is hard to assign to a particular operation ...

>> Imagine we only knew an index was used, but not which node used it and
>> for what keys. That would be a bit ... useless.
> Sure, but I am suspicious about the strong necessity to show a specific
> clause estimated by the statistic. It is almost obvious because of the
> simple choosing algorithm.

You could say the same thing about indexes, The process of picking an
index is fairly straightforward, same as determining keys and filters
for the index scan. Yet we still show all of this in the plan. The
algorithm may be simple, but the outcomes are far from obvious. And we
may also enhance/rework the algorithm in the future ... I don't think we
should require intimate knowledge of these details from users.

Also, a couple paragraphs back you wrote:

    Sometimes, they intersect, and it is hard to say which one was used
    for a specific clause.

Doesn't that really contradict your claim that it's "almost obvious"?

>>
>> Or what info does the SQL server include in the plan, exactly? Can you
>> share an example?
> I wouldn't say SQL Server designed it ideally. In an XML file, you can
> find something like this:
> 
> <StatisticsInfo LastUpdate="2024-09-09T21:55:04.43"
> ModificationCount="0" SamplingPercent="17.9892"
> Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]"
> Schema="[dbo]" Database="[DB]"></StatisticsInfo>
> 
> The main profit here - you see all the stats involved in estimations
> (and their state), even if final plan doesn't contain estimated stuff at
> all.

OK, that seems very underwhelming. I still think we should show which
clauses were estimated using which statistics object.

>>
>>> I find this method much easier to implement, as it allows us to see any
>>> usage points - remember that `estimate_num_groups` may be applied in
>>> multiple places and may not always correspond to a node clause.
>>>
>>
>> I may be wrong, but I don't quite see why would that be significantly
>> easier to implement. You still need to track the information somewhere,
>> and it probably needs for individual Path nodes. Because while building
>> the plan you don't know which paths will get selected.
> In my mind, gathering extended statistics usage data may be implemented
> like it is already done for many parameters in the stat collector. It
> may help identify whether statistics are used frequently or not.

Isn't the statistics collector dealing with a completely different use
case? Sure, maybe it would be interesting to track how many queries used
which statistics object / for what purpose. A bit like we track counts
for indexes etc.

But that seems very different from what this patch aims to do, which is
to track info about how a particular query used extended stats.

I suppose we could have a "buffer" to remember which statistics objects
were used by a single statement/transaction, just like we do for tables
or indexes. But I still think we should show clauses estimated by each
statistics object in a given query, and this would not help with that.

>>
>> Although, maybe the selectivity estimation is sufficiently independent
>> from the exact paths? In which case we might track it at the plan level.
> Not only that argument. Usually, when I request a DBMS about an EXPLAIN,
> my typical desire is to realise which plan we have and why the optimiser
> has skipped another, maybe even more optimal, plan. I wonder if people
> frequently have the same intention.
> That means you will never see clauses (and their estimations) that were
> kicked off the plan - ppi_clauses of a parameterised path, for example.
> And it may hide the real mechanic that caused skipping the better plan.
> Mentioning all statistics involved in the summary may reveal such info.
> 

I don't quite agree with this argument. Explain is meant to describe one
particular plan, not why some other plans were not selected. So I don't
think we should be adding details about estimating clauses from some
other alternative plan. We're not showing info about costing for other
indexes either, for example.

Also, if we started adding more and more statistics, that'd be just an
extra argument against the "almost obvious" claim. Because not only
you'd have to deduce which clauses were estimated by each object, you'd
also first have to figure out which statistics were used for the plan.

>>
>> The point about estimate_num_groups is good - I think there will be more
>> cases where linking the extended statistics to a clause will be hard.
>> But maybe let's not block the whole patch because of that?
> Sure, if the current code doesn't block further improvements of showing
> all statistics involved in the query planning process.
> 
> I just wanted to emphasize the key idea: quite frequently we don't see
> in explain stuff that triggered suboptimal plan, because it was
> overestimated and excluded: an index, join, clause ... .
> 
> Anyway, thanks for answers, I will discover the code more.
> 

Thanks for the discussion.

I wonder what Yamada-san thinks about these suggestions ... He's the one
actually developing the patch, so I'd like to know his opinions.

regards

-- 
Tomas Vondra

Re: Showing applied extended statistics in explain Part 2

From

Tatsuro Yamada

Date:

12 February, 08:02:20

Hi Tomas and ALL,

>I wonder what Yamada-san thinks about these suggestions ... He's the one
>actually developing the patch, so I'd like to know his opinions.

I will state my thoughts on the two points of discussion.

>>>> I often use SQL Server to compare execution plans generated by
>>>> PostgreSQL, and I appreciate how they display the usage of extended
>>>> statistics. They clearly identify which statistics were utilized during
>>>> query planning and include them in the summary section of query plan.
>>>>
>>> I admit not knowing what exactly SQL Server shows in the explain, but it
>>> seems helpful to know which part of the plan used which statistics, no?
>>
>> I wonder if you meant clauses, not a part of the plan. Clauses may be
>> used on different levels of the join tree, but they are estimated once.
>>
>> I meant we may just refer to the statistic used by its name in the
>> node's explanation if this node contains estimated clause. I use
>> multiple extended statistics. Sometimes, they intersect, and it is hard
>> to say which one was used for a specific clause.
>
>I did mean "part of the plan", i.e. the actual operation. That being
>said, maybe you're right the exact node doesn't matter all that much,
>and it'd be better to have a separate list of stats for the whole plan.
>
>That'd probably work better for stuff like estimate_num_groups(), which
>is hard to assign to a particular operation ...

The patch currently under development aims to display the actual usage of
extended statistics on a plan-by-plan basis.
I thought that being able to check the extended statistics used by plan nodes
in a plan would make it easier to tune plans using extended statistics.

On the other hand, I also think that a mechanism to check the usage of
extended statistics on a DB-by-DB basis would be useful.
I talked about this idea in my talk at PGConf.dev 2024.
This is similar to pg_stat_user_indexes for indexes. If you can find unused or
infrequently used extended statistics, you can delete unnecessary ones.

However, this functionality is not included in this patch. I plan to provide it as

a separate patch after the current patch is committed.

>>> I may be wrong, but I don't quite see why would that be significantly
>>> easier to implement. You still need to track the information somewhere,
>>> and it probably needs for individual Path nodes. Because while building
>>> the plan you don't know which paths will get selected.
>> In my mind, gathering extended statistics usage data may be implemented
>> like it is already done for many parameters in the stat collector. It
>> may help identify whether statistics are used frequently or not.
>
>Isn't the statistics collector dealing with a completely different use
>case? Sure, maybe it would be interesting to track how many queries used
>which statistics object / for what purpose. A bit like we track counts
>for indexes etc.
>
>But that seems very different from what this patch aims to do, which is
>to track info about how a particular query used extended stats.
>
>I suppose we could have a "buffer" to remember which statistics objects
>were used by a single statement/transaction, just like we do for tables
>or indexes. But I still think we should show clauses estimated by each
>statistics object in a given query, and this would not help with that.

Ideally, it would be nice to be able to show which extended statistics
were used for which clauses. However, based on the discussion so far,
this seems difficult to implement.

A second best solution would be to add a summary of the extended statistics
used at the end of the plan. This alone would make it easier for users to
use the extended statistics.

The expected output is as follows:
- Do not output "Clauses: ".
- Display only the names of the extended statistics used.
- If displaying the names of multiple extended statistics,
should they be joined by commas? (Or should Ext Stats: be displayed
on multiple lines?)

e.g.,
explain (stats) select 1 from t where a = 1 and b =1;

## Output from the current patch
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1943.00 rows=10210 width=4)
Filter: ((a = 1) AND (b = 1))
Ext Stats: s Clauses: ((a = 1) AND (b = 1))
(3 rows)

## Output for displaying summary extended statistics
QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..1943.00 rows=10210 width=4)
Filter: ((a = 1) AND (b = 1))
Planning:
Ext Stats: s <== Is it acceptable?

If we do not display "Clause" information, I believe that there is
no need to deparse clauses, and therefore no need to improve the deparse
function to enable recursive traversal of expression trees.

To further the discussion, I will modify the patch to display the summary
as shown above and send it to -hackers.

P.S.
I have tried to understand the discussion correctly, but please
let me know if I have misunderstood anything.

Regards,

Tatsuro Yamada

Re: Showing applied extended statistics in explain Part 2

From

Tomas Vondra

Date:

12 February, 14:11:36

On 2/12/25 06:02, Tatsuro Yamada wrote:
> Hi Tomas and ALL,
> 
>>I wonder what Yamada-san thinks about these suggestions ... He's the one
>>actually developing the patch, so I'd like to know his opinions.
> 
> I will state my thoughts on the two points of discussion.
> 
>>>>> I often use SQL Server to compare execution plans generated by
>>>>> PostgreSQL, and I appreciate how they display the usage of extended
>>>>> statistics. They clearly identify which statistics were utilized during
>>>>> query planning and include them in the summary section of query plan.
>>>>>
>>>> I admit not knowing what exactly SQL Server shows in the explain, but it
>>>> seems helpful to know which part of the plan used which statistics, no?
>>>
>>> I wonder if you meant clauses, not a part of the plan. Clauses may be
>>> used on different levels of the join tree, but they are estimated once.
>>>
>>> I meant we may just refer to the statistic used by its name in the
>>> node's explanation if this node contains estimated clause. I use
>>> multiple extended statistics. Sometimes, they intersect, and it is hard
>>> to say which one was used for a specific clause.
>>
>>I did mean "part of the plan", i.e. the actual operation. That being
>>said, maybe you're right the exact node doesn't matter all that much,
>>and it'd be better to have a separate list of stats for the whole plan.
>>
>>That'd probably work better for stuff like estimate_num_groups(), which
>>is hard to assign to a particular operation ...
> 
> 
> The patch currently under development aims to display the actual usage of
> extended statistics on a plan-by-plan basis.
> I thought that being able to check the extended statistics used by plan
> nodes
> in a plan would make it easier to tune plans using extended statistics.
> 
> On the other hand, I also think that a mechanism to check the usage of
> extended statistics on a DB-by-DB basis would be useful.
> I talked about this idea in my talk at PGConf.dev 2024.
> This is similar to pg_stat_user_indexes for indexes. If you can find
> unused or
> infrequently used extended statistics, you can delete unnecessary ones.
> However, this functionality is not included in this patch. I plan to
> provide it as 
> a separate patch after the current patch is committed.
> 

+1 to treating the DB-level stuff (i.e. system catalog showing how often
was each statistics used, etc.) as a separate feature / patch. I did not
mean to suggest this should be included in this patch, my point was that
it's very different from what this patch aimed to do.

> 
>>>> I may be wrong, but I don't quite see why would that be significantly
>>>> easier to implement. You still need to track the information somewhere,
>>>> and it probably needs for individual Path nodes. Because while building
>>>> the plan you don't know which paths will get selected.
>>> In my mind, gathering extended statistics usage data may be implemented
>>> like it is already done for many parameters in the stat collector. It
>>> may help identify whether statistics are used frequently or not.
>>
>>Isn't the statistics collector dealing with a completely different use
>>case? Sure, maybe it would be interesting to track how many queries used
>>which statistics object / for what purpose. A bit like we track counts
>>for indexes etc.
>>
>>But that seems very different from what this patch aims to do, which is
>>to track info about how a particular query used extended stats.
>>
>>I suppose we could have a "buffer" to remember which statistics objects
>>were used by a single statement/transaction, just like we do for tables
>>or indexes. But I still think we should show clauses estimated by each
>>statistics object in a given query, and this would not help with that.
> 
> Ideally, it would be nice to be able to show which extended statistics
> were used for which clauses. However, based on the discussion so far,
> this seems difficult to implement.
> 

Maybe I'm missing something, but why would this be difficult to
implement? There was a problem with RestrictInfo nodes, but doesn't
Tom's suggestion [1] to use extract_actual_clauses() address this? I
might have done something silly in the PoC patch.

https://www.postgresql.org/message-id/1691487.1733180716@sss.pgh.pa.us

> A second best solution would be to add a summary of the extended statistics
> used at the end of the plan. This alone would make it easier for users to
> use the extended statistics.
> 
> The expected output is as follows:
>     - Do not output "Clauses: ".
>     - Display only the names of the extended statistics used.
>     - If displaying the names of multiple extended statistics,
>       should they be joined by commas? (Or should Ext Stats: be displayed
>       on multiple lines?)
> 

The list of stats is better than no information at all, but I don't
quite see why we couldn't show clauses, and to me the clauses seem like
a pretty important piece of information.

> e.g.,
> explain (stats) select 1 from t where a = 1 and b =1;
> 
> ## Output from the current patch
>                        QUERY PLAN
> --------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1943.00 rows=10210 width=4)
>    Filter: ((a = 1) AND (b = 1))
>    Ext Stats: s  Clauses: ((a = 1) AND (b = 1))
> (3 rows)
> 
> 
> ## Output for displaying summary extended statistics
>                        QUERY PLAN
> --------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1943.00 rows=10210 width=4)
>    Filter: ((a = 1) AND (b = 1))
>  Planning:
>    Ext Stats: s  <== Is it acceptable?
> 
> If we do not display "Clause" information, I believe that there is
> no need to deparse clauses, and therefore no need to improve the deparse
> function to enable recursive traversal of expression trees.
> 

What do you mean by the need to improve the deparse function? Is this
about the nested RestrictInfos? Have you tried doing that Tom suggested
in [1], or does it not address the issue?

> To further the discussion, I will modify the patch to display the summary
> as shown above and send it to -hackers.
> 
> P.S.
> I have tried to understand the discussion correctly, but please
> let me know if I have misunderstood anything.
> 

Thanks for working on this!


regards

-- 
Tomas Vondra

Re: Showing applied extended statistics in explain Part 2

From

Andrei Lepikhov

Date:

24 April, 17:31:14

On 2/10/25 16:56, Tomas Vondra wrote:
> On 2/10/25 10:09, Andrei Lepikhov wrote:
>> On 8/2/2025 20:50, Tomas Vondra wrote:
>> <StatisticsInfo LastUpdate="2024-09-09T21:55:04.43"
>> ModificationCount="0" SamplingPercent="17.9892"
>> Statistics="[_WA_Sys_00000001_0A55DF1D]" Table="[_InfoRgX]"
>> Schema="[dbo]" Database="[DB]"></StatisticsInfo>
>>
>> The main profit here - you see all the stats involved in estimations
>> (and their state), even if final plan doesn't contain estimated stuff at
>> all.
> 
> OK, that seems very underwhelming. I still think we should show which
> clauses were estimated using which statistics object.
To understand how it may work, I employed the EXPLAIN extensibility 
introduced in PG 18 to show the use of plain statistics [1]. It looks 
like the following:

EXPLAIN (COSTS OFF, STAT ON)
SELECT * FROM sc_a WHERE x=1 AND y LIKE 'a';

  Seq Scan on sc_a
    Filter: ((y ~~ 'a'::text) AND (x = 1))
  Statistics:
    "sc_a.y: 1 times, stats: { MCV: 10 values, Correlation,
        ndistinct: 10.0000, nullfrac: 0.0000, width: 5 }
    "sc_a.x: 1 times, stats: { Histogram: 0 values, Correlation,
         ndistinct: -1.0000, nullfrac: 0.0000, width: 4 }

As you can see, stat usage is summarised at the end of the EXPLAIN. It 
contains information about the column, how many times it was used and 
the parameters of statistic slots.
Of course, being an extension it is constrained a lot, but even there is 
the profit:
1. You may see types of statistics exist on the column
2. Precision of the histogram or MCV (statistic_target) on a specific 
table - some users forget to increase it on large (or partitioned) tables
3. You have basic stat like nullfrac, ndistinct without the necessity to 
teach personnel how to gather it on a production instance safely.

Also, using it in real cases, I realised that it would be highly 
profitable to specify which statistic type was used to estimate this 
specific clause.

Of course, extended statistics have their own specifics, which may 
require another output format. Just consider this example too.

[1] https://github.com/danolivo/pg_index_stats

-- 
regards, Andrei Lepikhov