RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

From
Subject RE: Showing applied extended statistics in explain Part 2
Date
Msg-id TYWPR01MB109820AEAB1E18F89404A8304B1AC2@TYWPR01MB10982.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Showing applied extended statistics in explain Part 2  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Showing applied extended statistics in explain Part 2
List pgsql-hackers
> Let me share my opinion on those questions ...

Thanks! I could understand the patch well thanks to your comments.


> On 7/12/24 12:09, Masahiro.Ikeda@nttdata.com wrote:
> > Is it better to make the order of output consistent? For example, even
> > though there are three clauses shown in the below case, the order does
> > not match.
> > * "Filter" shows that "id1" is first.
> > * "Ext Stats" shows that "id2" is first.
> >
> > -- An example
> > DROP TABLE IF EXISTS test;
> > CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
> > INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM
> > generate_series(1,1000000) s(i)); create statistics test_s1 on id1,
> > id2 from test; analyze;
> >
> > =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
> >                                       QUERY PLAN
> > ----------------------------------------------------------------------
> > -----------------  Gather  (cost=1000.00..23092.77 rows=84311
> > width=20)
> >    Workers Planned: 2
> >    ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
> >          Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))
> -- here
> >          Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))
> -- here
> > (5 rows)
> >
>
> I don't think we need to make the order consistent. It probably wouldn't hurt, but I'm
> not sure it's even possible for all scan types - for example in an index scan, the clauses
> might be split between index conditions and filters, etc.

OK, I understand it isn't unexpected behavior.


> > (3)
> >
> > I might be misunderstanding something, but do we need the clauses? Is
> > there any case where users would want to know the clauses? For
> > example, wouldn't the following be sufficient?
> >
> >> Ext Stats: id1, id2 using test_s1
> >
>
> The stats may overlap, and some clauses may be matching multiple of them. And some
> statistics do not support all clause types (e.g.
> functional dependencies work only with equality conditions). Yes, you might deduce
> which statistics are used for which clause, but it's not trivial - interpreting explain is
> already not trivial, let's not make it harder.
>
> (If tracking the exact clauses turns out to be expensive, we might revisit this - it might
> make it cheaper).

Thanks. I agree that we need to show the clauses.


> > (4)
> >
> > The extended statistics with "dependencies" or "ndistinct" option
> > don't seem to be shown in EXPLAIN output. Am I missing something? (Is
> > this expected?)
> >
> > I tested the examples in the documentation. Although it might work
> > with "mcv" option, I can't confirm that it works because "unrecognized node type"
> > error occurred in my environment.
> > https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sq
> > l-createstatistics.html__;!!GCTRfqYYOYGmgK_z!9H-FTXrhg7cr0U2r4PoKEeWM1
> > v9feP8I8zlNyhf-801n-KI8bIMAxOQgaetSTpek3ECk2_FKWEsuApVZ-ys-ka7rfjX8ANB
> > 9zQ$
> >
> > (It might be wrong since I'm beginner with extended stats codes.)
> > IIUC, the reason is that the patch only handles
> > statext_mcv_clauselist_selectivity(),
> > and doesn't handle dependencies_clauselist_selectivity() and
> estimate_multivariate_ndistinct().
> >
> >
> > -- doesn't work with "dependencies" option?
> > =# \dX
> >                         List of extended statistics
> >  Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> > --------+---------+--------------------+-----------+--------------+---
> >  public | s1      | a, b FROM t1       | (null)    | defined      | (null)
> > (2 rows)
> >
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
> >                                                     QUERY PLAN
> > ----------------------------------------------------------------------
> > ---------------------------------------------
> >  Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual
> time=0.214..50.327 rows=100 loops=1)
> >    Workers Planned: 2
> >    Workers Launched: 2
> >    ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual
> time=30.300..46.610 rows=33 loops=3)
> >          Filter: ((a = 1) AND (b = 0))
> >          Rows Removed by Filter: 333300  Planning Time: 0.246 ms
> > Execution Time: 50.361 ms
> > (8 rows)
> >
> > -- doesn't work with "ndistinct"?
> > =# \dX
> >                                              List of extended statistics
> >  Schema | Name |                            Definition                            |
> Ndistinct | Dependencies |  MCV
> >
> --------+------+------------------------------------------------------------------+------
> -----+--------------+--------
> >  public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 |
> defined   | (null)       | (null)
> > (1 row)
> >
> > postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> >   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> >                                                 QUERY PLAN
> > ----------------------------------------------------------------------
> > ------------------------------------
> >  Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual
> time=0.027..143.199 rows=44640 loops=1)
> >    Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp
> without time zone)
> >    Rows Removed by Filter: 480961
> >  Planning Time: 0.088 ms
> >  Execution Time: 144.590 ms
> > (5 rows)
> >
> > -- doesn't work with "mvc". It might work, but the error happens in my
> > environments =# \dX
> >                     List of extended statistics
> >  Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV
> > --------+------+--------------+-----------+--------------+---------
> >  public | s2   | a, b FROM t2 | (null)    | (null)       | defined
> > (1 row)
> >
> > -- I encountered the error with the query.
> > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b =
> > 1);
> > ERROR:  unrecognized node type: 268
> >
> >
>
> Yes, you're right we don't show some stats. For dependencies there's the problem that
> we don't apply them individually, so it's not really possible to map clauses to individual
> stats. I wonder if we might have a special "entry" to show clauses estimated by the
> functional dependencies combined from all stats (instead of a particular statistics).

OK, I understand it's intended behavior for "dependencies" and we need to consider how to
show them in EXPLAIN output in future.


> For ndistinct, I think we don't show this because it doesn't go through
> clauselist_selectivity, which is the only thing I modified in the PoC.
> But I guess we might improve estimate_num_groups() to track the stats in a similar way,
> I guess.

Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because
the number of clauses is one though it goes through clauselist_selectivity().


> > ERROR:  unrecognized node type: 268

Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch?
change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to
show extra information for example "kind" of "StatisticExtInfo"?

The above is just one idea came up with while I read the following comments of header
of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy.
  * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
  * There are some subsidiary structs that are useful to copy, though.


By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice.
I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and "Group".
I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users
understand
when the stats are used, but I don't have any idea now.

-- I tested with the example of CREATE STATISTICS documentation.
psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
                                                                 QUERY PLAN
                     

---------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1)
   Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
   ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2)
               Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
               Sort Method: quicksort  Memory: 32kB
               Worker 0:  Sort Method: quicksort  Memory: 32kB
               ->  Partial HashAggregate  (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768
rows=249loops=2) 
                     Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
                     Batches: 1  Memory Usage: 45kB
                     Worker 0:  Batches: 1  Memory Usage: 45kB
                     ->  Parallel Seq Scan on t3  (cost=0.00..6963.66 rows=309177 width=16) (actual time=0.021..171.214
rows=262800loops=2) 
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)  --
here
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)  --
here
 Planning Time: 114327.206 ms
 Execution Time: 288.007 ms
(18 rows)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Add mention of execution time memory for enable_partitionwise_* GUCs
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure