RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers
From | |
---|---|
Subject | RE: Showing applied extended statistics in explain Part 2 |
Date | |
Msg-id | TYWPR01MB109820AEAB1E18F89404A8304B1AC2@TYWPR01MB10982.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Showing applied extended statistics in explain Part 2 (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Showing applied extended statistics in explain Part 2
|
List | pgsql-hackers |
> Let me share my opinion on those questions ... Thanks! I could understand the patch well thanks to your comments. > On 7/12/24 12:09, Masahiro.Ikeda@nttdata.com wrote: > > Is it better to make the order of output consistent? For example, even > > though there are three clauses shown in the below case, the order does > > not match. > > * "Filter" shows that "id1" is first. > > * "Ext Stats" shows that "id2" is first. > > > > -- An example > > DROP TABLE IF EXISTS test; > > CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32)); > > INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM > > generate_series(1,1000000) s(i)); create statistics test_s1 on id1, > > id2 from test; analyze; > > > > =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10); > > QUERY PLAN > > ---------------------------------------------------------------------- > > ----------------- Gather (cost=1000.00..23092.77 rows=84311 > > width=20) > > Workers Planned: 2 > > -> Parallel Seq Scan on test (cost=0.00..13661.67 rows=35130 width=20) > > Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10))) > -- here > > Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) > -- here > > (5 rows) > > > > I don't think we need to make the order consistent. It probably wouldn't hurt, but I'm > not sure it's even possible for all scan types - for example in an index scan, the clauses > might be split between index conditions and filters, etc. OK, I understand it isn't unexpected behavior. > > (3) > > > > I might be misunderstanding something, but do we need the clauses? Is > > there any case where users would want to know the clauses? For > > example, wouldn't the following be sufficient? > > > >> Ext Stats: id1, id2 using test_s1 > > > > The stats may overlap, and some clauses may be matching multiple of them. And some > statistics do not support all clause types (e.g. > functional dependencies work only with equality conditions). Yes, you might deduce > which statistics are used for which clause, but it's not trivial - interpreting explain is > already not trivial, let's not make it harder. > > (If tracking the exact clauses turns out to be expensive, we might revisit this - it might > make it cheaper). Thanks. I agree that we need to show the clauses. > > (4) > > > > The extended statistics with "dependencies" or "ndistinct" option > > don't seem to be shown in EXPLAIN output. Am I missing something? (Is > > this expected?) > > > > I tested the examples in the documentation. Although it might work > > with "mcv" option, I can't confirm that it works because "unrecognized node type" > > error occurred in my environment. > > https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sq > > l-createstatistics.html__;!!GCTRfqYYOYGmgK_z!9H-FTXrhg7cr0U2r4PoKEeWM1 > > v9feP8I8zlNyhf-801n-KI8bIMAxOQgaetSTpek3ECk2_FKWEsuApVZ-ys-ka7rfjX8ANB > > 9zQ$ > > > > (It might be wrong since I'm beginner with extended stats codes.) > > IIUC, the reason is that the patch only handles > > statext_mcv_clauselist_selectivity(), > > and doesn't handle dependencies_clauselist_selectivity() and > estimate_multivariate_ndistinct(). > > > > > > -- doesn't work with "dependencies" option? > > =# \dX > > List of extended statistics > > Schema | Name | Definition | Ndistinct | Dependencies | MCV > > --------+---------+--------------------+-----------+--------------+--- > > --------+---------+--------------------+-----------+--------------+--- > > --------+---------+--------------------+-----------+--------------+--- > > public | s1 | a, b FROM t1 | (null) | defined | (null) > > (2 rows) > > > > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0); > > QUERY PLAN > > ---------------------------------------------------------------------- > > --------------------------------------------- > > Gather (cost=1000.00..11685.00 rows=100 width=8) (actual > time=0.214..50.327 rows=100 loops=1) > > Workers Planned: 2 > > Workers Launched: 2 > > -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=42 width=8) (actual > time=30.300..46.610 rows=33 loops=3) > > Filter: ((a = 1) AND (b = 0)) > > Rows Removed by Filter: 333300 Planning Time: 0.246 ms > > Execution Time: 50.361 ms > > (8 rows) > > > > -- doesn't work with "ndistinct"? > > =# \dX > > List of extended statistics > > Schema | Name | Definition | > Ndistinct | Dependencies | MCV > > > --------+------+------------------------------------------------------------------+------ > -----+--------------+-------- > > public | s3 | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | > defined | (null) | (null) > > (1 row) > > > > postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3 > > WHERE date_trunc('month', a) = '2020-01-01'::timestamp; > > QUERY PLAN > > ---------------------------------------------------------------------- > > ------------------------------------ > > Seq Scan on t3 (cost=0.00..10210.01 rows=45710 width=8) (actual > time=0.027..143.199 rows=44640 loops=1) > > Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp > without time zone) > > Rows Removed by Filter: 480961 > > Planning Time: 0.088 ms > > Execution Time: 144.590 ms > > (5 rows) > > > > -- doesn't work with "mvc". It might work, but the error happens in my > > environments =# \dX > > List of extended statistics > > Schema | Name | Definition | Ndistinct | Dependencies | MCV > > --------+------+--------------+-----------+--------------+--------- > > public | s2 | a, b FROM t2 | (null) | (null) | defined > > (1 row) > > > > -- I encountered the error with the query. > > =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = > > 1); > > ERROR: unrecognized node type: 268 > > > > > > Yes, you're right we don't show some stats. For dependencies there's the problem that > we don't apply them individually, so it's not really possible to map clauses to individual > stats. I wonder if we might have a special "entry" to show clauses estimated by the > functional dependencies combined from all stats (instead of a particular statistics). OK, I understand it's intended behavior for "dependencies" and we need to consider how to show them in EXPLAIN output in future. > For ndistinct, I think we don't show this because it doesn't go through > clauselist_selectivity, which is the only thing I modified in the PoC. > But I guess we might improve estimate_num_groups() to track the stats in a similar way, > I guess. Thanks. IIUC, the reason is that it doesn't go through statext_clauselist_selectivity() because the number of clauses is one though it goes through clauselist_selectivity(). > > ERROR: unrecognized node type: 268 Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo" because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the current patch? change_to_applied_stats_has_list_of_oids.diff is the change I assumed. Do you have any plan to show extra information for example "kind" of "StatisticExtInfo"? The above is just one idea came up with while I read the following comments of header of pathnodes.h, and to support copy "StatisticExtInfo" will leads many other nodes to support copy. * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes. * There are some subsidiary structs that are useful to copy, though. By the way, I found curios result while I tested with the above patch. It shows same "Ext Stats" twice. I think it's expected behavior because the stat is used when estimate the cost of "Partial HashAggregate" and "Group". I've shared the result because I could not understand soon when I saw it first time. I think it's better to let users understand when the stats are used, but I don't have any idea now. -- I tested with the example of CREATE STATISTICS documentation. psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Group (cost=9530.56..9576.18 rows=365 width=16) (actual time=286.908..287.909 rows=366 loops=1) Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) -> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual time=286.904..287.822 rows=498 loops=1) Workers Planned: 1 Workers Launched: 1 -> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual time=282.905..282.919 rows=249 loops=2) Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) Sort Method: quicksort Memory: 32kB Worker 0: Sort Method: quicksort Memory: 32kB -> Partial HashAggregate (cost=8509.54..8515.02 rows=365 width=16) (actual time=282.716..282.768 rows=249loops=2) Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a) Batches: 1 Memory Usage: 45kB Worker 0: Batches: 1 Memory Usage: 45kB -> Parallel Seq Scan on t3 (cost=0.00..6963.66 rows=309177 width=16) (actual time=0.021..171.214 rows=262800loops=2) Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) -- here Planning Time: 114327.206 ms Execution Time: 288.007 ms (18 rows) Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
pgsql-hackers by date: