RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers
From | |
---|---|
Subject | RE: Showing applied extended statistics in explain Part 2 |
Date | |
Msg-id | TYWPR01MB109820D4151E844AC0A368BD3B1AD2@TYWPR01MB10982.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Showing applied extended statistics in explain Part 2 (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
> On 7/18/24 12:37, Masahiro.Ikeda@nttdata.com wrote: > >> Let me share my opinion on those questions ... > > ...> > >> For ndistinct, I think we don't show this because it doesn't go > >> through clauselist_selectivity, which is the only thing I modified in the PoC. > >> But I guess we might improve estimate_num_groups() to track the stats > >> in a similar way, I guess. > > > > Thanks. IIUC, the reason is that it doesn't go through > > statext_clauselist_selectivity() because the number of clauses is one though it goes > through clauselist_selectivity(). > > > > Ah, I see I misunderstood the original report. The query used was > > EXPLAIN (STATS, ANALYZE) SELECT * FROM t3 > WHERE date_trunc('month', a) = '2020-01-01'::timestamp; > > And it has nothing to do with the number of clauses being one neither. > > The problem is this estimate is handled by examine_variable() matching the expression > to the "expression" stats, and injecting it into the variable, so that the > clauselist_selectivity() sees these stats. > > This would happen even if you build just expression statistics on each of the > date_trunc() calls, and then tried a query with two clauses: > > CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3; > CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3; > > EXPLAIN SELECT * FROM t3 > WHERE date_trunc('month', a) = '2020-01-01'::timestamp > AND date_trunc('day', 'a') = '2020-01-01'::timestamp; > > Not sure how to handle this - we could remember when explain_variable() injects > statistics like this, I guess. But do we know that each call to > examine_variable() is for estimation? And do we know for which clause? I see. The issue is related to extended statistics for single expression. As a first step, it's ok for me that we don't support it. The below is just an idea to know clauses... Although I'm missing something, can callers of examine_variable() for estimation to rebuild the clauses from partial information of "OpExpr"? Only clause_selectivity_ext() knows the information of actual full clauses. But we don't need full information. It's enough to know the information to show "OpExpr" for EXPLAIN. get_oper_expr() deparse "OpExpr" using only the operator oid and arguments in get_oper_expr(). If so, the caller to estimate, for example eqsel_internal(), scalarineqsel_wrapper() and so on, seems to be able to know the "OpExpr" information, which are operator oid and arguments, and used extended statistics easily to show for EXPLAIN. # Memo: the call path of the estimation function caller to estimate selectivity (eqsel_internal()/scalargtjoinsel_wrappter()/...) -> get_restriction_variable()/get_join_valiables() -> examine_variable() > >>> ERROR: unrecognized node type: 268 > > > > Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo" > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the > current patch? > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed. > > Do you have any plan to show extra information for example "kind" of > "StatisticExtInfo"? > > > > The above is just one idea came up with while I read the following > > comments of header of pathnodes.h, and to support copy "StatisticExtInfo" will leads > many other nodes to support copy. > > * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes. > > * There are some subsidiary structs that are useful to copy, though. > > > > I do think tracking just the OID would work, because we already know how to copy List > objects. But if we want to also track the clauses, we'd have to keep multiple lists, right? > That seems a bit inconvenient. Understood. In future, we might show not only the applied_clauses but also the clauses of its extended statistics (StatisticExtInfo->exprs). > > By the way, I found curios result while I tested with the above patch. It shows same > "Ext Stats" twice. > > I think it's expected behavior because the stat is used when estimate the cost of > "Partial HashAggregate" and "Group". > > I've shared the result because I could not understand soon when I saw > > it first time. I think it's better to let users understand when the stats are used, but I > don't have any idea now. > > > > -- I tested with the example of CREATE STATISTICS documentation. > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day', > a) FROM t3 GROUP BY 1, 2; > > QUERY PLAN > > ---------------------------------------------------------------------- > > ---------------------------------------------------------------------- > > - Group (cost=9530.56..9576.18 rows=365 width=16) (actual > > time=286.908..287.909 rows=366 loops=1) > > Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) > > -> Gather Merge (cost=9530.56..9572.53 rows=365 width=16) (actual > time=286.904..287.822 rows=498 loops=1) > > Workers Planned: 1 > > Workers Launched: 1 > > -> Sort (cost=8530.55..8531.46 rows=365 width=16) (actual > time=282.905..282.919 rows=249 loops=2) > > Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) > > Sort Method: quicksort Memory: 32kB > > Worker 0: Sort Method: quicksort Memory: 32kB > > -> Partial HashAggregate (cost=8509.54..8515.02 rows=365 > width=16) (actual time=282.716..282.768 rows=249 loops=2) > > Group Key: date_trunc('month'::text, a), date_trunc('day'::text, > a) > > Batches: 1 Memory Usage: 45kB > > Worker 0: Batches: 1 Memory Usage: 45kB > > -> Parallel Seq Scan on t3 (cost=0.00..6963.66 rows=309177 > width=16) (actual time=0.021..171.214 rows=262800 loops=2) > > Ext Stats: public.s3 Clauses: date_trunc('month'::text, > a), date_trunc('day'::text, a) -- here > > Ext Stats: public.s3 Clauses: > > date_trunc('month'::text, a), date_trunc('day'::text, a) -- here > > Planning Time: 114327.206 ms Execution Time: 288.007 ms > > (18 rows) > > > > I haven't looked into this, but my guess would be this is somehow related to the > parallelism - there's one parallel worker, which means we have 2 processes to report > stats for (leader + worker). And you get two copies of the "Ext Stats" line. Could be a > coincidence, ofc, but maybe there's a loop to print some worker info, and you print the > statistics info in it? I think yes and no. In the above case, it relates to parallelism, but it doesn't print the information per each worker. -- Make the number of workers is 5 and EXPLAIN without ANALYZE option. -- But "Ext Stats" is printed only twice. =# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Group (cost=4449.49..4489.50 rows=365 width=16) Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) -> Gather Merge (cost=4449.49..4478.55 rows=1825 width=16) Workers Planned: 5 -> Sort (cost=4449.41..4450.32 rows=365 width=16) Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a)) -> Partial HashAggregate (cost=4428.40..4433.88 rows=365 width=16) Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a) -> Parallel Seq Scan on t3 (cost=0.00..3902.80 rows=105120 width=16) Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) Ext Stats: public.s3 Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a) (11 rows) When creating a group path, it creates partial grouping paths if possible, and then creates the final grouping path. At this time, both the partial grouping path and the final grouping path use the same RelOptInfo to repeatedly use the extended statistics to know how many groups there will be. That's why it outputs only twice. There may be other similar calculation for partial paths. # The call path of the above query create_grouping_paths create_ordinary_grouping_paths create_partial_grouping_paths get_number_of_groups estimate_num_groups estimate_multivariate_ndistinct -- first time to estimate the number of groups for partial grouping path get_number_of_groups estimate_num_groups estimate_multivariate_ndistinct -- second time to estimate the number of groups for final grouping path Regards, -- Masahiro Ikeda NTT DATA CORPORATION
pgsql-hackers by date: