RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

From
Subject RE: Showing applied extended statistics in explain Part 2
Date
Msg-id TYWPR01MB109820D4151E844AC0A368BD3B1AD2@TYWPR01MB10982.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Showing applied extended statistics in explain Part 2  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
> On 7/18/24 12:37, Masahiro.Ikeda@nttdata.com wrote:
> >> Let me share my opinion on those questions ...
> > ...>
> >> For ndistinct, I think we don't show this because it doesn't go
> >> through clauselist_selectivity, which is the only thing I modified in the PoC.
> >> But I guess we might improve estimate_num_groups() to track the stats
> >> in a similar way, I guess.
> >
> > Thanks. IIUC, the reason is that it doesn't go through
> > statext_clauselist_selectivity() because the number of clauses is one though it goes
> through clauselist_selectivity().
> >
>
> Ah, I see I misunderstood the original report. The query used was
>
>   EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
>     WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
>
> And it has nothing to do with the number of clauses being one neither.
>
> The problem is this estimate is handled by examine_variable() matching the expression
> to the "expression" stats, and injecting it into the variable, so that the
> clauselist_selectivity() sees these stats.
>
> This would happen even if you build just expression statistics on each of the
> date_trunc() calls, and then tried a query with two clauses:
>
>   CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
>   CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;
>
>   EXPLAIN SELECT * FROM t3
>     WHERE date_trunc('month', a) = '2020-01-01'::timestamp
>       AND date_trunc('day', 'a') = '2020-01-01'::timestamp;
>
> Not sure how to handle this - we could remember when explain_variable() injects
> statistics like this, I guess. But do we know that each call to
> examine_variable() is for estimation? And do we know for which clause?

I see. The issue is related to extended statistics for single expression. As a
first step, it's ok for me that we don't support it.

The below is just an idea to know clauses...
Although I'm missing something, can callers of examine_variable()
for estimation to rebuild the clauses from partial information of "OpExpr"?

Only clause_selectivity_ext() knows the information of actual full clauses.
But we don't need full information. It's enough to know the information
to show "OpExpr" for EXPLAIN.

get_oper_expr() deparse "OpExpr" using only the operator oid and arguments
in get_oper_expr().

If so, the caller to estimate, for example eqsel_internal(), scalarineqsel_wrapper()
and so on, seems to be able to know the "OpExpr" information, which are operator
oid and arguments, and used extended statistics easily to show for EXPLAIN.

# Memo: the call path of the estimation function
 caller to estimate selectivity (eqsel_internal()/scalargtjoinsel_wrappter()/...)
  -> get_restriction_variable()/get_join_valiables()
   -> examine_variable()


> >>> ERROR:  unrecognized node type: 268
> >
> > Regarding the above error, do "applied_stats" need have the list of "StatisticExtInfo"
> > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN output in the
> current patch?
> > change_to_applied_stats_has_list_of_oids.diff is the change I assumed.
> > Do you have any plan to show extra information for example "kind" of
> "StatisticExtInfo"?
> >
> > The above is just one idea came up with while I read the following
> > comments of header of pathnodes.h, and to support copy "StatisticExtInfo" will leads
> many other nodes to support copy.
> >   * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> >   * There are some subsidiary structs that are useful to copy, though.
> >
>
> I do think tracking just the OID would work, because we already know how to copy List
> objects. But if we want to also track the clauses, we'd have to keep multiple lists, right?
> That seems a bit inconvenient.

Understood. In future, we might show not only the applied_clauses but also the clauses of
its extended statistics (StatisticExtInfo->exprs).


> > By the way, I found curios result while I tested with the above patch. It shows same
> "Ext Stats" twice.
> > I think it's expected behavior because the stat is used when estimate the cost of
> "Partial HashAggregate" and "Group".
> > I've shared the result because I could not understand soon when I saw
> > it first time. I think it's better to let users understand when the stats are used, but I
> don't have any idea now.
> >
> > -- I tested with the example of CREATE STATISTICS documentation.
> > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), date_trunc('day',
> a) FROM t3 GROUP BY 1, 2;
> >                                                                  QUERY PLAN
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > -  Group  (cost=9530.56..9576.18 rows=365 width=16) (actual
> > time=286.908..287.909 rows=366 loops=1)
> >    Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> >    ->  Gather Merge  (cost=9530.56..9572.53 rows=365 width=16) (actual
> time=286.904..287.822 rows=498 loops=1)
> >          Workers Planned: 1
> >          Workers Launched: 1
> >          ->  Sort  (cost=8530.55..8531.46 rows=365 width=16) (actual
> time=282.905..282.919 rows=249 loops=2)
> >                Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
> >                Sort Method: quicksort  Memory: 32kB
> >                Worker 0:  Sort Method: quicksort  Memory: 32kB
> >                ->  Partial HashAggregate  (cost=8509.54..8515.02 rows=365
> width=16) (actual time=282.716..282.768 rows=249 loops=2)
> >                      Group Key: date_trunc('month'::text, a), date_trunc('day'::text,
> a)
> >                      Batches: 1  Memory Usage: 45kB
> >                      Worker 0:  Batches: 1  Memory Usage: 45kB
> >                      ->  Parallel Seq Scan on t3  (cost=0.00..6963.66 rows=309177
> width=16) (actual time=0.021..171.214 rows=262800 loops=2)
> >                            Ext Stats: public.s3  Clauses: date_trunc('month'::text,
> a), date_trunc('day'::text, a)  -- here
> >                            Ext Stats: public.s3  Clauses:
> > date_trunc('month'::text, a), date_trunc('day'::text, a)  -- here
> > Planning Time: 114327.206 ms  Execution Time: 288.007 ms
> > (18 rows)
> >
>
> I haven't looked into this, but my guess would be this is somehow related to the
> parallelism - there's one parallel worker, which means we have 2 processes to report
> stats for (leader + worker). And you get two copies of the "Ext Stats" line. Could be a
> coincidence, ofc, but maybe there's a loop to print some worker info, and you print the
> statistics info in it?

I think yes and no. In the above case, it relates to parallelism, but it doesn't print the
information per each worker.

-- Make the number of workers is 5 and EXPLAIN without ANALYZE option.
-- But "Ext Stats" is printed only twice.
=# EXPLAIN (STATS) SELECT date_trunc('month', a), date_trunc('day', a) FROM t3 GROUP BY 1, 2;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Group  (cost=4449.49..4489.50 rows=365 width=16)
   Group Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
   ->  Gather Merge  (cost=4449.49..4478.55 rows=1825 width=16)
         Workers Planned: 5
         ->  Sort  (cost=4449.41..4450.32 rows=365 width=16)
               Sort Key: (date_trunc('month'::text, a)), (date_trunc('day'::text, a))
               ->  Partial HashAggregate  (cost=4428.40..4433.88 rows=365 width=16)
                     Group Key: date_trunc('month'::text, a), date_trunc('day'::text, a)
                     ->  Parallel Seq Scan on t3  (cost=0.00..3902.80 rows=105120 width=16)
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
                           Ext Stats: public.s3  Clauses: date_trunc('month'::text, a), date_trunc('day'::text, a)
(11 rows)

When creating a group path, it creates partial grouping paths if possible, and then
creates the final grouping path. At this time, both the partial grouping path and
the final grouping path use the same RelOptInfo to repeatedly use the extended
statistics to know how many groups there will be. That's why it outputs only twice.
There may be other similar calculation for partial paths.

# The call path of the above query
 create_grouping_paths
  create_ordinary_grouping_paths
   create_partial_grouping_paths
    get_number_of_groups
     estimate_num_groups
      estimate_multivariate_ndistinct  -- first time to estimate the number of groups for partial grouping path
   get_number_of_groups
    estimate_num_groups
     estimate_multivariate_ndistinct  -- second time to estimate the number of groups for final grouping path


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Set log_lock_waits=on by default
Next
From: Aleksander Alekseev
Date:
Subject: Re: Build with LTO / -flto on macOS