Home > mailing lists

RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

From
Subject	RE: Showing applied extended statistics in explain Part 2
Date	July 12, 2024 10:09:42
Msg-id	TYWPR01MB109825D40CB18BED80B6B4140B1A62@TYWPR01MB10982.jpnprd01.prod.outlook.com Whole thread Raw
In response to	Re: Showing applied extended statistics in explain Part 2 (Tatsuro Yamada <yamatattsu@gmail.com>)
Responses	Re: Showing applied extended statistics in explain Part 2
List	pgsql-hackers

Tree view

Hi,

Thanks for working the feature. As a user, I find it useful, and I'd like to use
it in v18! Although I've just started start looking into it, I have a few questions.


(1)

Is it better to make the order of output consistent? For example, even
though there are three clauses shown in the below case, the order does not
match.
* "Filter" shows that "id1" is first.
* "Ext Stats" shows that "id2" is first.

-- An example
DROP TABLE IF EXISTS test;
CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i));
create statistics test_s1 on id1, id2 from test; analyze;

=# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..23092.77 rows=84311 width=20)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
         Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))                                -- here
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))    -- here
(5 rows)



(2)

Do we really need the schema names without VERBOSE option? As in the above case,
"Ext Stats" shows schema name "public", even though the table name "test" isn't
shown with its schema name.

Additionally, if the VERBOSE option is specified, should the column names also be
printed with namespace?

=# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..22947.37 rows=82857 width=20)
   Output: id1, id2, id3, value
   Workers Planned: 2
   ->  Parallel Seq Scan on public.test  (cost=0.00..13661.67 rows=34524 width=20)
         Output: id1, id2, id3, value
         Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1))   -- here
(7 rows)



(3)

I might be misunderstanding something, but do we need the clauses? Is there any
case where users would want to know the clauses? For example, wouldn't the
following be sufficient?

> Ext Stats: id1, id2 using test_s1



(4)

The extended statistics with "dependencies" or "ndistinct" option don't seem to
be shown in EXPLAIN output. Am I missing something? (Is this expected?)

I tested the examples in the documentation. Although it might work with
"mcv" option, I can't confirm that it works because "unrecognized node type"
error occurred in my environment.
https://www.postgresql.org/docs/current/sql-createstatistics.html

(It might be wrong since I'm beginner with extended stats codes.)
IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(),
and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct().


-- doesn't work with "dependencies" option?
=# \dX
                        List of extended statistics
 Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV
--------+---------+--------------------+-----------+--------------+---------
 public | s1      | a, b FROM t1       | (null)    | defined      | (null)
(2 rows)

=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3)
         Filter: ((a = 1) AND (b = 0))
         Rows Removed by Filter: 333300
 Planning Time: 0.246 ms
 Execution Time: 50.361 ms
(8 rows)

-- doesn't work with "ndistinct"?
=# \dX
                                             List of extended statistics
 Schema | Name |                            Definition                            | Ndistinct | Dependencies |  MCV
--------+------+------------------------------------------------------------------+-----------+--------------+--------
 public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined   | (null)       | (null)
(1 row)

postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1)
   Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone)
   Rows Removed by Filter: 480961
 Planning Time: 0.088 ms
 Execution Time: 144.590 ms
(5 rows)

-- doesn't work with "mvc". It might work, but the error happens in my environments
=# \dX
                    List of extended statistics
 Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV
--------+------+--------------+-----------+--------------+---------
 public | s2   | a, b FROM t2 | (null)    | (null)       | defined
(1 row)

-- I encountered the error with the query.
=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
ERROR:  unrecognized node type: 268



Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

pgsql-hackers by date:

From: Antonin Houska
Date: 12 July 2024, 10:01:11
Subject: Re: Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?

From: Junwang Zhao
Date: 12 July 2024, 10:15:37
Subject: Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers

Previous

Next