RE: Showing applied extended statistics in explain Part 2 - Mailing list pgsql-hackers
From | |
---|---|
Subject | RE: Showing applied extended statistics in explain Part 2 |
Date | |
Msg-id | TYWPR01MB109825D40CB18BED80B6B4140B1A62@TYWPR01MB10982.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Showing applied extended statistics in explain Part 2 (Tatsuro Yamada <yamatattsu@gmail.com>) |
Responses |
Re: Showing applied extended statistics in explain Part 2
|
List | pgsql-hackers |
Hi, Thanks for working the feature. As a user, I find it useful, and I'd like to use it in v18! Although I've just started start looking into it, I have a few questions. (1) Is it better to make the order of output consistent? For example, even though there are three clauses shown in the below case, the order does not match. * "Filter" shows that "id1" is first. * "Ext Stats" shows that "id2" is first. -- An example DROP TABLE IF EXISTS test; CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32)); INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM generate_series(1,1000000) s(i)); create statistics test_s1 on id1, id2 from test; analyze; =# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10); QUERY PLAN --------------------------------------------------------------------------------------- Gather (cost=1000.00..23092.77 rows=84311 width=20) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..13661.67 rows=35130 width=20) Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10))) -- here Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here (5 rows) (2) Do we really need the schema names without VERBOSE option? As in the above case, "Ext Stats" shows schema name "public", even though the table name "test" isn't shown with its schema name. Additionally, if the VERBOSE option is specified, should the column names also be printed with namespace? =# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10); QUERY PLAN --------------------------------------------------------------------------------------- Gather (cost=1000.00..22947.37 rows=82857 width=20) Output: id1, id2, id3, value Workers Planned: 2 -> Parallel Seq Scan on public.test (cost=0.00..13661.67 rows=34524 width=20) Output: id1, id2, id3, value Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10))) Ext Stats: public.test_s1 Clauses: (((id2 = 2) OR (id2 > 10)) AND (id1 = 1)) -- here (7 rows) (3) I might be misunderstanding something, but do we need the clauses? Is there any case where users would want to know the clauses? For example, wouldn't the following be sufficient? > Ext Stats: id1, id2 using test_s1 (4) The extended statistics with "dependencies" or "ndistinct" option don't seem to be shown in EXPLAIN output. Am I missing something? (Is this expected?) I tested the examples in the documentation. Although it might work with "mcv" option, I can't confirm that it works because "unrecognized node type" error occurred in my environment. https://www.postgresql.org/docs/current/sql-createstatistics.html (It might be wrong since I'm beginner with extended stats codes.) IIUC, the reason is that the patch only handles statext_mcv_clauselist_selectivity(), and doesn't handle dependencies_clauselist_selectivity() and estimate_multivariate_ndistinct(). -- doesn't work with "dependencies" option? =# \dX List of extended statistics Schema | Name | Definition | Ndistinct | Dependencies | MCV --------+---------+--------------------+-----------+--------------+--------- public | s1 | a, b FROM t1 | (null) | defined | (null) (2 rows) =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0); QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 rows=100 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on t1 (cost=0.00..10675.00 rows=42 width=8) (actual time=30.300..46.610 rows=33 loops=3) Filter: ((a = 1) AND (b = 0)) Rows Removed by Filter: 333300 Planning Time: 0.246 ms Execution Time: 50.361 ms (8 rows) -- doesn't work with "ndistinct"? =# \dX List of extended statistics Schema | Name | Definition | Ndistinct | Dependencies | MCV --------+------+------------------------------------------------------------------+-----------+--------------+-------- public | s3 | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM t3 | defined | (null) | (null) (1 row) postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3 WHERE date_trunc('month', a) = '2020-01-01'::timestamp; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on t3 (cost=0.00..10210.01 rows=45710 width=8) (actual time=0.027..143.199 rows=44640 loops=1) Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp without time zone) Rows Removed by Filter: 480961 Planning Time: 0.088 ms Execution Time: 144.590 ms (5 rows) -- doesn't work with "mvc". It might work, but the error happens in my environments =# \dX List of extended statistics Schema | Name | Definition | Ndistinct | Dependencies | MCV --------+------+--------------+-----------+--------------+--------- public | s2 | a, b FROM t2 | (null) | (null) | defined (1 row) -- I encountered the error with the query. =# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1); ERROR: unrecognized node type: 268 Regards, -- Masahiro Ikeda NTT DATA CORPORATION
pgsql-hackers by date: