Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | jian he |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | CACJufxEBza9HAt6piUTkx6TvW76Qn0SYgytMcnUqXiLt7NX4Kg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (jian he <jian.universality@gmail.com>) |
List | pgsql-hackers |
On Mon, May 20, 2024 at 4:54 PM jian he <jian.universality@gmail.com> wrote: > > > The behavior is still the same as the master. > meaning that below quoted queries are still using "Presorted Key: x". > > > > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w; > > > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,y,z; > > > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,w,y; > > > EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,w,z,y; > > > the above part will use: > > > -> Incremental Sort > > > Sort Key: x, $, $, $ > > > Presorted Key: x > > > -> Index Scan using t1_x_y_idx on t1 > > > > 2. Could you try to find the reason? > the following are my understanding, it may be wrong... > I think my previous thread only explained that two paths have the same cost, the planner chooses the one that was first added into the pathlist. but didn't explain why Incremental Sort path, presorted by one key and presorted by two keys yield the same cost. -------------------------------------------- if (rel->tuples > 0) { /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. The * fudge factor is because the Vars are probably correlated but we * don't know by how much. We should never clamp to less than the * largest ndistinct value for any of the Vars, though, since * there will surely be at least that many groups. */ double clamp = rel->tuples; if (relvarcount > 1) { clamp *= 0.1; if (clamp < relmaxndistinct) { clamp = relmaxndistinct; /* for sanity in case some ndistinct is too large: */ if (clamp > rel->tuples) clamp = rel->tuples; } } if (reldistinct > clamp) reldistinct = clamp; ... } i think, the above code[0] snippet within function estimate_num_groups makes Incremental Sort Path not pickup the optimal presorted keys. see original problem thread: [1] ---------------------------------------------------------------------------------------- CREATE TABLE t1 AS SELECT (i % 10)::numeric AS x,(i % 10)::int8 AS y,'abc' || i % 10 AS z, i::int4 AS w FROM generate_series(1, 100) AS i; CREATE INDEX t1_x_y_idx ON t1 (x, y); ANALYZE t1; SET enable_hashagg = off; SET enable_seqscan = off; EXPLAIN (COSTS OFF) SELECT count(*) FROM t1 GROUP BY x,z,y,w; will generate 2 Incremental Sort path, one: "Presorted Key: x", another one: "Presorted Key: x,y". The first Incremental Sort path is added first. The function estimate_num_groups returned value (input_groups) is the main key to calculate the cost of Incremental Sort path! But here, the estimate_num_groups function returned the same value: 10 for "Presorted Key: x" and "Presorted Key: x,y". then later cost_incremental_sort returns the same cost for "Presorted Key: x" and "Presorted Key: x,y". (line refers to src/backend/utils/adt/selfuncs line number). why "Presorted Key: x,y" return 10 in estimate_num_groups: line 3667 assign clamp to 100.0, because rel->tuples is 100. line 3671 clamp *= 0.1; make clamp because 10. line 3680, 3681 `if (reldistinct > clamp) branch` make reldistinct from 100 to 10. line 3733 make `numdistinct *= reldistinct;` makes numdistinct because 10. If I change the total number of rows in a relation, or change the distinct number of y values then the planner will use "Incremental Sort Presorted Key: x,y". for example: CREATE TABLE t10 AS SELECT (i % 10)::numeric AS x,(i % 11)::int8 AS y,'abc' || i % 10 AS z, i::float4 AS w FROM generate_series(1, 1E2) AS i; CREATE INDEX t10_x_y_idx ON t10 (x, y); ANALYZE t10; The above setup will make the following query using "Incremental Sort Presorted Key: x,y". EXPLAIN (COSTS OFF) SELECT count(*) FROM t10 GROUP BY x,z,y,w; EXPLAIN (COSTS OFF) SELECT count(*) FROM t10 GROUP BY x,w,y,z; EXPLAIN (COSTS OFF) SELECT count(*) FROM t10 GROUP BY x,z,w,y; EXPLAIN (COSTS OFF) SELECT count(*) FROM t10 GROUP BY x,w,z,y; summary: * the regression test setup at [2] can make some cases not pickup the best optimal Incremental Sort path. we use this test in src/test/regress/sql/aggregates.sql -- Engage incremental sort EXPLAIN (COSTS OFF) SELECT count(*) FROM btg GROUP BY z, y, w, x; but if we use: EXPLAIN (COSTS OFF) SELECT count(*) FROM btg GROUP BY x, z, w, y; then the plan is not the best. * The regression test setup makes estimate_num_groups code logic return the same value. therefore make Incremental Sort presort by one key, two keys yield the same cost. the cost_incremental_sort will return the same cost. * https://git.postgresql.org/cgit/postgresql.git/tree/src/test/regress/sql/aggregates.sql#n1194 maybe we can change: CREATE TABLE btg AS SELECT i % 10 AS x, i % 10 AS y, 'abc' || i % 10 AS z, i AS w FROM generate_series(1, 100) AS i; to CREATE TABLE btg AS SELECT i % 10 AS x, i % 10 AS y, 'abc' || i % 10 AS z, i AS w FROM generate_series(1, 1000) AS i; [0] https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/utils/adt/selfuncs.c#n3669 [1] https://www.postgresql.org/message-id/CACJufxGt99nZ%2Bnir%2BaB6pFQ%3DK8oNiHAQ3OELqSbGMqNxok8nxA%40mail.gmail.com [2] https://git.postgresql.org/cgit/postgresql.git/tree/src/test/regress/sql/aggregates.sql#n1194
pgsql-hackers by date: