Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers
From | Alena Rybakina |
---|---|
Subject | Re: Consider the number of columns in the sort cost model |
Date | |
Msg-id | cbdf6cd0-9fce-49e8-9b31-ef031c18c082@postgrespro.ru Whole thread Raw |
In response to | Re: Consider the number of columns in the sort cost model (Kirill Reshke <reshkekirill@gmail.com>) |
Responses |
Re: Consider the number of columns in the sort cost model
|
List | pgsql-hackers |
On 10/28/24 16:48, Alena Rybakina wrote:Thanks for the confirmation.On 23.10.2024 04:39, Andrei Lepikhov wrote:YesOn 15/10/2024 12:15, David Rowley wrote:To be honest, I didn’t find information about this in the code, but did I understand everything correctly?
And the last patch is a demo of how I'm going to use the previous three patches and add one more strategy to improve the order of columns in the GROUP-BY clause.
When you said that, I saw it. You are right, checking for 0 is what stops us from useless recalculation. I noticed it right away.Hmm, I don't see multiple calculations. em_distinct has made specifically for this sake. Can you provide specific case and code lines?
2. I noticed that statistics of distinct values are calculated several times. Maybe I miss something, but this can be avoided if these statistics can be saved after calculation. For example, I saw that it is possible that you calculate the same statistic information for the same equivalence members in cost_incremental_sort and identify_sort_ecmember. Is it possible to store information in a structure and use it later?
Thank you :)I think you can just update your compiler. But I added the ndist initialisation to make more compilers happy :).
3. I think you should initialize the variable ndist in your patch. I faced the compile complaining during your code compilation.
costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized [-Werror=maybe-uninitialized]
6694 | em->em_ndistinct = ndist;
| ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
6680 | double ndist;
| ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1
+ Assert (node != NULL);I don't think so. At least until you provide the case when the get_sortgroupclause_expr function returns NULL.
+
examine_variable(root, node, 0, &vardata);
if (!HeapTupleIsValid(vardata.statsTuple))
continue;
That's more, remember - the patch 0004 here is just to show the perspective and still under construction.
Anyway, thanks, I found out that the patch set doesn't apply correctly because of 828e94c. So, see the new version in the attachment.
Yes, you are right, if this case is possible, then it is connected with an error in the formation of the target list.
I played around with the examples a bit and couldn't figure out something. When I added the same values to different columns - firstly in a, later in b, the order of the columns for sort operation doesn't change. Isn't this a mistake? create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);
create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);
insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;
explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual time=0.225..0.250 rows=8 loops=1)
Hash Cond: (b.x2 = a.x1)
-> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual time=0.117..0.123 rows=3 loops=1)
Group Key: b.x2, b.y2
Batches: 1 Memory Usage: 40kB
-> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (actual time=0.030..0.044 rows=48 loops=1)
-> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.047..0.051 rows=4 loops=1)
Planning Time: 1.649 ms
Execution Time: 0.485 ms
(11 rows)
delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual time=0.083..0.090 rows=4 loops=1)
Hash Cond: (a.x1 = b.x2)
-> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.010..0.011 rows=4 loops=1)
-> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual time=0.063..0.064 rows=3 loops=1)
Group Key: b.x2, b.y2
Batches: 1 Memory Usage: 24kB
-> Seq Scan on b (cost=0.00..1.48 rows=48 width=8) (actual time=0.006..0.014 rows=48 loops=1)
Planning Time: 0.391 ms
Execution Time: 0.151 ms
(11 rows)
I noticed a spelling: "significance phusics".
/*
* Calculate multiplier reflecting the number of comparisons which executor
* have to perform during the sort with this specific order of columns.
*
* The comparison factor f = 1.+F(pathkeys). There 1. incapsulates the
* second-order of significance phusics which cost function doesn't consider.
-- Regards, Alena Rybakina Postgres Professional
pgsql-hackers by date: