Home > mailing lists

Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers

From	Alena Rybakina
Subject	Re: Consider the number of columns in the sort cost model
Date	October 31, 2024 17:00:09
Msg-id	cbdf6cd0-9fce-49e8-9b31-ef031c18c082@postgrespro.ru Whole thread Raw
In response to	Re: Consider the number of columns in the sort cost model (Kirill Reshke <reshkekirill@gmail.com>)
Responses	Re: Consider the number of columns in the sort cost model Re: Consider the number of columns in the sort cost model
List	pgsql-hackers

Tree view

On 29.10.2024 05:47, Andrei Lepikhov wrote:

On 10/28/24 16:48, Alena Rybakina wrote:
On 23.10.2024 04:39, Andrei Lepikhov wrote:
On 15/10/2024 12:15, David Rowley wrote:
And the last patch is a demo of how I'm going to use the previous three patches and add one more strategy to improve the order of columns in the GROUP-BY clause.

To be honest, I didn’t find information about this in the code, but did I understand everything correctly?
Yes

Thanks for the confirmation.

2. I noticed that statistics of distinct values are calculated several times. Maybe I miss something, but this can be avoided if these statistics can be saved after calculation. For example, I saw that it is possible that you calculate the same statistic information for the same equivalence members in cost_incremental_sort and identify_sort_ecmember. Is it possible to store information in a structure and use it later?
Hmm, I don't see multiple calculations. em_distinct has made specifically for this sake. Can you provide specific case and code lines?

When you said that, I saw it. You are right, checking for 0 is what stops us from useless recalculation. I noticed it right away.

3. I think you should initialize the variable ndist in your patch. I faced the compile complaining during your code compilation.

costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized [-Werror=maybe-uninitialized]
  6694 |                         em->em_ndistinct = ndist;
       |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
  6680 |                         double ndist;
       |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1
I think you can just update your compiler. But I added the ndist initialisation to make more compilers happy :).

Thank you :)

+       Assert (node != NULL);
+
         examine_variable(root, node, 0, &vardata);
         if (!HeapTupleIsValid(vardata.statsTuple))
             continue;
I don't think so. At least until you provide the case when the get_sortgroupclause_expr function returns NULL.
That's more, remember - the patch 0004 here is just to show the perspective and still under construction.
Anyway, thanks, I found out that the patch set doesn't apply correctly because of 828e94c. So, see the new version in the attachment.

Yes, you are right, if this case is possible, then it is connected with an error in the formation of the target list.

I played around with the examples a bit and couldn't figure out something. When I added the same values to different columns - firstly in a, later in b, the order of the columns for sort operation doesn't change. Isn't this a mistake? create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);

create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);

insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual time=0.225..0.250 rows=8 loops=1)
   Hash Cond: (b.x2 = a.x1)
   -> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual time=0.117..0.123 rows=3 loops=1)
         Group Key: b.x2, b.y2
         Batches: 1 Memory Usage: 40kB
         -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (actual time=0.030..0.044 rows=48 loops=1)
   -> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074 rows=4 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.047..0.051 rows=4 loops=1)
Planning Time: 1.649 ms
Execution Time: 0.485 ms
(11 rows)

delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual time=0.083..0.090 rows=4 loops=1)
   Hash Cond: (a.x1 = b.x2)
   -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.010..0.011 rows=4 loops=1)
   -> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068 rows=3 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual time=0.063..0.064 rows=3 loops=1)
               Group Key: b.x2, b.y2
               Batches: 1 Memory Usage: 24kB
               -> Seq Scan on b (cost=0.00..1.48 rows=48 width=8) (actual time=0.006..0.014 rows=48 loops=1)
Planning Time: 0.391 ms
Execution Time: 0.151 ms
(11 rows)

I noticed a spelling: "significance phusics".

/*
* Calculate multiplier reflecting the number of comparisons which executor
* have to perform during the sort with this specific order of columns.
*
* The comparison factor f = 1.+F(pathkeys). There 1. incapsulates the
* second-order of significance phusics which cost function doesn't consider.

-- 
Regards,
Alena Rybakina
Postgres Professional

pgsql-hackers by date:

From: Corey Huinker
Date: 31 October 2024, 16:52:12
Subject: Re: Statistics Import and Export

From: Aleksander Alekseev
Date: 31 October 2024, 17:02:23
Subject: Re: Having problems generating a code coverage report

Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers

Previous

Next