Re: Incremental Sort Cost Estimation Instability - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: Incremental Sort Cost Estimation Instability
Date
Msg-id 6291ee9c-a87d-43a9-b848-43a53cef7762@postgrespro.ru
Whole thread Raw
In response to Incremental Sort Cost Estimation Instability  (Andrei Lepikhov <lepihov@gmail.com>)
Responses Re: Incremental Sort Cost Estimation Instability
List pgsql-hackers

Hi!

On 07.11.2024 08:57, Andrei Lepikhov wrote:
On 10/8/24 11:33, Andrei Lepikhov wrote:
On 9/23/24 20:02, Andrei Lepikhov wrote:
On 12/9/2024 12:12, David Rowley wrote:
On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepihov@gmail.com>
Minor change to make compiler and cfbot happy
Now, this thread looks connected to the [1]. However, it still has independent profit, which can be discussed separately.
After the introduction of the em->em_ndistinct cache, I played around with the idea of letting the estimate_num_groups use this cache. Occasionally found out  that we have one more instability case like the following:

DROP TABLE IF EXISTS test;
CREATE TABLE test (x int, y int, z int);
INSERT INTO test (x,y,z) (SELECT random()*1E5, random()*2, random() FROM generate_series(1,1e4));
VACUUM ANALYZE test;

EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY x,y;
EXPLAIN SELECT count(*) FROM test WHERE x=y GROUP BY y,x;

Here, you can see that depending on the initial order of grouping, Postgres chooses different columns for grouping. Doing that causes different estimations - one of them is definitely wrong:

GroupAggregate  (cost=181.41..182.29 rows=50 width=16)
GroupAggregate  (cost=181.41..181.82 rows=3 width=16)

That happens because when estimating the number of groups, Postgres doesn't consider EquivalenceClass, which can let him correct group estimation at a low price.
It may be done inside the make_pathkeys_for_sortclauses_extended by choosing a column with a lower number of distinct, but IMO, it is better to do it at the moment of the number of groups estimation.

Thoughts? Is it a real issue or just a non-practical corner case?

The new version of the patch is attached.

[1] https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

But you haven’t considered the case when you need to use non-cached values, for example, if ndistinct has already changed. Look, here x has a minimum ndistinct, and then column z:

alena@postgres=# delete from test;
DELETE 10000
alena@postgres=# INSERT INTO test (x,y,z) (SELECT id%3, id*2, id FROM generate_series(1,1e4) as id);
INSERT 0 10000
alena@postgres=# EXPLAIN SELECT * FROM test where x=y ORDER BY x,y,z;
                          QUERY PLAN                          
--------------------------------------------------------------
 Sort  (cost=196.88..197.02 rows=56 width=12)
   Sort Key: x, z
   ->  Seq Scan on test  (cost=0.00..195.25 rows=56 width=12)
         Filter: (x = y)
(4 rows)

alena@postgres=# delete from test;
DELETE 10000
alena@postgres=# INSERT INTO test (x,y,z) (SELECT id, id*2, id%3 FROM generate_series(1,1e4) as id);
 
INSERT 0 10000
alena@postgres=# vacuum analyze;
VACUUM
alena@postgres=# EXPLAIN SELECT * FROM test where x=y ORDER BY x,y,z;
                          QUERY PLAN                          
--------------------------------------------------------------
 Sort  (cost=235.41..235.54 rows=50 width=12)
   Sort Key: x, z
   ->  Seq Scan on test  (cost=0.00..234.00 rows=50 width=12)
         Filter: (x = y)
(4 rows)

but the order of the columns does not change, as you can see.

-- 
Regards,
Alena Rybakina
Postgres Professional

pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: Avoiding superfluous buffer locking during nbtree backwards scans
Next
From: Amit Kapila
Date:
Subject: Re: Linkify mentions of the primary/subscriber's max_replication_slots