Aggregate node doesn't include cost for sorting - Mailing list pgsql-hackers

From David Geier
Subject Aggregate node doesn't include cost for sorting
Date
Msg-id 2691f881-d8de-dd57-5793-381d7d13e3cc@gmail.com
Whole thread Raw
Responses Re: Aggregate node doesn't include cost for sorting  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi hackers,

The cost of an Aggregate node seems to be the same regardless of the 
input being pre-sorted or not. If however the input is not sorted, the 
Aggregate node must additionally perform a sort which can impact runtime 
significantly. Here is an example:

CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 100000), 
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' || md5(random()::TEXT);
CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;

-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual 
time=1684.705..1684.809 rows=1 loops=1)
    ->  Seq Scan on foo  (cost=0.00..2334.00 rows=100000 width=71) 
(actual time=0.018..343.280 rows=100000 loops=1)
  Planning Time: 0.317 ms
  Execution Time: 1685.543 ms


-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=6756.42..6756.43 rows=1 width=8) (actual 
time=819.028..819.041 rows=1 loops=1)
    ->  Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 
rows=100000 width=71) (actual time=0.046..404.260 rows=100000 loops=1)
          Heap Fetches: 100000
          Heap Prefetches: 1334
  Planning Time: 0.438 ms
  Execution Time: 819.515 ms

The cost of the Aggregate node is in both cases the same (250.0) while 
its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.

Also, why does the Aggregate node sort itself? Why don't we instead emit 
an explicit Sort node in the plan so that it's visible to the user what 
is happening? As soon as there's also a GROUP BY in the query, a Sort 
node occurs in the plan. This seems inconsistent.

-- 
David Geier
(ServiceNow)




pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Alvaro Herrera
Date:
Subject: Re: on placeholder entries in view rule action query's range table