Re: Disk-based hash aggregate's cost model - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Disk-based hash aggregate's cost model |
Date | |
Msg-id | 56c977908f33be8838759849c21f2370a10f7683.camel@j-davis.com Whole thread Raw |
In response to | Re: Disk-based hash aggregate's cost model (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Disk-based hash aggregate's cost model
|
List | pgsql-hackers |
On Wed, 2020-09-02 at 17:35 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote: > > create table text10m(t text collate "C.UTF-8", i int, n numeric); > > insert into text10m select s.g::text, s.g, s.g::numeric from > > (select > > (random()*1000000000)::int as g from generate_series(1,10000000)) > > s; > > explain analyze select distinct t from text10m; > > Note that you won't get what Postgres considers to be the C collation > unless you specify "collate C" -- "C.UTF-8" is the C collation > exposed > by glibc. The difference matters a lot, because only the former can > use abbreviated keys (unless you manually #define TRUST_STRXFRM). And > even without abbreviated keys it's probably still significantly > faster > for other reasons. Thank you. I reran with: create table text10m2(t text collate "C", i int, n numeric); -- same data, same queries And the new table is: Plan | work | 10M | 100M INT4 | 100M | 10M | 10M | mem | INT4 | 10M grps | INT4 | TEXT | TEXTC ---------+------+------+-----------+------+------+------ HashAgg | 4MB | 88 | 63 | 82 | 78 | 80 HashAgg | 1TB | 41 | 37 | 33 | 38 | 43 Sort | 4MB | 182 | 188 | 174 | 37 | 146 Sort | 1TB | 184 | 231 | 189 | 30 | 149 HashAgg* | 4MB | 192 | 131 | 178 | 166 | 176 *: patched For the 'COLLATE "C"' case, the costs still come out the almost the same between HashAgg and Sort, but the runtimes are much closer. So even if it did flip the plan from HashAgg to Sort, it goes from 9.5s (HashAgg) to 12s (Sort), which is not so bad. So the patched version looks good to me at this point. It accounts for Tomas's observations about IO: "The other thing is that sort seems to be doing only about half the physical I/O (as measured by iosnoop) compared to hashagg, even though the estimates of pages / input_bytes are exactly the same." by penalizing HashAgg disk costs by 2X. The patch also accounts for his other observation about missing CPU costs by costing the spilling. Tomas framed the CPU costs as the cost of the extra lookups, but the extra lookups are only in the cases where it misses in the hash table and needs to spill. So I think it's reasonable to consider the extra lookups as a part of the spill cost. The remaining problems are: * comparison costs for Sort should be adjusted to make them relatively consistent between data types * in-memory HashAgg is unfairly favored in a lot of cases I don't think either of those problems need to be (or should be) fixed in 13, but we can revisit in 14 if there are reports of bad plans. Regards, Jeff Davis
pgsql-hackers by date: