Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200521143040.zgn7ealdasfkmfcb@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote:
>On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:
>>
>>...
>>
>>The problem is that the hashagg plan runs in ~1400 seconds, while the
>>groupagg only takes ~360. And per explain analyze, the difference really
>>is in the aggregation - if we subtract the seqscan, the sort+groupagg
>>takes about 310s:
>>
>>   ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192
rows=6398981loops=1)
 
>>         Group Key: lineitem_1.l_partkey
>>         ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393
rows=191969841loops=1)
 
>>               Sort Key: lineitem_1.l_partkey
>>               Sort Method: external merge  Disk: 3569544kB
>>               ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.019..28253.076rows=192000551 loops=1)
 
>>
>>while the hashagg takes ~1330s:
>>
>>   ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897
rows=6400000loops=1)
 
>>         Group Key: lineitem_1.l_partkey
>>         Planned Partitions: 128
>>         Peak Memory Usage: 4249 kB
>>         Disk Usage: 26321840 kB
>>         HashAgg Batches: 16512
>>         ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.007..22205.617rows=192000551 loops=1)
 
>>
>>And that's while only writing 26GB, compared to 35GB in the sorted plan,
>>and with cost being ~16M vs. ~43M (so roughly inverse).
>>
>
>I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
>in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
>writes out *way less* data to the temp file.
>
>The reason is revealed by explain verbose:
>
>  ->  GroupAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Sort
>              Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>              Sort Key: lineitem_1.l_partkey
>              ->  Seq Scan on public.lineitem lineitem_1
>                    Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>
>  ->  HashAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Seq Scan on public.lineitem lineitem_1
>              Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
>                      lineitem_1.l_suppkey, lineitem_1.l_linenumber,
>                      lineitem_1.l_quantity, lineitem_1.l_extendedprice,
>                      lineitem_1.l_discount, lineitem_1.l_tax,
>                      lineitem_1.l_returnflag, lineitem_1.l_linestatus,
>                      lineitem_1.l_shipdate, lineitem_1.l_commitdate,
>                      lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
>                      lineitem_1.l_shipmode, lineitem_1.l_comment
>
>It seems that in the hashagg case we're not applying projection in the
>seqscan, forcing us to serialize way much data (the whole lineitem
>table, essentially).
>
>It's probably still worth tweaking the I/O pattern, I think.
>

OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to
CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched
version that pre-allocates 128 blocks, and the results seem pretty nice:

                sort      hash      hash+tlist
    ------------------------------------------
       4MB       331       478            188
     128MB       222       434            210

which I guess is what we wanted ...

I'll give it a try on the other machine (temp on SATA), but I don't see
why would it not behave similarly nicely.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Pavel Borisov
Date:
Subject: Re: [PATCH] fix GIN index search sometimes losing results
Next
From: Amit Kapila
Date:
Subject: Re: Behaviour of failed Primary