Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 88dd39d782178539d7ee46f98b98579940e2a12e.camel@j-davis.com
Whole thread Raw
In response to Trouble with hashagg spill I/O pattern and costing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
> I think there are two related problem - with costing and with
> excessive
> I/O due to using logical tapes.

Thank you for the detailed analysis. I am still digesting this
information.

> This kinda makes me question whether logical tapes are the right tool
> for hashagg. I've read the explanation in logtape.c why it's about
> the
> same amount of I/O as using separate files, but IMO that only really
> works for I/O patters similar to merge sort - the more I think about
> this, the more I'm convinced we should just do what hashjoin is
> doing.

Fundamentally, sort writes sequentially and reads randomly; while
HashAgg writes randomly and reads sequentially. 

If the random writes of HashAgg end up fragmented too much on disk,
then clearly the sequential reads are not so sequential anyway. The
only way to avoid fragmentation on disk is to preallocate for the
tape/file.

BufFile (relying more on the OS) would probably do a better job of
preallocating the disk space in a useful way; whereas logtape.c makes
it easier to manage buffers and the overall number of files created
(thereby allowing higher fanout of partitions).

We have a number of possibilities here:

1. Improve costing to reflect that HashAgg is creating more random IOs
than Sort.
2. Reduce the partition fanout in the hopes that the OS does a better
job with readahead.
3. Switch back to BufFile, in which case we probably need to reduce the
fanout for other reasons.
4. Change logtape.c to allow preallocation or to write in larger
blocks.
5. Change BufFile to allow more control over buffer usage, and switch
to that.

#1 or #2 are the least invasive, and I think we can get a satisfactory
solution by combining those.

I saw good results with the high fanout and low work_mem when there is
still a lot of system memory. That's a nice benefit, but perhaps it's
safer to use a lower fanout (which will lead to recursion) until we get
a better handle on the IO patterns.

Perhaps you can try recompiling with a lower max partitions and rerun
the query? How much would we have to lower it for either the cost to
approach reality or the OS readahead to become effective?

Regards,
    Jeff Davis





pgsql-hackers by date:

Previous
From: Ashwin Agrawal
Date:
Subject: Re: SyncRepLock acquired exclusively in default configuration
Next
From: Lukas Fittl
Date:
Subject: Re: PostgreSQL 13 Beta 1 Release Announcement Draft