Home > mailing lists

Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: Trouble with hashagg spill I/O pattern and costing
Date	May 19, 2020 19:27:34
Msg-id	88dd39d782178539d7ee46f98b98579940e2a12e.camel@j-davis.com Whole thread Raw
In response to	Trouble with hashagg spill I/O pattern and costing (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Trouble with hashagg spill I/O pattern and costing
List	pgsql-hackers

Tree view

On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
> I think there are two related problem - with costing and with
> excessive
> I/O due to using logical tapes.

Thank you for the detailed analysis. I am still digesting this
information.

> This kinda makes me question whether logical tapes are the right tool
> for hashagg. I've read the explanation in logtape.c why it's about
> the
> same amount of I/O as using separate files, but IMO that only really
> works for I/O patters similar to merge sort - the more I think about
> this, the more I'm convinced we should just do what hashjoin is
> doing.

Fundamentally, sort writes sequentially and reads randomly; while
HashAgg writes randomly and reads sequentially. 

If the random writes of HashAgg end up fragmented too much on disk,
then clearly the sequential reads are not so sequential anyway. The
only way to avoid fragmentation on disk is to preallocate for the
tape/file.

BufFile (relying more on the OS) would probably do a better job of
preallocating the disk space in a useful way; whereas logtape.c makes
it easier to manage buffers and the overall number of files created
(thereby allowing higher fanout of partitions).

We have a number of possibilities here:

1. Improve costing to reflect that HashAgg is creating more random IOs
than Sort.
2. Reduce the partition fanout in the hopes that the OS does a better
job with readahead.
3. Switch back to BufFile, in which case we probably need to reduce the
fanout for other reasons.
4. Change logtape.c to allow preallocation or to write in larger
blocks.
5. Change BufFile to allow more control over buffer usage, and switch
to that.

#1 or #2 are the least invasive, and I think we can get a satisfactory
solution by combining those.

I saw good results with the high fanout and low work_mem when there is
still a lot of system memory. That's a nice benefit, but perhaps it's
safer to use a lower fanout (which will lead to recursion) until we get
a better handle on the IO patterns.

Perhaps you can try recompiling with a lower max partitions and rerun
the query? How much would we have to lower it for either the cost to
approach reality or the OS readahead to become effective?

Regards,
    Jeff Davis

pgsql-hackers by date:

From: Ashwin Agrawal
Date: 19 May 2020, 18:56:13
Subject: Re: SyncRepLock acquired exclusively in default configuration

From: Lukas Fittl
Date: 19 May 2020, 19:58:56
Subject: Re: PostgreSQL 13 Beta 1 Release Announcement Draft

Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

Previous

Next