Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200521191318.gw3xjivrxracvzmg@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>

I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:

1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.

IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.

Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.

I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Trouble with hashagg spill I/O pattern and costing
Next
From: Tomas Vondra
Date:
Subject: Re: Trouble with hashagg spill I/O pattern and costing