Re: logtape.c stats don't account for unused "prefetched" block numbers - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: logtape.c stats don't account for unused "prefetched" block numbers
Date
Msg-id CAH2-Wzn_o7tE2+hRVvwSFghRb75AJ5g-nqGzDUqLYMexjOAe=g@mail.gmail.com
Whole thread Raw
In response to Re: logtape.c stats don't account for unused "prefetched" block numbers  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: logtape.c stats don't account for unused "prefetched" block numbers  (Peter Geoghegan <pg@bowt.ie>)
Re: logtape.c stats don't account for unused "prefetched" block numbers  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Tue, Sep 8, 2020 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Preallocation showed significant gains for HashAgg, and BufFile doesn't
> support sparse writes. So, for HashAgg, it seems like we should just
> update the comment and consider it the price of using BufFile.

> For Sort, we can just disable preallocation.

+1.

I think that you can push sort-no-prealloc.patch without delay. That
looks good to me.

> Right now, it seems nBlocksAllocated means "number of blocks returned
> by ltsGetFreeBlock(), plus nHoleBlocks".

Just to be clear: I'm assuming that we must honor the original intent
of earlier code in my remarks here (in particular, code added by
7ac4a389a7d). This may not be exactly where we end up, but it's a good
starting point.

nHoleBlocks was added by the parallel CREATE INDEX commit in 2018,
which added BufFile/logtape.c concatenation. Whereas nBlocksAllocated
was added by the earlier bugfix commit 7ac4a389a7d in 2017 (the bugfix
I've referenced several times upthread). Clearly nBlocksAllocated
cannot be defined in terms of some other thing that wasn't there when
it was first added (I refer to nHoleBlocks). In general nHoleBlocks
can only be non-zero when the logical tapeset has been unified in a
leader process (for a parallel CREATE INDEX).

nBlocksAllocated is not the same thing as nBlocksWritten, though the
difference is more subtle than you suggest. nBlocksAllocated actually
means (or should mean) "the number of blocks allocated to the file",
which is usually the same thing that a stat() call or tools like "ls"
are expected report for the underlying temp file once the merge phase
of an external sort is reached (assuming that you only need one temp
file for a BufFile, and didn't use parallelism/concatenation, which is
the common case). That's why nBlocksAllocated is what
LogicalTapeSetBlocks() returns (pretty much). At least, the original
post-7ac4a389a7d version of LogicalTapeSetBlocks() was just "return
nBlocksAllocated;". nHoleBlocks was added for parallel CI, but that
was only supposed to compensate for the holes left behind by
concatenation/parallel sort, without changing the logtape.c space
management design in any fundamental way.

Obviously you must be wondering what the difference is, if it's not
just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal
to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but
it's almost always equal in practice (again, barring nHoleBlocks !=
0). It's possible that a tuplesort will not have flushed the last
block at a point when LogicalTapeSetBlocks() is called -- it will have
allocated the block, but not yet written it to the BufFile. IOW, as
far as tuplesort.c is concerned the data is written to tape, but it
happens to not have been passed through to the OS via write(), or even
passed through to BufFileWrite() -- it happens to still be in one of
the small per-tape write buffers. When this occurs, a small amount of
dirty data in said per-tape buffer is considered written by
tuplesort.c, but from the point of view of logtape.c it is allocated
but not yet "written" (by which I mean not yet passed to buffile.c,
which actually does its own buffering, which it can neglect to flush
immediately in turn).

It's possible that tuplesort.c will need to call
LogicalTapeSetBlocks() at an earlier point after all tuples are
written but before they're "flushed" in logtape.c/buffile.c. We need
to avoid confusion when that happens. We want to insulate tuplesort.c
from implementation details that are private to logtape.c and/or
buffile.c. Bear in mind that nBlocksAllocated was originally only ever
supposed to have a value equal to nBlocksWritten, or the value
nBlocksWritten + 1. It is reasonable to want to hide the buffering
from LogicalTapeSetBlocks() once you realize that this mechanism is
only supposed to smooth-over an edge case involving one extra block
that will be written out in a moment anyway.

What does all of this mean for the new preallocation stuff that
benefits HashAggs that spill? Well, I'm not sure. I was specifically
concerned that somebody would end up misusing the ltsWriteBlock()
allocated-but-not-written thing in this way back in 2017, and said so
at the time -- that's why commit 7ac4a389a7d added comments about all
this to ltsWriteBlock(). For external sorts, that we're agreed won't
be using preallocation anyway, I think that we should go back to
reporting allocated blocks from LogicalTapeSetBlocks() -- very often
this is nBlocksWritten, but occasionally it's nBlocksWritten + 1. I
haven't yet refreshed my memory on the exact details of when you get
one behavior rather than the other, but I know it is possible in
practice with a tuplesort on Postgres 12. It might depend on subtle
issues like the alignment with BufFile segments -- see my test case
from 2017 to get an idea of how to make it easier to reveal problems
in this area:

https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com

We still need to put the reliance on ltsWriteBlock() allocating many
blocks before they've been logically written on some kind of formal
footing for Postgres 13 -- it is now possible that an all-zero block
will be left behind even after we're done writing and have flushed all
temp buffers, which is a new thing. In cases when the
zero-block-written thing happened on Postgres 12, we would later flush
out a block that overwrote every zero block -- that happened reliably.
ltsWriteBlock()'s loop only "preallocated" blocks it *knew* would get
filled with real data shortly afterwards, as an implementation
expedient -- not as an optimization. This is no longer the case.

At a minimum, we need to update the old ltsWriteBlock()
allocated-but-not-written comments to acknowledge that the HashAgg
case exists and has different concerns. We must also determine whether
we have the same issue with written-but-not-yet-flushed data for the
new nodeAgg.c caller. You're not doing the ltsWriteBlock()
loop-that-writes-zero-blocks thing because you have an unflushed
buffer from another tape -- you're doing it to preallocate and avoid
possible fragmentation. I'm mostly okay with doing the preallocation
that way, but that needs to be reconciled with the original design.
And the original design needs to continue to do the same things for
tuplesort.c, and maybe nodeAgg.c, too.

I think that the return value of LogicalTapeSetBlocks() should be at
least nBlocksWritten, while also including blocks that we know that
flushing dirty buffered data out will write in a moment, too (note
that I'm still pretending nHoleBlocks doesn't exist because it's not
important in my remarks here). IOW, it ought to include preallocated
blocks (for HashAgg), while not failing to count one extra block that
happens to still be buffered but is written as far as the logtape.c
caller is concerned (certainly for tuplesort caller, and maybe for
HashAgg caller too).

> nBlocksWritten seems to mean "the logical size of the BufFile". The
> BufFile can have holes in it after concatenation, but from the
> perspective of logtape.c, nBlocksWritten seems like a better fit for
> instrumentation purposes. So I'd be inclined to return (nBlocksWritten
> - nHoleBlocks).
>
> The only thing I can think of that would be better is if BufFile
> tracked for itself the logical vs. physical size, which might be a good
> improvement to make (and would mean that logtape.c wouldn't be
> responsible for tracking the holes itself).

I don't really think that that's workable, for what it's worth. The
"holes" left behind by concatenation (and counted by nHoleBlocks) are
ranges that logtape.c can never reuse that are "between" worker tapes.
They are necessary because logtape.c needs to be able to read back
block number metadata from worker temp files (it makes sense of them
by applying an offset). ISTM that the logical vs physical size
distinction will have to be tracked by logtape.c for as long as it
buffers data for writes. It's the natural way to do it IMV.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Logical Replication - detail message with names of missing columns
Next
From: Peter Smith
Date:
Subject: Re: Implement UNLOGGED clause for COPY FROM