Thread: logtape.c stats don't account for unused "prefetched" block numbers

logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

30 July 2020, 22:51:00

Commit 896ddf9b added prefetching to logtape.c to avoid excessive
fragmentation in the context of hash aggs that spill and have many
batches/tapes. Apparently the preallocation doesn't actually perform
any filesystem operations, so the new mechanism should be zero
overhead when "preallocated" blocks aren't actually used after all
(right?). However, I notice that this breaks the statistics shown by
things like trace_sort, and even EXPLAIN ANALYZE.
LogicalTapeSetBlocks() didn't get the memo about preallocation.

The easiest way to spot the issue is to compare trace_sort output on
v13 with output for the same case in v12 -- the "%u disk blocks used"
statistics are consistently higher on v13, especially for cases with
many tapes. I spotted the bug when I noticed that v13 external sorts
reportedly use more or less disk space when fewer or more tapes are
involved (again, this came from trace_sort). That doesn't make sense
-- the total amount of space used for external sort temp files should
practically be fixed, aside from insignificant rounding effects.
Reducing the amount of memory by orders of magnitude in a Postgres 12
tuplesort will hardly affect the "%u disk blocks used" trace_sort
output at all. That's what we need to get back to.

This bug probably won't be difficult to fix. Actually, we have had
similar problems in the past. The fix could be as simple as teaching
LogicalTapeSetBlocks() about this new variety of "sparse allocation".
Although maybe the preallocation stuff should somehow be rolled into
the much older nHoleBlocks stuff. Unsure.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Alvaro Herrera

Date:

01 September 2020, 23:36:02

On 2020-Jul-30, Peter Geoghegan wrote:

> Commit 896ddf9b added prefetching to logtape.c to avoid excessive
> fragmentation in the context of hash aggs that spill and have many
> batches/tapes. Apparently the preallocation doesn't actually perform
> any filesystem operations, so the new mechanism should be zero
> overhead when "preallocated" blocks aren't actually used after all
> (right?). However, I notice that this breaks the statistics shown by
> things like trace_sort, and even EXPLAIN ANALYZE.
> LogicalTapeSetBlocks() didn't get the memo about preallocation.

This open item hasn't received any replies.  I think Peter knows how to
fix it already, but no patch has been posted ... It'd be good to get a
move on it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

02 September 2020, 00:24:27

On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> This open item hasn't received any replies.  I think Peter knows how to
> fix it already, but no patch has been posted ... It'd be good to get a
> move on it.

I picked this up again today.

It's not obvious what we should do. It's true that the instrumentation
doesn't accurately reflect the on-disk temp file overhead. That is, it
doesn't agree with the high watermark temp file size I see in the
pgsql_tmp directory, which is a clear regression compared to earlier
releases (where tuplesort was the only user of logtape.c). But it's
also true that we need to use somewhat more temp file space for a
tuplesort in Postgres 13, because we use the preallocation stuff for
tuplesort -- though probably without getting any benefit for it.

I haven't figured out how to correct the accounting just yet. In fact,
I'm not sure that this isn't some kind of leak of blocks from the
freelist, which shouldn't happen at all. The code is complicated
enough that I wasn't able to work that out in the couple of hours I
spent on it today. I can pick it up again tomorrow.

BTW, this MaxAllocSize freeBlocksLen check is wrong -- doesn't match
the later repalloc allocation:

    if (lts->nFreeBlocks >= lts->freeBlocksLen)
    {
        /*
         * If the freelist becomes very large, just return and leak this free
         * block.
         */
        if (lts->freeBlocksLen * 2 > MaxAllocSize)
            return;

        lts->freeBlocksLen *= 2;
        lts->freeBlocks = (long *) repalloc(lts->freeBlocks,
                                            lts->freeBlocksLen * sizeof(long));
    }

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

05 September 2020, 19:03:43

On Tue, Sep 1, 2020 at 5:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > This open item hasn't received any replies.  I think Peter knows how to
> > fix it already, but no patch has been posted ... It'd be good to get a
> > move on it.
>
> I picked this up again today.

One easy way to get logtape.c to behave in the same way as Postgres 12
for a multi-pass external sort (i.e. to use fewer blocks and to report
the number of blocks used accurately) is to #define
TAPE_WRITE_PREALLOC_MIN and TAPE_WRITE_PREALLOC_MAX to 1. So it looks
like the problem is in the preallocation stuff added by commit
896ddf9b3cd, and not the new heap-based free list logic added by
commit c02fdc92230. That's good news, because it means that the
problem may be fairly well isolated -- commit 896ddf9b3cd was a pretty
small and isolated thing.

The comments in ltsWriteBlock() added by the 2017 bugfix commit
7ac4a389a7d clearly say that the zero block writing stuff is only
supposed to happen at the edge of a tape boundary, which ought to be
rare -- see the comment block in ltsWriteBlock(). And yet the new
preallocation stuff explicitly relies on that it writing zero blocks
much more frequently. I'm concerned that that can result in increased
and unnecessary I/O, especially for sorts, but also for hash aggs that
spill. I'm also concerned that having preallocated-but-allocated
blocks confuses the accounting used by
trace_sort/LogicalTapeSetBlocks().

Separately, it's possible to make the
trace_sort/LogicalTapeSetBlocks() instrumentation agree with the
filesystem by replacing the use of nBlocksAllocated within
LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the
instrumentation correct without changing the current behavior at all.
But I'm not ready to endorse that approach, since it's not quite clear
what nBlocksAllocated and nBlocksWritten mean right now -- those two
fields were both added by the aforementioned 2017 bugfix commit, which
introduced the "allocated vs written" distinction in the first place.

We should totally disable the preallocation stuff for external sorts
in any case. External sorts are naturally characterized by relatively
large, distinct batching of reads and writes -- preallocation cannot
help.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

08 September 2020, 17:27:06

On Sat, 2020-09-05 at 12:03 -0700, Peter Geoghegan wrote:
> We should totally disable the preallocation stuff for external sorts
> in any case. External sorts are naturally characterized by relatively
> large, distinct batching of reads and writes -- preallocation cannot
> help.

Patch attached to disable preallocation for Sort.

I'm still looking into the other concerns.

Regards,
    Jeff Davis

Attachment

sort-no-prealloc.patch

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

09 September 2020, 06:28:49

On Sat, 2020-09-05 at 12:03 -0700, Peter Geoghegan wrote:
> The comments in ltsWriteBlock() added by the 2017 bugfix commit
> 7ac4a389a7d clearly say that the zero block writing stuff is only
> supposed to happen at the edge of a tape boundary, which ought to be
> rare -- see the comment block in ltsWriteBlock(). And yet the new
> preallocation stuff explicitly relies on that it writing zero blocks
> much more frequently. I'm concerned that that can result in increased
> and unnecessary I/O, especially for sorts, but also for hash aggs
> that
> spill. I'm also concerned that having preallocated-but-allocated
> blocks confuses the accounting used by
> trace_sort/LogicalTapeSetBlocks().

Preallocation showed significant gains for HashAgg, and BufFile doesn't
support sparse writes. So, for HashAgg, it seems like we should just
update the comment and consider it the price of using BufFile.

(Aside: is there a reason why BufFile doesn't support sparse writes, or
is it just a matter of implementation?)

For Sort, we can just disable preallocation.

> Separately, it's possible to make the
> trace_sort/LogicalTapeSetBlocks() instrumentation agree with the
> filesystem by replacing the use of nBlocksAllocated within
> LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the
> instrumentation correct without changing the current behavior at all.
> But I'm not ready to endorse that approach, since it's not quite
> clear
> what nBlocksAllocated and nBlocksWritten mean right now -- those two
> fields were both added by the aforementioned 2017 bugfix commit,
> which
> introduced the "allocated vs written" distinction in the first place

Right now, it seems nBlocksAllocated means "number of blocks returned
by ltsGetFreeBlock(), plus nHoleBlocks".

nBlocksWritten seems to mean "the logical size of the BufFile". The
BufFile can have holes in it after concatenation, but from the
perspective of logtape.c, nBlocksWritten seems like a better fit for
instrumentation purposes. So I'd be inclined to return (nBlocksWritten
- nHoleBlocks).

The only thing I can think of that would be better is if BufFile
tracked for itself the logical vs. physical size, which might be a good
improvement to make (and would mean that logtape.c wouldn't be
responsible for tracking the holes itself).

Thoughts?

Regards,
    Jeff Davis

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

11 September 2020, 01:42:16

On Tue, Sep 8, 2020 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Preallocation showed significant gains for HashAgg, and BufFile doesn't
> support sparse writes. So, for HashAgg, it seems like we should just
> update the comment and consider it the price of using BufFile.

> For Sort, we can just disable preallocation.

+1.

I think that you can push sort-no-prealloc.patch without delay. That
looks good to me.

> Right now, it seems nBlocksAllocated means "number of blocks returned
> by ltsGetFreeBlock(), plus nHoleBlocks".

Just to be clear: I'm assuming that we must honor the original intent
of earlier code in my remarks here (in particular, code added by
7ac4a389a7d). This may not be exactly where we end up, but it's a good
starting point.

nHoleBlocks was added by the parallel CREATE INDEX commit in 2018,
which added BufFile/logtape.c concatenation. Whereas nBlocksAllocated
was added by the earlier bugfix commit 7ac4a389a7d in 2017 (the bugfix
I've referenced several times upthread). Clearly nBlocksAllocated
cannot be defined in terms of some other thing that wasn't there when
it was first added (I refer to nHoleBlocks). In general nHoleBlocks
can only be non-zero when the logical tapeset has been unified in a
leader process (for a parallel CREATE INDEX).

nBlocksAllocated is not the same thing as nBlocksWritten, though the
difference is more subtle than you suggest. nBlocksAllocated actually
means (or should mean) "the number of blocks allocated to the file",
which is usually the same thing that a stat() call or tools like "ls"
are expected report for the underlying temp file once the merge phase
of an external sort is reached (assuming that you only need one temp
file for a BufFile, and didn't use parallelism/concatenation, which is
the common case). That's why nBlocksAllocated is what
LogicalTapeSetBlocks() returns (pretty much). At least, the original
post-7ac4a389a7d version of LogicalTapeSetBlocks() was just "return
nBlocksAllocated;". nHoleBlocks was added for parallel CI, but that
was only supposed to compensate for the holes left behind by
concatenation/parallel sort, without changing the logtape.c space
management design in any fundamental way.

Obviously you must be wondering what the difference is, if it's not
just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal
to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but
it's almost always equal in practice (again, barring nHoleBlocks !=
0). It's possible that a tuplesort will not have flushed the last
block at a point when LogicalTapeSetBlocks() is called -- it will have
allocated the block, but not yet written it to the BufFile. IOW, as
far as tuplesort.c is concerned the data is written to tape, but it
happens to not have been passed through to the OS via write(), or even
passed through to BufFileWrite() -- it happens to still be in one of
the small per-tape write buffers. When this occurs, a small amount of
dirty data in said per-tape buffer is considered written by
tuplesort.c, but from the point of view of logtape.c it is allocated
but not yet "written" (by which I mean not yet passed to buffile.c,
which actually does its own buffering, which it can neglect to flush
immediately in turn).

It's possible that tuplesort.c will need to call
LogicalTapeSetBlocks() at an earlier point after all tuples are
written but before they're "flushed" in logtape.c/buffile.c. We need
to avoid confusion when that happens. We want to insulate tuplesort.c
from implementation details that are private to logtape.c and/or
buffile.c. Bear in mind that nBlocksAllocated was originally only ever
supposed to have a value equal to nBlocksWritten, or the value
nBlocksWritten + 1. It is reasonable to want to hide the buffering
from LogicalTapeSetBlocks() once you realize that this mechanism is
only supposed to smooth-over an edge case involving one extra block
that will be written out in a moment anyway.

What does all of this mean for the new preallocation stuff that
benefits HashAggs that spill? Well, I'm not sure. I was specifically
concerned that somebody would end up misusing the ltsWriteBlock()
allocated-but-not-written thing in this way back in 2017, and said so
at the time -- that's why commit 7ac4a389a7d added comments about all
this to ltsWriteBlock(). For external sorts, that we're agreed won't
be using preallocation anyway, I think that we should go back to
reporting allocated blocks from LogicalTapeSetBlocks() -- very often
this is nBlocksWritten, but occasionally it's nBlocksWritten + 1. I
haven't yet refreshed my memory on the exact details of when you get
one behavior rather than the other, but I know it is possible in
practice with a tuplesort on Postgres 12. It might depend on subtle
issues like the alignment with BufFile segments -- see my test case
from 2017 to get an idea of how to make it easier to reveal problems
in this area:

https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com

We still need to put the reliance on ltsWriteBlock() allocating many
blocks before they've been logically written on some kind of formal
footing for Postgres 13 -- it is now possible that an all-zero block
will be left behind even after we're done writing and have flushed all
temp buffers, which is a new thing. In cases when the
zero-block-written thing happened on Postgres 12, we would later flush
out a block that overwrote every zero block -- that happened reliably.
ltsWriteBlock()'s loop only "preallocated" blocks it *knew* would get
filled with real data shortly afterwards, as an implementation
expedient -- not as an optimization. This is no longer the case.

At a minimum, we need to update the old ltsWriteBlock()
allocated-but-not-written comments to acknowledge that the HashAgg
case exists and has different concerns. We must also determine whether
we have the same issue with written-but-not-yet-flushed data for the
new nodeAgg.c caller. You're not doing the ltsWriteBlock()
loop-that-writes-zero-blocks thing because you have an unflushed
buffer from another tape -- you're doing it to preallocate and avoid
possible fragmentation. I'm mostly okay with doing the preallocation
that way, but that needs to be reconciled with the original design.
And the original design needs to continue to do the same things for
tuplesort.c, and maybe nodeAgg.c, too.

I think that the return value of LogicalTapeSetBlocks() should be at
least nBlocksWritten, while also including blocks that we know that
flushing dirty buffered data out will write in a moment, too (note
that I'm still pretending nHoleBlocks doesn't exist because it's not
important in my remarks here). IOW, it ought to include preallocated
blocks (for HashAgg), while not failing to count one extra block that
happens to still be buffered but is written as far as the logtape.c
caller is concerned (certainly for tuplesort caller, and maybe for
HashAgg caller too).

> nBlocksWritten seems to mean "the logical size of the BufFile". The
> BufFile can have holes in it after concatenation, but from the
> perspective of logtape.c, nBlocksWritten seems like a better fit for
> instrumentation purposes. So I'd be inclined to return (nBlocksWritten
> - nHoleBlocks).
>
> The only thing I can think of that would be better is if BufFile
> tracked for itself the logical vs. physical size, which might be a good
> improvement to make (and would mean that logtape.c wouldn't be
> responsible for tracking the holes itself).

I don't really think that that's workable, for what it's worth. The
"holes" left behind by concatenation (and counted by nHoleBlocks) are
ranges that logtape.c can never reuse that are "between" worker tapes.
They are necessary because logtape.c needs to be able to read back
block number metadata from worker temp files (it makes sense of them
by applying an offset). ISTM that the logical vs physical size
distinction will have to be tracked by logtape.c for as long as it
buffers data for writes. It's the natural way to do it IMV.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

12 September 2020, 01:29:08

On Thu, Sep 10, 2020 at 6:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Obviously you must be wondering what the difference is, if it's not
> just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal
> to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but
> it's almost always equal in practice (again, barring nHoleBlocks !=
> 0).

Noticing that you pushed a commit to disable preallocation for
external sorts, I tried to determine if there are any remaining
problem. As far as I can tell there are no remaining problems --
evidently the loop logic in ltsWriteBlock() both performs its original
task (per commit 7ac4a389a7d), as well as the new task of
preallocation for its HashAggs-that-spill caller.

There is a case in the regression tests (including the Postgres 12
regression tests) that relies on the loop within ltsWriteBlock() for
an external sort. FWIW, that happens in the "cluster clstr_tst4 using
cluster_sort" cluster tuplesort. The trace_sort output (and the temp
file size) is now consistent across versions 12 and 13.

I'll probably close out this open item tomorrow. I need to think about
it some more, but right now everything looks good. I think I'll
probably end up pushing a commit with more explanatory comments.

Thank you
-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

12 September 2020, 01:37:13

On Fri, Sep 11, 2020 at 6:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'll probably close out this open item tomorrow. I need to think about
> it some more, but right now everything looks good. I think I'll
> probably end up pushing a commit with more explanatory comments.

That said, we still need to make sure that the preallocation
instrumentation for HashAggs-that-spill is sensible -- it has to
actually match the temp file size.

It would be awkward if we just used nBlocksWritten within
LogicalTapeSetBlocks() in the case where we didn't preallocate (or in
all cases). Not entirely sure what to do about that just yet.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

14 September 2020, 21:24:54

On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It would be awkward if we just used nBlocksWritten within
> LogicalTapeSetBlocks() in the case where we didn't preallocate (or in
> all cases). Not entirely sure what to do about that just yet.

I guess that that's the logical thing to do, as in the attached patch.

What do you think, Jeff?

-- 
Peter Geoghegan

Attachment

0001-LogicalTapeSetBlocks-issue-Tentative-fix.patch

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

14 September 2020, 22:20:02

On Mon, 2020-09-14 at 14:24 -0700, Peter Geoghegan wrote:
> On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It would be awkward if we just used nBlocksWritten within
> > LogicalTapeSetBlocks() in the case where we didn't preallocate (or
> > in
> > all cases). Not entirely sure what to do about that just yet.
> 
> I guess that that's the logical thing to do, as in the attached
> patch.

Hi Peter,

In the comment in the patch, you say:

"In practice this probably doesn't matter because we'll be called after
the flush anyway, but be tidy."

By which I assume you mean that LogicalTapeRewindForRead() will be
called before LogicalTapeSetBlocks().

If that's the intention of LogicalTapeSetBlocks(), should we just make
it a requirement that there are no open write buffers for any tapes
when it's called? Then we could just use nBlocksWritten in both cases,
right?

(Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might
be a mistake in HashAgg where it will keep the write buffers around
longer than necessary. If I recall correctly, it was my intention to
rewind for reading immediately after the batch was finished, which is
why I made the read buffer lazily-allocated.)

Regards,
    Jeff Davis

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Alvaro Herrera

Date:

14 September 2020, 22:23:59

On 2020-Sep-14, Peter Geoghegan wrote:

> On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > It would be awkward if we just used nBlocksWritten within
> > LogicalTapeSetBlocks() in the case where we didn't preallocate (or in
> > all cases). Not entirely sure what to do about that just yet.
> 
> I guess that that's the logical thing to do, as in the attached patch.

I don't understand this patch.  Or maybe I should say I don't understand
the code you're patching.  Why isn't the correct answer *always*
nBlocksWritten?  The comment in LogicalTapeSet says:

"nBlocksWritten is the size of the underlying file, in BLCKSZ blocks."

so if LogicalTapeSetBlocks wants to do what its comment says, that is,

"Obtain total disk space currently used by a LogicalTapeSet, in blocks."

then it seems like they're an exact match.  Either that, or more than
zero of those comments are lying.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

14 September 2020, 22:39:08

On Mon, Sep 14, 2020 at 3:24 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I don't understand this patch.  Or maybe I should say I don't understand
> the code you're patching.  Why isn't the correct answer *always*
> nBlocksWritten?  The comment in LogicalTapeSet says:

I think that they are an exact match in practice (i.e. nBlocksWritten
== nBlocksAllocated), given when and how we call
LogicalTapeSetBlocks().

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

14 September 2020, 22:54:48

On Mon, Sep 14, 2020 at 3:20 PM Jeff Davis <pgsql@j-davis.com> wrote:
> In the comment in the patch, you say:
>
> "In practice this probably doesn't matter because we'll be called after
> the flush anyway, but be tidy."
>
> By which I assume you mean that LogicalTapeRewindForRead() will be
> called before LogicalTapeSetBlocks().

Yeah, I'm pretty sure that that's an equivalent way of expressing the
same idea. It appears that this assumption holds, though only when
we're not using preallocation (i.e. it doesn't necessarily hold for
the HashAggs-that-spill case, as I go into below).

> If that's the intention of LogicalTapeSetBlocks(), should we just make
> it a requirement that there are no open write buffers for any tapes
> when it's called? Then we could just use nBlocksWritten in both cases,
> right?

That does seem appealing. Perhaps it could be enforced by an assertion.

> (Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might
> be a mistake in HashAgg where it will keep the write buffers around
> longer than necessary. If I recall correctly, it was my intention to
> rewind for reading immediately after the batch was finished, which is
> why I made the read buffer lazily-allocated.)

If I add the assertion described above and run the regression tests,
it fails within "select_distinct" (and at other points). This is the
specific code:

--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -1284,6 +1284,7 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
      * (In practice this probably doesn't matter because we'll be called after
      * the flush anyway, but be tidy.)
      */
+    Assert(lts->nBlocksWritten == lts->nBlocksAllocated);
     if (lts->enable_prealloc)
         return lts->nBlocksWritten;

Maybe the LogicalTapeRewindForRead() inconsistency you mention could
be fixed, which would enable the simplification you suggested. What do
you think?

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

14 September 2020, 22:58:08

On Mon, Sep 14, 2020 at 3:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that they are an exact match in practice (i.e. nBlocksWritten
> == nBlocksAllocated), given when and how we call
> LogicalTapeSetBlocks().

Just to be clear: this is only true for external sorts. The
preallocation stuff can make nBlocksAllocated quite a lot higher.
That's probably why adding a new "Assert(lts->nBlocksWritten ==
lts->nBlocksAllocated)" assertion fails during the regression tests,
though there might be other reasons as well (I'm thinking of the
LogicalTapeRewindForRead() inconsistency Jeff mentioned).

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

14 September 2020, 23:03:14

On Mon, Sep 14, 2020 at 3:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If I add the assertion described above and run the regression tests,
> it fails within "select_distinct" (and at other points). This is the
> specific code:

This variant of the same assertion works fine:

+    Assert(lts->enable_prealloc || lts->nBlocksWritten ==
lts->nBlocksAllocated);

(This is hardly surprising, though.)

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

15 September 2020, 00:50:15

On Mon, 2020-09-14 at 15:54 -0700, Peter Geoghegan wrote:
> Maybe the LogicalTapeRewindForRead() inconsistency you mention could
> be fixed, which would enable the simplification you suggested. What
> do
> you think?

Yes, it was apparently an oversight. Patch attached.

RC1 was just stamped, are we in a sensitive time or is it still
possible to backport this to REL_13_STABLE?

If not, that's fine, I'll just commit it to master. It's a little less
important after 9878b643, which reduced the overpartitioning issue.

Regards,
    Jeff Davis

Attachment

hashagg-release-write-buffers.patch

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 01:56:50

On Mon, Sep 14, 2020 at 5:50 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Yes, it was apparently an oversight. Patch attached.

This is closer to how logical tapes are used within tuplesort.c. I
notice that this leads to about a 50% reduction in temp file usage for
a test case involving very little work_mem (work_mem is set to 64).
But it doesn't seem to make as much difference with more work_mem. It
probably has something to do with recursion during spilling.

> RC1 was just stamped, are we in a sensitive time or is it still
> possible to backport this to REL_13_STABLE?

Testing indicates that this still doesn't make "nBlocksWritten ==
nBlocksAllocated" when the instrumentation is used for
HashAggs-that-spill.

I'm not sure what I was talking about earlier when I connected this
with the main/instrumentation issue, since preallocation used by
logtape.c to help HashAggs-that-spill necessarily reserves blocks
without writing them out for a while (the fires in California have
made it difficult to be productive). You might write blocks out as
zero blocks first, and then only write the real data later
(overwriting the zero blocks). But no matter how the writes among
tapes are interlaced, the fact is that nBlocksAllocated can exceed
nBlocksWritten by at least one block per active tape.

If we really wanted to ensure "nBlocksWritten == nBlocksAllocated",
wouldn't it be necessary for LogicalTapeSetBlocks() to go through the
remaining preallocated blocks from each tape and count the number of
blocks "logically preallocated" (by ltsGetPreallocBlock()) but not yet
"physically preallocated" (by being written out as zero blocks within
ltsWriteBlock())? That count would have to be subtracted, because
nBlocksAllocated includes logically preallocated blocks, without
regard for whether they've been physically preallocated. But we only
know the difference by checking against nBlocksWritten, so we might as
well just use my patch from earlier. (I'm not arguing that we should,
I'm just pointing out the logical though perhaps absurd conclusion.)

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 02:09:17

On Mon, Sep 14, 2020 at 6:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm not sure what I was talking about earlier when I connected this
> with the main/instrumentation issue, since preallocation used by
> logtape.c to help HashAggs-that-spill necessarily reserves blocks
> without writing them out for a while (the fires in California have
> made it difficult to be productive). You might write blocks out as
> zero blocks first, and then only write the real data later
> (overwriting the zero blocks). But no matter how the writes among
> tapes are interlaced, the fact is that nBlocksAllocated can exceed
> nBlocksWritten by at least one block per active tape.

Oh, wait. Of course the point was that we wouldn't even have to use
nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make
the assumption that nBlocksWritten could be used for all callers in
all cases. Which is a reasonable assumption once you enforce that
there are no active write buffers. Which is evidently a good idea
anyway, since it saves on temp file disk space in
HashAggs-that-spill/prealloc cases with very little work_mem.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 02:29:33

On Mon, Sep 14, 2020 at 7:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Oh, wait. Of course the point was that we wouldn't even have to use
> nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make
> the assumption that nBlocksWritten could be used for all callers in
> all cases. Which is a reasonable assumption once you enforce that
> there are no active write buffers. Which is evidently a good idea
> anyway, since it saves on temp file disk space in
> HashAggs-that-spill/prealloc cases with very little work_mem.

Let's assume that we'll teach LogicalTapeSetBlocks() to use
nBlocksWritten in place of nBlocksAllocated in all cases, as now seems
likely. Rather than asserting "nBlocksWritten == nBlocksAllocated"
inside LogicalTapeSetBlocks() (as I suggested earlier at one point),
we could instead teach LogicalTapeSetBlocks() to iterate through each
tape from the tapeset and make sure each tape has no writes buffered
(so everything must be flushed). We could add a loop that would only
be used on assert-enabled builds.

This looping-through-tapes-to assert code would justify relying on
nBlocksWritten in LogicalTapeSetBlocks(), and would make sure that we
don't let any bugs like this slip in in the future. It would
necessitate that we commit Jeff's hashagg-release-write-buffers.patch
patch from earlier, I think, but that seems like a good idea anyway.

You suggested this yourself, Jeff (my suggestion about the assertion
is just an expansion on your suggestion from earlier). This all seems
like a good idea to me. Can you write a patch that adjusts
LogicalTapeSetBlocks() along these lines? Hopefully the assertion loop
thing won't reveal some other problem with this plan.

Thanks
-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

15 September 2020, 03:07:44

On Mon, 2020-09-14 at 19:29 -0700, Peter Geoghegan wrote:
> Let's assume that we'll teach LogicalTapeSetBlocks() to use
> nBlocksWritten in place of nBlocksAllocated in all cases, as now
> seems
> likely. Rather than asserting "nBlocksWritten == nBlocksAllocated"
> inside LogicalTapeSetBlocks() (as I suggested earlier at one point),
> we could instead teach LogicalTapeSetBlocks() to iterate through each
> tape from the tapeset and make sure each tape has no writes buffered
> (so everything must be flushed). We could add a loop that would only
> be used on assert-enabled builds.

Sounds reasonable.

> You suggested this yourself, Jeff (my suggestion about the assertion
> is just an expansion on your suggestion from earlier). This all seems
> like a good idea to me. Can you write a patch that adjusts
> LogicalTapeSetBlocks() along these lines? Hopefully the assertion
> loop
> thing won't reveal some other problem with this plan.

Sure. Will backporting either patch into REL_13_STABLE now interfere
with RC1 release in any way?

Regards,
    Jeff Davis

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 03:48:05

On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Sure. Will backporting either patch into REL_13_STABLE now interfere
> with RC1 release in any way?

The RMT will discuss this.

It would help if there was a patch ready to go.

Thanks
-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

15 September 2020, 06:44:47

On Mon, 2020-09-14 at 20:48 -0700, Peter Geoghegan wrote:
> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > Sure. Will backporting either patch into REL_13_STABLE now
> > interfere
> > with RC1 release in any way?
> 
> The RMT will discuss this.
> 
> It would help if there was a patch ready to go.

Attached. HashAgg seems to accurately reflect the filesize, as does
Sort.

Regards,
    Jeff Davis

Attachment

logtape-nblockswritten.patch

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Jeff Davis

Date:

15 September 2020, 06:52:32

On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote:
> We still need to put the reliance on ltsWriteBlock() allocating many
> blocks before they've been logically written on some kind of formal
> footing for Postgres 13 -- it is now possible that an all-zero block
> will be left behind even after we're done writing and have flushed
> all
> temp buffers, which is a new thing.

Is the current direction of this thread (i.e. the two posted patches)
addressing your concern here?

Regards,
    Jeff Davis

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 07:00:54

On Mon, Sep 14, 2020 at 11:52 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote:
> > We still need to put the reliance on ltsWriteBlock() allocating many
> > blocks before they've been logically written on some kind of formal
> > footing for Postgres 13 -- it is now possible that an all-zero block
> > will be left behind even after we're done writing and have flushed
> > all
> > temp buffers, which is a new thing.
>
> Is the current direction of this thread (i.e. the two posted patches)
> addressing your concern here?

Yes.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 16:03:50

On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > Sure. Will backporting either patch into REL_13_STABLE now interfere
> > with RC1 release in any way?
>
> The RMT will discuss this.

It is okay to skip RC1 and commit the patch/patches for 13 itself.
Please wait until after Tom has pushed the rc1 tag. This will probably
happen tomorrow.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Tom Lane

Date:

15 September 2020, 16:08:43

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
>>> Sure. Will backporting either patch into REL_13_STABLE now interfere
>>> with RC1 release in any way?

> It is okay to skip RC1 and commit the patch/patches for 13 itself.
> Please wait until after Tom has pushed the rc1 tag. This will probably
> happen tomorrow.

I plan to tag rc1 in around six hours, ~2200UTC today, barring
trouble reports from packagers (none so far).  Feel free to push your
patch once the tag appears.

            regards, tom lane

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 18:33:00

On Mon, Sep 14, 2020 at 11:44 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Attached. HashAgg seems to accurately reflect the filesize, as does
> Sort.

For the avoidance of doubt: I think that this is the right way to go,
and that it should be committed shortly, before we stamp 13.0. The
same goes for hashagg-release-write-buffers.patch.

Thanks
-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Tom Lane

Date:

15 September 2020, 22:02:21

I wrote:
> I plan to tag rc1 in around six hours, ~2200UTC today, barring
> trouble reports from packagers (none so far).  Feel free to push your
> patch once the tag appears.

The tag is applied, though for some reason the pgsql-committers auto
e-mail about new tags hasn't been working lately.

            regards, tom lane

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Peter Geoghegan

Date:

15 September 2020, 22:27:21

On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The tag is applied, though for some reason the pgsql-committers auto
> e-mail about new tags hasn't been working lately.

Thanks. FWIW I did get the automated email shortly after you sent this email.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Tom Lane

Date:

15 September 2020, 23:00:37

Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The tag is applied, though for some reason the pgsql-committers auto
>> e-mail about new tags hasn't been working lately.

> Thanks. FWIW I did get the automated email shortly after you sent this email.

Yeah, it did show up here too, about an hour after I pushed the tag.
The last several taggings have been delayed similarly, and I think
at least one never was reported at all.

            regards, tom lane

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Alvaro Herrera

Date:

16 September 2020, 00:10:27

On 2020-Sep-15, Tom Lane wrote:

> Peter Geoghegan <pg@bowt.ie> writes:
> > On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The tag is applied, though for some reason the pgsql-committers auto
> >> e-mail about new tags hasn't been working lately.
> 
> > Thanks. FWIW I did get the automated email shortly after you sent this email.
> 
> Yeah, it did show up here too, about an hour after I pushed the tag.
> The last several taggings have been delayed similarly, and I think
> at least one never was reported at all.

I approved it about half an hour after it got in the moderation queue.

They get moderated because the noreply@postgresql.org address which
appears as sender is not subscribed to any list.  I also added that
address to the whitelist now, but whether that's a great fix in the long
run is debatable. 

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logtape.c stats don't account for unused "prefetched" block numbers

From

Tom Lane

Date:

16 September 2020, 00:26:59

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> They get moderated because the noreply@postgresql.org address which
> appears as sender is not subscribed to any list.

Ah-hah.

> I also added that
> address to the whitelist now, but whether that's a great fix in the long
> run is debatable. 

Can/should we change the address that originates such messages?

            regards, tom lane