Thread: logtape.c stats don't account for unused "prefetched" block numbers
Commit 896ddf9b added prefetching to logtape.c to avoid excessive fragmentation in the context of hash aggs that spill and have many batches/tapes. Apparently the preallocation doesn't actually perform any filesystem operations, so the new mechanism should be zero overhead when "preallocated" blocks aren't actually used after all (right?). However, I notice that this breaks the statistics shown by things like trace_sort, and even EXPLAIN ANALYZE. LogicalTapeSetBlocks() didn't get the memo about preallocation. The easiest way to spot the issue is to compare trace_sort output on v13 with output for the same case in v12 -- the "%u disk blocks used" statistics are consistently higher on v13, especially for cases with many tapes. I spotted the bug when I noticed that v13 external sorts reportedly use more or less disk space when fewer or more tapes are involved (again, this came from trace_sort). That doesn't make sense -- the total amount of space used for external sort temp files should practically be fixed, aside from insignificant rounding effects. Reducing the amount of memory by orders of magnitude in a Postgres 12 tuplesort will hardly affect the "%u disk blocks used" trace_sort output at all. That's what we need to get back to. This bug probably won't be difficult to fix. Actually, we have had similar problems in the past. The fix could be as simple as teaching LogicalTapeSetBlocks() about this new variety of "sparse allocation". Although maybe the preallocation stuff should somehow be rolled into the much older nHoleBlocks stuff. Unsure. -- Peter Geoghegan
On 2020-Jul-30, Peter Geoghegan wrote: > Commit 896ddf9b added prefetching to logtape.c to avoid excessive > fragmentation in the context of hash aggs that spill and have many > batches/tapes. Apparently the preallocation doesn't actually perform > any filesystem operations, so the new mechanism should be zero > overhead when "preallocated" blocks aren't actually used after all > (right?). However, I notice that this breaks the statistics shown by > things like trace_sort, and even EXPLAIN ANALYZE. > LogicalTapeSetBlocks() didn't get the memo about preallocation. This open item hasn't received any replies. I think Peter knows how to fix it already, but no patch has been posted ... It'd be good to get a move on it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > This open item hasn't received any replies. I think Peter knows how to > fix it already, but no patch has been posted ... It'd be good to get a > move on it. I picked this up again today. It's not obvious what we should do. It's true that the instrumentation doesn't accurately reflect the on-disk temp file overhead. That is, it doesn't agree with the high watermark temp file size I see in the pgsql_tmp directory, which is a clear regression compared to earlier releases (where tuplesort was the only user of logtape.c). But it's also true that we need to use somewhat more temp file space for a tuplesort in Postgres 13, because we use the preallocation stuff for tuplesort -- though probably without getting any benefit for it. I haven't figured out how to correct the accounting just yet. In fact, I'm not sure that this isn't some kind of leak of blocks from the freelist, which shouldn't happen at all. The code is complicated enough that I wasn't able to work that out in the couple of hours I spent on it today. I can pick it up again tomorrow. BTW, this MaxAllocSize freeBlocksLen check is wrong -- doesn't match the later repalloc allocation: if (lts->nFreeBlocks >= lts->freeBlocksLen) { /* * If the freelist becomes very large, just return and leak this free * block. */ if (lts->freeBlocksLen * 2 > MaxAllocSize) return; lts->freeBlocksLen *= 2; lts->freeBlocks = (long *) repalloc(lts->freeBlocks, lts->freeBlocksLen * sizeof(long)); } -- Peter Geoghegan
On Tue, Sep 1, 2020 at 5:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > This open item hasn't received any replies. I think Peter knows how to > > fix it already, but no patch has been posted ... It'd be good to get a > > move on it. > > I picked this up again today. One easy way to get logtape.c to behave in the same way as Postgres 12 for a multi-pass external sort (i.e. to use fewer blocks and to report the number of blocks used accurately) is to #define TAPE_WRITE_PREALLOC_MIN and TAPE_WRITE_PREALLOC_MAX to 1. So it looks like the problem is in the preallocation stuff added by commit 896ddf9b3cd, and not the new heap-based free list logic added by commit c02fdc92230. That's good news, because it means that the problem may be fairly well isolated -- commit 896ddf9b3cd was a pretty small and isolated thing. The comments in ltsWriteBlock() added by the 2017 bugfix commit 7ac4a389a7d clearly say that the zero block writing stuff is only supposed to happen at the edge of a tape boundary, which ought to be rare -- see the comment block in ltsWriteBlock(). And yet the new preallocation stuff explicitly relies on that it writing zero blocks much more frequently. I'm concerned that that can result in increased and unnecessary I/O, especially for sorts, but also for hash aggs that spill. I'm also concerned that having preallocated-but-allocated blocks confuses the accounting used by trace_sort/LogicalTapeSetBlocks(). Separately, it's possible to make the trace_sort/LogicalTapeSetBlocks() instrumentation agree with the filesystem by replacing the use of nBlocksAllocated within LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the instrumentation correct without changing the current behavior at all. But I'm not ready to endorse that approach, since it's not quite clear what nBlocksAllocated and nBlocksWritten mean right now -- those two fields were both added by the aforementioned 2017 bugfix commit, which introduced the "allocated vs written" distinction in the first place. We should totally disable the preallocation stuff for external sorts in any case. External sorts are naturally characterized by relatively large, distinct batching of reads and writes -- preallocation cannot help. -- Peter Geoghegan
On Sat, 2020-09-05 at 12:03 -0700, Peter Geoghegan wrote: > We should totally disable the preallocation stuff for external sorts > in any case. External sorts are naturally characterized by relatively > large, distinct batching of reads and writes -- preallocation cannot > help. Patch attached to disable preallocation for Sort. I'm still looking into the other concerns. Regards, Jeff Davis
Attachment
On Sat, 2020-09-05 at 12:03 -0700, Peter Geoghegan wrote: > The comments in ltsWriteBlock() added by the 2017 bugfix commit > 7ac4a389a7d clearly say that the zero block writing stuff is only > supposed to happen at the edge of a tape boundary, which ought to be > rare -- see the comment block in ltsWriteBlock(). And yet the new > preallocation stuff explicitly relies on that it writing zero blocks > much more frequently. I'm concerned that that can result in increased > and unnecessary I/O, especially for sorts, but also for hash aggs > that > spill. I'm also concerned that having preallocated-but-allocated > blocks confuses the accounting used by > trace_sort/LogicalTapeSetBlocks(). Preallocation showed significant gains for HashAgg, and BufFile doesn't support sparse writes. So, for HashAgg, it seems like we should just update the comment and consider it the price of using BufFile. (Aside: is there a reason why BufFile doesn't support sparse writes, or is it just a matter of implementation?) For Sort, we can just disable preallocation. > Separately, it's possible to make the > trace_sort/LogicalTapeSetBlocks() instrumentation agree with the > filesystem by replacing the use of nBlocksAllocated within > LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the > instrumentation correct without changing the current behavior at all. > But I'm not ready to endorse that approach, since it's not quite > clear > what nBlocksAllocated and nBlocksWritten mean right now -- those two > fields were both added by the aforementioned 2017 bugfix commit, > which > introduced the "allocated vs written" distinction in the first place Right now, it seems nBlocksAllocated means "number of blocks returned by ltsGetFreeBlock(), plus nHoleBlocks". nBlocksWritten seems to mean "the logical size of the BufFile". The BufFile can have holes in it after concatenation, but from the perspective of logtape.c, nBlocksWritten seems like a better fit for instrumentation purposes. So I'd be inclined to return (nBlocksWritten - nHoleBlocks). The only thing I can think of that would be better is if BufFile tracked for itself the logical vs. physical size, which might be a good improvement to make (and would mean that logtape.c wouldn't be responsible for tracking the holes itself). Thoughts? Regards, Jeff Davis
On Tue, Sep 8, 2020 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote: > Preallocation showed significant gains for HashAgg, and BufFile doesn't > support sparse writes. So, for HashAgg, it seems like we should just > update the comment and consider it the price of using BufFile. > For Sort, we can just disable preallocation. +1. I think that you can push sort-no-prealloc.patch without delay. That looks good to me. > Right now, it seems nBlocksAllocated means "number of blocks returned > by ltsGetFreeBlock(), plus nHoleBlocks". Just to be clear: I'm assuming that we must honor the original intent of earlier code in my remarks here (in particular, code added by 7ac4a389a7d). This may not be exactly where we end up, but it's a good starting point. nHoleBlocks was added by the parallel CREATE INDEX commit in 2018, which added BufFile/logtape.c concatenation. Whereas nBlocksAllocated was added by the earlier bugfix commit 7ac4a389a7d in 2017 (the bugfix I've referenced several times upthread). Clearly nBlocksAllocated cannot be defined in terms of some other thing that wasn't there when it was first added (I refer to nHoleBlocks). In general nHoleBlocks can only be non-zero when the logical tapeset has been unified in a leader process (for a parallel CREATE INDEX). nBlocksAllocated is not the same thing as nBlocksWritten, though the difference is more subtle than you suggest. nBlocksAllocated actually means (or should mean) "the number of blocks allocated to the file", which is usually the same thing that a stat() call or tools like "ls" are expected report for the underlying temp file once the merge phase of an external sort is reached (assuming that you only need one temp file for a BufFile, and didn't use parallelism/concatenation, which is the common case). That's why nBlocksAllocated is what LogicalTapeSetBlocks() returns (pretty much). At least, the original post-7ac4a389a7d version of LogicalTapeSetBlocks() was just "return nBlocksAllocated;". nHoleBlocks was added for parallel CI, but that was only supposed to compensate for the holes left behind by concatenation/parallel sort, without changing the logtape.c space management design in any fundamental way. Obviously you must be wondering what the difference is, if it's not just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but it's almost always equal in practice (again, barring nHoleBlocks != 0). It's possible that a tuplesort will not have flushed the last block at a point when LogicalTapeSetBlocks() is called -- it will have allocated the block, but not yet written it to the BufFile. IOW, as far as tuplesort.c is concerned the data is written to tape, but it happens to not have been passed through to the OS via write(), or even passed through to BufFileWrite() -- it happens to still be in one of the small per-tape write buffers. When this occurs, a small amount of dirty data in said per-tape buffer is considered written by tuplesort.c, but from the point of view of logtape.c it is allocated but not yet "written" (by which I mean not yet passed to buffile.c, which actually does its own buffering, which it can neglect to flush immediately in turn). It's possible that tuplesort.c will need to call LogicalTapeSetBlocks() at an earlier point after all tuples are written but before they're "flushed" in logtape.c/buffile.c. We need to avoid confusion when that happens. We want to insulate tuplesort.c from implementation details that are private to logtape.c and/or buffile.c. Bear in mind that nBlocksAllocated was originally only ever supposed to have a value equal to nBlocksWritten, or the value nBlocksWritten + 1. It is reasonable to want to hide the buffering from LogicalTapeSetBlocks() once you realize that this mechanism is only supposed to smooth-over an edge case involving one extra block that will be written out in a moment anyway. What does all of this mean for the new preallocation stuff that benefits HashAggs that spill? Well, I'm not sure. I was specifically concerned that somebody would end up misusing the ltsWriteBlock() allocated-but-not-written thing in this way back in 2017, and said so at the time -- that's why commit 7ac4a389a7d added comments about all this to ltsWriteBlock(). For external sorts, that we're agreed won't be using preallocation anyway, I think that we should go back to reporting allocated blocks from LogicalTapeSetBlocks() -- very often this is nBlocksWritten, but occasionally it's nBlocksWritten + 1. I haven't yet refreshed my memory on the exact details of when you get one behavior rather than the other, but I know it is possible in practice with a tuplesort on Postgres 12. It might depend on subtle issues like the alignment with BufFile segments -- see my test case from 2017 to get an idea of how to make it easier to reveal problems in this area: https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com We still need to put the reliance on ltsWriteBlock() allocating many blocks before they've been logically written on some kind of formal footing for Postgres 13 -- it is now possible that an all-zero block will be left behind even after we're done writing and have flushed all temp buffers, which is a new thing. In cases when the zero-block-written thing happened on Postgres 12, we would later flush out a block that overwrote every zero block -- that happened reliably. ltsWriteBlock()'s loop only "preallocated" blocks it *knew* would get filled with real data shortly afterwards, as an implementation expedient -- not as an optimization. This is no longer the case. At a minimum, we need to update the old ltsWriteBlock() allocated-but-not-written comments to acknowledge that the HashAgg case exists and has different concerns. We must also determine whether we have the same issue with written-but-not-yet-flushed data for the new nodeAgg.c caller. You're not doing the ltsWriteBlock() loop-that-writes-zero-blocks thing because you have an unflushed buffer from another tape -- you're doing it to preallocate and avoid possible fragmentation. I'm mostly okay with doing the preallocation that way, but that needs to be reconciled with the original design. And the original design needs to continue to do the same things for tuplesort.c, and maybe nodeAgg.c, too. I think that the return value of LogicalTapeSetBlocks() should be at least nBlocksWritten, while also including blocks that we know that flushing dirty buffered data out will write in a moment, too (note that I'm still pretending nHoleBlocks doesn't exist because it's not important in my remarks here). IOW, it ought to include preallocated blocks (for HashAgg), while not failing to count one extra block that happens to still be buffered but is written as far as the logtape.c caller is concerned (certainly for tuplesort caller, and maybe for HashAgg caller too). > nBlocksWritten seems to mean "the logical size of the BufFile". The > BufFile can have holes in it after concatenation, but from the > perspective of logtape.c, nBlocksWritten seems like a better fit for > instrumentation purposes. So I'd be inclined to return (nBlocksWritten > - nHoleBlocks). > > The only thing I can think of that would be better is if BufFile > tracked for itself the logical vs. physical size, which might be a good > improvement to make (and would mean that logtape.c wouldn't be > responsible for tracking the holes itself). I don't really think that that's workable, for what it's worth. The "holes" left behind by concatenation (and counted by nHoleBlocks) are ranges that logtape.c can never reuse that are "between" worker tapes. They are necessary because logtape.c needs to be able to read back block number metadata from worker temp files (it makes sense of them by applying an offset). ISTM that the logical vs physical size distinction will have to be tracked by logtape.c for as long as it buffers data for writes. It's the natural way to do it IMV. -- Peter Geoghegan
On Thu, Sep 10, 2020 at 6:42 PM Peter Geoghegan <pg@bowt.ie> wrote: > Obviously you must be wondering what the difference is, if it's not > just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal > to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but > it's almost always equal in practice (again, barring nHoleBlocks != > 0). Noticing that you pushed a commit to disable preallocation for external sorts, I tried to determine if there are any remaining problem. As far as I can tell there are no remaining problems -- evidently the loop logic in ltsWriteBlock() both performs its original task (per commit 7ac4a389a7d), as well as the new task of preallocation for its HashAggs-that-spill caller. There is a case in the regression tests (including the Postgres 12 regression tests) that relies on the loop within ltsWriteBlock() for an external sort. FWIW, that happens in the "cluster clstr_tst4 using cluster_sort" cluster tuplesort. The trace_sort output (and the temp file size) is now consistent across versions 12 and 13. I'll probably close out this open item tomorrow. I need to think about it some more, but right now everything looks good. I think I'll probably end up pushing a commit with more explanatory comments. Thank you -- Peter Geoghegan
On Fri, Sep 11, 2020 at 6:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'll probably close out this open item tomorrow. I need to think about > it some more, but right now everything looks good. I think I'll > probably end up pushing a commit with more explanatory comments. That said, we still need to make sure that the preallocation instrumentation for HashAggs-that-spill is sensible -- it has to actually match the temp file size. It would be awkward if we just used nBlocksWritten within LogicalTapeSetBlocks() in the case where we didn't preallocate (or in all cases). Not entirely sure what to do about that just yet. -- Peter Geoghegan
On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > It would be awkward if we just used nBlocksWritten within > LogicalTapeSetBlocks() in the case where we didn't preallocate (or in > all cases). Not entirely sure what to do about that just yet. I guess that that's the logical thing to do, as in the attached patch. What do you think, Jeff? -- Peter Geoghegan
Attachment
On Mon, 2020-09-14 at 14:24 -0700, Peter Geoghegan wrote: > On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > > It would be awkward if we just used nBlocksWritten within > > LogicalTapeSetBlocks() in the case where we didn't preallocate (or > > in > > all cases). Not entirely sure what to do about that just yet. > > I guess that that's the logical thing to do, as in the attached > patch. Hi Peter, In the comment in the patch, you say: "In practice this probably doesn't matter because we'll be called after the flush anyway, but be tidy." By which I assume you mean that LogicalTapeRewindForRead() will be called before LogicalTapeSetBlocks(). If that's the intention of LogicalTapeSetBlocks(), should we just make it a requirement that there are no open write buffers for any tapes when it's called? Then we could just use nBlocksWritten in both cases, right? (Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might be a mistake in HashAgg where it will keep the write buffers around longer than necessary. If I recall correctly, it was my intention to rewind for reading immediately after the batch was finished, which is why I made the read buffer lazily-allocated.) Regards, Jeff Davis
On 2020-Sep-14, Peter Geoghegan wrote: > On Fri, Sep 11, 2020 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > > It would be awkward if we just used nBlocksWritten within > > LogicalTapeSetBlocks() in the case where we didn't preallocate (or in > > all cases). Not entirely sure what to do about that just yet. > > I guess that that's the logical thing to do, as in the attached patch. I don't understand this patch. Or maybe I should say I don't understand the code you're patching. Why isn't the correct answer *always* nBlocksWritten? The comment in LogicalTapeSet says: "nBlocksWritten is the size of the underlying file, in BLCKSZ blocks." so if LogicalTapeSetBlocks wants to do what its comment says, that is, "Obtain total disk space currently used by a LogicalTapeSet, in blocks." then it seems like they're an exact match. Either that, or more than zero of those comments are lying. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Sep 14, 2020 at 3:24 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I don't understand this patch. Or maybe I should say I don't understand > the code you're patching. Why isn't the correct answer *always* > nBlocksWritten? The comment in LogicalTapeSet says: I think that they are an exact match in practice (i.e. nBlocksWritten == nBlocksAllocated), given when and how we call LogicalTapeSetBlocks(). -- Peter Geoghegan
On Mon, Sep 14, 2020 at 3:20 PM Jeff Davis <pgsql@j-davis.com> wrote: > In the comment in the patch, you say: > > "In practice this probably doesn't matter because we'll be called after > the flush anyway, but be tidy." > > By which I assume you mean that LogicalTapeRewindForRead() will be > called before LogicalTapeSetBlocks(). Yeah, I'm pretty sure that that's an equivalent way of expressing the same idea. It appears that this assumption holds, though only when we're not using preallocation (i.e. it doesn't necessarily hold for the HashAggs-that-spill case, as I go into below). > If that's the intention of LogicalTapeSetBlocks(), should we just make > it a requirement that there are no open write buffers for any tapes > when it's called? Then we could just use nBlocksWritten in both cases, > right? That does seem appealing. Perhaps it could be enforced by an assertion. > (Aside: HashAgg calls it before LogicalTapeRewindForRead(). That might > be a mistake in HashAgg where it will keep the write buffers around > longer than necessary. If I recall correctly, it was my intention to > rewind for reading immediately after the batch was finished, which is > why I made the read buffer lazily-allocated.) If I add the assertion described above and run the regression tests, it fails within "select_distinct" (and at other points). This is the specific code: --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -1284,6 +1284,7 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts) * (In practice this probably doesn't matter because we'll be called after * the flush anyway, but be tidy.) */ + Assert(lts->nBlocksWritten == lts->nBlocksAllocated); if (lts->enable_prealloc) return lts->nBlocksWritten; Maybe the LogicalTapeRewindForRead() inconsistency you mention could be fixed, which would enable the simplification you suggested. What do you think? -- Peter Geoghegan
On Mon, Sep 14, 2020 at 3:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that they are an exact match in practice (i.e. nBlocksWritten > == nBlocksAllocated), given when and how we call > LogicalTapeSetBlocks(). Just to be clear: this is only true for external sorts. The preallocation stuff can make nBlocksAllocated quite a lot higher. That's probably why adding a new "Assert(lts->nBlocksWritten == lts->nBlocksAllocated)" assertion fails during the regression tests, though there might be other reasons as well (I'm thinking of the LogicalTapeRewindForRead() inconsistency Jeff mentioned). -- Peter Geoghegan
On Mon, Sep 14, 2020 at 3:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > If I add the assertion described above and run the regression tests, > it fails within "select_distinct" (and at other points). This is the > specific code: This variant of the same assertion works fine: + Assert(lts->enable_prealloc || lts->nBlocksWritten == lts->nBlocksAllocated); (This is hardly surprising, though.) -- Peter Geoghegan
On Mon, 2020-09-14 at 15:54 -0700, Peter Geoghegan wrote: > Maybe the LogicalTapeRewindForRead() inconsistency you mention could > be fixed, which would enable the simplification you suggested. What > do > you think? Yes, it was apparently an oversight. Patch attached. RC1 was just stamped, are we in a sensitive time or is it still possible to backport this to REL_13_STABLE? If not, that's fine, I'll just commit it to master. It's a little less important after 9878b643, which reduced the overpartitioning issue. Regards, Jeff Davis
Attachment
On Mon, Sep 14, 2020 at 5:50 PM Jeff Davis <pgsql@j-davis.com> wrote: > Yes, it was apparently an oversight. Patch attached. This is closer to how logical tapes are used within tuplesort.c. I notice that this leads to about a 50% reduction in temp file usage for a test case involving very little work_mem (work_mem is set to 64). But it doesn't seem to make as much difference with more work_mem. It probably has something to do with recursion during spilling. > RC1 was just stamped, are we in a sensitive time or is it still > possible to backport this to REL_13_STABLE? Testing indicates that this still doesn't make "nBlocksWritten == nBlocksAllocated" when the instrumentation is used for HashAggs-that-spill. I'm not sure what I was talking about earlier when I connected this with the main/instrumentation issue, since preallocation used by logtape.c to help HashAggs-that-spill necessarily reserves blocks without writing them out for a while (the fires in California have made it difficult to be productive). You might write blocks out as zero blocks first, and then only write the real data later (overwriting the zero blocks). But no matter how the writes among tapes are interlaced, the fact is that nBlocksAllocated can exceed nBlocksWritten by at least one block per active tape. If we really wanted to ensure "nBlocksWritten == nBlocksAllocated", wouldn't it be necessary for LogicalTapeSetBlocks() to go through the remaining preallocated blocks from each tape and count the number of blocks "logically preallocated" (by ltsGetPreallocBlock()) but not yet "physically preallocated" (by being written out as zero blocks within ltsWriteBlock())? That count would have to be subtracted, because nBlocksAllocated includes logically preallocated blocks, without regard for whether they've been physically preallocated. But we only know the difference by checking against nBlocksWritten, so we might as well just use my patch from earlier. (I'm not arguing that we should, I'm just pointing out the logical though perhaps absurd conclusion.) -- Peter Geoghegan
On Mon, Sep 14, 2020 at 6:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm not sure what I was talking about earlier when I connected this > with the main/instrumentation issue, since preallocation used by > logtape.c to help HashAggs-that-spill necessarily reserves blocks > without writing them out for a while (the fires in California have > made it difficult to be productive). You might write blocks out as > zero blocks first, and then only write the real data later > (overwriting the zero blocks). But no matter how the writes among > tapes are interlaced, the fact is that nBlocksAllocated can exceed > nBlocksWritten by at least one block per active tape. Oh, wait. Of course the point was that we wouldn't even have to use nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make the assumption that nBlocksWritten could be used for all callers in all cases. Which is a reasonable assumption once you enforce that there are no active write buffers. Which is evidently a good idea anyway, since it saves on temp file disk space in HashAggs-that-spill/prealloc cases with very little work_mem. -- Peter Geoghegan
On Mon, Sep 14, 2020 at 7:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > Oh, wait. Of course the point was that we wouldn't even have to use > nBlocksAllocated in LogicalTapeSetBlocks() anymore -- we would make > the assumption that nBlocksWritten could be used for all callers in > all cases. Which is a reasonable assumption once you enforce that > there are no active write buffers. Which is evidently a good idea > anyway, since it saves on temp file disk space in > HashAggs-that-spill/prealloc cases with very little work_mem. Let's assume that we'll teach LogicalTapeSetBlocks() to use nBlocksWritten in place of nBlocksAllocated in all cases, as now seems likely. Rather than asserting "nBlocksWritten == nBlocksAllocated" inside LogicalTapeSetBlocks() (as I suggested earlier at one point), we could instead teach LogicalTapeSetBlocks() to iterate through each tape from the tapeset and make sure each tape has no writes buffered (so everything must be flushed). We could add a loop that would only be used on assert-enabled builds. This looping-through-tapes-to assert code would justify relying on nBlocksWritten in LogicalTapeSetBlocks(), and would make sure that we don't let any bugs like this slip in in the future. It would necessitate that we commit Jeff's hashagg-release-write-buffers.patch patch from earlier, I think, but that seems like a good idea anyway. You suggested this yourself, Jeff (my suggestion about the assertion is just an expansion on your suggestion from earlier). This all seems like a good idea to me. Can you write a patch that adjusts LogicalTapeSetBlocks() along these lines? Hopefully the assertion loop thing won't reveal some other problem with this plan. Thanks -- Peter Geoghegan
On Mon, 2020-09-14 at 19:29 -0700, Peter Geoghegan wrote: > Let's assume that we'll teach LogicalTapeSetBlocks() to use > nBlocksWritten in place of nBlocksAllocated in all cases, as now > seems > likely. Rather than asserting "nBlocksWritten == nBlocksAllocated" > inside LogicalTapeSetBlocks() (as I suggested earlier at one point), > we could instead teach LogicalTapeSetBlocks() to iterate through each > tape from the tapeset and make sure each tape has no writes buffered > (so everything must be flushed). We could add a loop that would only > be used on assert-enabled builds. Sounds reasonable. > You suggested this yourself, Jeff (my suggestion about the assertion > is just an expansion on your suggestion from earlier). This all seems > like a good idea to me. Can you write a patch that adjusts > LogicalTapeSetBlocks() along these lines? Hopefully the assertion > loop > thing won't reveal some other problem with this plan. Sure. Will backporting either patch into REL_13_STABLE now interfere with RC1 release in any way? Regards, Jeff Davis
On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > Sure. Will backporting either patch into REL_13_STABLE now interfere > with RC1 release in any way? The RMT will discuss this. It would help if there was a patch ready to go. Thanks -- Peter Geoghegan
On Mon, 2020-09-14 at 20:48 -0700, Peter Geoghegan wrote: > On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > > Sure. Will backporting either patch into REL_13_STABLE now > > interfere > > with RC1 release in any way? > > The RMT will discuss this. > > It would help if there was a patch ready to go. Attached. HashAgg seems to accurately reflect the filesize, as does Sort. Regards, Jeff Davis
Attachment
On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote: > We still need to put the reliance on ltsWriteBlock() allocating many > blocks before they've been logically written on some kind of formal > footing for Postgres 13 -- it is now possible that an all-zero block > will be left behind even after we're done writing and have flushed > all > temp buffers, which is a new thing. Is the current direction of this thread (i.e. the two posted patches) addressing your concern here? Regards, Jeff Davis
On Mon, Sep 14, 2020 at 11:52 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote: > > We still need to put the reliance on ltsWriteBlock() allocating many > > blocks before they've been logically written on some kind of formal > > footing for Postgres 13 -- it is now possible that an all-zero block > > will be left behind even after we're done writing and have flushed > > all > > temp buffers, which is a new thing. > > Is the current direction of this thread (i.e. the two posted patches) > addressing your concern here? Yes. -- Peter Geoghegan
On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > > Sure. Will backporting either patch into REL_13_STABLE now interfere > > with RC1 release in any way? > > The RMT will discuss this. It is okay to skip RC1 and commit the patch/patches for 13 itself. Please wait until after Tom has pushed the rc1 tag. This will probably happen tomorrow. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan <pg@bowt.ie> wrote: >> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis <pgsql@j-davis.com> wrote: >>> Sure. Will backporting either patch into REL_13_STABLE now interfere >>> with RC1 release in any way? > It is okay to skip RC1 and commit the patch/patches for 13 itself. > Please wait until after Tom has pushed the rc1 tag. This will probably > happen tomorrow. I plan to tag rc1 in around six hours, ~2200UTC today, barring trouble reports from packagers (none so far). Feel free to push your patch once the tag appears. regards, tom lane
On Mon, Sep 14, 2020 at 11:44 PM Jeff Davis <pgsql@j-davis.com> wrote: > Attached. HashAgg seems to accurately reflect the filesize, as does > Sort. For the avoidance of doubt: I think that this is the right way to go, and that it should be committed shortly, before we stamp 13.0. The same goes for hashagg-release-write-buffers.patch. Thanks -- Peter Geoghegan
I wrote: > I plan to tag rc1 in around six hours, ~2200UTC today, barring > trouble reports from packagers (none so far). Feel free to push your > patch once the tag appears. The tag is applied, though for some reason the pgsql-committers auto e-mail about new tags hasn't been working lately. regards, tom lane
On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The tag is applied, though for some reason the pgsql-committers auto > e-mail about new tags hasn't been working lately. Thanks. FWIW I did get the automated email shortly after you sent this email. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The tag is applied, though for some reason the pgsql-committers auto >> e-mail about new tags hasn't been working lately. > Thanks. FWIW I did get the automated email shortly after you sent this email. Yeah, it did show up here too, about an hour after I pushed the tag. The last several taggings have been delayed similarly, and I think at least one never was reported at all. regards, tom lane
On 2020-Sep-15, Tom Lane wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > On Tue, Sep 15, 2020 at 3:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The tag is applied, though for some reason the pgsql-committers auto > >> e-mail about new tags hasn't been working lately. > > > Thanks. FWIW I did get the automated email shortly after you sent this email. > > Yeah, it did show up here too, about an hour after I pushed the tag. > The last several taggings have been delayed similarly, and I think > at least one never was reported at all. I approved it about half an hour after it got in the moderation queue. They get moderated because the noreply@postgresql.org address which appears as sender is not subscribed to any list. I also added that address to the whitelist now, but whether that's a great fix in the long run is debatable. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > They get moderated because the noreply@postgresql.org address which > appears as sender is not subscribed to any list. Ah-hah. > I also added that > address to the whitelist now, but whether that's a great fix in the long > run is debatable. Can/should we change the address that originates such messages? regards, tom lane