Thread: Spurious "apparent wraparound" via SimpleLruTruncate() rounding
While testing an xidStopLimit corner case, I got this: 3656710 2019-01-05 00:05:13.910 GMT LOG: automatic aggressive vacuum to prevent wraparound of table "test.pg_toast.pg_toast_826":index scans: 0 3656710 2019-01-05 00:05:16.912 GMT LOG: could not truncate directory "pg_xact": apparent wraparound 3656710 2019-01-05 00:05:16.912 GMT DEBUG: transaction ID wrap limit is 4294486400, limited by database with OID 1 3656710 2019-01-05 00:05:16.912 GMT WARNING: database "template1" must be vacuumed within 481499 transactions 3656710 2019-01-05 00:05:16.912 GMT HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. I think the WARNING was correct about having 481499 XIDs left before xidWrapLimit, and the spurious "apparent wraparound" arose from this rounding-down in SimpleLruTruncate(): cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT; ... /* * While we are holding the lock, make an important safety check: the * planned cutoff point must be <= the current endpoint page. Otherwise we * have already wrapped around, and proceeding with the truncation would * risk removing the current segment. */ if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage)) { LWLockRelease(shared->ControlLock); ereport(LOG, (errmsg("could not truncate directory \"%s\": apparent wraparound", ctl->Dir))); We round "cutoffPage" to make ctl->PagePrecedes(segpage, cutoffPage) return false for the segment containing the cutoff page. CLOGPagePrecedes() (and most SLRU PagePrecedes methods) implements a circular address space. Hence, the rounding also causes ctl->PagePrecedes(segpage, cutoffPage) to return true for the segment furthest in the future relative to the unrounded cutoffPage (if it exists). That's bad. Such a segment rarely exists, because xidStopLimit protects 1000000 XIDs, and the rounding moves truncation by no more than (BLCKSZ * CLOG_XACTS_PER_BYTE * SLRU_PAGES_PER_SEGMENT - 1) = 1048575 XIDs. Thus, I expect to see this problem at 4.9% of xidStopLimit values. I expect this is easier to see with multiStopLimit, which protects only 100 mxid. The main consequence is the false alarm. A prudent DBA will want to react to true wraparound, but no such wraparound has occurred. Also, we temporarily waste disk space in pg_xact. This feels like a recipe for future bugs. The fix I have in mind, attached, is to change instances of ctl->PagePrecedes(FIRST_PAGE_OF_SEGMENT, ROUNDED_cutoffPage) to ctl->PagePrecedes(LAST_PAGE_OF_SEGMENT, cutoffPage). I'm inclined not to back-patch this; does anyone favor back-patching? Thanks, nm
Attachment
On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote: > The main consequence is the false alarm. A prudent DBA will want to react to > true wraparound, but no such wraparound has occurred. Also, we temporarily > waste disk space in pg_xact. This feels like a recipe for future bugs. The > fix I have in mind, attached, is to change instances of > ctl->PagePrecedes(FIRST_PAGE_OF_SEGMENT, ROUNDED_cutoffPage) to > ctl->PagePrecedes(LAST_PAGE_OF_SEGMENT, cutoffPage). I'm inclined not to > back-patch this; does anyone favor back-patching? To avoid wasting more of anyone's time: that patch is bad; I'll update this thread when I have something better.
On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote: > The main consequence is the false alarm. That conclusion was incorrect. On further study, I was able to reproduce data loss via either of two weaknesses in the "apparent wraparound" test: 1. The result of the test is valid only until we release the SLRU ControlLock, which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate segments for deletion. Once we release that lock, latest_page_number can advance. This creates a TOCTOU race condition, allowing excess deletion: [local] test=# table trunc_clog_concurrency ; ERROR: could not access status of transaction 2149484247 DETAIL: Could not open file "pg_xact/0801": No such file or directory. 2. By the time the "apparent wraparound" test fires, we've already WAL-logged the truncation. clog_redo() suppresses the "apparent wraparound" test, then deletes too much. Startup then fails: 881997 2019-02-10 02:53:32.105 GMT FATAL: could not access status of transaction 708112327 881997 2019-02-10 02:53:32.105 GMT DETAIL: Could not open file "pg_xact/02A3": No such file or directory. 881855 2019-02-10 02:53:32.107 GMT LOG: startup process (PID 881997) exited with exit code 1 Fixes are available: a. Fix the rounding in SimpleLruTruncate(). (The patch I posted upthread is wrong; I will correct it in a separate message.) b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a checkpoint, or in the startup process. Hence, also arrange for only one backend to call SimpleLruTruncate(AsyncCtl) at a time. c. Test "apparent wraparound" before writing WAL, and don't WAL-log truncations that "apparent wraparound" forces us to skip. d. Hold the ControlLock for the entirety of SimpleLruTruncate(). This removes the TOCTOU race condition, but TransactionIdDidCommit() and other key operations would be waiting behind filesystem I/O. e. Have the SLRU track a "low cutoff" for an ongoing truncation. Initially, the low cutoff is the page furthest in the past relative to cutoffPage (the "high cutoff"). If SimpleLruZeroPage() wishes to use a page in the truncation range, it would acquire an LWLock and increment the low cutoff. Before unlinking any segment, SlruScanDirCbDeleteCutoff() would take the same LWLock and recheck the segment against the latest low cutoff. With both (a) and (b), the only way I'd know to reach the "apparent wraparound" message is to restart in single-user mode and burn XIDs to the point of bona fide wraparound. Hence, I propose to back-patch (a) and (b), and I propose (c) for HEAD only. I don't want (d), which threatens performance too much. I would rather not have (e), because I expect it's more complex than (b) and fixes strictly less than (b) fixes. Can you see a way to improve on that plan? Can you see other bugs of this nature that this plan does not fix? Thanks, nm
On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > On further study, I was able to reproduce data loss > Fixes are available: > > a. Fix the rounding in SimpleLruTruncate(). (The patch I posted upthread is > wrong; I will correct it in a separate message.) Here's a corrected version. I now delete a segment only if both its first page and its last page are considered to precede the cutoff; see the new comment at SlruMayDeleteSegment().
Attachment
Sorry in advance for link-breaking message forced by gmail.. https://www.postgresql.org/message-id/flat/20190202083822.GC32531@gust.leadboat.com > 1. The result of the test is valid only until we release the SLRU ControlLock, > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > segments for deletion. Once we release that lock, latest_page_number can > advance. This creates a TOCTOU race condition, allowing excess deletion: > > > [local] test=# table trunc_clog_concurrency ; > ERROR: could not access status of transaction 2149484247 > DETAIL: Could not open file "pg_xact/0801": No such file or directory. It seems like some other vacuum process saw larger cutoff page? If I'm not missing something, the missing page is no longer the "recently-populated" page at the time (As I understand it as the last page that holds valid data). Couldn't we just ignore ENOENT there? > 2. By the time the "apparent wraparound" test fires, we've already WAL-logged > the truncation. clog_redo() suppresses the "apparent wraparound" test, > then deletes too much. Startup then fails: I agree that if truncation is skipped after issuing log, it will lead to data-loss at the next recovery. But the follwoing log..: > 881997 2019-02-10 02:53:32.105 GMT FATAL: could not access status of transaction 708112327 > 881997 2019-02-10 02:53:32.105 GMT DETAIL: Could not open file "pg_xact/02A3": No such file or directory. > 881855 2019-02-10 02:53:32.107 GMT LOG: startup process (PID 881997) exited with exit code 1 If it came from the same reason as 1, the log is simply ignorable, so recovery stopping by the error is unacceptable, but the ENOENT is just ignorable for the same reason. As the result, I agree to (a) (fix rounding), and (c) (test wrap-around before writing WAL) but I'm not sure for others. And additional fix for ignorable ENOENT is needed. What do you think about this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Jul 24, 2019 at 05:27:18PM +0900, Kyotaro Horiguchi wrote: > Sorry in advance for link-breaking message forced by gmail.. Using the archives page "Resend email" link avoids that. > https://www.postgresql.org/message-id/flat/20190202083822.GC32531@gust.leadboat.com > > > 1. The result of the test is valid only until we release the SLRU ControlLock, > > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > > segments for deletion. Once we release that lock, latest_page_number can > > advance. This creates a TOCTOU race condition, allowing excess deletion: > > > > > > [local] test=# table trunc_clog_concurrency ; > > ERROR: could not access status of transaction 2149484247 > > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > > It seems like some other vacuum process saw larger cutoff page? No, just one VACUUM suffices. > If I'm > not missing something, the missing page is no longer the > "recently-populated" page at the time (As I understand it as the last > page that holds valid data). Couldn't we just ignore ENOENT there? The server reported this error while attempting to read CLOG to determine whether a tuple's xmin committed or aborted. That ENOENT means the relevant CLOG page is not available. To ignore that ENOENT, the server would need to guess whether to consider the xmin committed or consider it aborted. So, no, we can't just ignore the ENOENT.
On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: >On Sat, Feb 02, 2019 at 03:38:22AM -0500, Noah Misch wrote: >> The main consequence is the false alarm. > >That conclusion was incorrect. On further study, I was able to reproduce data >loss via either of two weaknesses in the "apparent wraparound" test: > >1. The result of the test is valid only until we release the SLRU ControlLock, > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > segments for deletion. Once we release that lock, latest_page_number can > advance. This creates a TOCTOU race condition, allowing excess deletion: > > [local] test=# table trunc_clog_concurrency ; > ERROR: could not access status of transaction 2149484247 > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > >2. By the time the "apparent wraparound" test fires, we've already WAL-logged > the truncation. clog_redo() suppresses the "apparent wraparound" test, > then deletes too much. Startup then fails: > > 881997 2019-02-10 02:53:32.105 GMT FATAL: could not access status of transaction 708112327 > 881997 2019-02-10 02:53:32.105 GMT DETAIL: Could not open file "pg_xact/02A3": No such file or directory. > 881855 2019-02-10 02:53:32.107 GMT LOG: startup process (PID 881997) exited with exit code 1 > > >Fixes are available: > >a. Fix the rounding in SimpleLruTruncate(). (The patch I posted upthread is > wrong; I will correct it in a separate message.) > >b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > checkpoint, or in the startup process. Hence, also arrange for only one > backend to call SimpleLruTruncate(AsyncCtl) at a time. > >c. Test "apparent wraparound" before writing WAL, and don't WAL-log > truncations that "apparent wraparound" forces us to skip. > >d. Hold the ControlLock for the entirety of SimpleLruTruncate(). This removes > the TOCTOU race condition, but TransactionIdDidCommit() and other key > operations would be waiting behind filesystem I/O. > >e. Have the SLRU track a "low cutoff" for an ongoing truncation. Initially, > the low cutoff is the page furthest in the past relative to cutoffPage (the > "high cutoff"). If SimpleLruZeroPage() wishes to use a page in the > truncation range, it would acquire an LWLock and increment the low cutoff. > Before unlinking any segment, SlruScanDirCbDeleteCutoff() would take the > same LWLock and recheck the segment against the latest low cutoff. > >With both (a) and (b), the only way I'd know to reach the "apparent >wraparound" message is to restart in single-user mode and burn XIDs to the >point of bona fide wraparound. Hence, I propose to back-patch (a) and (b), >and I propose (c) for HEAD only. I don't want (d), which threatens >performance too much. I would rather not have (e), because I expect it's more >complex than (b) and fixes strictly less than (b) fixes. > >Can you see a way to improve on that plan? Can you see other bugs of this >nature that this plan does not fix? > Seems reasonable, although I wonder how much more expensive would just doing (d) be. It seems by far the least complex solution, and it moves "just" the SlruScanDirectory() call before the lock. It's true it adds I/O requests, OTOH it's just unlink() without fsync() and I'd expect the number of files to be relatively low. Plus we already do SimpleLruWaitIO and SlruInternalWritePage in the loop. BTW isn't that an issue that SlruInternalDeleteSegment does not do any fsync calls after unlinking the segments? If the system crashes/reboots before this becomes persistent (i.e. some of the segments reappear, won't that cause a problem)? It's a bit unfortunate that a patch for a data corruption / loss issue (even if a low-probability one) fell through multiple commitfests. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jan 05, 2020 at 01:33:55AM +0100, Tomas Vondra wrote: > On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > >a. Fix the rounding in SimpleLruTruncate(). (The patch I posted upthread is > > wrong; I will correct it in a separate message.) > > > >b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > > checkpoint, or in the startup process. Hence, also arrange for only one > > backend to call SimpleLruTruncate(AsyncCtl) at a time. > > > >c. Test "apparent wraparound" before writing WAL, and don't WAL-log > > truncations that "apparent wraparound" forces us to skip. > > > >d. Hold the ControlLock for the entirety of SimpleLruTruncate(). This removes > > the TOCTOU race condition, but TransactionIdDidCommit() and other key > > operations would be waiting behind filesystem I/O. > >With both (a) and (b), the only way I'd know to reach the "apparent > >wraparound" message is to restart in single-user mode and burn XIDs to the > >point of bona fide wraparound. Hence, I propose to back-patch (a) and (b), > >and I propose (c) for HEAD only. I don't want (d), which threatens > >performance too much. I would rather not have (e), because I expect it's more > >complex than (b) and fixes strictly less than (b) fixes. > Seems reasonable, although I wonder how much more expensive would just > doing (d) be. It seems by far the least complex solution, and it moves > "just" the SlruScanDirectory() call before the lock. It's true it adds > I/O requests, OTOH it's just unlink() without fsync() and I'd expect the > number of files to be relatively low. Plus we already do SimpleLruWaitIO > and SlruInternalWritePage in the loop. Trivial read-only transactions often need CLogControlLock to check tuple visibility. If an unlink() takes 1s, stalling read-only transactions for that 1s is a big problem. SimpleLruWaitIO() and SlruInternalWritePage() release the control lock during I/O, then re-acquire it. (Moreover, SimpleLruTruncate() rarely calls them. Calling them implies a page old enough to truncate was modified after the most recent checkpoint.) > BTW isn't that an issue that SlruInternalDeleteSegment does not do any > fsync calls after unlinking the segments? If the system crashes/reboots > before this becomes persistent (i.e. some of the segments reappear, > won't that cause a problem)? I think not; we could turn SlruInternalDeleteSegment() into a no-op, and the only SQL-visible consequence would be extra disk usage. CheckPoint fields tell the server what region of slru data is meaningful, and segments outside that range merely waste disk space. (If that's not true and can't be made true, we'd also need to stop ignoring the unlink() return value.) > It's a bit unfortunate that a patch for a data corruption / loss issue > (even if a low-probability one) fell through multiple commitfests. Thanks for investing in steps to fix that.
Noah Misch <noah@leadboat.com> writes: > On Sun, Jan 05, 2020 at 01:33:55AM +0100, Tomas Vondra wrote: >> It's a bit unfortunate that a patch for a data corruption / loss issue >> (even if a low-probability one) fell through multiple commitfests. > Thanks for investing in steps to fix that. Yeah, this patch has been waiting in the queue for way too long :-(. I spent some time studying this, and I have a few comments/questions: 1. It seems like this discussion is conflating two different issues. The original complaint was about a seemingly-bogus log message "could not truncate directory "pg_xact": apparent wraparound". Your theory about that, IIUC, is that SimpleLruTruncate's initial round-back of cutoffPage to a segment boundary moved us from a state where shared->latest_page_number doesn't logically precede the cutoffPage to one where it does, triggering the message. So removing the initial round-back, and then doing whatever's needful to compensate for that in the later processing, is a reasonable fix to prevent the bogus warning. However, you're also discussing whether or not an SLRU segment file that is close to the wraparound boundary should get removed or not. As far as I can see that's 100% independent of issuance of the log message, no? This might not affect the code substance of the patch at all; but it seems like we need to be clear about it in our discussion, and maybe the comments need to change too. 2. I wonder whether we have an issue even with rounding back to the SLRU page boundary, as is done by each caller before we ever get to SimpleLruTruncate. I'm pretty sure that the actual anti-wraparound protections are all precise to the XID level, so that there's some daylight between what SimpleLruTruncate is told and the range of data that the higher-level code thinks is of interest. Maybe we should restructure things so that we keep the original cutoff XID (or equivalent) all the way through, and compare the start/end positions of a segment file directly to that. 3. It feels like the proposed test of cutoff position against both ends of a segment is a roundabout way of fixing the problem. I wonder whether we ought not pass *both* the cutoff and the current endpoint (latest_page_number) down to the truncation logic, and have it compare against both of those values. To try to clarify this in my head, I thought about an image of the modular XID space as an octagon, where each side would correspond to a segment file if we chose numbers such that there were only 8 possible segment files. Let's say that nextXID is somewhere in the bottommost octagon segment. The oldest possible value for the truncation cutoff XID is a bit less than "halfway around" from nextXID; so it could be in the topmost octagon segment, if the minimum permitted daylight- till-wraparound is less than the SLRU segment size (which it is). Then, if we round the cutoff XID "back" to a segment boundary, most of the current (bottommost) segment is now less than halfway around from that cutoff point, and in particular the current segment's starting page is exactly halfway around. Because of the way that TransactionIdPrecedes works, the current segment will be considered to precede that cutoff point (the int32 difference comes out as exactly 2^31 which is negative). Boom, data loss, because we'll decide the current segment is removable. I think that your proposed patch does fix this, but I'm not quite sold that the corner cases (where the cutoff XID is itself exactly at a page boundary) are right. In any case, I think it'd be more robust to be comparing explicitly against a notion of the latest in-use page number, instead of backing into it from an assumption that the cutoff XID itself is less than halfway around. I wonder if we ought to dodge the problem by having a higher minimum value of the required daylight-before-wraparound, so that the cutoff point couldn't be in the diametrically-opposite-current segment but would have to be at least one segment before that. In the end, I believe that all of this logic was written under an assumption that we should never get into a situation where we are so close to the wraparound threshold that considerations like these would manifest. Maybe we can get it right, but I don't have huge faith in it. It also bothers me that some of the callers of SimpleLruTruncate have explicit one-count backoffs of the cutoff point and others do not. There's no obvious reason for the difference, so I wonder if that isn't something we should have across-the-board, or else adjust SimpleLruTruncate to do the equivalent thing internally. I haven't thought much yet about your second point about race conditions arising from nextXID possibly moving before we finish the deletion scan. Maybe we could integrate a fix for that issue, along the lines of (1) see an SLRU segment file, (2) determine that it appears to precede the cutoff XID, if so (3) acquire the control lock and fetch latest_page_number, compare against that to verify that the segment file is old and not new, then (4) unlink if that still holds. regards, tom lane
On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote: > Yeah, this patch has been waiting in the queue for way too long :-(. Thanks for reviewing. > I spent some time studying this, and I have a few comments/questions: > > 1. It seems like this discussion is conflating two different issues. > The original complaint was about a seemingly-bogus log message "could > not truncate directory "pg_xact": apparent wraparound". Your theory > about that, IIUC, is that SimpleLruTruncate's initial round-back of > cutoffPage to a segment boundary moved us from a state where > shared->latest_page_number doesn't logically precede the cutoffPage to > one where it does, triggering the message. So removing the initial > round-back, and then doing whatever's needful to compensate for that in > the later processing, is a reasonable fix to prevent the bogus warning. > However, you're also discussing whether or not an SLRU segment file that > is close to the wraparound boundary should get removed or not. As far When the newest XID and the oldest XID fall in "opposite" segments in the XID space, we must not unlink the segment containing the newest XID. That is the chief goal at present. > as I can see that's 100% independent of issuance of the log message, no? Perhaps confusing is that the first message of the thread and the subject line contain wrong claims, which I corrected in the 2019-02-13 message[1]. Due to point (2) in [1], it's essential to make the "apparent wraparound" message a can't-happen event. Hence, I'm not looking to improve the message or its firing conditions. I want to fix the excess segment deletion, at which point the message will become unreachable except under single-user mode. > 2. I wonder whether we have an issue even with rounding back to the > SLRU page boundary, as is done by each caller before we ever get to > SimpleLruTruncate. I'm pretty sure that the actual anti-wraparound > protections are all precise to the XID level, so that there's some > daylight between what SimpleLruTruncate is told and the range of > data that the higher-level code thinks is of interest. Maybe we > should restructure things so that we keep the original cutoff XID > (or equivalent) all the way through, and compare the start/end > positions of a segment file directly to that. Currently, slru.c knows nothing about the division of pages into records. Hmm. To keep oldestXact all the way through, I suppose the PagePrecedes callback could become one or more record-oriented (e.g. XID-oriented) callbacks. The current scheme just relies on TransactionIdToPage() in TruncateCLOG(). If TransactionIdToPage() had a bug, all sorts of CLOG lookups would do wrong things. Hence, I think today's scheme is tougher to get wrong. Do you see it differently? > 3. It feels like the proposed test of cutoff position against both > ends of a segment is a roundabout way of fixing the problem. I > wonder whether we ought not pass *both* the cutoff and the current > endpoint (latest_page_number) down to the truncation logic, and > have it compare against both of those values. Since latest_page_number can keep changing throughout SlruScanDirectory() execution, that would give a false impression of control. Better to demonstrate that the xidWrapLimit machinery keeps latest_page_number within acceptable constraints than to ascribe significance to a comparison with a stale latest_page_number. > To try to clarify this in my head, I thought about an image of > the modular XID space as an octagon, where each side would correspond to > a segment file if we chose numbers such that there were only 8 possible > segment files. Let's say that nextXID is somewhere in the bottommost > octagon segment. The oldest possible value for the truncation cutoff > XID is a bit less than "halfway around" from nextXID; so it could be > in the topmost octagon segment, if the minimum permitted daylight- > till-wraparound is less than the SLRU segment size (which it is). > Then, if we round the cutoff XID "back" to a segment boundary, most > of the current (bottommost) segment is now less than halfway around > from that cutoff point, and in particular the current segment's > starting page is exactly halfway around. Because of the way that > TransactionIdPrecedes works, the current segment will be considered to > precede that cutoff point (the int32 difference comes out as exactly > 2^31 which is negative). Boom, data loss, because we'll decide the > current segment is removable. Exactly. https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I uses your octagon to show the behaviors before and after this patch. > I think that your proposed patch does fix this, but I'm not quite > sold that the corner cases (where the cutoff XID is itself exactly > at a page boundary) are right. That's a good thing to worry about. More specifically, I think the edge case to check is when oldestXact is the last XID of a _segment_. That case maximizes the XIDs we can delete. At that time, xidWrapLimit should likewise fall near the end of some opposing segment that we refuse to unlink. > It also bothers me that some of the callers of SimpleLruTruncate > have explicit one-count backoffs of the cutoff point and others > do not. There's no obvious reason for the difference, so I wonder > if that isn't something we should have across-the-board, or else > adjust SimpleLruTruncate to do the equivalent thing internally. Consider the case of PerformOffsetsTruncation(). If newOldestMulti is the first of a page, then SimpleLruTruncate() gets the previous page. If that page and the newOldestMulti page fall in different segments, could we unlink the segment that contained newOldestMulti, or does some other off-by-one compensate? I'm not sure. I do know that to lose mxact data this way, one must reach multiStopLimit, restart in single user mode, and consume all the way to the edge of multiWrapLimit. Considering the rarity of those preconditions, I am not inclined to bundle a fix with $SUBJECT. (Even OID reuse race conditions may present more risk than this does.) If someone does pursue a fix here, I recommend looking at other fixes before making the other callers subtract one. The subtraction in TruncateSUBTRANS() and PerformOffsetsTruncation() is a hack. [1] https://postgr.es/m/20190214072623.GA1139206@rfd.leadboat.com
Noah Misch <noah@leadboat.com> writes: > On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote: >> 1. It seems like this discussion is conflating two different issues. > When the newest XID and the oldest XID fall in "opposite" segments in the XID > space, we must not unlink the segment containing the newest XID. That is the > chief goal at present. Got it. Thanks for clarifying the scope of the patch. >> 3. It feels like the proposed test of cutoff position against both >> ends of a segment is a roundabout way of fixing the problem. I >> wonder whether we ought not pass *both* the cutoff and the current >> endpoint (latest_page_number) down to the truncation logic, and >> have it compare against both of those values. > Since latest_page_number can keep changing throughout SlruScanDirectory() > execution, that would give a false impression of control. Better to > demonstrate that the xidWrapLimit machinery keeps latest_page_number within > acceptable constraints than to ascribe significance to a comparison with a > stale latest_page_number. Perhaps. I'm prepared to accept that line of argument so far as the clog SLRU goes, but I'm not convinced that the other SLRUs have equally robust defenses against advancing too far. So on the whole I'd rather that the SLRU logic handled this issue strictly on the basis of what it knows, without assumptions about what calling code may be doing. Still, maybe we only really care about the risk for the clog SLRU? >> To try to clarify this in my head, I thought about an image of >> the modular XID space as an octagon, where each side would correspond to >> a segment file if we chose numbers such that there were only 8 possible >> segment files. > Exactly. > https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I > uses your octagon to show the behaviors before and after this patch. Cool, thanks for drafting that up. (My original sketch was not of publishable quality ;-).) To clarify, the upper annotations probably ought to read "nextXid <= xidWrapLimit"? And "cutoffPage" ought to be affixed to the orange dot at lower right of the center image? I agree that this diagram depicts why we have a problem right now, and the right-hand image shows what we want to have happen. What's a little less clear is whether the proposed patch achieves that effect. In particular, after studying this awhile, it seems like removal of the initial "cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT" adjustment isn't really affecting anything. It's already the case that just by allowing oldestXact to get rounded back to an SLRU page boundary, we've created some daylight between oldestXact and the cutoff point. Rounding back further within the same SLRU segment changes nothing. (For example, suppose that oldestXact is already within the oldest page of its SLRU segment. Then either rounding rule has the same effect. But there's still a little bit of room for xidWrapLimit to be in the opposite SLRU segment, allowing trouble.) So I think what we're actually trying to accomplish here is to ensure that instead of deleting up to half of the SLRU space before the cutoff, we delete up to half-less-one-segment. Maybe it should be half-less-two-segments, just to provide some cushion against edge cases. Reading the first comment in SetTransactionIdLimit makes one not want to trust too much in arguments based on the exact value of xidWrapLimit, while for the other SLRUs it was already unclear whether the edge cases were exactly right. In any case, it feels like the specific solution you have here, of testing both ends of the segment, is a roundabout way of providing that one-segment slop; and it doesn't help if we decide we need two-segment slop. Can we write the test in a way that explicitly provides N segments of slop? regards, tom lane
On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote: > >> 3. It feels like the proposed test of cutoff position against both > >> ends of a segment is a roundabout way of fixing the problem. I > >> wonder whether we ought not pass *both* the cutoff and the current > >> endpoint (latest_page_number) down to the truncation logic, and > >> have it compare against both of those values. > > > Since latest_page_number can keep changing throughout SlruScanDirectory() > > execution, that would give a false impression of control. Better to > > demonstrate that the xidWrapLimit machinery keeps latest_page_number within > > acceptable constraints than to ascribe significance to a comparison with a > > stale latest_page_number. > > Perhaps. I'm prepared to accept that line of argument so far as the clog > SLRU goes, but I'm not convinced that the other SLRUs have equally robust > defenses against advancing too far. So on the whole I'd rather that the > SLRU logic handled this issue strictly on the basis of what it knows, > without assumptions about what calling code may be doing. Still, maybe > we only really care about the risk for the clog SLRU? PerformOffsetsTruncation() is the most at-risk, since a single VACUUM could burn millions of multixacts via FreezeMultiXactId() calls. (To make that happen in single-user mode, I suspect one could use prepared transactions as active lockers and/or in-progress updaters.) I'm not concerned about other SLRUs. TruncateCommitTs() moves in lockstep with TruncateCLOG(). The other SimpleLruTruncate() callers handle data that becomes obsolete at every postmaster restart. > > Exactly. > > https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I > > uses your octagon to show the behaviors before and after this patch. > > Cool, thanks for drafting that up. (My original sketch was not of > publishable quality ;-).) To clarify, the upper annotations probably > ought to read "nextXid <= xidWrapLimit"? It diagrams the scenario of nextXid reaching xidWrapLimit, so the green dot represents both values. > And "cutoffPage" ought > to be affixed to the orange dot at lower right of the center image? No; oldestXact and cutoffPage have the same position in that diagram, because the patch causes the cutoffPage variable to denote the page that contains oldestXact. I've now added an orange dot to show that. > I agree that this diagram depicts why we have a problem right now, > and the right-hand image shows what we want to have happen. > What's a little less clear is whether the proposed patch achieves > that effect. > > In particular, after studying this awhile, it seems like removal > of the initial "cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT" > adjustment isn't really affecting anything. True. The set of unlink() calls needs to be the same for oldestXact in the first page of a segment, in the last page, or in some interior page. Removing the rounding neither helps nor hurts correctness. > So I think what we're actually trying to accomplish here is to > ensure that instead of deleting up to half of the SLRU space > before the cutoff, we delete up to half-less-one-segment. > Maybe it should be half-less-two-segments, just to provide some > cushion against edge cases. Reading the first comment in > SetTransactionIdLimit makes one not want to trust too much in > arguments based on the exact value of xidWrapLimit, while for > the other SLRUs it was already unclear whether the edge cases > were exactly right. That could be interesting insurance. While it would be sad for us to miss an edge case and print "must be vacuumed within 2 transactions" when wrap has already happened, reaching that message implies the DBA burned ~1M XIDs, all in single-user mode. More plausible is FreezeMultiXactId() overrunning the limit by tens of segments. Hence, if we do buy this insurance, let's skip far more segments. For example, instead of unlinking segments representing up to 2^31 past XIDs, we could divide that into an upper half that we unlink and a lower half. The lower half will stay in place; eventually, XID consumption will overwrite it. Truncation behavior won't change until the region of CLOG for pre-oldestXact XIDs exceeds 256 MiB. Beyond that threshold, vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. CLOG maximum would rise from 512 MiB to 768 MiB. Would that be worthwhile?
Noah Misch <noah@leadboat.com> writes: > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote: >> So I think what we're actually trying to accomplish here is to >> ensure that instead of deleting up to half of the SLRU space >> before the cutoff, we delete up to half-less-one-segment. >> Maybe it should be half-less-two-segments, just to provide some >> cushion against edge cases. Reading the first comment in >> SetTransactionIdLimit makes one not want to trust too much in >> arguments based on the exact value of xidWrapLimit, while for >> the other SLRUs it was already unclear whether the edge cases >> were exactly right. > That could be interesting insurance. While it would be sad for us to miss an > edge case and print "must be vacuumed within 2 transactions" when wrap has > already happened, reaching that message implies the DBA burned ~1M XIDs, all > in single-user mode. More plausible is FreezeMultiXactId() overrunning the > limit by tens of segments. Hence, if we do buy this insurance, let's skip far > more segments. For example, instead of unlinking segments representing up to > 2^31 past XIDs, we could divide that into an upper half that we unlink and a > lower half. The lower half will stay in place; eventually, XID consumption > will overwrite it. Truncation behavior won't change until the region of CLOG > for pre-oldestXact XIDs exceeds 256 MiB. Beyond that threshold, > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. CLOG > maximum would rise from 512 MiB to 768 MiB. Would that be worthwhile? Hmm. I'm not particularly concerned about the disk-space-consumption angle, but I do wonder about whether we'd be sacrificing the ability to recover cleanly from a situation that the code does let you get into. However, as long as we're sure that the system will ultimately reuse/ recycle a not-deleted old segment file without complaint, it's hard to find much fault with your proposal. Temporarily wasting some disk space is a lot more palatable than corrupting data, and these code paths are necessarily not terribly well tested. So +1 for more insurance. regards, tom lane
On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote: > >> So I think what we're actually trying to accomplish here is to > >> ensure that instead of deleting up to half of the SLRU space > >> before the cutoff, we delete up to half-less-one-segment. > >> Maybe it should be half-less-two-segments, just to provide some > >> cushion against edge cases. Reading the first comment in > >> SetTransactionIdLimit makes one not want to trust too much in > >> arguments based on the exact value of xidWrapLimit, while for > >> the other SLRUs it was already unclear whether the edge cases > >> were exactly right. > > > That could be interesting insurance. While it would be sad for us to miss an > > edge case and print "must be vacuumed within 2 transactions" when wrap has > > already happened, reaching that message implies the DBA burned ~1M XIDs, all > > in single-user mode. More plausible is FreezeMultiXactId() overrunning the > > limit by tens of segments. Hence, if we do buy this insurance, let's skip far > > more segments. For example, instead of unlinking segments representing up to > > 2^31 past XIDs, we could divide that into an upper half that we unlink and a > > lower half. The lower half will stay in place; eventually, XID consumption > > will overwrite it. Truncation behavior won't change until the region of CLOG > > for pre-oldestXact XIDs exceeds 256 MiB. Beyond that threshold, > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. CLOG > > maximum would rise from 512 MiB to 768 MiB. Would that be worthwhile? > > Hmm. I'm not particularly concerned about the disk-space-consumption > angle, but I do wonder about whether we'd be sacrificing the ability to > recover cleanly from a situation that the code does let you get into. > However, as long as we're sure that the system will ultimately reuse/ > recycle a not-deleted old segment file without complaint, it's hard to > find much fault with your proposal. That is the trade-off. By distancing ourselves from the wraparound edge cases, we'll get more segment recycling (heretofore an edge case). Fortunately, recycling doesn't change behavior as you approach some limit; it works or it doesn't. > Temporarily wasting some disk > space is a lot more palatable than corrupting data, and these code > paths are necessarily not terribly well tested. So +1 for more > insurance. Okay, I'll give that a try. I expect this will replace the PagePrecedes callback with a PageDiff callback such that PageDiff(a, b) < 0 iff PagePrecedes(a, b). PageDiff callbacks shall distribute return values uniformly in [INT_MIN,INT_MAX]. SimpleLruTruncate() will unlink segments where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.
On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote: > On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote: > > >> So I think what we're actually trying to accomplish here is to > > >> ensure that instead of deleting up to half of the SLRU space > > >> before the cutoff, we delete up to half-less-one-segment. > > >> Maybe it should be half-less-two-segments, just to provide some > > >> cushion against edge cases. Reading the first comment in > > >> SetTransactionIdLimit makes one not want to trust too much in > > >> arguments based on the exact value of xidWrapLimit, while for > > >> the other SLRUs it was already unclear whether the edge cases > > >> were exactly right. > > > > > That could be interesting insurance. While it would be sad for us to miss an > > > edge case and print "must be vacuumed within 2 transactions" when wrap has > > > already happened, reaching that message implies the DBA burned ~1M XIDs, all > > > in single-user mode. More plausible is FreezeMultiXactId() overrunning the > > > limit by tens of segments. Hence, if we do buy this insurance, let's skip far > > > more segments. For example, instead of unlinking segments representing up to > > > 2^31 past XIDs, we could divide that into an upper half that we unlink and a > > > lower half. The lower half will stay in place; eventually, XID consumption > > > will overwrite it. Truncation behavior won't change until the region of CLOG > > > for pre-oldestXact XIDs exceeds 256 MiB. Beyond that threshold, > > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. CLOG > > > maximum would rise from 512 MiB to 768 MiB. Would that be worthwhile? > > Temporarily wasting some disk > > space is a lot more palatable than corrupting data, and these code > > paths are necessarily not terribly well tested. So +1 for more > > insurance. > > Okay, I'll give that a try. I expect this will replace the PagePrecedes > callback with a PageDiff callback such that PageDiff(a, b) < 0 iff > PagePrecedes(a, b). PageDiff callbacks shall distribute return values > uniformly in [INT_MIN,INT_MAX]. SimpleLruTruncate() will unlink segments > where INT_MIN/2 < PageDiff(candidate, cutoff) < 0. While doing so, I found that slru-truncate-modulo-v2.patch did get edge cases wrong, as you feared. In particular, if the newest XID reached xidStopLimit and was in the first page of a segment, TruncateCLOG() would delete its segment. Attached slru-truncate-modulo-v3.patch fixes that; as restitution, I added unit tests covering that and other scenarios. Reaching the bug via XIDs was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in single-user mode. I expect the bug was easier to reach via pg_multixact. The insurance patch stacks on top of the bug fix patch. It does have a negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest to skip truncation in corrupted clusters. SlruScanDirCbFindEarliest() gives nonsense answers if "future" segments exist. That can happen today, but the patch creates new ways to make it happen. The symptom is wasting yet more space in pg_multixact. I am okay with this, since it arises only after one fills pg_multixact 50% full. There are alternatives. We could weaken the corruption defense in TruncateMultiXact() or look for another implementation of equivalent defense. We could unlink, say, 75% or 95% of the "past" instead of 50% (this patch) or >99.99% (today's behavior). Thanks, nm
Attachment
Hi Noah! I've found this thread in CF looking for something to review. > 9 нояб. 2020 г., в 09:53, Noah Misch <noah@leadboat.com> написал(а): > > Rebased both patches, necessitated by commit c732c3f (a repair of commit > dee663f). As I mentioned on another branch of the thread, I'd be content if > someone reviews the slru-truncate-modulo patch and disclaims knowledge of the > slru-truncate-insurance patch; I would then abandon the latter patch. > <slru-truncate-modulo-v5.patch><slru-truncate-t-insurance-v4.patch> Commit c732c3f adds some SYNC_FORGET_REQUESTs. slru-truncate-modulo-v5.patch fixes off-by-one error in functions like *PagePrecedes(int page1, int page2). slru-truncate-t-insurance-v4.patch ensures that off-by-one errors do not inflict data loss. While I agree that fixing error is better than hiding it, I could not figure out how c732c3f is connected to these patches. Can you please give me few pointers how to understand this connection? Best regards, Andrey Borodin.
> 2 янв. 2021 г., в 01:35, Noah Misch <noah@leadboat.com> написал(а): > There's no > other connection to this thread, and one can review patches on this thread > without studying commit c732c3f. OK, thanks! Do I understand correctly that this is bugfix that needs to be back-patched? Thus we should not refactor 4 identical *PagePrecedes(int page1, int page2) into 1 generic function? Since functions are not symmetric anymore, maybe we should have better names for arguments than "page1" and "page2"? At leastin dev branch. Is it common practice to embed tests into assert checking like in SlruPagePrecedesUnitTests()? SLRU seems no near simple, BTW. The only simple place is naive caching algorithm. I remember there was a thread to do relationsfrom SLRUs. Best regards, Andrey Borodin.
> 1 янв. 2021 г., в 23:05, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > I've found this thread in CF looking for something to review. We discussed patches with Noah offlist. I'm resending summary to list. There are two patches: 1. slru-truncate-modulo-v5.patch 2. slru-truncate-t-insurance-v4.patch It would be a bit easier to review if patches were versioned together (v5 both), because 2nd patch applies on top of 1st.Also 2nd patch have a problem if applying with git (in async.c). First patch fixes a bug with possible SLRU truncation around wrapping point too early. Basic idea of the patch is "If we want to delete a range we must be eligible to delete it's beginning and ending". So to test if page is deletable it tests that first and last xids(or other SLRU's unit) are of no interest. To test if asegment is deletable it tests if first and last pages can be deleted. Patch adds test in unusual manner: they are implemented as assert functions. Tests are fast, they are only checking basicand edge cases. But tests will not be run if Postgres is build without asserts. I'm a little suspicious of implementation of *PagePrecedes(int page1, int page2) functions. Consider following example fromthe patch: static bool CommitTsPagePrecedes(int page1, int page2) { TransactionId xid1; TransactionId xid2; xid1 = ((TransactionId) page1) * COMMIT_TS_XACTS_PER_PAGE; xid1 += FirstNormalTransactionId + 1; xid2 = ((TransactionId) page2) * COMMIT_TS_XACTS_PER_PAGE; xid2 += FirstNormalTransactionId + 1; return (TransactionIdPrecedes(xid1, xid2) && TransactionIdPrecedes(xid1, xid2 + COMMIT_TS_XACTS_PER_PAGE - 1)); } We are adding FirstNormalTransactionId to xids to avoid them being special xids. We add COMMIT_TS_XACTS_PER_PAGE - 1 to xid2 to shift it to the end of the page. But due to += FirstNormalTransactionId weshift slightly behind page boundaries and risk that xid2 + COMMIT_TS_XACTS_PER_PAGE - 1 can become FrozenTransactionId(FirstNormalTransactionId - 1). Thus we add +1 to all values in scope. While the logic is correct, codingis difficult. Maybe we could just use page1_first_normal_xid = ((TransactionId) page1) * CLOG_XACTS_PER_PAGE + FirstNormalTransactionId; page2_first_normal_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE + FirstNormalTransactionId; page2_last_xid = ((TransactionId) page2) * CLOG_XACTS_PER_PAGE + CLOG_XACTS_PER_PAGE - 1; But I'm not insisting on this. Following comment is not correct for 1Kb and 4Kb pages + * At every supported BLCKSZ, (1 << 31) % COMMIT_TS_XACTS_PER_PAGE == 128. All above notes are not blocking the fix, I just wanted to let committer know about this. I think that it's very importantto have this bug fixed. Second patch converts binary *PagePreceeds() functions to *PageDiff()s and adds logic to avoid deleting pages in suspiciouscases. This logic depends on the scale of returned by diff value: it expects that overflow happens between INT_MIN\INT_MAX.This it prevents page deletion if page_diff <= INT_MIN / 2 (too far from current cleaning point; and in normalcases, of cause). It must be comparison here, not equality test. - ctl->PagePrecedes(segpage, trunc->earliestExistingPage)) + ctl->PageDiff(segpage, trunc->earliestExistingPage)) This int diff_max = ((QUEUE_MAX_PAGE + 1) / 2) - 1, seems to be functional equivalent of int diff_max = ((QUEUE_MAX_PAGE - 1) / 2), What I like about the patch is that it removes all described above trickery around + FirstNormalTransactionId + 1. AFAICS the overall purpose of the 2nd patch is to help corrupted by other bugs clusters avoid deleting SLRU segments. I'm a little bit afraid that this kind of patch can hide bugs (while potentially saving some users data). Besides this patchseems like a useful precaution. Maybe we could emit scary warnings if SLRU segments do not stack into continuous range? Thanks! Best regards, Andrey Borodin.
> 9 янв. 2021 г., в 15:17, Noah Misch <noah@leadboat.com> написал(а): > >> This >> int diff_max = ((QUEUE_MAX_PAGE + 1) / 2) - 1, >> seems to be functional equivalent of >> int diff_max = ((QUEUE_MAX_PAGE - 1) / 2), > > Do you think one conveys the concept better than the other? I see now that next comments mention "(QUEUE_MAX_PAGE+1)/2", so I think there is no need to change something in a patch here. >> I'm a little bit afraid that this kind of patch can hide bugs (while potentially saving some users data). Besides thispatch seems like a useful precaution. Maybe we could emit scary warnings if SLRU segments do not stack into continuousrange? > > Scary warnings are good for an observation that implies a bug, but the > slru-truncate-t-insurance patch causes such an outcome in non-bug cases where > it doesn't happen today. In other words, discontinuous ranges of SLRU > segments would be even more common after that patch. For example, it would > happen anytime oldestXID advances by more than ~1B at a time. Uhm, I thought that if there is going to be more than ~1B xids - we are going to keep all segements forever and range stillwill be continuous. Or am I missing something? Thanks! Best regards, Andrey Borodin.
> 10 янв. 2021 г., в 03:15, Noah Misch <noah@leadboat.com> написал(а): > > No; it deletes the most recent ~1B and leaves the older segments. An > exception is multixact, as described in the commit message and the patch's > change to a comment in TruncateMultiXact(). Thanks for clarification. One more thing: retention point at 3/4 of overall space (half of wraparound) seems more or less random to me. Why not 5/8or 9/16? Can you please send revised patches with fixes? Thanks! Best regards, Andrey Borodin.
> 10 янв. 2021 г., в 14:43, Noah Misch <noah@leadboat.com> написал(а): > >> Can you please send revised patches with fixes? > > Attached. > <slru-truncate-modulo-v6.patch> > <slru-truncate-t-insurance-v5.patch> I'm marking patch as ready for committer. I can't tell should we backpatch insurance patch or not: it potentially fixes unknown bugs, and potentially contains unknownbugs. I can't reason because of such uncertainty. I've tried to look for any potential problem and as for now I seenone. Chances are <slru-truncate-t-insurance-v5.patch> is doing code less error-prone. Fix <slru-truncate-modulo-v6.patch> certainly worth backpatching. Thanks! Best regards, Andrey Borodin.
> 12 янв. 2021 г., в 13:49, Noah Misch <noah@leadboat.com> написал(а): > > What do you think of abandoning slru-truncate-t-insurance entirely? As of > https://postgr.es/m/20200330052809.GB2324620@rfd.leadboat.com I liked the idea > behind it, despite its complicating the system for hackers and DBAs. The > TruncateMultiXact() interaction rendered it less appealing. In v14+, commit > cd5e822 mitigates the kind of bugs that slru-truncate-t-insurance mitigates, > further reducing the latter's value. slru-truncate-t-insurance does mitigate > larger trespasses into unlink-eligible space, though. I seem to me that not committing an insurance patch is not a mistake. Let's abandon slru-truncate-t-insurance for now. >> Fix <slru-truncate-modulo-v6.patch> certainly worth backpatching. > > I'll push it on Saturday, probably. Thanks! Best regards, Andrey Borodin.