Thread: SimpleLruTruncate() mutual exclusion
I'm forking this thread from https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which reported a race condition involving the "apparent wraparound" safety check in SimpleLruTruncate(): On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > 1. The result of the test is valid only until we release the SLRU ControlLock, > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > segments for deletion. Once we release that lock, latest_page_number can > advance. This creates a TOCTOU race condition, allowing excess deletion: > > [local] test=# table trunc_clog_concurrency ; > ERROR: could not access status of transaction 2149484247 > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > Fixes are available: > b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > checkpoint, or in the startup process. Hence, also arrange for only one > backend to call SimpleLruTruncate(AsyncCtl) at a time. More specifically, restrict vac_update_datfrozenxid() to one backend per database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one backend per cluster. This, attached, was rather straightforward. I wonder about performance in a database with millions of small relations, particularly considering my intent to back-patch this. In such databases, vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two things work in our favor. First, vac_update_datfrozenxid() runs once per VACUUM command, not once per relation. Second, Autovacuum has this logic: * ... we skip * this if (1) we found no work to do and (2) we skipped at least one * table due to concurrent autovacuum activity. In that case, the other * worker has already done it, or will do so when it finishes. */ if (did_vacuum || !found_concurrent_worker) vac_update_datfrozenxid(); That makes me relatively unworried. I did consider some alternatives: - Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we find the need for pg_database updates, take the lock and scan pg_class again to get final numbers. This doubles the work in cases that end up taking the lock, so I'm not betting it being a net win. - Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other backend is already waiting. This is similar to Autovacuum's found_concurrent_worker test. It is tempting. I'm not proposing it, because it changes the states possible when manual VACUUM completes. Today, you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid values. If manual VACUUM could skip vac_update_datfrozenxid() this way, datfrozenxid could lag until some concurrent VACUUM finishes. Thanks, nm
Attachment
On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: > I'm forking this thread from > https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which > reported a race condition involving the "apparent wraparound" safety check in > SimpleLruTruncate(): > > On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > > 1. The result of the test is valid only until we release the SLRU ControlLock, > > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > > segments for deletion. Once we release that lock, latest_page_number can > > advance. This creates a TOCTOU race condition, allowing excess deletion: > > > > [local] test=# table trunc_clog_concurrency ; > > ERROR: could not access status of transaction 2149484247 > > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > > > Fixes are available: > > > b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > > checkpoint, or in the startup process. Hence, also arrange for only one > > backend to call SimpleLruTruncate(AsyncCtl) at a time. > > More specifically, restrict vac_update_datfrozenxid() to one backend per > database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one > backend per cluster. This, attached, was rather straightforward. Rebased. The conflicts were limited to comments and documentation. > I wonder about performance in a database with millions of small relations, > particularly considering my intent to back-patch this. In such databases, > vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two > things work in our favor. First, vac_update_datfrozenxid() runs once per > VACUUM command, not once per relation. Second, Autovacuum has this logic: > > * ... we skip > * this if (1) we found no work to do and (2) we skipped at least one > * table due to concurrent autovacuum activity. In that case, the other > * worker has already done it, or will do so when it finishes. > */ > if (did_vacuum || !found_concurrent_worker) > vac_update_datfrozenxid(); > > That makes me relatively unworried. I did consider some alternatives: > > - Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we > find the need for pg_database updates, take the lock and scan pg_class again > to get final numbers. This doubles the work in cases that end up taking the > lock, so I'm not betting it being a net win. > > - Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other > backend is already waiting. This is similar to Autovacuum's > found_concurrent_worker test. It is tempting. I'm not proposing it, > because it changes the states possible when manual VACUUM completes. Today, > you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid > values. If manual VACUUM could skip vac_update_datfrozenxid() this way, > datfrozenxid could lag until some concurrent VACUUM finishes.
Attachment
Noah Misch <noah@leadboat.com> writes: > On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: >>> b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than >>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a >>> checkpoint, or in the startup process. Hence, also arrange for only one >>> backend to call SimpleLruTruncate(AsyncCtl) at a time. >> More specifically, restrict vac_update_datfrozenxid() to one backend per >> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one >> backend per cluster. This, attached, was rather straightforward. > Rebased. The conflicts were limited to comments and documentation. I tried to review this, along with your adjacent patch to adjust the segment-roundoff logic, but I didn't get far. I see the point that the roundoff might create problems when we are close to hitting wraparound. I do not, however, see why serializing vac_truncate_clog helps. I'm pretty sure it was intentional that multiple backends could run truncation directory scans concurrently, and I don't really want to give that up if possible. Also, if I understand the data-loss hazard properly, it's what you said in the other thread: the latest_page_number could advance after we make our decision about what to truncate, and then maybe we could truncate newly-added data. We surely don't want to lock out the operations that can advance last_page_number, so how does serializing vac_truncate_clog help? I wonder whether what we need to do is add some state in shared memory saying "it is only safe to create pages before here", and make SimpleLruZeroPage or its callers check that. The truncation logic would advance that value, but only after completing a scan. (Oh ..., hmm, maybe the point of serializing truncations is to ensure that we know that we can safely advance that?) Can you post whatever script you used to detect/reproduce the problem in the first place? regards, tom lane PS: also, re-reading this code, it's worrisome that we are not checking for failure of the unlink calls. I think the idea was that it didn't really matter because if we did re-use an existing file we'd just re-zero it; hence failing the truncate is an overreaction. But probably some comments about that are in order.
On Mon, Jul 29, 2019 at 12:58:17PM -0400, Tom Lane wrote: > > On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: > >>> b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > >>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > >>> checkpoint, or in the startup process. Hence, also arrange for only one > >>> backend to call SimpleLruTruncate(AsyncCtl) at a time. > > >> More specifically, restrict vac_update_datfrozenxid() to one backend per > >> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one > >> backend per cluster. This, attached, was rather straightforward. > I'm pretty sure it was intentional that multiple backends > could run truncation directory scans concurrently, and I don't really > want to give that up if possible. I want to avoid a design that requires painstaking concurrency analysis. Such analysis is worth it for heap_update(), but vac_truncate_clog() isn't enough of a hot path. If there's a way to make vac_truncate_clog() easy to analyze and still somewhat concurrent, fair. > Also, if I understand the data-loss hazard properly, it's what you > said in the other thread: the latest_page_number could advance after > we make our decision about what to truncate, and then maybe we could > truncate newly-added data. We surely don't want to lock out the > operations that can advance last_page_number, so how does serializing > vac_truncate_clog help? > > I wonder whether what we need to do is add some state in shared > memory saying "it is only safe to create pages before here", and > make SimpleLruZeroPage or its callers check that. The truncation > logic would advance that value, but only after completing a scan. > (Oh ..., hmm, maybe the point of serializing truncations is to > ensure that we know that we can safely advance that?) vac_truncate_clog() already records "it is only safe to create pages before here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks. The trouble comes when two vac_truncate_clog() run in parallel and you get a sequence of events like this: vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink vac_truncate_clog() instance 1 unlinks segment ABCD vac_truncate_clog() instance 1 calls SetTransactionIdLimit() vac_truncate_clog() instance 1 finishes some backend calls SimpleLruZeroPage(), creating segment ABCD vac_truncate_clog() instance 2 unlinks segment ABCD Serializing vac_truncate_clog() fixes that. > Can you post whatever script you used to detect/reproduce the problem > in the first place? I've attached it, but it's pretty hackish. Apply the patch on commit 7170268, make sure your --bindir is in $PATH, copy "conf-test-pg" to your home directory, and run "make trunc-check". This incorporates xid-burn acceleration code that Jeff Janes shared with -hackers at some point. > PS: also, re-reading this code, it's worrisome that we are not checking > for failure of the unlink calls. I think the idea was that it didn't > really matter because if we did re-use an existing file we'd just > re-zero it; hence failing the truncate is an overreaction. But probably > some comments about that are in order. That's my understanding. We'll zero any page before reusing it. Failing the whole vac_truncate_clog() (and therefore not advancing the wrap limit) would do more harm than the bit of wasted disk space. Still, a LOG or WARNING message would be fine, I think. Thanks, nm
Attachment
On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote: > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink > vac_truncate_clog() instance 1 unlinks segment ABCD > vac_truncate_clog() instance 1 calls SetTransactionIdLimit() > vac_truncate_clog() instance 1 finishes > some backend calls SimpleLruZeroPage(), creating segment ABCD > vac_truncate_clog() instance 2 unlinks segment ABCD > > Serializing vac_truncate_clog() fixes that. I've wondered before (in a -bugs thread[1] about unexplained pg_serial wraparound warnings) if we could map 64 bit xids to wide SLRU file names that never wrap around and make this class of problem go away. Unfortunately multixacts would need 64 bit support too... [1] https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com
On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote: > On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote: > > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink > > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink > > vac_truncate_clog() instance 1 unlinks segment ABCD > > vac_truncate_clog() instance 1 calls SetTransactionIdLimit() > > vac_truncate_clog() instance 1 finishes > > some backend calls SimpleLruZeroPage(), creating segment ABCD > > vac_truncate_clog() instance 2 unlinks segment ABCD > > > > Serializing vac_truncate_clog() fixes that. > > I've wondered before (in a -bugs thread[1] about unexplained pg_serial > wraparound warnings) if we could map 64 bit xids to wide SLRU file > names that never wrap around and make this class of problem go away. > Unfortunately multixacts would need 64 bit support too... > > [1] https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com That change sounds good to me.
On Mon, Nov 04, 2019 at 03:43:09PM -0800, Noah Misch wrote: > On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote: > > On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote: > > > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink > > > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink > > > vac_truncate_clog() instance 1 unlinks segment ABCD > > > vac_truncate_clog() instance 1 calls SetTransactionIdLimit() > > > vac_truncate_clog() instance 1 finishes > > > some backend calls SimpleLruZeroPage(), creating segment ABCD > > > vac_truncate_clog() instance 2 unlinks segment ABCD > > > > > > Serializing vac_truncate_clog() fixes that. > > > > I've wondered before (in a -bugs thread[1] about unexplained pg_serial > > wraparound warnings) if we could map 64 bit xids to wide SLRU file > > names that never wrap around and make this class of problem go away. > > Unfortunately multixacts would need 64 bit support too... > > > > [1] https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com > > That change sounds good to me. I do think $SUBJECT should move forward independent of that.
> On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote: Hi, > > Also, if I understand the data-loss hazard properly, it's what you > > said in the other thread: the latest_page_number could advance after > > we make our decision about what to truncate, and then maybe we could > > truncate newly-added data. We surely don't want to lock out the > > operations that can advance last_page_number, so how does serializing > > vac_truncate_clog help? > > > > I wonder whether what we need to do is add some state in shared > > memory saying "it is only safe to create pages before here", and > > make SimpleLruZeroPage or its callers check that. The truncation > > logic would advance that value, but only after completing a scan. > > (Oh ..., hmm, maybe the point of serializing truncations is to > > ensure that we know that we can safely advance that?) > > vac_truncate_clog() already records "it is only safe to create pages before > here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks. > The trouble comes when two vac_truncate_clog() run in parallel and you get a > sequence of events like this: > > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink > vac_truncate_clog() instance 1 unlinks segment ABCD > vac_truncate_clog() instance 1 calls SetTransactionIdLimit() > vac_truncate_clog() instance 1 finishes > some backend calls SimpleLruZeroPage(), creating segment ABCD > vac_truncate_clog() instance 2 unlinks segment ABCD > > Serializing vac_truncate_clog() fixes that. I'm probably missing something, so just wanted to clarify. Do I understand correctly, that thread [1] and this one are independent, and it is assumed in the scenario above that we're at the end of XID space, but not necessarily having rounding issues? I'm a bit confused, since the reproduce script does something about cutoffPage, and I'm not sure if it's important or not. > > Can you post whatever script you used to detect/reproduce the problem > > in the first place? > > I've attached it, but it's pretty hackish. Apply the patch on commit 7170268, > make sure your --bindir is in $PATH, copy "conf-test-pg" to your home > directory, and run "make trunc-check". This incorporates xid-burn > acceleration code that Jeff Janes shared with -hackers at some point. I've tried to use it to understand the problem better, but somehow couldn't reproduce via suggested script. I've applied it to 7170268 (tried also apply rebased to the master with the same results) with the conf-test-pg in place, but after going through all steps there are no errors like: ERROR: could not access status of transaction 2149484247 DETAIL: Could not open file "pg_xact/0801": No such file or directory. Is there anything extra one needs to do to reproduce the problem, maybe adjust delays or something? [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
On Sun, Nov 17, 2019 at 12:56:52PM +0100, Dmitry Dolgov wrote: > > On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote: > > > Also, if I understand the data-loss hazard properly, it's what you > > > said in the other thread: the latest_page_number could advance after > > > we make our decision about what to truncate, and then maybe we could > > > truncate newly-added data. We surely don't want to lock out the > > > operations that can advance last_page_number, so how does serializing > > > vac_truncate_clog help? > > > > > > I wonder whether what we need to do is add some state in shared > > > memory saying "it is only safe to create pages before here", and > > > make SimpleLruZeroPage or its callers check that. The truncation > > > logic would advance that value, but only after completing a scan. > > > (Oh ..., hmm, maybe the point of serializing truncations is to > > > ensure that we know that we can safely advance that?) > > > > vac_truncate_clog() already records "it is only safe to create pages before > > here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks. > > The trouble comes when two vac_truncate_clog() run in parallel and you get a > > sequence of events like this: > > > > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink > > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink > > vac_truncate_clog() instance 1 unlinks segment ABCD > > vac_truncate_clog() instance 1 calls SetTransactionIdLimit() > > vac_truncate_clog() instance 1 finishes > > some backend calls SimpleLruZeroPage(), creating segment ABCD > > vac_truncate_clog() instance 2 unlinks segment ABCD > > > > Serializing vac_truncate_clog() fixes that. > > I'm probably missing something, so just wanted to clarify. Do I > understand correctly, that thread [1] and this one are independent, and > it is assumed in the scenario above that we're at the end of XID space, > but not necessarily having rounding issues? I'm a bit confused, since > the reproduce script does something about cutoffPage, and I'm not sure > if it's important or not. I think the repro recipe contained an early fix for the other thread's bug. While they're independent in principle, I likely never reproduced this bug without having a fix in place for the other thread's bug. The bug in the other thread was easier to hit. > > > Can you post whatever script you used to detect/reproduce the problem > > > in the first place? > > > > I've attached it, but it's pretty hackish. Apply the patch on commit 7170268, > > make sure your --bindir is in $PATH, copy "conf-test-pg" to your home > > directory, and run "make trunc-check". This incorporates xid-burn > > acceleration code that Jeff Janes shared with -hackers at some point. > > I've tried to use it to understand the problem better, but somehow > couldn't reproduce via suggested script. I've applied it to 7170268 > (tried also apply rebased to the master with the same results) with the > conf-test-pg in place, but after going through all steps there are no > errors like: That is unfortunate. > Is there anything extra one needs to do to reproduce the problem, maybe > adjust delays or something? It wouldn't surprise me. I did all my testing on one or two systems; the hard-coded delays suffice there, but I'm sure there exist systems needing different delays. Though I did reproduce this bug, I'm motivated by the abstract problem more than any particular way to reproduce it. Commit 996d273 inspired me; by removing a GetCurrentTransactionId(), it allowed the global xmin to advance at times it previously could not. That subtly changed the concurrency possibilities. I think safe, parallel SimpleLruTruncate() is difficult to maintain and helps too rarely to justify such maintenance. That's why I propose eliminating the concurrency. > [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
> On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote: > > Though I did reproduce this bug, I'm motivated by the abstract problem more > than any particular way to reproduce it. Commit 996d273 inspired me; by > removing a GetCurrentTransactionId(), it allowed the global xmin to advance at > times it previously could not. That subtly changed the concurrency > possibilities. I think safe, parallel SimpleLruTruncate() is difficult to > maintain and helps too rarely to justify such maintenance. That's why I > propose eliminating the concurrency. Sure, I see the point and the possibility for the issue itself, but of course it's easier to reason about an issue I can reproduce :) > I wonder about performance in a database with millions of small relations, > particularly considering my intent to back-patch this. In such databases, > vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two > things work in our favor. First, vac_update_datfrozenxid() runs once per > VACUUM command, not once per relation. Second, Autovacuum has this logic: > > * ... we skip > * this if (1) we found no work to do and (2) we skipped at least one > * table due to concurrent autovacuum activity. In that case, the other > * worker has already done it, or will do so when it finishes. > */ > if (did_vacuum || !found_concurrent_worker) > vac_update_datfrozenxid(); > > That makes me relatively unworried. I did consider some alternatives: Btw, I've performed few experiments with parallel vacuuming of 10^4 small tables that are taking some small inserts, the results look like this: # with patch # funclatency -u bin/postgres:vac_update_datfrozenxid usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 3 |*** | 1024 -> 2047 : 38 |****************************************| 2048 -> 4095 : 15 |*************** | 4096 -> 8191 : 15 |*************** | 8192 -> 16383 : 2 |** | # without patch # funclatency -u bin/postgres:vac_update_datfrozenxid usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 5 |**** | 1024 -> 2047 : 49 |****************************************| 2048 -> 4095 : 11 |******** | 4096 -> 8191 : 5 |**** | 8192 -> 16383 : 1 | | In general it seems that latency tends to be a bit higher, but I don't think it's significant.
> On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote: > > > I'm probably missing something, so just wanted to clarify. Do I > > understand correctly, that thread [1] and this one are independent, and > > it is assumed in the scenario above that we're at the end of XID space, > > but not necessarily having rounding issues? I'm a bit confused, since > > the reproduce script does something about cutoffPage, and I'm not sure > > if it's important or not. > > I think the repro recipe contained an early fix for the other thread's bug. > While they're independent in principle, I likely never reproduced this bug > without having a fix in place for the other thread's bug. The bug in the > other thread was easier to hit. Just to clarify, since the CF item for this patch was withdrawn recently. Does it mean that eventually the thread [1] covers this one too? [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote: > > On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote: > > > I'm probably missing something, so just wanted to clarify. Do I > > > understand correctly, that thread [1] and this one are independent, and > > > it is assumed in the scenario above that we're at the end of XID space, > > > but not necessarily having rounding issues? I'm a bit confused, since > > > the reproduce script does something about cutoffPage, and I'm not sure > > > if it's important or not. > > > > I think the repro recipe contained an early fix for the other thread's bug. > > While they're independent in principle, I likely never reproduced this bug > > without having a fix in place for the other thread's bug. The bug in the > > other thread was easier to hit. > > Just to clarify, since the CF item for this patch was withdrawn > recently. Does it mean that eventually the thread [1] covers this one > too? > > [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com I withdrew $SUBJECT because, if someone reviews one of my patches, I want it to be the one you cite in [1]. I plan not to commit [1] without a Ready for Committer, and I plan not to commit $SUBJECT before committing [1]. I would be willing to commit $SUBJECT without getting a review, however.
On Fri, Jun 28, 2019 at 10:06:28AM -0700, Noah Misch wrote: > On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: > > I'm forking this thread from > > https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which > > reported a race condition involving the "apparent wraparound" safety check in > > SimpleLruTruncate(): > > > > On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > > > 1. The result of the test is valid only until we release the SLRU ControlLock, > > > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > > > segments for deletion. Once we release that lock, latest_page_number can > > > advance. This creates a TOCTOU race condition, allowing excess deletion: > > > > > > [local] test=# table trunc_clog_concurrency ; > > > ERROR: could not access status of transaction 2149484247 > > > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > > > > > Fixes are available: > > > > > b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > > > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > > > checkpoint, or in the startup process. Hence, also arrange for only one > > > backend to call SimpleLruTruncate(AsyncCtl) at a time. > > > > More specifically, restrict vac_update_datfrozenxid() to one backend per > > database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one > > backend per cluster. This, attached, was rather straightforward. > > Rebased. The conflicts were limited to comments and documentation. Rebased, most notably over the lock renames of commit 5da1493. In LockTagTypeNames, I did s/frozen IDs/frozenid/ for consistency with that commit having done s/speculative token/spectoken/.
Attachment
On Fri, Jan 31, 2020 at 12:42:13AM -0800, Noah Misch wrote: > On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote: > > Just to clarify, since the CF item for this patch was withdrawn > > recently. Does it mean that eventually the thread [1] covers this one > > too? > > > > [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com > > I withdrew $SUBJECT because, if someone reviews one of my patches, I want it > to be the one you cite in [1]. I plan not to commit [1] without a Ready for > Committer, and I plan not to commit $SUBJECT before committing [1]. I would > be willing to commit $SUBJECT without getting a review, however. After further reflection, I plan to push $SUBJECT shortly after 2020-08-13, not waiting for [1] to change status. Reasons: - While I put significant weight on the fact that I couldn't reproduce $SUBJECT problems without first fixing [1], I'm no longer confident of that representing real-world experiences. Reproducing [1] absolutely requires a close approach to a wrap limit, but $SUBJECT problems might not. - Adding locks won't change correct functional behavior to incorrect functional behavior. - By pushing at that particular time, the fix ordinarily will appear in v13.0 before appearing in a back branch release. If problematic contention arises quickly in the field, that's a more-comfortable way to discover it.