Re: SimpleLruTruncate() mutual exclusion - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: SimpleLruTruncate() mutual exclusion |
Date | |
Msg-id | 20190628170628.GA1238361@rfd.leadboat.com Whole thread Raw |
In response to | SimpleLruTruncate() mutual exclusion (Noah Misch <noah@leadboat.com>) |
Responses |
Re: SimpleLruTruncate() mutual exclusion
Re: SimpleLruTruncate() mutual exclusion |
List | pgsql-hackers |
On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote: > I'm forking this thread from > https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which > reported a race condition involving the "apparent wraparound" safety check in > SimpleLruTruncate(): > > On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote: > > 1. The result of the test is valid only until we release the SLRU ControlLock, > > which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate > > segments for deletion. Once we release that lock, latest_page_number can > > advance. This creates a TOCTOU race condition, allowing excess deletion: > > > > [local] test=# table trunc_clog_concurrency ; > > ERROR: could not access status of transaction 2149484247 > > DETAIL: Could not open file "pg_xact/0801": No such file or directory. > > > Fixes are available: > > > b. Arrange so only one backend runs vac_truncate_clog() at a time. Other than > > AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a > > checkpoint, or in the startup process. Hence, also arrange for only one > > backend to call SimpleLruTruncate(AsyncCtl) at a time. > > More specifically, restrict vac_update_datfrozenxid() to one backend per > database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one > backend per cluster. This, attached, was rather straightforward. Rebased. The conflicts were limited to comments and documentation. > I wonder about performance in a database with millions of small relations, > particularly considering my intent to back-patch this. In such databases, > vac_update_datfrozenxid() can be a major part of the VACUUM's cost. Two > things work in our favor. First, vac_update_datfrozenxid() runs once per > VACUUM command, not once per relation. Second, Autovacuum has this logic: > > * ... we skip > * this if (1) we found no work to do and (2) we skipped at least one > * table due to concurrent autovacuum activity. In that case, the other > * worker has already done it, or will do so when it finishes. > */ > if (did_vacuum || !found_concurrent_worker) > vac_update_datfrozenxid(); > > That makes me relatively unworried. I did consider some alternatives: > > - Run vac_update_datfrozenxid()'s pg_class scan before taking a lock. If we > find the need for pg_database updates, take the lock and scan pg_class again > to get final numbers. This doubles the work in cases that end up taking the > lock, so I'm not betting it being a net win. > > - Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other > backend is already waiting. This is similar to Autovacuum's > found_concurrent_worker test. It is tempting. I'm not proposing it, > because it changes the states possible when manual VACUUM completes. Today, > you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid > values. If manual VACUUM could skip vac_update_datfrozenxid() this way, > datfrozenxid could lag until some concurrent VACUUM finishes.
Attachment
pgsql-hackers by date: