Thread: SimpleLruTruncate() mutual exclusion

SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

18 February 2019, 07:31:03

I'm forking this thread from
https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which
reported a race condition involving the "apparent wraparound" safety check in
SimpleLruTruncate():

On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> 1. The result of the test is valid only until we release the SLRU ControlLock,
>    which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
>    segments for deletion.  Once we release that lock, latest_page_number can
>    advance.  This creates a TOCTOU race condition, allowing excess deletion:
> 
>    [local] test=# table trunc_clog_concurrency ;
>    ERROR:  could not access status of transaction 2149484247
>    DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

> Fixes are available:

> b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
>    AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
>    checkpoint, or in the startup process.  Hence, also arrange for only one
>    backend to call SimpleLruTruncate(AsyncCtl) at a time.

More specifically, restrict vac_update_datfrozenxid() to one backend per
database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
backend per cluster.  This, attached, was rather straightforward.

I wonder about performance in a database with millions of small relations,
particularly considering my intent to back-patch this.  In such databases,
vac_update_datfrozenxid() can be a major part of the VACUUM's cost.  Two
things work in our favor.  First, vac_update_datfrozenxid() runs once per
VACUUM command, not once per relation.  Second, Autovacuum has this logic:

     * ... we skip
     * this if (1) we found no work to do and (2) we skipped at least one
     * table due to concurrent autovacuum activity.  In that case, the other
     * worker has already done it, or will do so when it finishes.
     */
    if (did_vacuum || !found_concurrent_worker)
        vac_update_datfrozenxid();

That makes me relatively unworried.  I did consider some alternatives:

- Run vac_update_datfrozenxid()'s pg_class scan before taking a lock.  If we
  find the need for pg_database updates, take the lock and scan pg_class again
  to get final numbers.  This doubles the work in cases that end up taking the
  lock, so I'm not betting it being a net win.

- Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
  backend is already waiting.  This is similar to Autovacuum's
  found_concurrent_worker test.  It is tempting.  I'm not proposing it,
  because it changes the states possible when manual VACUUM completes.  Today,
  you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
  values.  If manual VACUUM could skip vac_update_datfrozenxid() this way,
  datfrozenxid could lag until some concurrent VACUUM finishes.

Thanks,
nm

Attachment

wrap-limits-mutex-v1.patch

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

28 June 2019, 17:06:28

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
> I'm forking this thread from
> https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which
> reported a race condition involving the "apparent wraparound" safety check in
> SimpleLruTruncate():
> 
> On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> > 1. The result of the test is valid only until we release the SLRU ControlLock,
> >    which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
> >    segments for deletion.  Once we release that lock, latest_page_number can
> >    advance.  This creates a TOCTOU race condition, allowing excess deletion:
> > 
> >    [local] test=# table trunc_clog_concurrency ;
> >    ERROR:  could not access status of transaction 2149484247
> >    DETAIL:  Could not open file "pg_xact/0801": No such file or directory.
> 
> > Fixes are available:
> 
> > b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
> >    AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> >    checkpoint, or in the startup process.  Hence, also arrange for only one
> >    backend to call SimpleLruTruncate(AsyncCtl) at a time.
> 
> More specifically, restrict vac_update_datfrozenxid() to one backend per
> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
> backend per cluster.  This, attached, was rather straightforward.

Rebased.  The conflicts were limited to comments and documentation.

> I wonder about performance in a database with millions of small relations,
> particularly considering my intent to back-patch this.  In such databases,
> vac_update_datfrozenxid() can be a major part of the VACUUM's cost.  Two
> things work in our favor.  First, vac_update_datfrozenxid() runs once per
> VACUUM command, not once per relation.  Second, Autovacuum has this logic:
> 
>      * ... we skip
>      * this if (1) we found no work to do and (2) we skipped at least one
>      * table due to concurrent autovacuum activity.  In that case, the other
>      * worker has already done it, or will do so when it finishes.
>      */
>     if (did_vacuum || !found_concurrent_worker)
>         vac_update_datfrozenxid();
> 
> That makes me relatively unworried.  I did consider some alternatives:
> 
> - Run vac_update_datfrozenxid()'s pg_class scan before taking a lock.  If we
>   find the need for pg_database updates, take the lock and scan pg_class again
>   to get final numbers.  This doubles the work in cases that end up taking the
>   lock, so I'm not betting it being a net win.
> 
> - Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
>   backend is already waiting.  This is similar to Autovacuum's
>   found_concurrent_worker test.  It is tempting.  I'm not proposing it,
>   because it changes the states possible when manual VACUUM completes.  Today,
>   you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
>   values.  If manual VACUUM could skip vac_update_datfrozenxid() this way,
>   datfrozenxid could lag until some concurrent VACUUM finishes.

Attachment

wrap-limits-mutex-v2.patch

Re: SimpleLruTruncate() mutual exclusion

From

Tom Lane

Date:

29 July 2019, 16:58:17

Noah Misch <noah@leadboat.com> writes:
> On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
>>> b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
>>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
>>> checkpoint, or in the startup process.  Hence, also arrange for only one
>>> backend to call SimpleLruTruncate(AsyncCtl) at a time.

>> More specifically, restrict vac_update_datfrozenxid() to one backend per
>> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
>> backend per cluster.  This, attached, was rather straightforward.

> Rebased.  The conflicts were limited to comments and documentation.

I tried to review this, along with your adjacent patch to adjust the
segment-roundoff logic, but I didn't get far.  I see the point that
the roundoff might create problems when we are close to hitting
wraparound.  I do not, however, see why serializing vac_truncate_clog
helps.  I'm pretty sure it was intentional that multiple backends
could run truncation directory scans concurrently, and I don't really
want to give that up if possible.

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data.  We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that.  The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?)

Can you post whatever script you used to detect/reproduce the problem
in the first place?

            regards, tom lane

PS: also, re-reading this code, it's worrisome that we are not checking
for failure of the unlink calls.  I think the idea was that it didn't
really matter because if we did re-use an existing file we'd just
re-zero it; hence failing the truncate is an overreaction.  But probably
some comments about that are in order.

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

01 August 2019, 06:51:17

On Mon, Jul 29, 2019 at 12:58:17PM -0400, Tom Lane wrote:
> > On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
> >>> b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
> >>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> >>> checkpoint, or in the startup process.  Hence, also arrange for only one
> >>> backend to call SimpleLruTruncate(AsyncCtl) at a time.
> 
> >> More specifically, restrict vac_update_datfrozenxid() to one backend per
> >> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
> >> backend per cluster.  This, attached, was rather straightforward.

> I'm pretty sure it was intentional that multiple backends
> could run truncation directory scans concurrently, and I don't really
> want to give that up if possible.

I want to avoid a design that requires painstaking concurrency analysis.  Such
analysis is worth it for heap_update(), but vac_truncate_clog() isn't enough
of a hot path.  If there's a way to make vac_truncate_clog() easy to analyze
and still somewhat concurrent, fair.

> Also, if I understand the data-loss hazard properly, it's what you
> said in the other thread: the latest_page_number could advance after
> we make our decision about what to truncate, and then maybe we could
> truncate newly-added data.  We surely don't want to lock out the
> operations that can advance last_page_number, so how does serializing
> vac_truncate_clog help?
> 
> I wonder whether what we need to do is add some state in shared
> memory saying "it is only safe to create pages before here", and
> make SimpleLruZeroPage or its callers check that.  The truncation
> logic would advance that value, but only after completing a scan.
> (Oh ..., hmm, maybe the point of serializing truncations is to
> ensure that we know that we can safely advance that?) 

vac_truncate_clog() already records "it is only safe to create pages before
here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
The trouble comes when two vac_truncate_clog() run in parallel and you get a
sequence of events like this:

vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
vac_truncate_clog() instance 1 unlinks segment ABCD
vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
vac_truncate_clog() instance 1 finishes
some backend calls SimpleLruZeroPage(), creating segment ABCD
vac_truncate_clog() instance 2 unlinks segment ABCD

Serializing vac_truncate_clog() fixes that.

> Can you post whatever script you used to detect/reproduce the problem
> in the first place?

I've attached it, but it's pretty hackish.  Apply the patch on commit 7170268,
make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
directory, and run "make trunc-check".  This incorporates xid-burn
acceleration code that Jeff Janes shared with -hackers at some point.

> PS: also, re-reading this code, it's worrisome that we are not checking
> for failure of the unlink calls.  I think the idea was that it didn't
> really matter because if we did re-use an existing file we'd just
> re-zero it; hence failing the truncate is an overreaction.  But probably
> some comments about that are in order.

That's my understanding.  We'll zero any page before reusing it.  Failing the
whole vac_truncate_clog() (and therefore not advancing the wrap limit) would
do more harm than the bit of wasted disk space.  Still, a LOG or WARNING
message would be fine, I think.

Thanks,
nm

Attachment

Re: SimpleLruTruncate() mutual exclusion

From

Thomas Munro

Date:

04 November 2019, 02:26:35

On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:
> vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
> vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
> vac_truncate_clog() instance 1 unlinks segment ABCD
> vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> vac_truncate_clog() instance 1 finishes
> some backend calls SimpleLruZeroPage(), creating segment ABCD
> vac_truncate_clog() instance 2 unlinks segment ABCD
>
> Serializing vac_truncate_clog() fixes that.

I've wondered before (in a -bugs thread[1] about unexplained pg_serial
wraparound warnings) if we could map 64 bit xids to wide SLRU file
names that never wrap around and make this class of problem go away.
Unfortunately multixacts would need 64 bit support too...

[1]
https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

04 November 2019, 23:43:09

On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote:
> On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:
> > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
> > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
> > vac_truncate_clog() instance 1 unlinks segment ABCD
> > vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> > vac_truncate_clog() instance 1 finishes
> > some backend calls SimpleLruZeroPage(), creating segment ABCD
> > vac_truncate_clog() instance 2 unlinks segment ABCD
> >
> > Serializing vac_truncate_clog() fixes that.
> 
> I've wondered before (in a -bugs thread[1] about unexplained pg_serial
> wraparound warnings) if we could map 64 bit xids to wide SLRU file
> names that never wrap around and make this class of problem go away.
> Unfortunately multixacts would need 64 bit support too...
> 
> [1]
https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com

That change sounds good to me.

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

15 November 2019, 14:52:04

On Mon, Nov 04, 2019 at 03:43:09PM -0800, Noah Misch wrote:
> On Mon, Nov 04, 2019 at 03:26:35PM +1300, Thomas Munro wrote:
> > On Thu, Aug 1, 2019 at 6:51 PM Noah Misch <noah@leadboat.com> wrote:
> > > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
> > > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
> > > vac_truncate_clog() instance 1 unlinks segment ABCD
> > > vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> > > vac_truncate_clog() instance 1 finishes
> > > some backend calls SimpleLruZeroPage(), creating segment ABCD
> > > vac_truncate_clog() instance 2 unlinks segment ABCD
> > >
> > > Serializing vac_truncate_clog() fixes that.
> > 
> > I've wondered before (in a -bugs thread[1] about unexplained pg_serial
> > wraparound warnings) if we could map 64 bit xids to wide SLRU file
> > names that never wrap around and make this class of problem go away.
> > Unfortunately multixacts would need 64 bit support too...
> > 
> > [1]
https://www.postgresql.org/message-id/flat/CAEBTBzuS-01t12GGVD6qCezce8EFD8aZ1V%2Bo_3BZ%3DbuVLQBtRg%40mail.gmail.com
> 
> That change sounds good to me.

I do think $SUBJECT should move forward independent of that.

Re: SimpleLruTruncate() mutual exclusion

From

Dmitry Dolgov

Date:

17 November 2019, 11:56:52

> On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:

Hi,

> > Also, if I understand the data-loss hazard properly, it's what you
> > said in the other thread: the latest_page_number could advance after
> > we make our decision about what to truncate, and then maybe we could
> > truncate newly-added data.  We surely don't want to lock out the
> > operations that can advance last_page_number, so how does serializing
> > vac_truncate_clog help?
> >
> > I wonder whether what we need to do is add some state in shared
> > memory saying "it is only safe to create pages before here", and
> > make SimpleLruZeroPage or its callers check that.  The truncation
> > logic would advance that value, but only after completing a scan.
> > (Oh ..., hmm, maybe the point of serializing truncations is to
> > ensure that we know that we can safely advance that?)
>
> vac_truncate_clog() already records "it is only safe to create pages before
> here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
> The trouble comes when two vac_truncate_clog() run in parallel and you get a
> sequence of events like this:
>
> vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
> vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
> vac_truncate_clog() instance 1 unlinks segment ABCD
> vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> vac_truncate_clog() instance 1 finishes
> some backend calls SimpleLruZeroPage(), creating segment ABCD
> vac_truncate_clog() instance 2 unlinks segment ABCD
>
> Serializing vac_truncate_clog() fixes that.

I'm probably missing something, so just wanted to clarify. Do I
understand correctly, that thread [1] and this one are independent, and
it is assumed in the scenario above that we're at the end of XID space,
but not necessarily having rounding issues? I'm a bit confused, since
the reproduce script does something about cutoffPage, and I'm not sure
if it's important or not.

> > Can you post whatever script you used to detect/reproduce the problem
> > in the first place?
>
> I've attached it, but it's pretty hackish.  Apply the patch on commit 7170268,
> make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
> directory, and run "make trunc-check".  This incorporates xid-burn
> acceleration code that Jeff Janes shared with -hackers at some point.

I've tried to use it to understand the problem better, but somehow
couldn't reproduce via suggested script. I've applied it to 7170268
(tried also apply rebased to the master with the same results) with the
conf-test-pg in place, but after going through all steps there are no
errors like:

   ERROR:  could not access status of transaction 2149484247
   DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

Is there anything extra one needs to do to reproduce the problem, maybe
adjust delays or something?

[1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

18 November 2019, 06:14:26

On Sun, Nov 17, 2019 at 12:56:52PM +0100, Dmitry Dolgov wrote:
> > On Wed, Jul 31, 2019 at 11:51:17PM -0700, Noah Misch wrote:
> > > Also, if I understand the data-loss hazard properly, it's what you
> > > said in the other thread: the latest_page_number could advance after
> > > we make our decision about what to truncate, and then maybe we could
> > > truncate newly-added data.  We surely don't want to lock out the
> > > operations that can advance last_page_number, so how does serializing
> > > vac_truncate_clog help?
> > >
> > > I wonder whether what we need to do is add some state in shared
> > > memory saying "it is only safe to create pages before here", and
> > > make SimpleLruZeroPage or its callers check that.  The truncation
> > > logic would advance that value, but only after completing a scan.
> > > (Oh ..., hmm, maybe the point of serializing truncations is to
> > > ensure that we know that we can safely advance that?)
> >
> > vac_truncate_clog() already records "it is only safe to create pages before
> > here" in ShmemVariableCache->xidWrapLimit, which it updates after any unlinks.
> > The trouble comes when two vac_truncate_clog() run in parallel and you get a
> > sequence of events like this:
> >
> > vac_truncate_clog() instance 1 starts, considers segment ABCD eligible to unlink
> > vac_truncate_clog() instance 2 starts, considers segment ABCD eligible to unlink
> > vac_truncate_clog() instance 1 unlinks segment ABCD
> > vac_truncate_clog() instance 1 calls SetTransactionIdLimit()
> > vac_truncate_clog() instance 1 finishes
> > some backend calls SimpleLruZeroPage(), creating segment ABCD
> > vac_truncate_clog() instance 2 unlinks segment ABCD
> >
> > Serializing vac_truncate_clog() fixes that.
> 
> I'm probably missing something, so just wanted to clarify. Do I
> understand correctly, that thread [1] and this one are independent, and
> it is assumed in the scenario above that we're at the end of XID space,
> but not necessarily having rounding issues? I'm a bit confused, since
> the reproduce script does something about cutoffPage, and I'm not sure
> if it's important or not.

I think the repro recipe contained an early fix for the other thread's bug.
While they're independent in principle, I likely never reproduced this bug
without having a fix in place for the other thread's bug.  The bug in the
other thread was easier to hit.

> > > Can you post whatever script you used to detect/reproduce the problem
> > > in the first place?
> >
> > I've attached it, but it's pretty hackish.  Apply the patch on commit 7170268,
> > make sure your --bindir is in $PATH, copy "conf-test-pg" to your home
> > directory, and run "make trunc-check".  This incorporates xid-burn
> > acceleration code that Jeff Janes shared with -hackers at some point.
> 
> I've tried to use it to understand the problem better, but somehow
> couldn't reproduce via suggested script. I've applied it to 7170268
> (tried also apply rebased to the master with the same results) with the
> conf-test-pg in place, but after going through all steps there are no
> errors like:

That is unfortunate.

> Is there anything extra one needs to do to reproduce the problem, maybe
> adjust delays or something?

It wouldn't surprise me.  I did all my testing on one or two systems; the
hard-coded delays suffice there, but I'm sure there exist systems needing
different delays.

Though I did reproduce this bug, I'm motivated by the abstract problem more
than any particular way to reproduce it.  Commit 996d273 inspired me; by
removing a GetCurrentTransactionId(), it allowed the global xmin to advance at
times it previously could not.  That subtly changed the concurrency
possibilities.  I think safe, parallel SimpleLruTruncate() is difficult to
maintain and helps too rarely to justify such maintenance.  That's why I
propose eliminating the concurrency.

> [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

Re: SimpleLruTruncate() mutual exclusion

From

Dmitry Dolgov

Date:

22 November 2019, 15:32:22

> On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:
>
> Though I did reproduce this bug, I'm motivated by the abstract problem more
> than any particular way to reproduce it.  Commit 996d273 inspired me; by
> removing a GetCurrentTransactionId(), it allowed the global xmin to advance at
> times it previously could not.  That subtly changed the concurrency
> possibilities.  I think safe, parallel SimpleLruTruncate() is difficult to
> maintain and helps too rarely to justify such maintenance.  That's why I
> propose eliminating the concurrency.

Sure, I see the point and the possibility for the issue itself, but of
course it's easier to reason about an issue I can reproduce :)

> I wonder about performance in a database with millions of small relations,
> particularly considering my intent to back-patch this.  In such databases,
> vac_update_datfrozenxid() can be a major part of the VACUUM's cost.  Two
> things work in our favor.  First, vac_update_datfrozenxid() runs once per
> VACUUM command, not once per relation.  Second, Autovacuum has this logic:
>
>      * ... we skip
>      * this if (1) we found no work to do and (2) we skipped at least one
>      * table due to concurrent autovacuum activity.  In that case, the other
>      * worker has already done it, or will do so when it finishes.
>      */
>     if (did_vacuum || !found_concurrent_worker)
>         vac_update_datfrozenxid();
>
> That makes me relatively unworried.  I did consider some alternatives:

Btw, I've performed few experiments with parallel vacuuming of 10^4
small tables that are taking some small inserts, the results look like
this:

    # with patch
    # funclatency -u bin/postgres:vac_update_datfrozenxid

         usecs               : count     distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 0        |                                        |
           256 -> 511        : 0        |                                        |
           512 -> 1023       : 3        |***                                     |
          1024 -> 2047       : 38       |****************************************|
          2048 -> 4095       : 15       |***************                         |
          4096 -> 8191       : 15       |***************                         |
          8192 -> 16383      : 2        |**                                      |

    # without patch
    # funclatency -u bin/postgres:vac_update_datfrozenxid

         usecs               : count     distribution
             0 -> 1          : 0        |                                        |
             2 -> 3          : 0        |                                        |
             4 -> 7          : 0        |                                        |
             8 -> 15         : 0        |                                        |
            16 -> 31         : 0        |                                        |
            32 -> 63         : 0        |                                        |
            64 -> 127        : 0        |                                        |
           128 -> 255        : 0        |                                        |
           256 -> 511        : 0        |                                        |
           512 -> 1023       : 5        |****                                    |
          1024 -> 2047       : 49       |****************************************|
          2048 -> 4095       : 11       |********                                |
          4096 -> 8191       : 5        |****                                    |
          8192 -> 16383      : 1        |                                        |

In general it seems that latency tends to be a bit higher, but I don't
think it's significant.

Re: SimpleLruTruncate() mutual exclusion

From

Dmitry Dolgov

Date:

30 January 2020, 15:34:33

> On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:
>
> > I'm probably missing something, so just wanted to clarify. Do I
> > understand correctly, that thread [1] and this one are independent, and
> > it is assumed in the scenario above that we're at the end of XID space,
> > but not necessarily having rounding issues? I'm a bit confused, since
> > the reproduce script does something about cutoffPage, and I'm not sure
> > if it's important or not.
>
> I think the repro recipe contained an early fix for the other thread's bug.
> While they're independent in principle, I likely never reproduced this bug
> without having a fix in place for the other thread's bug.  The bug in the
> other thread was easier to hit.

Just to clarify, since the CF item for this patch was withdrawn
recently. Does it mean that eventually the thread [1] covers this one
too?

[1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

31 January 2020, 08:42:13

On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote:
> > On Sun, Nov 17, 2019 at 10:14:26PM -0800, Noah Misch wrote:
> > > I'm probably missing something, so just wanted to clarify. Do I
> > > understand correctly, that thread [1] and this one are independent, and
> > > it is assumed in the scenario above that we're at the end of XID space,
> > > but not necessarily having rounding issues? I'm a bit confused, since
> > > the reproduce script does something about cutoffPage, and I'm not sure
> > > if it's important or not.
> >
> > I think the repro recipe contained an early fix for the other thread's bug.
> > While they're independent in principle, I likely never reproduced this bug
> > without having a fix in place for the other thread's bug.  The bug in the
> > other thread was easier to hit.
> 
> Just to clarify, since the CF item for this patch was withdrawn
> recently. Does it mean that eventually the thread [1] covers this one
> too?
> 
> [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

I withdrew $SUBJECT because, if someone reviews one of my patches, I want it
to be the one you cite in [1].  I plan not to commit [1] without a Ready for
Committer, and I plan not to commit $SUBJECT before committing [1].  I would
be willing to commit $SUBJECT without getting a review, however.

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

25 May 2020, 20:34:19

On Fri, Jun 28, 2019 at 10:06:28AM -0700, Noah Misch wrote:
> On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
> > I'm forking this thread from
> > https://postgr.es/m/flat/20190202083822.GC32531@gust.leadboat.com, which
> > reported a race condition involving the "apparent wraparound" safety check in
> > SimpleLruTruncate():
> > 
> > On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> > > 1. The result of the test is valid only until we release the SLRU ControlLock,
> > >    which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
> > >    segments for deletion.  Once we release that lock, latest_page_number can
> > >    advance.  This creates a TOCTOU race condition, allowing excess deletion:
> > > 
> > >    [local] test=# table trunc_clog_concurrency ;
> > >    ERROR:  could not access status of transaction 2149484247
> > >    DETAIL:  Could not open file "pg_xact/0801": No such file or directory.
> > 
> > > Fixes are available:
> > 
> > > b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other than
> > >    AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> > >    checkpoint, or in the startup process.  Hence, also arrange for only one
> > >    backend to call SimpleLruTruncate(AsyncCtl) at a time.
> > 
> > More specifically, restrict vac_update_datfrozenxid() to one backend per
> > database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
> > backend per cluster.  This, attached, was rather straightforward.
> 
> Rebased.  The conflicts were limited to comments and documentation.

Rebased, most notably over the lock renames of commit 5da1493.  In
LockTagTypeNames, I did s/frozen IDs/frozenid/ for consistency with that
commit having done s/speculative token/spectoken/.

Attachment

wrap-limits-mutex-v3.patch

Re: SimpleLruTruncate() mutual exclusion

From

Noah Misch

Date:

02 August 2020, 06:11:42

On Fri, Jan 31, 2020 at 12:42:13AM -0800, Noah Misch wrote:
> On Thu, Jan 30, 2020 at 04:34:33PM +0100, Dmitry Dolgov wrote:
> > Just to clarify, since the CF item for this patch was withdrawn
> > recently. Does it mean that eventually the thread [1] covers this one
> > too?
> > 
> > [1]: https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com
> 
> I withdrew $SUBJECT because, if someone reviews one of my patches, I want it
> to be the one you cite in [1].  I plan not to commit [1] without a Ready for
> Committer, and I plan not to commit $SUBJECT before committing [1].  I would
> be willing to commit $SUBJECT without getting a review, however.

After further reflection, I plan to push $SUBJECT shortly after 2020-08-13,
not waiting for [1] to change status.  Reasons:

- While I put significant weight on the fact that I couldn't reproduce
  $SUBJECT problems without first fixing [1], I'm no longer confident of that
  representing real-world experiences.  Reproducing [1] absolutely requires a
  close approach to a wrap limit, but $SUBJECT problems might not.

- Adding locks won't change correct functional behavior to incorrect
  functional behavior.

- By pushing at that particular time, the fix ordinarily will appear in v13.0
  before appearing in a back branch release.  If problematic contention arises
  quickly in the field, that's a more-comfortable way to discover it.