Thread: USE_BARRIER_SMGRRELEASE on Linux?

USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

15 February 2022, 23:58:45

I've noticed that my WAL pre-allocation patches [0] routinely fail with
"tablespace is not empty" errors on Linux [1]:

     DROP TABLESPACE regress_tblspace_renamed;
    +ERROR:  tablespace "regress_tblspace_renamed" is not empty

This appears to have been discussed elsewhere (and fixed) for Windows [2]
[3] [4].  However, I'm able to reliably reproduce the issue on Linux when
the WAL pre-allocation patches are applied.  Adding a short sleep before
dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear
up the issue.

My current thinking is that the performance boost from pre-allocating WAL
is revealing this issue, but I'm not 100% certain it's not due to a bug in
my code.  I'm continuing to investigate.

[0] https://commitfest.postgresql.org/37/3158/
[1] https://api.cirrus-ci.com/v1/artifact/task/4678319084929024/log/src/test/regress/regression.diffs
[2] https://postgr.es/m/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com
[3] https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com
[4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4eb2176

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Thomas Munro

Date:

16 February 2022, 02:09:15

Hi Nathan,

On Wed, Feb 16, 2022 at 12:58 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> I've noticed that my WAL pre-allocation patches [0] routinely fail with
> "tablespace is not empty" errors on Linux [1]:
>
>          DROP TABLESPACE regress_tblspace_renamed;
>         +ERROR:  tablespace "regress_tblspace_renamed" is not empty
>
> This appears to have been discussed elsewhere (and fixed) for Windows [2]
> [3] [4].  However, I'm able to reliably reproduce the issue on Linux when
> the WAL pre-allocation patches are applied.  Adding a short sleep before
> dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear
> up the issue.
>
> My current thinking is that the performance boost from pre-allocating WAL
> is revealing this issue, but I'm not 100% certain it's not due to a bug in
> my code.  I'm continuing to investigate.

Huh.  I applied your patches and saw no failure on my local Linux
hacking station.  I noticed that it failed a couple of times recently
on Linux CI as you mentioned:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3158

But it didn't fail before (sorry, cfbot branch names needlessly
include the commitfest number, which is stupid and breaks the history;
I'll fix that one day):

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3158

It also hasn't failed on those other Unices (FreeBSD, macOS).

I pushed a hack to log the name of the file that's getting in the way:

2022-02-16 01:10:54.180 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] LOG:  XXXX found 16748
2022-02-16 01:10:54.180 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] STATEMENT:  DROP TABLESPACE
regress_tblspace_renamed;
2022-02-16 01:10:54.187 UTC [21880][checkpointer] LOG:  checkpoint
starting: immediate force wait
2022-02-16 01:10:54.188 UTC [21880][checkpointer] LOG:  checkpoint
complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] LOG:  XXXX found 16748
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] STATEMENT:  DROP TABLESPACE
regress_tblspace_renamed;
2022-02-16 01:10:54.189 UTC [22286][client backend]
[pg_regress/tablespace][3/682:918] ERROR:  tablespace
"regress_tblspace_renamed" is not empty

So we can see that the checkpoint didn't get rid of (presumed)
tombstone file 16748.  At least in unpatched PostgreSQL, this
shouldn't be possible, if 16748 is on the pending delete list (since
CreateCheckPoint() runs SyncPostCheckpoint() before returning, and
only after that do we kick the condition variable and release the
backend), and USE_BARRIER_SMGRRELEASE shouldn't really make a
difference on a Unix, since all it does it close file descriptors, it
doesn't unlink anything.

I tried adding another log line so I could see the unlink() happening,
but with that it doesn't fail (though I'm still trying).  We may be in
heisenbug territory.

A conspiracy theorist with a siege mentality might wonder if Linux has
developed a name cache invalidation bug so we're seeing a slightly out
of date directory list (hmm, that CI image's kernel did go from
5.10.84-1 to 5.10.92-1 ~24 days ago), but I don't really understand
what your patch has to do with that...  Hrmph.

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Thomas Munro

Date:

16 February 2022, 02:34:08

On Wed, Feb 16, 2022 at 3:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I tried adding another log line so I could see the unlink() happening,
> but with that it doesn't fail (though I'm still trying).  We may be in
> heisenbug territory.

This time I was luckier:

https://api.cirrus-ci.com/v1/artifact/task/4915815005814784/log/src/test/regress/log/postmaster.log

Interesting lines:

2022-02-16 02:08:50.933 UTC [14496][client backend]
[pg_regress/tablespace][3/682:918] LOG:  XXXX found 16748

2022-02-16 02:08:50.935 UTC [14089][checkpointer] LOG:  checkpoint
starting: immediate force wait
2022-02-16 02:08:50.937 UTC [14089][checkpointer] LOG:  checkpoint
complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB
2022-02-16 02:08:50.937 UTC [14496][client backend]
[pg_regress/tablespace][3/682:918] LOG:  XXXX found 16748
...
2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG:  shutting down
2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG:  checkpoint
starting: shutdown immediate
...
2022-02-16 02:10:10.080 UTC [14089][checkpointer] LOG:  XXXX
mdunlinkfiletag "pg_tblspc/16501/PG_15_202202141/16384/16748"

So it's not getting unlinked until the *next* checkpoint cycle.  I
don't know why.

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Andres Freund

Date:

16 February 2022, 04:19:30

Hi,

On 2022-02-16 15:09:15 +1300, Thomas Munro wrote:
> I pushed a hack to log the name of the file that's getting in the way:

FWIW, I think we should just have that in core. Right now it's just
pointlessly hard to debug. I think just reporting the first file would be good
enough...

Greetings,

Andres Freund

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 04:59:11

On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
> So it's not getting unlinked until the *next* checkpoint cycle.  I
> don't know why.

On my machine (5.11.0-43), it looks like the test starts failing after
cc50080.  That commit adjusted some regression tests, so I'm assuming it's
not the root cause.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Andres Freund

Date:

16 February 2022, 04:59:43

Hi,

On 2022-02-16 15:34:08 +1300, Thomas Munro wrote:
> So it's not getting unlinked until the *next* checkpoint cycle.  I
> don't know why.

It might be helpful to know what the relfilenode maps to, so we know the
operations done to it. Perhaps logging in heap_create() and
RelationSetNewRelfilenode() would do the trick?

Greetings,

Andres Freund

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Andres Freund

Date:

16 February 2022, 05:06:45

Hi,

On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
> > So it's not getting unlinked until the *next* checkpoint cycle.  I
> > don't know why.
> 
> On my machine (5.11.0-43), it looks like the test starts failing after
> cc50080.  That commit adjusted some regression tests, so I'm assuming it's
> not the root cause.

That makes some sense. Before that we didn't create database objects before
tablespace.sql. A scheduling like this that could make some pre-existing
issues much more visible.

I think we also saw the windows drop tablespace issues less commonly than
recently (outside the aio tree at least)...

Greetings,

Andres Freund

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 06:14:04

On Tue, Feb 15, 2022 at 09:06:45PM -0800, Andres Freund wrote:
> On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote:
>> On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote:
>> > So it's not getting unlinked until the *next* checkpoint cycle.  I
>> > don't know why.
>> 
>> On my machine (5.11.0-43), it looks like the test starts failing after
>> cc50080.  That commit adjusted some regression tests, so I'm assuming it's
>> not the root cause.
> 
> That makes some sense. Before that we didn't create database objects before
> tablespace.sql. A scheduling like this that could make some pre-existing
> issues much more visible.
> 
> I think we also saw the windows drop tablespace issues less commonly than
> recently (outside the aio tree at least)...

It looks like register_unlink_segment() is called prior to the checkpoint,
but the checkpointer is not calling RememberSyncRequest() until after
SyncPreCheckpoint().  This means that the requests are registered with the
next checkpoint cycle count, so they aren't processed until the next
checkpoint.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 06:57:32

On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
> It looks like register_unlink_segment() is called prior to the checkpoint,
> but the checkpointer is not calling RememberSyncRequest() until after
> SyncPreCheckpoint().  This means that the requests are registered with the
> next checkpoint cycle count, so they aren't processed until the next
> checkpoint.

Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
out of the critical section in CreateCheckPoint().  Patch attached.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-call-AbsorbSyncRequests-before-advancing-checkpoi.patch

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 16:44:42

On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>> It looks like register_unlink_segment() is called prior to the checkpoint,
>> but the checkpointer is not calling RememberSyncRequest() until after
>> SyncPreCheckpoint().  This means that the requests are registered with the
>> next checkpoint cycle count, so they aren't processed until the next
>> checkpoint.
> 
> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
> out of the critical section in CreateCheckPoint().  Patch attached.

An alternative fix might be to call AbsorbSyncRequests() after increasing
the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
just before checkpointing where new requests are registered for the
checkpoint following the one about to begin.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 17:37:21

On Wed, Feb 16, 2022 at 08:44:42AM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote:
>> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote:
>>> It looks like register_unlink_segment() is called prior to the checkpoint,
>>> but the checkpointer is not calling RememberSyncRequest() until after
>>> SyncPreCheckpoint().  This means that the requests are registered with the
>>> next checkpoint cycle count, so they aren't processed until the next
>>> checkpoint.
>> 
>> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter
>> seems to fix the issue.  However, this requires moving SyncPreCheckpoint()
>> out of the critical section in CreateCheckPoint().  Patch attached.
> 
> An alternative fix might be to call AbsorbSyncRequests() after increasing
> the ckpt_started counter in CheckpointerMain().  AFAICT there is a window
> just before checkpointing where new requests are registered for the
> checkpoint following the one about to begin.

Here's a patch that adds a call to AbsorbSyncRequests() in
CheckpointerMain() instead of SyncPreCheckpoint().  I've also figured out a
way to reproduce the issue without the pre-allocation patches applied:

1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
   increment ckpt_started.
2. In session 1, run the following commands:
    a. CREATE TABLESPACE test LOCATION '/path/to/dir';
    b. CREATE TABLE test TABLESPACE test AS SELECT 1;
3. In session 2, start a checkpoint.
4. In session 1, run these commands:
    a. ALTER TABLE test SET TABLESPACE pg_default;
    b. DROP TABLESPACE test;  -- fails
    c. DROP TABLESPACE test;  -- succeeds

With the attached patch applied, the first attempt at dropping the
tablespace no longer fails.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-call-AbsorbSyncRequests-after-indicating-checkpoi.patch

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Andres Freund

Date:

16 February 2022, 19:27:31

Hi,

On 2022-02-16 09:37:21 -0800, Nathan Bossart wrote:
> I've also figured out a
> way to reproduce the issue without the pre-allocation patches applied:
> 
> 1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to
>    increment ckpt_started.
> 2. In session 1, run the following commands:
>     a. CREATE TABLESPACE test LOCATION '/path/to/dir';
>     b. CREATE TABLE test TABLESPACE test AS SELECT 1;
> 3. In session 2, start a checkpoint.
> 4. In session 1, run these commands:
>     a. ALTER TABLE test SET TABLESPACE pg_default;
>     b. DROP TABLESPACE test;  -- fails
>     c. DROP TABLESPACE test;  -- succeeds

Did you check whether this is a problem recently introduced or long-lived?

Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
that it influences the timing in a way that makes it less likely?

ISTM that the problem is partially caused by having multiple "checkpoint"
counters that are used in different ways, but then only waiting for one of
them. I wonder whether a better approach could be to either unify the
different counters, or to request / wait for the sync counter specifically?

Historically the sync stuff was something in md.c that the global system
didn't really know anything about, but now it's a "proper" subsystem, so we
can change the separation of concerns a bit more.

> Here's a patch that adds a call to AbsorbSyncRequests() in
> CheckpointerMain() instead of SyncPreCheckpoint().

That doesn't strike me as great architecturally. E.g. in theory the same
problem could exist in single user mode. I think it doesn't today, because
RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
of feels like an implementation detail?

Greetings,

Andres Freund

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 February 2022, 21:00:53

On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote:
> Did you check whether this is a problem recently introduced or long-lived?

I've reproduced it back to v9.3.  I'm assuming it's much older than that.

> Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just
> that it influences the timing in a way that makes it less likely?

I think it just influences the timing.  I believe the WAL pre-allocation
stuff triggers the issue because it adds a step between
AbsorbSyncRequests() and incrementing the started counter.

> ISTM that the problem is partially caused by having multiple "checkpoint"
> counters that are used in different ways, but then only waiting for one of
> them. I wonder whether a better approach could be to either unify the
> different counters, or to request / wait for the sync counter specifically?
> 
> Historically the sync stuff was something in md.c that the global system
> didn't really know anything about, but now it's a "proper" subsystem, so we
> can change the separation of concerns a bit more.

If the counters were unified, I think we might still need to do an extra
aborb after incrementing it, and we'd need to make sure that all of those
requests were tagged with the previous counter value so that they are
processed in the current checkpoint.  If callers requested with a specific
counter value, they might need to lock the counter (ckpt_lck) when making
requests.  Maybe that is okay.

>> Here's a patch that adds a call to AbsorbSyncRequests() in
>> CheckpointerMain() instead of SyncPreCheckpoint().
> 
> That doesn't strike me as great architecturally. E.g. in theory the same
> problem could exist in single user mode. I think it doesn't today, because
> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
> of feels like an implementation detail?

Yeah, maybe that is a reason to add an absorb somewhere within
CreateCheckPoint() instead, like v1 [0] does.  Then the extra absorb would
be sufficient for single-user mode if the requests were not absorbed
immediately.

[0]
https://www.postgresql.org/message-id/attachment/130994/v1-0001-call-AbsorbSyncRequests-before-advancing-checkpoi.patch

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

18 February 2022, 18:01:44

On Wed, Feb 16, 2022 at 01:00:53PM -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote:
>> That doesn't strike me as great architecturally. E.g. in theory the same
>> problem could exist in single user mode. I think it doesn't today, because
>> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind
>> of feels like an implementation detail?
> 
> Yeah, maybe that is a reason to add an absorb somewhere within
> CreateCheckPoint() instead, like v1 [0] does.  Then the extra absorb would
> be sufficient for single-user mode if the requests were not absorbed
> immediately.

Here is a new patch.  This is essentially the same as v1, but I've spruced
up the comments and the commit message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v3-0001-Fix-race-condition-between-DROP-TABLESPACE-and-ch.patch

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Thomas Munro

Date:

16 March 2022, 05:40:23

On Sat, Feb 19, 2022 at 7:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is a new patch.  This is essentially the same as v1, but I've spruced
> up the comments and the commit message.

Hi Nathan,

Pushed and back-patched (it's slightly different before 12).  Thanks!

Re: USE_BARRIER_SMGRRELEASE on Linux?

From

Nathan Bossart

Date:

16 March 2022, 17:42:28

On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote:
> Pushed and back-patched (it's slightly different before 12).  Thanks!

Thank you!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com