Thread: USE_BARRIER_SMGRRELEASE on Linux?
I've noticed that my WAL pre-allocation patches [0] routinely fail with "tablespace is not empty" errors on Linux [1]: DROP TABLESPACE regress_tblspace_renamed; +ERROR: tablespace "regress_tblspace_renamed" is not empty This appears to have been discussed elsewhere (and fixed) for Windows [2] [3] [4]. However, I'm able to reliably reproduce the issue on Linux when the WAL pre-allocation patches are applied. Adding a short sleep before dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear up the issue. My current thinking is that the performance boost from pre-allocating WAL is revealing this issue, but I'm not 100% certain it's not due to a bug in my code. I'm continuing to investigate. [0] https://commitfest.postgresql.org/37/3158/ [1] https://api.cirrus-ci.com/v1/artifact/task/4678319084929024/log/src/test/regress/regression.diffs [2] https://postgr.es/m/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com [3] https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com [4] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4eb2176 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi Nathan, On Wed, Feb 16, 2022 at 12:58 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I've noticed that my WAL pre-allocation patches [0] routinely fail with > "tablespace is not empty" errors on Linux [1]: > > DROP TABLESPACE regress_tblspace_renamed; > +ERROR: tablespace "regress_tblspace_renamed" is not empty > > This appears to have been discussed elsewhere (and fixed) for Windows [2] > [3] [4]. However, I'm able to reliably reproduce the issue on Linux when > the WAL pre-allocation patches are applied. Adding a short sleep before > dropping the tablespace or defining USE_BARRIER_SMGRRELEASE seems to clear > up the issue. > > My current thinking is that the performance boost from pre-allocating WAL > is revealing this issue, but I'm not 100% certain it's not due to a bug in > my code. I'm continuing to investigate. Huh. I applied your patches and saw no failure on my local Linux hacking station. I noticed that it failed a couple of times recently on Linux CI as you mentioned: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3158 But it didn't fail before (sorry, cfbot branch names needlessly include the commitfest number, which is stupid and breaks the history; I'll fix that one day): https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3158 It also hasn't failed on those other Unices (FreeBSD, macOS). I pushed a hack to log the name of the file that's getting in the way: 2022-02-16 01:10:54.180 UTC [22286][client backend] [pg_regress/tablespace][3/682:918] LOG: XXXX found 16748 2022-02-16 01:10:54.180 UTC [22286][client backend] [pg_regress/tablespace][3/682:918] STATEMENT: DROP TABLESPACE regress_tblspace_renamed; 2022-02-16 01:10:54.187 UTC [21880][checkpointer] LOG: checkpoint starting: immediate force wait 2022-02-16 01:10:54.188 UTC [21880][checkpointer] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB 2022-02-16 01:10:54.189 UTC [22286][client backend] [pg_regress/tablespace][3/682:918] LOG: XXXX found 16748 2022-02-16 01:10:54.189 UTC [22286][client backend] [pg_regress/tablespace][3/682:918] STATEMENT: DROP TABLESPACE regress_tblspace_renamed; 2022-02-16 01:10:54.189 UTC [22286][client backend] [pg_regress/tablespace][3/682:918] ERROR: tablespace "regress_tblspace_renamed" is not empty So we can see that the checkpoint didn't get rid of (presumed) tombstone file 16748. At least in unpatched PostgreSQL, this shouldn't be possible, if 16748 is on the pending delete list (since CreateCheckPoint() runs SyncPostCheckpoint() before returning, and only after that do we kick the condition variable and release the backend), and USE_BARRIER_SMGRRELEASE shouldn't really make a difference on a Unix, since all it does it close file descriptors, it doesn't unlink anything. I tried adding another log line so I could see the unlink() happening, but with that it doesn't fail (though I'm still trying). We may be in heisenbug territory. A conspiracy theorist with a siege mentality might wonder if Linux has developed a name cache invalidation bug so we're seeing a slightly out of date directory list (hmm, that CI image's kernel did go from 5.10.84-1 to 5.10.92-1 ~24 days ago), but I don't really understand what your patch has to do with that... Hrmph.
On Wed, Feb 16, 2022 at 3:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I tried adding another log line so I could see the unlink() happening, > but with that it doesn't fail (though I'm still trying). We may be in > heisenbug territory. This time I was luckier: https://api.cirrus-ci.com/v1/artifact/task/4915815005814784/log/src/test/regress/log/postmaster.log Interesting lines: 2022-02-16 02:08:50.933 UTC [14496][client backend] [pg_regress/tablespace][3/682:918] LOG: XXXX found 16748 2022-02-16 02:08:50.935 UTC [14089][checkpointer] LOG: checkpoint starting: immediate force wait 2022-02-16 02:08:50.937 UTC [14089][checkpointer] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=274 kB, estimate=7214 kB 2022-02-16 02:08:50.937 UTC [14496][client backend] [pg_regress/tablespace][3/682:918] LOG: XXXX found 16748 ... 2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG: shutting down 2022-02-16 02:10:10.010 UTC [14089][checkpointer] LOG: checkpoint starting: shutdown immediate ... 2022-02-16 02:10:10.080 UTC [14089][checkpointer] LOG: XXXX mdunlinkfiletag "pg_tblspc/16501/PG_15_202202141/16384/16748" So it's not getting unlinked until the *next* checkpoint cycle. I don't know why.
Hi, On 2022-02-16 15:09:15 +1300, Thomas Munro wrote: > I pushed a hack to log the name of the file that's getting in the way: FWIW, I think we should just have that in core. Right now it's just pointlessly hard to debug. I think just reporting the first file would be good enough... Greetings, Andres Freund
On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote: > So it's not getting unlinked until the *next* checkpoint cycle. I > don't know why. On my machine (5.11.0-43), it looks like the test starts failing after cc50080. That commit adjusted some regression tests, so I'm assuming it's not the root cause. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-02-16 15:34:08 +1300, Thomas Munro wrote: > So it's not getting unlinked until the *next* checkpoint cycle. I > don't know why. It might be helpful to know what the relfilenode maps to, so we know the operations done to it. Perhaps logging in heap_create() and RelationSetNewRelfilenode() would do the trick? Greetings, Andres Freund
Hi, On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote: > On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote: > > So it's not getting unlinked until the *next* checkpoint cycle. I > > don't know why. > > On my machine (5.11.0-43), it looks like the test starts failing after > cc50080. That commit adjusted some regression tests, so I'm assuming it's > not the root cause. That makes some sense. Before that we didn't create database objects before tablespace.sql. A scheduling like this that could make some pre-existing issues much more visible. I think we also saw the windows drop tablespace issues less commonly than recently (outside the aio tree at least)... Greetings, Andres Freund
On Tue, Feb 15, 2022 at 09:06:45PM -0800, Andres Freund wrote: > On 2022-02-15 20:59:11 -0800, Nathan Bossart wrote: >> On Wed, Feb 16, 2022 at 03:34:08PM +1300, Thomas Munro wrote: >> > So it's not getting unlinked until the *next* checkpoint cycle. I >> > don't know why. >> >> On my machine (5.11.0-43), it looks like the test starts failing after >> cc50080. That commit adjusted some regression tests, so I'm assuming it's >> not the root cause. > > That makes some sense. Before that we didn't create database objects before > tablespace.sql. A scheduling like this that could make some pre-existing > issues much more visible. > > I think we also saw the windows drop tablespace issues less commonly than > recently (outside the aio tree at least)... It looks like register_unlink_segment() is called prior to the checkpoint, but the checkpointer is not calling RememberSyncRequest() until after SyncPreCheckpoint(). This means that the requests are registered with the next checkpoint cycle count, so they aren't processed until the next checkpoint. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote: > It looks like register_unlink_segment() is called prior to the checkpoint, > but the checkpointer is not calling RememberSyncRequest() until after > SyncPreCheckpoint(). This means that the requests are registered with the > next checkpoint cycle count, so they aren't processed until the next > checkpoint. Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter seems to fix the issue. However, this requires moving SyncPreCheckpoint() out of the critical section in CreateCheckPoint(). Patch attached. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote: > On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote: >> It looks like register_unlink_segment() is called prior to the checkpoint, >> but the checkpointer is not calling RememberSyncRequest() until after >> SyncPreCheckpoint(). This means that the requests are registered with the >> next checkpoint cycle count, so they aren't processed until the next >> checkpoint. > > Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter > seems to fix the issue. However, this requires moving SyncPreCheckpoint() > out of the critical section in CreateCheckPoint(). Patch attached. An alternative fix might be to call AbsorbSyncRequests() after increasing the ckpt_started counter in CheckpointerMain(). AFAICT there is a window just before checkpointing where new requests are registered for the checkpoint following the one about to begin. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Feb 16, 2022 at 08:44:42AM -0800, Nathan Bossart wrote: > On Tue, Feb 15, 2022 at 10:57:32PM -0800, Nathan Bossart wrote: >> On Tue, Feb 15, 2022 at 10:14:04PM -0800, Nathan Bossart wrote: >>> It looks like register_unlink_segment() is called prior to the checkpoint, >>> but the checkpointer is not calling RememberSyncRequest() until after >>> SyncPreCheckpoint(). This means that the requests are registered with the >>> next checkpoint cycle count, so they aren't processed until the next >>> checkpoint. >> >> Calling AbsorbSyncRequests() before advancing the checkpoint cycle counter >> seems to fix the issue. However, this requires moving SyncPreCheckpoint() >> out of the critical section in CreateCheckPoint(). Patch attached. > > An alternative fix might be to call AbsorbSyncRequests() after increasing > the ckpt_started counter in CheckpointerMain(). AFAICT there is a window > just before checkpointing where new requests are registered for the > checkpoint following the one about to begin. Here's a patch that adds a call to AbsorbSyncRequests() in CheckpointerMain() instead of SyncPreCheckpoint(). I've also figured out a way to reproduce the issue without the pre-allocation patches applied: 1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to increment ckpt_started. 2. In session 1, run the following commands: a. CREATE TABLESPACE test LOCATION '/path/to/dir'; b. CREATE TABLE test TABLESPACE test AS SELECT 1; 3. In session 2, start a checkpoint. 4. In session 1, run these commands: a. ALTER TABLE test SET TABLESPACE pg_default; b. DROP TABLESPACE test; -- fails c. DROP TABLESPACE test; -- succeeds With the attached patch applied, the first attempt at dropping the tablespace no longer fails. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2022-02-16 09:37:21 -0800, Nathan Bossart wrote: > I've also figured out a > way to reproduce the issue without the pre-allocation patches applied: > > 1. In checkpointer.c, add a 30 second sleep before acquiring ckpt_lck to > increment ckpt_started. > 2. In session 1, run the following commands: > a. CREATE TABLESPACE test LOCATION '/path/to/dir'; > b. CREATE TABLE test TABLESPACE test AS SELECT 1; > 3. In session 2, start a checkpoint. > 4. In session 1, run these commands: > a. ALTER TABLE test SET TABLESPACE pg_default; > b. DROP TABLESPACE test; -- fails > c. DROP TABLESPACE test; -- succeeds Did you check whether this is a problem recently introduced or long-lived? Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just that it influences the timing in a way that makes it less likely? ISTM that the problem is partially caused by having multiple "checkpoint" counters that are used in different ways, but then only waiting for one of them. I wonder whether a better approach could be to either unify the different counters, or to request / wait for the sync counter specifically? Historically the sync stuff was something in md.c that the global system didn't really know anything about, but now it's a "proper" subsystem, so we can change the separation of concerns a bit more. > Here's a patch that adds a call to AbsorbSyncRequests() in > CheckpointerMain() instead of SyncPreCheckpoint(). That doesn't strike me as great architecturally. E.g. in theory the same problem could exist in single user mode. I think it doesn't today, because RegisterSyncRequest() will effectively "absorb" it immediately, but that kind of feels like an implementation detail? Greetings, Andres Freund
On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote: > Did you check whether this is a problem recently introduced or long-lived? I've reproduced it back to v9.3. I'm assuming it's much older than that. > Does USE_BARRIER_SMGRRELEASE actually prevent this problem? Or was it just > that it influences the timing in a way that makes it less likely? I think it just influences the timing. I believe the WAL pre-allocation stuff triggers the issue because it adds a step between AbsorbSyncRequests() and incrementing the started counter. > ISTM that the problem is partially caused by having multiple "checkpoint" > counters that are used in different ways, but then only waiting for one of > them. I wonder whether a better approach could be to either unify the > different counters, or to request / wait for the sync counter specifically? > > Historically the sync stuff was something in md.c that the global system > didn't really know anything about, but now it's a "proper" subsystem, so we > can change the separation of concerns a bit more. If the counters were unified, I think we might still need to do an extra aborb after incrementing it, and we'd need to make sure that all of those requests were tagged with the previous counter value so that they are processed in the current checkpoint. If callers requested with a specific counter value, they might need to lock the counter (ckpt_lck) when making requests. Maybe that is okay. >> Here's a patch that adds a call to AbsorbSyncRequests() in >> CheckpointerMain() instead of SyncPreCheckpoint(). > > That doesn't strike me as great architecturally. E.g. in theory the same > problem could exist in single user mode. I think it doesn't today, because > RegisterSyncRequest() will effectively "absorb" it immediately, but that kind > of feels like an implementation detail? Yeah, maybe that is a reason to add an absorb somewhere within CreateCheckPoint() instead, like v1 [0] does. Then the extra absorb would be sufficient for single-user mode if the requests were not absorbed immediately. [0] https://www.postgresql.org/message-id/attachment/130994/v1-0001-call-AbsorbSyncRequests-before-advancing-checkpoi.patch -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Feb 16, 2022 at 01:00:53PM -0800, Nathan Bossart wrote: > On Wed, Feb 16, 2022 at 11:27:31AM -0800, Andres Freund wrote: >> That doesn't strike me as great architecturally. E.g. in theory the same >> problem could exist in single user mode. I think it doesn't today, because >> RegisterSyncRequest() will effectively "absorb" it immediately, but that kind >> of feels like an implementation detail? > > Yeah, maybe that is a reason to add an absorb somewhere within > CreateCheckPoint() instead, like v1 [0] does. Then the extra absorb would > be sufficient for single-user mode if the requests were not absorbed > immediately. Here is a new patch. This is essentially the same as v1, but I've spruced up the comments and the commit message. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Feb 19, 2022 at 7:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Here is a new patch. This is essentially the same as v1, but I've spruced > up the comments and the commit message. Hi Nathan, Pushed and back-patched (it's slightly different before 12). Thanks!
On Wed, Mar 16, 2022 at 06:40:23PM +1300, Thomas Munro wrote: > Pushed and back-patched (it's slightly different before 12). Thanks! Thank you! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com