Thread: race condition when writing pg_control
Hi hackers, I believe I've discovered a race condition between the startup and checkpointer processes that can cause a CRC mismatch in the pg_control file. If a cluster crashes at the right time, the following error appears when you attempt to restart it: FATAL: incorrect checksum in control file This appears to be caused by some code paths in xlog_redo() that update ControlFile without taking the ControlFileLock. The attached patch seems to be sufficient to prevent the CRC mismatch in the control file, but perhaps this is a symptom of a bigger problem with concurrent modifications of ControlFile->checkPointCopy.nextFullXid. Nathan
Attachment
On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote: > I believe I've discovered a race condition between the startup and > checkpointer processes that can cause a CRC mismatch in the pg_control > file. If a cluster crashes at the right time, the following error > appears when you attempt to restart it: > > FATAL: incorrect checksum in control file > > This appears to be caused by some code paths in xlog_redo() that > update ControlFile without taking the ControlFileLock. The attached > patch seems to be sufficient to prevent the CRC mismatch in the > control file, but perhaps this is a symptom of a bigger problem with > concurrent modifications of ControlFile->checkPointCopy.nextFullXid. This does indeed look pretty dodgy. CreateRestartPoint() running in the checkpointer does UpdateControlFile() to compute a checksum and write it out, but xlog_redo() processing XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without interlocking. It looks like the ancestors of that line were there since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran UpdateControLFile() directly in the startup process (immediately after that update), so no interlocking problem. Then in cdd46c76548 (2009), RecoveryRestartPoint() was split up so that CreateRestartPoint() ran in another process.
On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > I believe I've discovered a race condition between the startup and > > checkpointer processes that can cause a CRC mismatch in the pg_control > > file. If a cluster crashes at the right time, the following error > > appears when you attempt to restart it: > > > > FATAL: incorrect checksum in control file > > > > This appears to be caused by some code paths in xlog_redo() that > > update ControlFile without taking the ControlFileLock. The attached > > patch seems to be sufficient to prevent the CRC mismatch in the > > control file, but perhaps this is a symptom of a bigger problem with > > concurrent modifications of ControlFile->checkPointCopy.nextFullXid. > > This does indeed look pretty dodgy. CreateRestartPoint() running in > the checkpointer does UpdateControlFile() to compute a checksum and > write it out, but xlog_redo() processing > XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without > interlocking. It looks like the ancestors of that line were there > since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran > UpdateControLFile() directly in the startup process (immediately after > that update), so no interlocking problem. Then in cdd46c76548 (2009), > RecoveryRestartPoint() was split up so that CreateRestartPoint() ran > in another process. Here's a version with a commit message added. I'll push this to all releases in a day or two if there are no objections.
Attachment
On 2020/05/22 13:51, Thomas Munro wrote: > On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote: >>> I believe I've discovered a race condition between the startup and >>> checkpointer processes that can cause a CRC mismatch in the pg_control >>> file. If a cluster crashes at the right time, the following error >>> appears when you attempt to restart it: >>> >>> FATAL: incorrect checksum in control file >>> >>> This appears to be caused by some code paths in xlog_redo() that >>> update ControlFile without taking the ControlFileLock. The attached >>> patch seems to be sufficient to prevent the CRC mismatch in the >>> control file, but perhaps this is a symptom of a bigger problem with >>> concurrent modifications of ControlFile->checkPointCopy.nextFullXid. >> >> This does indeed look pretty dodgy. CreateRestartPoint() running in >> the checkpointer does UpdateControlFile() to compute a checksum and >> write it out, but xlog_redo() processing >> XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without >> interlocking. It looks like the ancestors of that line were there >> since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran >> UpdateControLFile() directly in the startup process (immediately after >> that update), so no interlocking problem. Then in cdd46c76548 (2009), >> RecoveryRestartPoint() was split up so that CreateRestartPoint() ran >> in another process. > > Here's a version with a commit message added. I'll push this to all > releases in a day or two if there are no objections. +1 to push the patch. Per my quick check, XLogReportParameters() seems to have the similar issue, i.e., it updates the control file without taking ControlFileLock. Maybe we should fix this at the same time? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote: > Per my quick check, XLogReportParameters() seems to have the similar issue, > i.e., it updates the control file without taking ControlFileLock. > Maybe we should fix this at the same time? Yeah. It also checks the control file values, implying that we should have LW_SHARED taken at least at the beginning, but this lock cannot be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we should check with an assert if ControlFileLock is taken when going through UpdateControlFile(). We have one code path at the beginning of redo where we don't need a lock close to the backup_label file checks, but we could just pass down a boolean flag to the routine to handle that case. Another good thing in having an assert is that any new caller of UpdateControlFile() would need to think about the need of a lock. -- Michael
Attachment
On 5/21/20, 9:52 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote: > Here's a version with a commit message added. I'll push this to all > releases in a day or two if there are no objections. Looks good to me. Thanks! Nathan
On 5/22/20, 10:40 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote: >> Per my quick check, XLogReportParameters() seems to have the similar issue, >> i.e., it updates the control file without taking ControlFileLock. >> Maybe we should fix this at the same time? > > Yeah. It also checks the control file values, implying that we should > have LW_SHARED taken at least at the beginning, but this lock cannot > be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we > should check with an assert if ControlFileLock is taken when going > through UpdateControlFile(). We have one code path at the beginning > of redo where we don't need a lock close to the backup_label file > checks, but we could just pass down a boolean flag to the routine to > handle that case. Another good thing in having an assert is that any > new caller of UpdateControlFile() would need to think about the need > of a lock. While an assertion in UpdateControlFile() would not have helped us catch the problem I initially reported, it does seem worthwhile to add it. I have attached a patch that adds this assertion and also attempts to fix XLogReportParameters(). Since there is only one place where we feel it is safe to call UpdateControlFile() without a lock, I just changed it to take the lock. I don't think this adds any sort of significant contention risk, and IMO it is a bit cleaner than the boolean flag. For the XLogReportParameters() fix, I simply added an exclusive lock acquisition for the portion that updates the values in shared memory and calls UpdateControlFile(). IIUC the first part of this function that accesses several ControlFile values should be safe, as none of them can be updated after server start. Nathan
Attachment
On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote: > While an assertion in UpdateControlFile() would not have helped us > catch the problem I initially reported, it does seem worthwhile to add > it. I have attached a patch that adds this assertion and also > attempts to fix XLogReportParameters(). Since there is only one place > where we feel it is safe to call UpdateControlFile() without a lock, I > just changed it to take the lock. I don't think this adds any sort of > significant contention risk, and IMO it is a bit cleaner than the > boolean flag. Let's see what Fujii-san and Thomas think about that. I'd rather avoid taking a lock here because we don't need it and because it makes things IMO confusing with the beginning of StartupXLOG() where a lot of the fields are read, even if we go without this extra assertion. > For the XLogReportParameters() fix, I simply added an exclusive lock > acquisition for the portion that updates the values in shared memory > and calls UpdateControlFile(). IIUC the first part of this function > that accesses several ControlFile values should be safe, as none of > them can be updated after server start. They can get updated when replaying a XLOG_PARAMETER_CHANGE record. But you are right as all of this happens in the startup process, so your patch looks right to me here. -- Michael
Attachment
On 2020/05/27 16:10, Michael Paquier wrote: > On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote: >> While an assertion in UpdateControlFile() would not have helped us >> catch the problem I initially reported, it does seem worthwhile to add >> it. I have attached a patch that adds this assertion and also >> attempts to fix XLogReportParameters(). Since there is only one place >> where we feel it is safe to call UpdateControlFile() without a lock, I >> just changed it to take the lock. I don't think this adds any sort of >> significant contention risk, and IMO it is a bit cleaner than the >> boolean flag. > > Let's see what Fujii-san and Thomas think about that. I'd rather > avoid taking a lock here because we don't need it and because it makes > things IMO confusing with the beginning of StartupXLOG() where a lot > of the fields are read, even if we go without this extra assertion. I have no strong opinion about this, but I tend to agree with Michael here. >> For the XLogReportParameters() fix, I simply added an exclusive lock >> acquisition for the portion that updates the values in shared memory >> and calls UpdateControlFile(). IIUC the first part of this function >> that accesses several ControlFile values should be safe, as none of >> them can be updated after server start. > > They can get updated when replaying a XLOG_PARAMETER_CHANGE record. > But you are right as all of this happens in the startup process, so > your patch looks right to me here. LGTM. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 5/29/20, 12:24 AM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote: > On 2020/05/27 16:10, Michael Paquier wrote: >> On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote: >>> While an assertion in UpdateControlFile() would not have helped us >>> catch the problem I initially reported, it does seem worthwhile to add >>> it. I have attached a patch that adds this assertion and also >>> attempts to fix XLogReportParameters(). Since there is only one place >>> where we feel it is safe to call UpdateControlFile() without a lock, I >>> just changed it to take the lock. I don't think this adds any sort of >>> significant contention risk, and IMO it is a bit cleaner than the >>> boolean flag. >> >> Let's see what Fujii-san and Thomas think about that. I'd rather >> avoid taking a lock here because we don't need it and because it makes >> things IMO confusing with the beginning of StartupXLOG() where a lot >> of the fields are read, even if we go without this extra assertion. > > I have no strong opinion about this, but I tend to agree with Michael here. > >>> For the XLogReportParameters() fix, I simply added an exclusive lock >>> acquisition for the portion that updates the values in shared memory >>> and calls UpdateControlFile(). IIUC the first part of this function >>> that accesses several ControlFile values should be safe, as none of >>> them can be updated after server start. >> >> They can get updated when replaying a XLOG_PARAMETER_CHANGE record. >> But you are right as all of this happens in the startup process, so >> your patch looks right to me here. > > LGTM. Thanks for the feedback. I've attached a new set of patches. Nathan
Attachment
On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote: > Thanks for the feedback. I've attached a new set of patches. Thanks for splitting the set. 0001 and 0002 are the minimum set for back-patching, and it would be better to merge them together. 0003 is debatable and not an actual bug fix, so I would refrain from doing a backpatch. It does not seem that there is a strong consensus in favor of 0003 either. Thomas, are you planning to look at this patch set? -- Michael
Attachment
On Tue, Jun 2, 2020 at 5:24 PM Michael Paquier <michael@paquier.xyz> wrote: > On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote: > > Thanks for the feedback. I've attached a new set of patches. > > Thanks for splitting the set. 0001 and 0002 are the minimum set for > back-patching, and it would be better to merge them together. 0003 is > debatable and not an actual bug fix, so I would refrain from doing a > backpatch. It does not seem that there is a strong consensus in favor > of 0003 either. > > Thomas, are you planning to look at this patch set? Sorry for my radio silence, I got tangled up with a couple of conferences. I'm planning to look at 0001 and 0002 shortly.
On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > Sorry for my radio silence, I got tangled up with a couple of > conferences. I'm planning to look at 0001 and 0002 shortly. Thanks! -- Michael
Attachment
On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote: > On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > > Sorry for my radio silence, I got tangled up with a couple of > > conferences. I'm planning to look at 0001 and 0002 shortly. > > Thanks! I pushed 0001 and 0002, squashed into one commit. I'm not sure about 0003. If we're going to do that, wouldn't it be better to just acquire the lock in that one extra place in StartupXLOG(), rather than introducing the extra parameter?
On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote: > I pushed 0001 and 0002, squashed into one commit. I'm not sure about > 0003. If we're going to do that, wouldn't it be better to just > acquire the lock in that one extra place in StartupXLOG(), rather than > introducing the extra parameter? Thanks! The approach for 0003 was discussed a bit upthread [0]. I do not have a strong opinion, but I lean towards just acquiring the lock. Nathan [0] https://postgr.es/m/20200527071053.GD103662%40paquier.xyz
On Mon, Jun 08, 2020 at 03:25:31AM +0000, Bossart, Nathan wrote: > On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote: >> I pushed 0001 and 0002, squashed into one commit. I'm not sure about >> 0003. If we're going to do that, wouldn't it be better to just >> acquire the lock in that one extra place in StartupXLOG(), rather than >> introducing the extra parameter? > > Thanks! The approach for 0003 was discussed a bit upthread [0]. I do > not have a strong opinion, but I lean towards just acquiring the lock. Fujii-san has provided an answer upthread, that can maybe translated as a +0.3~0.4: https://www.postgresql.org/message-id/fc796148-7d63-47bb-e91d-e09b62a502e9@oss.nttdata.com FWIW, I'd rather not take the lock as that's not necessary and just add the parameter if I were to do it. Now I would be fine as well to just take the lock if you decide that's more simple, as long as we add this new assertion as a safety net for future changes. -- Michael
Attachment
On Fri, May 29, 2020 at 12:54 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2020/05/27 16:10, Michael Paquier wrote:
> On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:
>> While an assertion in UpdateControlFile() would not have helped us
>> catch the problem I initially reported, it does seem worthwhile to add
>> it. I have attached a patch that adds this assertion and also
>> attempts to fix XLogReportParameters(). Since there is only one place
>> where we feel it is safe to call UpdateControlFile() without a lock, I
>> just changed it to take the lock. I don't think this adds any sort of
>> significant contention risk, and IMO it is a bit cleaner than the
>> boolean flag.
>
> Let's see what Fujii-san and Thomas think about that. I'd rather
> avoid taking a lock here because we don't need it and because it makes
> things IMO confusing with the beginning of StartupXLOG() where a lot
> of the fields are read, even if we go without this extra assertion.
I have no strong opinion about this, but I tend to agree with Michael here.
I too don't have a strong opinion about this either, but I like Nathan's
approach more, just take the lock in the startup process as well for the
simplicity if that is not hurting much. I think, apart from the startup process we
have to take the lock to update the control file, then having separate treatment
for the startup process looks confusing to me, IMHO.
approach more, just take the lock in the startup process as well for the
simplicity if that is not hurting much. I think, apart from the startup process we
have to take the lock to update the control file, then having separate treatment
for the startup process looks confusing to me, IMHO.
Regards,
Amul
On Sun, Jun 7, 2020 at 10:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > > > Sorry for my radio silence, I got tangled up with a couple of > > > conferences. I'm planning to look at 0001 and 0002 shortly. > > > > Thanks! > > I pushed 0001 and 0002, squashed into one commit. I'm not sure about > 0003. If we're going to do that, wouldn't it be better to just > acquire the lock in that one extra place in StartupXLOG(), rather than > introducing the extra parameter? Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl fail and see this message in the replica log [2]. 2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect checksum in control file I'm pretty sure it's not related to my commit. So, I was looking for existing reports of this error message. It's a long shot, since 0001 and 0002 were already pushed, but this is the only recent report I could find of "FATAL: incorrect checksum in control file" in pgsql-hackers or bugs archives. I do see this thread from 2016 [3] which might be relevant because the reported bug was also on Windows. - Melanie [1] https://cirrus-ci.com/task/4626725689098240 [2] https://api.cirrus-ci.com/v1/artifact/task/4626725689098240/testrun/build/testrun/recovery/018_wal_optimize/log/018_wal_optimize_node_replica.log [3] https://www.postgresql.org/message-id/flat/CAEepm%3D0hh_Dvd2Q%2BfcjYpkVzSoNX2%2Bf167cYu5nwu%3Dqh5HZhJw%40mail.gmail.com#042e9ec55c782370ab49c3a4ef254f4a
On Thu, May 16, 2024 at 12:19:22PM -0400, Melanie Plageman wrote: > Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl > fail and see this message in the replica log [2]. > > 2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect > checksum in control file > > I'm pretty sure it's not related to my commit. So, I was looking for > existing reports of this error message. Yeah, I don't see how it could be related. > It's a long shot, since 0001 and 0002 were already pushed, but this is > the only recent report I could find of "FATAL: incorrect checksum in > control file" in pgsql-hackers or bugs archives. > > I do see this thread from 2016 [3] which might be relevant because the > reported bug was also on Windows. I suspect it will be difficult to investigate this one too much further unless we can track down a copy of the control file with the bad checksum. Other than searching for any new code that isn't doing the appropriate locking, maybe we could search the buildfarm for any other occurrences. I also seem some threads concerning whether the way we are reading/writing the control file is atomic. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > I suspect it will be difficult to investigate this one too much further > unless we can track down a copy of the control file with the bad checksum. > Other than searching for any new code that isn't doing the appropriate > locking, maybe we could search the buildfarm for any other occurrences. I > also seem some threads concerning whether the way we are reading/writing > the control file is atomic. The intention was certainly always that it be atomic. If it isn't we have got *big* trouble. regards, tom lane
Hi, On 2024-05-16 14:50:50 -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: > > I suspect it will be difficult to investigate this one too much further > > unless we can track down a copy of the control file with the bad checksum. > > Other than searching for any new code that isn't doing the appropriate > > locking, maybe we could search the buildfarm for any other occurrences. I > > also seem some threads concerning whether the way we are reading/writing > > the control file is atomic. > > The intention was certainly always that it be atomic. If it isn't > we have got *big* trouble. We unfortunately do *know* that on several systems e.g. basebackup can read a partially written control file, while the control file is being updated. Thomas addressed this partially for frontend code, but not yet for backend code. See https://postgr.es/m/CA%2BhUKGLhLGCV67NuTiE%3Detdcw5ChMkYgpgFsa9PtrXm-984FYA%40mail.gmail.com Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2024-05-16 14:50:50 -0400, Tom Lane wrote: >> The intention was certainly always that it be atomic. If it isn't >> we have got *big* trouble. > We unfortunately do *know* that on several systems e.g. basebackup can read a > partially written control file, while the control file is being > updated. Yeah, but can't we just retry that if we get a bad checksum? What had better be atomic is the write to disk. Systems that can't manage POSIX semantics for concurrent reads and writes are annoying, but not fatal ... regards, tom lane
Hi, On 2024-05-16 15:01:31 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2024-05-16 14:50:50 -0400, Tom Lane wrote: > >> The intention was certainly always that it be atomic. If it isn't > >> we have got *big* trouble. > > > We unfortunately do *know* that on several systems e.g. basebackup can read a > > partially written control file, while the control file is being > > updated. > > Yeah, but can't we just retry that if we get a bad checksum? Retry what/where precisely? We can avoid the issue in basebackup.c by taking ControlFileLock in the right moment - but that doesn't address pg_start/stop_backup based backups. Hence the patch in the referenced thread moving to replacing the control file by atomic-rename if there are base backups ongoing. > What had better be atomic is the write to disk. That is still true to my knowledge. > Systems that can't manage POSIX semantics for concurrent reads and writes > are annoying, but not fatal ... I think part of the issue that people don't agree what posix says about a read that's concurrent to a write... See e.g. https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic Greetings, Andres Freund
The specific problem here is that LocalProcessControlFile() runs in every launched child for EXEC_BACKEND builds. Windows uses EXEC_BACKEND, and Windows' NTFS file system is one of the two file systems known to this list to have the concurrent read/write data mashing problem (the other being ext4).
On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > The specific problem here is that LocalProcessControlFile() runs in > every launched child for EXEC_BACKEND builds. Windows uses > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > systems known to this list to have the concurrent read/write data > mashing problem (the other being ext4). Phngh... this is surprisingly difficult to fix. Things that don't work: We "just" need to acquire ControlFileLock while reading the file or examining the object in shared memory, or get a copy of it, passed through the EXEC_BACKEND BackendParameters that was acquired while holding the lock, but the current location of this code in child startup is too early to use LWLocks, and the postmaster can't acquire locks either so it can't even safely take a copy to pass on. You could reorder startup so that we are allowed to acquire LWLocks in children at that point, but then you'd need to convince yourself that there is no danger of breaking some ordering requirement in external preload libraries, and figure out what to do about children that don't even attach to shared memory. Maybe that's possible, but that doesn't sound like a good idea to back-patch. First idea idea I've come up with to avoid all of that: pass a copy of the "proto-controlfile", to coin a term for the one read early in postmaster startup by LocalProcessControlFile(). As far as I know, the only reason we need it is to suck some settings out of it that don't change while a cluster is running (mostly can't change after initdb, and checksums can only be {en,dis}abled while down). Right? Children can just "import" that sucker instead of calling LocalProcessControlFile() to figure out the size of WAL segments yada yada, I think? Later they will attach to the real one in shared memory for all future purposes, once normal interlocking is allowed. I dunno. Draft patch attached. Better plans welcome. This passes CI on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts?
Attachment
On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote: > On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > The specific problem here is that LocalProcessControlFile() runs in > > every launched child for EXEC_BACKEND builds. Windows uses > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > > systems known to this list to have the concurrent read/write data > > mashing problem (the other being ext4). > First idea idea I've come up with to avoid all of that: pass a copy of > the "proto-controlfile", to coin a term for the one read early in > postmaster startup by LocalProcessControlFile(). As far as I know, > the only reason we need it is to suck some settings out of it that > don't change while a cluster is running (mostly can't change after > initdb, and checksums can only be {en,dis}abled while down). Right? > Children can just "import" that sucker instead of calling > LocalProcessControlFile() to figure out the size of WAL segments yada > yada, I think? Later they will attach to the real one in shared > memory for all future purposes, once normal interlocking is allowed. I like that strategy, particularly because it recreates what !EXEC_BACKEND backends inherit from the postmaster. It might prevent future bugs that would have been specific to EXEC_BACKEND. > I dunno. Draft patch attached. Better plans welcome. This passes CI > on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts? Looks reasonable. I didn't check over every detail, given the draft status.
On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah@leadboat.com> wrote: > On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote: > > On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > The specific problem here is that LocalProcessControlFile() runs in > > > every launched child for EXEC_BACKEND builds. Windows uses > > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > > > systems known to this list to have the concurrent read/write data > > > mashing problem (the other being ext4). > > > First idea idea I've come up with to avoid all of that: pass a copy of > > the "proto-controlfile", to coin a term for the one read early in > > postmaster startup by LocalProcessControlFile(). As far as I know, > > the only reason we need it is to suck some settings out of it that > > don't change while a cluster is running (mostly can't change after > > initdb, and checksums can only be {en,dis}abled while down). Right? > > Children can just "import" that sucker instead of calling > > LocalProcessControlFile() to figure out the size of WAL segments yada > > yada, I think? Later they will attach to the real one in shared > > memory for all future purposes, once normal interlocking is allowed. > > I like that strategy, particularly because it recreates what !EXEC_BACKEND > backends inherit from the postmaster. It might prevent future bugs that would > have been specific to EXEC_BACKEND. Thanks for looking! Yeah, that is a good way to put it. The only other idea I can think of is that the Postmaster could take all of the things that LocalProcessControlFile() wants to extract from the file, and transfer them via that struct used for EXEC_BACKEND as individual variables, instead of this new proto-controlfile copy. I think it would be a bigger change with no obvious-to-me additional benefit, so I didn't try it. > > I dunno. Draft patch attached. Better plans welcome. This passes CI > > on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts? > > Looks reasonable. I didn't check over every detail, given the draft status. I'm going to upgrade this to a proposal: https://commitfest.postgresql.org/49/5124/ I wonder how often this happens in the wild.
Hello Thomas, 15.07.2024 06:44, Thomas Munro wrote: > I'm going to upgrade this to a proposal: > > https://commitfest.postgresql.org/49/5124/ > > I wonder how often this happens in the wild. Please look at a recent failure [1], produced by buildfarm animal culicidae, which tests EXEC_BACKEND. I guess it's caused by the issue discussed. Maybe it would make sense to construct a reliable reproducer for the issue (I could not find a ready-to-use recipe in this thread)... What do you think? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-07-24%2004%3A08%3A23 Best regards, Alexander
On Mon, Jul 15, 2024 at 03:44:48PM +1200, Thomas Munro wrote: > On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah@leadboat.com> wrote: > > On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote: > > > On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > The specific problem here is that LocalProcessControlFile() runs in > > > > every launched child for EXEC_BACKEND builds. Windows uses > > > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > > > > systems known to this list to have the concurrent read/write data > > > > mashing problem (the other being ext4). > > > > > First idea idea I've come up with to avoid all of that: pass a copy of > > > the "proto-controlfile", to coin a term for the one read early in > > > postmaster startup by LocalProcessControlFile(). As far as I know, > > > the only reason we need it is to suck some settings out of it that > > > don't change while a cluster is running (mostly can't change after > > > initdb, and checksums can only be {en,dis}abled while down). Right? > > > Children can just "import" that sucker instead of calling > > > LocalProcessControlFile() to figure out the size of WAL segments yada > > > yada, I think? Later they will attach to the real one in shared > > > memory for all future purposes, once normal interlocking is allowed. > > > > I like that strategy, particularly because it recreates what !EXEC_BACKEND > > backends inherit from the postmaster. It might prevent future bugs that would > > have been specific to EXEC_BACKEND. > > Thanks for looking! Yeah, that is a good way to put it. Oops, the way I put it turned out to be false. Postmaster has ControlFile pointing to shared memory before forking backends, so !EXEC_BACKEND children are born that way. In the postmaster, ControlFile->checkPointCopy->redo does change after each checkpoint. > The only other idea I can think of is that the Postmaster could take > all of the things that LocalProcessControlFile() wants to extract from > the file, and transfer them via that struct used for EXEC_BACKEND as > individual variables, instead of this new proto-controlfile copy. I > think it would be a bigger change with no obvious-to-me additional > benefit, so I didn't try it. Yeah, that would be more future-proof but a bigger change. One could argue for a yet-larger refactor so even the !EXEC_BACKEND case doesn't read those fields from ControlFile memory. Then we could get rid of ControlFile ever being set to something other than NULL or a shmem pointer. ControlFileData's mix of initdb-time fields, postmaster-start-time fields, and changes-anytime fields is inconvenient here. The unknown is the value of that future proofing. Much EXEC_BACKEND early startup code is shared with postmaster startup, which can assume it's the only process. I can't rule out a future bug where that shared code does a read that's harmless in postmaster startup but harmful when an EXEC_BACKEND child runs the same read. For a changes-anytime field, the code would already be subtly buggy in EXEC_BACKEND today, since it would be reading without an LWLock. For a postmaster-start-time field, things should be okay so long as we capture the proto ControlFileData after the last change to such fields. That invariant is not trivial to achieve, but it's not gravely hard either. A possible middle option would be to use the proto control file, but explicitly set its changes-anytime fields to bogus values. What's your preference? I don't think any of these would be bad decisions. It could be clearer after enumerating how many ControlFile fields get used this early.