Thread: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
I noticed that the "Restoring database schemas in the new cluster" part of pg_upgrade can take a while if you have many databases, so I experimented with a couple different settings to see if there are any easy ways to speed it up. The FILE_COPY strategy for CREATE DATABASE helped quite significantly on my laptop. For ~3k empty databases, this step went from ~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41 made a similar change for initdb, so there might even be an argument for back-patching this to v15 (where STRATEGY was introduced). One thing I still need to verify is that this doesn't harm anything when there are lots of objects in the databases, i.e., more WAL generated during many concurrent CREATE-DATABASE-induced checkpoints. Thoughts? -- nathan
Attachment
Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:
I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.
Thoughts?
Why not use it too, if not binary_upgrade?
else
{
appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 STRATEGY = FILE_COPY",
qdatname);
}
{
appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 STRATEGY = FILE_COPY",
qdatname);
}
It seems to me that it also improves in any use.
best regards,
Ranier Vilela
On Wed, Jun 05, 2024 at 01:47:09PM -0300, Ranier Vilela wrote: > Why not use it too, if not binary_upgrade? > > else > { > appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 > STRATEGY = FILE_COPY", > qdatname); > } > > It seems to me that it also improves in any use. Well, if that is true, and I'm not sure it is, then we should probably consider changing the default STRATEGY in the server instead. I haven't looked too deeply, but my assumption is that when fsync is disabled (as it is when restoring schemas during pg_upgrade), both checkpointing and copying the template database are sufficiently fast enough to beat writing out all the data to WAL. Furthermore, in my test, all the databases were basically empty, so I suspect that some CREATE DATABASE commands could piggy-back on checkpoints initiated by others. This might not be the case when there are many objects in each database, and that is a scenario I have yet to test. -- nathan
On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote: > > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu: >> >> I noticed that the "Restoring database schemas in the new cluster" part of >> pg_upgrade can take a while if you have many databases, so I experimented >> with a couple different settings to see if there are any easy ways to speed >> it up. The FILE_COPY strategy for CREATE DATABASE helped quite >> significantly on my laptop. For ~3k empty databases, this step went from >> ~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41 >> made a similar change for initdb, so there might even be an argument for >> back-patching this to v15 (where STRATEGY was introduced). One thing I >> still need to verify is that this doesn't harm anything when there are lots >> of objects in the databases, i.e., more WAL generated during many >> concurrent CREATE-DATABASE-induced checkpoints. >> >> Thoughts? > > Why not use it too, if not binary_upgrade? Because in the normal case (not during binary_upgrade) you don't want to have to generate 2 checkpoints for every created database, especially not when your shared buffers are large. Checkpoints' costs scale approximately linearly with the size of shared buffers, so being able to skip those checkpoints (with strategy=WAL_LOG) will save a lot of performance in the systems where this performance impact matters most. >> I noticed that the "Restoring database schemas in the new cluster" part of >> pg_upgrade can take a while if you have many databases, so I experimented >> with a couple different settings to see if there are any easy ways to speed >> it up. The FILE_COPY strategy for CREATE DATABASE helped quite >> significantly on my laptop. For ~3k empty databases, this step went from >> ~100 seconds to ~30 seconds with the attached patch. As for "on my laptop", that sounds very reasonable, but could you check the performance on systems with larger shared buffer configurations? I'd imagine (but haven't checked) that binary upgrade target systems may already be using the shared_buffers from their source system, which would cause a severe regression when the to-be-upgraded system has large shared buffers. For initdb the database size is known in advance and shared_buffers is approximately empty, but the same is not (always) true when we're doing binary upgrades. Kind regards, Matthias van de Meent
On Wed, Jun 05, 2024 at 07:28:42PM +0200, Matthias van de Meent wrote: > As for "on my laptop", that sounds very reasonable, but could you > check the performance on systems with larger shared buffer > configurations? I'd imagine (but haven't checked) that binary upgrade > target systems may already be using the shared_buffers from their > source system, which would cause a severe regression when the > to-be-upgraded system has large shared buffers. For initdb the > database size is known in advance and shared_buffers is approximately > empty, but the same is not (always) true when we're doing binary > upgrades. Will do. FWIW I haven't had much luck improving pg_upgrade times by adjusting server settings, but I also haven't explored it all that much. -- nathan
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote: > > > > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu: > >> > >> I noticed that the "Restoring database schemas in the new cluster" part of > >> pg_upgrade can take a while if you have many databases, so I experimented > >> with a couple different settings to see if there are any easy ways to speed > >> it up. The FILE_COPY strategy for CREATE DATABASE helped quite > >> significantly on my laptop. For ~3k empty databases, this step went from > >> ~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41 > >> made a similar change for initdb, so there might even be an argument for > >> back-patching this to v15 (where STRATEGY was introduced). One thing I > >> still need to verify is that this doesn't harm anything when there are lots > >> of objects in the databases, i.e., more WAL generated during many > >> concurrent CREATE-DATABASE-induced checkpoints. > >> > >> Thoughts? > > > > Why not use it too, if not binary_upgrade? > > Because in the normal case (not during binary_upgrade) you don't want > to have to generate 2 checkpoints for every created database, > especially not when your shared buffers are large. Checkpoints' costs > scale approximately linearly with the size of shared buffers, so being > able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > of performance in the systems where this performance impact matters > most. I agree with you that we introduced the WAL_LOG strategy to avoid these force checkpoints. However, in binary upgrade cases where no operations are happening in the system, the FILE_COPY strategy should be faster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote: >>> >>> Why not use it too, if not binary_upgrade? >> >> Because in the normal case (not during binary_upgrade) you don't want >> to have to generate 2 checkpoints for every created database, >> especially not when your shared buffers are large. Checkpoints' costs >> scale approximately linearly with the size of shared buffers, so being >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot >> of performance in the systems where this performance impact matters >> most. > > I agree with you that we introduced the WAL_LOG strategy to avoid > these force checkpoints. However, in binary upgrade cases where no > operations are happening in the system, the FILE_COPY strategy should > be faster. While you would be correct if there were no operations happening in the system, during binary upgrade we're still actively modifying catalogs; and this is done with potentially many concurrent jobs. I think it's not unlikely that this would impact performance. Now that I think about it, arguably, we shouldn't need to run checkpoints during binary upgrade for the FILE_COPY strategy after we've restored the template1 database and created a checkpoint after that: All other databases use template1 as their template database, and the checkpoint is there mostly to guarantee the FS knows about all changes in the template database before we task it with copying the template database over to our new database, so the protections we get from more checkpoints are practically useless. If such a change were implemented (i.e. no checkpoints for FILE_COPY in binary upgrade, with a single manual checkpoint after restoring template1 in create_new_objects) I think most of my concerns with this patch would be alleviated. Kind regards, Matthias van de Meent
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > >> > >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote: > >>> > >>> Why not use it too, if not binary_upgrade? > >> > >> Because in the normal case (not during binary_upgrade) you don't want > >> to have to generate 2 checkpoints for every created database, > >> especially not when your shared buffers are large. Checkpoints' costs > >> scale approximately linearly with the size of shared buffers, so being > >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > >> of performance in the systems where this performance impact matters > >> most. > > > > I agree with you that we introduced the WAL_LOG strategy to avoid > > these force checkpoints. However, in binary upgrade cases where no > > operations are happening in the system, the FILE_COPY strategy should > > be faster. > > While you would be correct if there were no operations happening in > the system, during binary upgrade we're still actively modifying > catalogs; and this is done with potentially many concurrent jobs. I > think it's not unlikely that this would impact performance. Maybe, but generally, long checkpoints are problematic because they involve a lot of I/O, which hampers overall system performance. However, in the case of a binary upgrade, the concurrent operations are only performing a schema restore, not a real data restore. Therefore, it shouldn't have a significant impact, and the checkpoints should also not do a lot of I/O during binary upgrade, right? > Now that I think about it, arguably, we shouldn't need to run > checkpoints during binary upgrade for the FILE_COPY strategy after > we've restored the template1 database and created a checkpoint after > that: All other databases use template1 as their template database, > and the checkpoint is there mostly to guarantee the FS knows about all > changes in the template database before we task it with copying the > template database over to our new database, so the protections we get > from more checkpoints are practically useless. > If such a change were implemented (i.e. no checkpoints for FILE_COPY > in binary upgrade, with a single manual checkpoint after restoring > template1 in create_new_objects) I think most of my concerns with this > patch would be alleviated. Yeah, I think that's a valid point. The second checkpoint is to ensure that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for binary upgrades, we don't need that guarantee because a checkpoint will be performed during shutdown at the end of the upgrade anyway. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent >>> <boekewurm+postgres@gmail.com> wrote: >>> >>> I agree with you that we introduced the WAL_LOG strategy to avoid >>> these force checkpoints. However, in binary upgrade cases where no >>> operations are happening in the system, the FILE_COPY strategy should >>> be faster. >> >> While you would be correct if there were no operations happening in >> the system, during binary upgrade we're still actively modifying >> catalogs; and this is done with potentially many concurrent jobs. I >> think it's not unlikely that this would impact performance. > > Maybe, but generally, long checkpoints are problematic because they > involve a lot of I/O, which hampers overall system performance. > However, in the case of a binary upgrade, the concurrent operations > are only performing a schema restore, not a real data restore. > Therefore, it shouldn't have a significant impact, and the checkpoints > should also not do a lot of I/O during binary upgrade, right? My primary concern isn't the IO, but the O(shared_buffers) that we have to go through during a checkpoint. As I mentioned upthread, it is reasonably possible the new cluster is already setup with a good fraction of the old system's shared_buffers configured. Every checkpoint has to scan all those buffers, which IMV can get (much) more expensive than the IO overhead caused by the WAL_LOG strategy. It may be a baseless fear as I haven't done the performance benchmarks for this, but I wouldn't be surprised if shared_buffers=8GB would measurably impact the upgrade performance in the current patch (vs the default 128MB). I'll note that the documentation for upgrading with pg_upgrade has the step for updating postgresql.conf / postgresql.auto.conf only after pg_upgrade has run already, but that may not be how it's actually used: after all, we don't have full control in this process, the user is the one who provides the new cluster with initdb. >> If such a change were implemented (i.e. no checkpoints for FILE_COPY >> in binary upgrade, with a single manual checkpoint after restoring >> template1 in create_new_objects) I think most of my concerns with this >> patch would be alleviated. > > Yeah, I think that's a valid point. The second checkpoint is to ensure > that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for > binary upgrades, we don't need that guarantee because a checkpoint > will be performed during shutdown at the end of the upgrade anyway. Indeed. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > >> > >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >>> > >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > >>> <boekewurm+postgres@gmail.com> wrote: > >>> > >>> I agree with you that we introduced the WAL_LOG strategy to avoid > >>> these force checkpoints. However, in binary upgrade cases where no > >>> operations are happening in the system, the FILE_COPY strategy should > >>> be faster. > >> > >> While you would be correct if there were no operations happening in > >> the system, during binary upgrade we're still actively modifying > >> catalogs; and this is done with potentially many concurrent jobs. I > >> think it's not unlikely that this would impact performance. > > > > Maybe, but generally, long checkpoints are problematic because they > > involve a lot of I/O, which hampers overall system performance. > > However, in the case of a binary upgrade, the concurrent operations > > are only performing a schema restore, not a real data restore. > > Therefore, it shouldn't have a significant impact, and the checkpoints > > should also not do a lot of I/O during binary upgrade, right? > > My primary concern isn't the IO, but the O(shared_buffers) that we > have to go through during a checkpoint. As I mentioned upthread, it is > reasonably possible the new cluster is already setup with a good > fraction of the old system's shared_buffers configured. Every > checkpoint has to scan all those buffers, which IMV can get (much) > more expensive than the IO overhead caused by the WAL_LOG strategy. It > may be a baseless fear as I haven't done the performance benchmarks > for this, but I wouldn't be surprised if shared_buffers=8GB would > measurably impact the upgrade performance in the current patch (vs the > default 128MB). Okay, that's a valid point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote: > My primary concern isn't the IO, but the O(shared_buffers) that we > have to go through during a checkpoint. As I mentioned upthread, it is > reasonably possible the new cluster is already setup with a good > fraction of the old system's shared_buffers configured. Every > checkpoint has to scan all those buffers, which IMV can get (much) > more expensive than the IO overhead caused by the WAL_LOG strategy. It > may be a baseless fear as I haven't done the performance benchmarks > for this, but I wouldn't be surprised if shared_buffers=8GB would > measurably impact the upgrade performance in the current patch (vs the > default 128MB). I did a handful of benchmarks on an r5.24xlarge that seem to prove your point. The following are the durations of the pg_restore step of pg_upgrade: * 10k empty databases, 128MB shared_buffers WAL_LOG: 1m 01s FILE_COPY: 0m 22s * 10k empty databases, 100GB shared_buffers WAL_LOG: 2m 03s FILE_COPY: 5m 08s * 2.5k databases with 10k tables each, 128MB shared_buffers WAL_LOG: 17m 20s FILE_COPY: 16m 44s * 2.5k databases with 10k tables each, 100GB shared_buffers WAL_LOG: 16m 39s FILE_COPY: 15m 21s I was surprised with the last result, but there's enough other stuff happening during such a test that I hesitate to conclude much. > I'll note that the documentation for upgrading with pg_upgrade has the > step for updating postgresql.conf / postgresql.auto.conf only after > pg_upgrade has run already, but that may not be how it's actually > used: after all, we don't have full control in this process, the user > is the one who provides the new cluster with initdb. Good point. I think it's clear that FILE_COPY is not necessarily a win in all cases for pg_upgrade. >>> If such a change were implemented (i.e. no checkpoints for FILE_COPY >>> in binary upgrade, with a single manual checkpoint after restoring >>> template1 in create_new_objects) I think most of my concerns with this >>> patch would be alleviated. >> >> Yeah, I think that's a valid point. The second checkpoint is to ensure >> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for >> binary upgrades, we don't need that guarantee because a checkpoint >> will be performed during shutdown at the end of the upgrade anyway. > > Indeed. It looks like pg_dump always uses template0, so AFAICT we don't even need the suggested manual checkpoint after restoring template1. I do like the idea of skipping a bunch of unnecessary operations in binary upgrade mode, since it'll help me in my goal of speeding up pg_upgrade. But I'm a bit hesitant to get too fancy here and to introduce a bunch of new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all that exciting. However, we've already sprinkled such checks quite liberally, so maybe I'm being too cautious... -- nathan
On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote: > > My primary concern isn't the IO, but the O(shared_buffers) that we > > have to go through during a checkpoint. As I mentioned upthread, it is > > reasonably possible the new cluster is already setup with a good > > fraction of the old system's shared_buffers configured. Every > > checkpoint has to scan all those buffers, which IMV can get (much) > > more expensive than the IO overhead caused by the WAL_LOG strategy. It > > may be a baseless fear as I haven't done the performance benchmarks > > for this, but I wouldn't be surprised if shared_buffers=8GB would > > measurably impact the upgrade performance in the current patch (vs the > > default 128MB). > > I did a handful of benchmarks on an r5.24xlarge that seem to prove your > point. The following are the durations of the pg_restore step of > pg_upgrade: > > * 10k empty databases, 128MB shared_buffers > WAL_LOG: 1m 01s > FILE_COPY: 0m 22s > > * 10k empty databases, 100GB shared_buffers > WAL_LOG: 2m 03s > FILE_COPY: 5m 08s > > * 2.5k databases with 10k tables each, 128MB shared_buffers > WAL_LOG: 17m 20s > FILE_COPY: 16m 44s > > * 2.5k databases with 10k tables each, 100GB shared_buffers > WAL_LOG: 16m 39s > FILE_COPY: 15m 21s > > I was surprised with the last result, but there's enough other stuff > happening during such a test that I hesitate to conclude much. If you still have the test data set up, could you test the attached patch (which does skip the checkpoints in FILE_COPY mode during binary upgrades)? >>>> If such a change were implemented (i.e. no checkpoints for FILE_COPY >>>> in binary upgrade, with a single manual checkpoint after restoring >>>> template1 in create_new_objects) I think most of my concerns with this >>>> patch would be alleviated. >>> >>> Yeah, I think that's a valid point. The second checkpoint is to ensure >>> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for >>> binary upgrades, we don't need that guarantee because a checkpoint >>> will be performed during shutdown at the end of the upgrade anyway. >> >> Indeed. > > It looks like pg_dump always uses template0, so AFAICT we don't even need > the suggested manual checkpoint after restoring template1. Thanks for reminding me. It seems I misunderstood the reason why we first process template1 in create_new_objects, as I didn't read the comments thoroughly enough. > I do like the idea of skipping a bunch of unnecessary operations in binary > upgrade mode, since it'll help me in my goal of speeding up pg_upgrade. > But I'm a bit hesitant to get too fancy here and to introduce a bunch of > new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all > that exciting. However, we've already sprinkled such checks quite > liberally, so maybe I'm being too cautious... Hmm, yes. From an IO perspective I think this could be an improvement, but let's check the numbers first. Kind regards, Matthias van de Meent
Attachment
On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote: > On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I did a handful of benchmarks on an r5.24xlarge that seem to prove your >> point. The following are the durations of the pg_restore step of >> pg_upgrade: >> >> * 10k empty databases, 128MB shared_buffers >> WAL_LOG: 1m 01s >> FILE_COPY: 0m 22s >> >> * 10k empty databases, 100GB shared_buffers >> WAL_LOG: 2m 03s >> FILE_COPY: 5m 08s >> >> * 2.5k databases with 10k tables each, 128MB shared_buffers >> WAL_LOG: 17m 20s >> FILE_COPY: 16m 44s >> >> * 2.5k databases with 10k tables each, 100GB shared_buffers >> WAL_LOG: 16m 39s >> FILE_COPY: 15m 21s >> >> I was surprised with the last result, but there's enough other stuff >> happening during such a test that I hesitate to conclude much. > > If you still have the test data set up, could you test the attached > patch (which does skip the checkpoints in FILE_COPY mode during binary > upgrades)? With your patch, I see the following: * 10k empty databases, 128MB shared_buffers: 0m 27s * 10k empty databases, 100GB shared_buffers: 1m 44s I believe the reason the large buffer cache test is still quite a bit slower is due to the truncation of pg_largeobject (specifically its call to DropRelationsAllBuffers()). This TRUNCATE command was added in commit bbe08b8. -- nathan
On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote: > On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote: >> It looks like pg_dump always uses template0, so AFAICT we don't even need >> the suggested manual checkpoint after restoring template1. > > Thanks for reminding me. It seems I misunderstood the reason why we > first process template1 in create_new_objects, as I didn't read the > comments thoroughly enough. Actually, I think you are right that we need a manual checkpoint, except I think we need it to be after prepare_new_globals(). set_frozenxids() connects to each database (including template0) and updates a bunch of pg_class rows, and we probably want those on disk before we start copying the files to create all the user's databases. -- nathan
On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote: > Actually, I think you are right that we need a manual checkpoint, except I > think we need it to be after prepare_new_globals(). set_frozenxids() > connects to each database (including template0) and updates a bunch of > pg_class rows, and we probably want those on disk before we start copying > the files to create all the user's databases. Here is an updated patch. -- nathan
Attachment
On Fri, 14 Jun 2024 at 23:29, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote: > > Actually, I think you are right that we need a manual checkpoint, except I > > think we need it to be after prepare_new_globals(). set_frozenxids() > > connects to each database (including template0) and updates a bunch of > > pg_class rows, and we probably want those on disk before we start copying > > the files to create all the user's databases. Good catch, I hadn't thought of that. > Here is an updated patch. LGTM. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote: > > Actually, I think you are right that we need a manual checkpoint, except I > > think we need it to be after prepare_new_globals(). set_frozenxids() > > connects to each database (including template0) and updates a bunch of > > pg_class rows, and we probably want those on disk before we start copying > > the files to create all the user's databases. > > Here is an updated patch. OK, I have a (probably) stupid question. The comment says: + * In binary upgrade mode, we can skip this checkpoint because neither of + * these problems applies: we don't ever replay the WAL generated during + * pg_upgrade, and we don't concurrently modify template0 (not to mention + * that trying to take a backup during pg_upgrade is pointless). But what happens if the system crashes during pg_upgrade? Does this patch make things worse than they are today? And should we care? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jun 19, 2024 at 09:17:00AM -0400, Robert Haas wrote: > OK, I have a (probably) stupid question. The comment says: > > + * In binary upgrade mode, we can skip this checkpoint because neither of > + * these problems applies: we don't ever replay the WAL generated during > + * pg_upgrade, and we don't concurrently modify template0 (not to mention > + * that trying to take a backup during pg_upgrade is pointless). > > But what happens if the system crashes during pg_upgrade? Does this > patch make things worse than they are today? And should we care? My understanding is that you basically have to restart the upgrade from scratch if that happens. I suppose there could be a problem if you try to use the half-upgraded cluster after a crash, but I imagine you have a good chance of encountering other problems if you do that, too. So I don't think we care... -- nathan
On Wed, 19 Jun 2024 at 15:17, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote: > > > Actually, I think you are right that we need a manual checkpoint, except I > > > think we need it to be after prepare_new_globals(). set_frozenxids() > > > connects to each database (including template0) and updates a bunch of > > > pg_class rows, and we probably want those on disk before we start copying > > > the files to create all the user's databases. > > > > Here is an updated patch. > > OK, I have a (probably) stupid question. The comment says: > > + * In binary upgrade mode, we can skip this checkpoint because neither of > + * these problems applies: we don't ever replay the WAL generated during > + * pg_upgrade, and we don't concurrently modify template0 (not to mention > + * that trying to take a backup during pg_upgrade is pointless). > > But what happens if the system crashes during pg_upgrade? Does this > patch make things worse than they are today? And should we care? As Nathan just said, AFAIK we don't have a way to resume progress from a crashed pg_upgrade, which implies you currently have to start over with a fresh dbinit. This patch wouldn't change that for the worse, it'd only add one more process that depends on that behaviour. Maybe in the future that'll change if pg_upgrade and pg_dump are made smart enough to resume their restore progress across forceful disconnects, but for now this patch seems like a nice performance improvement. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Wed, Jun 19, 2024 at 08:37:17AM -0500, Nathan Bossart wrote: > My understanding is that you basically have to restart the upgrade from > scratch if that happens. I suppose there could be a problem if you try to > use the half-upgraded cluster after a crash, but I imagine you have a good > chance of encountering other problems if you do that, too. So I don't > think we care... It's never been assumed that it would be safe to redo a pg_upgradeafter a crash on a cluster initdb'd for the upgrade, so I don't think we need to care about that, as well. One failure I suspect would quickly be faced is OIDs getting reused again as these are currently kept consistent. -- Michael
Attachment
Committed. -- nathan