Thread: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

04 June 2024, 19:39:23

I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop.  For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch.  I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced).  One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.

Thoughts?

-- 
nathan

Attachment

use_file_copy_for_pg_upgrade.patch

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Ranier Vilela

Date:

05 June 2024, 16:47:09

Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:

I noticed that the "Restoring database schemas in the new cluster" part of
pg_upgrade can take a while if you have many databases, so I experimented
with a couple different settings to see if there are any easy ways to speed
it up. The FILE_COPY strategy for CREATE DATABASE helped quite
significantly on my laptop. For ~3k empty databases, this step went from
~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41
made a similar change for initdb, so there might even be an argument for
back-patching this to v15 (where STRATEGY was introduced). One thing I
still need to verify is that this doesn't harm anything when there are lots
of objects in the databases, i.e., more WAL generated during many
concurrent CREATE-DATABASE-induced checkpoints.

Thoughts?

Why not use it too, if not binary_upgrade?

else
{
appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0 STRATEGY = FILE_COPY",
qdatname);
}

It seems to me that it also improves in any use.

best regards,

Ranier Vilela

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

05 June 2024, 17:07:58

On Wed, Jun 05, 2024 at 01:47:09PM -0300, Ranier Vilela wrote:
> Why not use it too, if not binary_upgrade?
> 
> else
> {
> appendPQExpBuffer(creaQry, "CREATE DATABASE %s WITH TEMPLATE = template0
> STRATEGY = FILE_COPY",
>  qdatname);
> }
> 
> It seems to me that it also improves in any use.

Well, if that is true, and I'm not sure it is, then we should probably
consider changing the default STRATEGY in the server instead.  I haven't
looked too deeply, but my assumption is that when fsync is disabled (as it
is when restoring schemas during pg_upgrade), both checkpointing and
copying the template database are sufficiently fast enough to beat writing
out all the data to WAL.  Furthermore, in my test, all the databases were
basically empty, so I suspect that some CREATE DATABASE commands could
piggy-back on checkpoints initiated by others.  This might not be the case
when there are many objects in each database, and that is a scenario I have
yet to test.

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

05 June 2024, 17:28:42

On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:
>>
>> I noticed that the "Restoring database schemas in the new cluster" part of
>> pg_upgrade can take a while if you have many databases, so I experimented
>> with a couple different settings to see if there are any easy ways to speed
>> it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
>> significantly on my laptop.  For ~3k empty databases, this step went from
>> ~100 seconds to ~30 seconds with the attached patch.  I see commit ad43a41
>> made a similar change for initdb, so there might even be an argument for
>> back-patching this to v15 (where STRATEGY was introduced).  One thing I
>> still need to verify is that this doesn't harm anything when there are lots
>> of objects in the databases, i.e., more WAL generated during many
>> concurrent CREATE-DATABASE-induced checkpoints.
>>
>> Thoughts?
>
> Why not use it too, if not binary_upgrade?

Because in the normal case (not during binary_upgrade) you don't want
to have to generate 2 checkpoints for every created database,
especially not when your shared buffers are large. Checkpoints' costs
scale approximately linearly with the size of shared buffers, so being
able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
of performance in the systems where this performance impact matters
most.

>> I noticed that the "Restoring database schemas in the new cluster" part of
>> pg_upgrade can take a while if you have many databases, so I experimented
>> with a couple different settings to see if there are any easy ways to speed
>> it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
>> significantly on my laptop.  For ~3k empty databases, this step went from
>> ~100 seconds to ~30 seconds with the attached patch.

As for "on my laptop", that sounds very reasonable, but could you
check the performance on systems with larger shared buffer
configurations? I'd imagine (but haven't checked) that binary upgrade
target systems may already be using the shared_buffers from their
source system, which would cause a severe regression when the
to-be-upgraded system has large shared buffers. For initdb the
database size is known in advance and shared_buffers is approximately
empty, but the same is not (always) true when we're doing binary
upgrades.

Kind regards,

Matthias van de Meent

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

05 June 2024, 17:45:48

On Wed, Jun 05, 2024 at 07:28:42PM +0200, Matthias van de Meent wrote:
> As for "on my laptop", that sounds very reasonable, but could you
> check the performance on systems with larger shared buffer
> configurations? I'd imagine (but haven't checked) that binary upgrade
> target systems may already be using the shared_buffers from their
> source system, which would cause a severe regression when the
> to-be-upgraded system has large shared buffers. For initdb the
> database size is known in advance and shared_buffers is approximately
> empty, but the same is not (always) true when we're doing binary
> upgrades.

Will do.  FWIW I haven't had much luck improving pg_upgrade times by
adjusting server settings, but I also haven't explored it all that much.

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Dilip Kumar

Date:

07 June 2024, 05:18:22

On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
> >
> > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart <nathandbossart@gmail.com> escreveu:
> >>
> >> I noticed that the "Restoring database schemas in the new cluster" part of
> >> pg_upgrade can take a while if you have many databases, so I experimented
> >> with a couple different settings to see if there are any easy ways to speed
> >> it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
> >> significantly on my laptop.  For ~3k empty databases, this step went from
> >> ~100 seconds to ~30 seconds with the attached patch.  I see commit ad43a41
> >> made a similar change for initdb, so there might even be an argument for
> >> back-patching this to v15 (where STRATEGY was introduced).  One thing I
> >> still need to verify is that this doesn't harm anything when there are lots
> >> of objects in the databases, i.e., more WAL generated during many
> >> concurrent CREATE-DATABASE-induced checkpoints.
> >>
> >> Thoughts?
> >
> > Why not use it too, if not binary_upgrade?
>
> Because in the normal case (not during binary_upgrade) you don't want
> to have to generate 2 checkpoints for every created database,
> especially not when your shared buffers are large. Checkpoints' costs
> scale approximately linearly with the size of shared buffers, so being
> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> of performance in the systems where this performance impact matters
> most.

I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

07 June 2024, 06:27:37

On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>>
>> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>>>
>>> Why not use it too, if not binary_upgrade?
>>
>> Because in the normal case (not during binary_upgrade) you don't want
>> to have to generate 2 checkpoints for every created database,
>> especially not when your shared buffers are large. Checkpoints' costs
>> scale approximately linearly with the size of shared buffers, so being
>> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
>> of performance in the systems where this performance impact matters
>> most.
>
> I agree with you that we introduced the WAL_LOG strategy to avoid
> these force checkpoints. However, in binary upgrade cases where no
> operations are happening in the system, the FILE_COPY strategy should
> be faster.

While you would be correct if there were no operations happening in
the system, during binary upgrade we're still actively modifying
catalogs; and this is done with potentially many concurrent jobs. I
think it's not unlikely that this would impact performance.

Now that I think about it, arguably, we shouldn't need to run
checkpoints during binary upgrade for the FILE_COPY strategy after
we've restored the template1 database and created a checkpoint after
that: All other databases use template1 as their template database,
and the checkpoint is there mostly to guarantee the FS knows about all
changes in the template database before we task it with copying the
template database over to our new database, so the protections we get
from more checkpoints are practically useless.
If such a change were implemented (i.e. no checkpoints for FILE_COPY
in binary upgrade, with a single manual checkpoint after restoring
template1 in create_new_objects) I think most of my concerns with this
patch would be alleviated.

Kind regards,

Matthias van de Meent

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Dilip Kumar

Date:

07 June 2024, 08:27:49

On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> >>
> >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
> >>>
> >>> Why not use it too, if not binary_upgrade?
> >>
> >> Because in the normal case (not during binary_upgrade) you don't want
> >> to have to generate 2 checkpoints for every created database,
> >> especially not when your shared buffers are large. Checkpoints' costs
> >> scale approximately linearly with the size of shared buffers, so being
> >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> >> of performance in the systems where this performance impact matters
> >> most.
> >
> > I agree with you that we introduced the WAL_LOG strategy to avoid
> > these force checkpoints. However, in binary upgrade cases where no
> > operations are happening in the system, the FILE_COPY strategy should
> > be faster.
>
> While you would be correct if there were no operations happening in
> the system, during binary upgrade we're still actively modifying
> catalogs; and this is done with potentially many concurrent jobs. I
> think it's not unlikely that this would impact performance.

Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?

> Now that I think about it, arguably, we shouldn't need to run
> checkpoints during binary upgrade for the FILE_COPY strategy after
> we've restored the template1 database and created a checkpoint after
> that: All other databases use template1 as their template database,
> and the checkpoint is there mostly to guarantee the FS knows about all
> changes in the template database before we task it with copying the
> template database over to our new database, so the protections we get
> from more checkpoints are practically useless.
> If such a change were implemented (i.e. no checkpoints for FILE_COPY
> in binary upgrade, with a single manual checkpoint after restoring
> template1 in create_new_objects) I think most of my concerns with this
> patch would be alleviated.

Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

07 June 2024, 09:10:25

On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>>
>> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>
>>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
>>> <boekewurm+postgres@gmail.com> wrote:
>>>
>>> I agree with you that we introduced the WAL_LOG strategy to avoid
>>> these force checkpoints. However, in binary upgrade cases where no
>>> operations are happening in the system, the FILE_COPY strategy should
>>> be faster.
>>
>> While you would be correct if there were no operations happening in
>> the system, during binary upgrade we're still actively modifying
>> catalogs; and this is done with potentially many concurrent jobs. I
>> think it's not unlikely that this would impact performance.
>
> Maybe, but generally, long checkpoints are problematic because they
> involve a lot of I/O, which hampers overall system performance.
> However, in the case of a binary upgrade, the concurrent operations
> are only performing a schema restore, not a real data restore.
> Therefore, it shouldn't have a significant impact, and the checkpoints
> should also not do a lot of I/O during binary upgrade, right?

My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).

I'll note that the documentation for upgrading with pg_upgrade has the
step for updating postgresql.conf / postgresql.auto.conf only after
pg_upgrade has run already, but that may not be how it's actually
used: after all, we don't have full control in this process, the user
is the one who provides the new cluster with initdb.

>> If such a change were implemented (i.e. no checkpoints for FILE_COPY
>> in binary upgrade, with a single manual checkpoint after restoring
>> template1 in create_new_objects) I think most of my concerns with this
>> patch would be alleviated.
>
> Yeah, I think that's a valid point. The second checkpoint is to ensure
> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
> binary upgrades, we don't need that guarantee because a checkpoint
> will be performed during shutdown at the end of the upgrade anyway.

Indeed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Dilip Kumar

Date:

07 June 2024, 09:17:21

On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Fri, 7 Jun 2024 at 10:28, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> >>
> >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>>
> >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> >>> <boekewurm+postgres@gmail.com> wrote:
> >>>
> >>> I agree with you that we introduced the WAL_LOG strategy to avoid
> >>> these force checkpoints. However, in binary upgrade cases where no
> >>> operations are happening in the system, the FILE_COPY strategy should
> >>> be faster.
> >>
> >> While you would be correct if there were no operations happening in
> >> the system, during binary upgrade we're still actively modifying
> >> catalogs; and this is done with potentially many concurrent jobs. I
> >> think it's not unlikely that this would impact performance.
> >
> > Maybe, but generally, long checkpoints are problematic because they
> > involve a lot of I/O, which hampers overall system performance.
> > However, in the case of a binary upgrade, the concurrent operations
> > are only performing a schema restore, not a real data restore.
> > Therefore, it shouldn't have a significant impact, and the checkpoints
> > should also not do a lot of I/O during binary upgrade, right?
>
> My primary concern isn't the IO, but the O(shared_buffers) that we
> have to go through during a checkpoint. As I mentioned upthread, it is
> reasonably possible the new cluster is already setup with a good
> fraction of the old system's shared_buffers configured. Every
> checkpoint has to scan all those buffers, which IMV can get (much)
> more expensive than the IO overhead caused by the WAL_LOG strategy. It
> may be a baseless fear as I haven't done the performance benchmarks
> for this, but I wouldn't be surprised if shared_buffers=8GB would
> measurably impact the upgrade performance in the current patch (vs the
> default 128MB).

Okay, that's a valid point.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

11 June 2024, 02:01:40

On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote:
> My primary concern isn't the IO, but the O(shared_buffers) that we
> have to go through during a checkpoint. As I mentioned upthread, it is
> reasonably possible the new cluster is already setup with a good
> fraction of the old system's shared_buffers configured. Every
> checkpoint has to scan all those buffers, which IMV can get (much)
> more expensive than the IO overhead caused by the WAL_LOG strategy. It
> may be a baseless fear as I haven't done the performance benchmarks
> for this, but I wouldn't be surprised if shared_buffers=8GB would
> measurably impact the upgrade performance in the current patch (vs the
> default 128MB).

I did a handful of benchmarks on an r5.24xlarge that seem to prove your
point.  The following are the durations of the pg_restore step of
pg_upgrade:

* 10k empty databases, 128MB shared_buffers
  WAL_LOG:   1m 01s
  FILE_COPY: 0m 22s

* 10k empty databases, 100GB shared_buffers
  WAL_LOG:   2m 03s
  FILE_COPY: 5m 08s

* 2.5k databases with 10k tables each, 128MB shared_buffers
  WAL_LOG:   17m 20s
  FILE_COPY: 16m 44s

* 2.5k databases with 10k tables each, 100GB shared_buffers
  WAL_LOG:   16m 39s
  FILE_COPY: 15m 21s

I was surprised with the last result, but there's enough other stuff
happening during such a test that I hesitate to conclude much.

> I'll note that the documentation for upgrading with pg_upgrade has the
> step for updating postgresql.conf / postgresql.auto.conf only after
> pg_upgrade has run already, but that may not be how it's actually
> used: after all, we don't have full control in this process, the user
> is the one who provides the new cluster with initdb.

Good point.  I think it's clear that FILE_COPY is not necessarily a win in
all cases for pg_upgrade.

>>> If such a change were implemented (i.e. no checkpoints for FILE_COPY
>>> in binary upgrade, with a single manual checkpoint after restoring
>>> template1 in create_new_objects) I think most of my concerns with this
>>> patch would be alleviated.
>>
>> Yeah, I think that's a valid point. The second checkpoint is to ensure
>> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
>> binary upgrades, we don't need that guarantee because a checkpoint
>> will be performed during shutdown at the end of the upgrade anyway.
> 
> Indeed.

It looks like pg_dump always uses template0, so AFAICT we don't even need
the suggested manual checkpoint after restoring template1.

I do like the idea of skipping a bunch of unnecessary operations in binary
upgrade mode, since it'll help me in my goal of speeding up pg_upgrade.
But I'm a bit hesitant to get too fancy here and to introduce a bunch of
new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all
that exciting.  However, we've already sprinkled such checks quite
liberally, so maybe I'm being too cautious...

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

11 June 2024, 08:39:51

On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote:
> > My primary concern isn't the IO, but the O(shared_buffers) that we
> > have to go through during a checkpoint. As I mentioned upthread, it is
> > reasonably possible the new cluster is already setup with a good
> > fraction of the old system's shared_buffers configured. Every
> > checkpoint has to scan all those buffers, which IMV can get (much)
> > more expensive than the IO overhead caused by the WAL_LOG strategy. It
> > may be a baseless fear as I haven't done the performance benchmarks
> > for this, but I wouldn't be surprised if shared_buffers=8GB would
> > measurably impact the upgrade performance in the current patch (vs the
> > default 128MB).
>
> I did a handful of benchmarks on an r5.24xlarge that seem to prove your
> point.  The following are the durations of the pg_restore step of
> pg_upgrade:
>
> * 10k empty databases, 128MB shared_buffers
>   WAL_LOG:   1m 01s
>   FILE_COPY: 0m 22s
>
> * 10k empty databases, 100GB shared_buffers
>   WAL_LOG:   2m 03s
>   FILE_COPY: 5m 08s
>
> * 2.5k databases with 10k tables each, 128MB shared_buffers
>   WAL_LOG:   17m 20s
>   FILE_COPY: 16m 44s
>
> * 2.5k databases with 10k tables each, 100GB shared_buffers
>   WAL_LOG:   16m 39s
>   FILE_COPY: 15m 21s
>
> I was surprised with the last result, but there's enough other stuff
> happening during such a test that I hesitate to conclude much.

If you still have the test data set up, could you test the attached
patch (which does skip the checkpoints in FILE_COPY mode during binary
upgrades)?

>>>> If such a change were implemented (i.e. no checkpoints for FILE_COPY
>>>> in binary upgrade, with a single manual checkpoint after restoring
>>>> template1 in create_new_objects) I think most of my concerns with this
>>>> patch would be alleviated.
>>>
>>> Yeah, I think that's a valid point. The second checkpoint is to ensure
>>> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
>>> binary upgrades, we don't need that guarantee because a checkpoint
>>> will be performed during shutdown at the end of the upgrade anyway.
>>
>> Indeed.
>
> It looks like pg_dump always uses template0, so AFAICT we don't even need
> the suggested manual checkpoint after restoring template1.

Thanks for reminding me. It seems I misunderstood the reason why we
first process template1 in create_new_objects, as I didn't read the
comments thoroughly enough.

> I do like the idea of skipping a bunch of unnecessary operations in binary
> upgrade mode, since it'll help me in my goal of speeding up pg_upgrade.
> But I'm a bit hesitant to get too fancy here and to introduce a bunch of
> new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all
> that exciting.  However, we've already sprinkled such checks quite
> liberally, so maybe I'm being too cautious...

Hmm, yes. From an IO perspective I think this could be an improvement,
but let's check the numbers first.

Kind regards,

Matthias van de Meent

Attachment

0001-pg_upgrade-use-CREATE-DATABASE-.-STRATEGY-FILE_COPY.patch

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

11 June 2024, 16:03:41

On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote:
> On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I did a handful of benchmarks on an r5.24xlarge that seem to prove your
>> point.  The following are the durations of the pg_restore step of
>> pg_upgrade:
>>
>> * 10k empty databases, 128MB shared_buffers
>>   WAL_LOG:   1m 01s
>>   FILE_COPY: 0m 22s
>>
>> * 10k empty databases, 100GB shared_buffers
>>   WAL_LOG:   2m 03s
>>   FILE_COPY: 5m 08s
>>
>> * 2.5k databases with 10k tables each, 128MB shared_buffers
>>   WAL_LOG:   17m 20s
>>   FILE_COPY: 16m 44s
>>
>> * 2.5k databases with 10k tables each, 100GB shared_buffers
>>   WAL_LOG:   16m 39s
>>   FILE_COPY: 15m 21s
>>
>> I was surprised with the last result, but there's enough other stuff
>> happening during such a test that I hesitate to conclude much.
> 
> If you still have the test data set up, could you test the attached
> patch (which does skip the checkpoints in FILE_COPY mode during binary
> upgrades)?

With your patch, I see the following:

* 10k empty databases, 128MB shared_buffers: 0m 27s
* 10k empty databases, 100GB shared_buffers: 1m 44s

I believe the reason the large buffer cache test is still quite a bit
slower is due to the truncation of pg_largeobject (specifically its call to
DropRelationsAllBuffers()).  This TRUNCATE command was added in commit
bbe08b8.

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

12 June 2024, 14:41:01

On Tue, Jun 11, 2024 at 10:39:51AM +0200, Matthias van de Meent wrote:
> On Tue, 11 Jun 2024 at 04:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> It looks like pg_dump always uses template0, so AFAICT we don't even need
>> the suggested manual checkpoint after restoring template1.
> 
> Thanks for reminding me. It seems I misunderstood the reason why we
> first process template1 in create_new_objects, as I didn't read the
> comments thoroughly enough.

Actually, I think you are right that we need a manual checkpoint, except I
think we need it to be after prepare_new_globals().  set_frozenxids()
connects to each database (including template0) and updates a bunch of
pg_class rows, and we probably want those on disk before we start copying
the files to create all the user's databases.

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

14 June 2024, 21:29:13

On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> Actually, I think you are right that we need a manual checkpoint, except I
> think we need it to be after prepare_new_globals().  set_frozenxids()
> connects to each database (including template0) and updates a bunch of
> pg_class rows, and we probably want those on disk before we start copying
> the files to create all the user's databases.

Here is an updated patch.

-- 
nathan

Attachment

v3-0001-Use-CREATE-DATABASE-.-STRATEGY-FILE_COPY-in-pg_up.patch

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

19 June 2024, 13:07:42

On Fri, 14 Jun 2024 at 23:29, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > Actually, I think you are right that we need a manual checkpoint, except I
> > think we need it to be after prepare_new_globals().  set_frozenxids()
> > connects to each database (including template0) and updates a bunch of
> > pg_class rows, and we probably want those on disk before we start copying
> > the files to create all the user's databases.

Good catch, I hadn't thought of that.

> Here is an updated patch.

LGTM.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Robert Haas

Date:

19 June 2024, 13:17:00

On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > Actually, I think you are right that we need a manual checkpoint, except I
> > think we need it to be after prepare_new_globals().  set_frozenxids()
> > connects to each database (including template0) and updates a bunch of
> > pg_class rows, and we probably want those on disk before we start copying
> > the files to create all the user's databases.
>
> Here is an updated patch.

OK, I have a (probably) stupid question. The comment says:

+ * In binary upgrade mode, we can skip this checkpoint because neither of
+ * these problems applies: we don't ever replay the WAL generated during
+ * pg_upgrade, and we don't concurrently modify template0 (not to mention
+ * that trying to take a backup during pg_upgrade is pointless).

But what happens if the system crashes during pg_upgrade? Does this
patch make things worse than they are today? And should we care?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

19 June 2024, 13:37:17

On Wed, Jun 19, 2024 at 09:17:00AM -0400, Robert Haas wrote:
> OK, I have a (probably) stupid question. The comment says:
> 
> + * In binary upgrade mode, we can skip this checkpoint because neither of
> + * these problems applies: we don't ever replay the WAL generated during
> + * pg_upgrade, and we don't concurrently modify template0 (not to mention
> + * that trying to take a backup during pg_upgrade is pointless).
> 
> But what happens if the system crashes during pg_upgrade? Does this
> patch make things worse than they are today? And should we care?

My understanding is that you basically have to restart the upgrade from
scratch if that happens.  I suppose there could be a problem if you try to
use the half-upgraded cluster after a crash, but I imagine you have a good
chance of encountering other problems if you do that, too.  So I don't
think we care...

-- 
nathan

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Matthias van de Meent

Date:

19 June 2024, 13:38:52

On Wed, 19 Jun 2024 at 15:17, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > > Actually, I think you are right that we need a manual checkpoint, except I
> > > think we need it to be after prepare_new_globals().  set_frozenxids()
> > > connects to each database (including template0) and updates a bunch of
> > > pg_class rows, and we probably want those on disk before we start copying
> > > the files to create all the user's databases.
> >
> > Here is an updated patch.
>
> OK, I have a (probably) stupid question. The comment says:
>
> + * In binary upgrade mode, we can skip this checkpoint because neither of
> + * these problems applies: we don't ever replay the WAL generated during
> + * pg_upgrade, and we don't concurrently modify template0 (not to mention
> + * that trying to take a backup during pg_upgrade is pointless).
>
> But what happens if the system crashes during pg_upgrade? Does this
> patch make things worse than they are today? And should we care?

As Nathan just said, AFAIK we don't have a way to resume progress from
a crashed pg_upgrade, which implies you currently have to start over
with a fresh dbinit.
This patch wouldn't change that for the worse, it'd only add one more
process that depends on that behaviour.

Maybe in the future that'll change if pg_upgrade and pg_dump are made
smart enough to resume their restore progress across forceful
disconnects, but for now this patch seems like a nice performance
improvement.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Michael Paquier

Date:

20 June 2024, 04:29:40

On Wed, Jun 19, 2024 at 08:37:17AM -0500, Nathan Bossart wrote:
> My understanding is that you basically have to restart the upgrade from
> scratch if that happens.  I suppose there could be a problem if you try to
> use the half-upgraded cluster after a crash, but I imagine you have a good
> chance of encountering other problems if you do that, too.  So I don't
> think we care...

It's never been assumed that it would be safe to redo a
pg_upgradeafter a crash on a cluster initdb'd for the upgrade, so I
don't think we need to care about that, as well.

One failure I suspect would quickly be faced is OIDs getting reused
again as these are currently kept consistent.
--
Michael

Attachment

signature.asc

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

From

Nathan Bossart

Date:

08 July 2024, 21:22:23

Committed.

-- 
nathan