Thread: Shortest offline window on database migration

Shortest offline window on database migration

From
Haroldo Kerry
Date:
Hello,

We are migrating our PostgreSQL 9.6.10 database (with streaming replication active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full backup, and then about 1h to fully restore it with checksum enabled on the new array, plus 2h to recreate the replica on the old array.

Although all synthetic tests (pgbench) indicate the new disk array is faster, we will only be 100% confident once we see its performance on production, so our backup plan is using our replica database on the older array. If the new array performance is poor during production ramp up, we can switch to the replica with little impact to our customers.

Problem is the offline window for backup, restore the full database with checksum and recreate the replica is about 5h:30m.

One thing that occurred to us to shorten the offline window was restoring the database to both the master and replica in parallel (of course we would configure the replica as master do restore the database), that would shave 1h of the total time. Although this is not documented we thought that restoring the same database to identical servers would result in binary identical data files.

We tried this in lab. As this is not a kosher way to create a replica, we ran a checksum comparison of all data files, and we ended up having a lot of differences. Bummer. Both master and replica worked (no errors on logs), but we ended up insecure about this path because of the binary differences on data files.
But in principle it should work, right?
Has anyone been through this type of problem?


Regards,
Haroldo Kerry

Re: Shortest offline window on database migration

From
Tomas Vondra
Date:
On Thu, May 30, 2019 at 12:08:04PM -0300, Haroldo Kerry wrote:
>Hello,
>
>We are migrating our PostgreSQL 9.6.10 database (with streaming replication
>active) to a faster disk array.
>We are using this opportunity to enable checksums, so we will have to do a
>full backup-restore.
>The database size is about 500GB, it takes about 2h:30min for a full
>backup, and then about 1h to fully restore it with checksum enabled on the
>new array, plus 2h to recreate the replica on the old array.
>
>Although all synthetic tests (pgbench) indicate the new disk array is
>faster, we will only be 100% confident once we see its performance on
>production, so our backup plan is using our replica database on the older
>array. If the new array performance is poor during production ramp up, we
>can switch to the replica with little impact to our customers.
>
>Problem is the offline window for backup, restore the full database with
>checksum and recreate the replica is about 5h:30m.
>
>One thing that occurred to us to shorten the offline window was restoring
>the database to both the master and replica in parallel (of course we would
>configure the replica as master do restore the database), that would shave
>1h of the total time. Although this is not documented we thought that
>restoring the same database to identical servers would result in binary
>identical data files.
>
>We tried this in lab. As this is not a kosher way to create a replica, we
>ran a checksum comparison of all data files, and we ended up having a lot
>of differences. Bummer. Both master and replica worked (no errors on logs),
>but we ended up insecure about this path because of the binary differences
>on data files.
>But in principle it should work, right?

What should work? Backup using pg_dump and restore certainly won't give
you the same binary files - the commit timestamps will be different,
operations may happen in a different order (esp. with parallel restore),
and so on. And the instances don't start as a copy anyway, so there will
be different system IDs, etc.

So no, this is not a valid way to provision master/standby cluster.

>Has anyone been through this type of problem?
>

Unfortunately, I don't think there's a much better solution that what you
initially described - dump/restore, and then creating a replica.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: Shortest offline window on database migration

From
Steven Winfield
Date:
>Has anyone been through this type of problem?
>

You could set up a new, empty db (with checksums enabled, etc.) on the new hardware and then use logical replication to
syncacross all the data from the existing cluster. 
(This logical replica could be doing binary replication to hot standbys too, if you like).

When the sync has finished you could perhaps gradually shift read-only load over to the new db, and finally switch
writeload too - your downtime would then be limited to how long this final cut-over takes. 

Steve.




Re: Shortest offline window on database migration

From
Haroldo Kerry
Date:
Hello Steven,
Thanks a lot for the idea, it had not thought about it.
@Joshua @Tomas, thanks for clarifying why it doesn't work!

Best regards,
Haroldo Kerry

On Thu, May 30, 2019 at 12:54 PM Steven Winfield <Steven.Winfield@cantabcapital.com> wrote:
>Has anyone been through this type of problem?
>

You could set up a new, empty db (with checksums enabled, etc.) on the new hardware and then use logical replication to sync across all the data from the existing cluster.
(This logical replica could be doing binary replication to hot standbys too, if you like).

When the sync has finished you could perhaps gradually shift read-only load over to the new db, and finally switch write load too - your downtime would then be limited to how long this final cut-over takes.

Steve.




This email is confidential. If you are not the intended recipient, please advise us immediately and delete this message. The registered name of Cantab- part of GAM Systematic is Cantab Capital Partners LLP. See - http://www.gam.com/en/Legal/Email+disclosures+EU for further information on confidentiality, the risks of non-secure electronic communication, and certain disclosures which we are required to make in accordance with applicable legislation and regulations. If you cannot access this link, please notify us by reply message and we will send the contents to you.

GAM Holding AG and its subsidiaries (Cantab – GAM Systematic) will collect and use information about you in the course of your interactions with us. Full details about the data types we collect and what we use this for and your related rights is set out in our online privacy policy at https://www.gam.com/en/legal/privacy-policy. Please familiarise yourself with this policy and check it from time to time for updates as it supplements this notice


--

Haroldo Kerry

CTO/COO

Rua do Rócio, 220, 7° andar, conjunto 72

São Paulo – SP / CEP 04552-000

hkerry@callix.com.br

www.callix.com.br

Re: Shortest offline window on database migration

From
Haroldo Kerry
Date:
Hello Steven,
Unfortunately logical replication is a pg10+ feature. One more reason for upgrading from 9.6.10 :-)

Regards,
Haroldo Kerry

On Thu, May 30, 2019 at 1:23 PM Haroldo Kerry <hkerry@callix.com.br> wrote:
Hello Steven,
Thanks a lot for the idea, it had not thought about it.
@Joshua @Tomas, thanks for clarifying why it doesn't work!

Best regards,
Haroldo Kerry

On Thu, May 30, 2019 at 12:54 PM Steven Winfield <Steven.Winfield@cantabcapital.com> wrote:
>Has anyone been through this type of problem?
>

You could set up a new, empty db (with checksums enabled, etc.) on the new hardware and then use logical replication to sync across all the data from the existing cluster.
(This logical replica could be doing binary replication to hot standbys too, if you like).

When the sync has finished you could perhaps gradually shift read-only load over to the new db, and finally switch write load too - your downtime would then be limited to how long this final cut-over takes.

Steve.




This email is confidential. If you are not the intended recipient, please advise us immediately and delete this message. The registered name of Cantab- part of GAM Systematic is Cantab Capital Partners LLP. See - http://www.gam.com/en/Legal/Email+disclosures+EU for further information on confidentiality, the risks of non-secure electronic communication, and certain disclosures which we are required to make in accordance with applicable legislation and regulations. If you cannot access this link, please notify us by reply message and we will send the contents to you.

GAM Holding AG and its subsidiaries (Cantab – GAM Systematic) will collect and use information about you in the course of your interactions with us. Full details about the data types we collect and what we use this for and your related rights is set out in our online privacy policy at https://www.gam.com/en/legal/privacy-policy. Please familiarise yourself with this policy and check it from time to time for updates as it supplements this notice


--

Haroldo Kerry

CTO/COO

Rua do Rócio, 220, 7° andar, conjunto 72

São Paulo – SP / CEP 04552-000

hkerry@callix.com.br

www.callix.com.br



--

Haroldo Kerry

CTO/COO

Rua do Rócio, 220, 7° andar, conjunto 72

São Paulo – SP / CEP 04552-000

hkerry@callix.com.br

www.callix.com.br

Re: Shortest offline window on database migration

From
Jeff Janes
Date:
On Thu, May 30, 2019 at 11:08 AM Haroldo Kerry <hkerry@callix.com.br> wrote:
Hello,

We are migrating our PostgreSQL 9.6.10 database (with streaming replication active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full backup, and then about 1h to fully restore it with checksum enabled on the new array, plus 2h to recreate the replica on the old array.

As others have noticed, your "trick" won't work.  So back to basics.  Are you using the best degree of parallelization on each one of these tasks?  What is the bottleneck of each one (CPU, disk, network)? how are you creating the replica?  Can you share the actual command lines for each one?  It seems odd that the dump (which only needs to dump the index and constraint definitions) is so much slower than the restore (which actually needs to build those indexes and validate the constraints). Is that because the dump is happening from the old slow disk and restore a new fast ones?  Same with creating the replica, why is that slower than actually doing the restore?

It sounds like you are planning on blowing away the old master server on the old array as soon as the upgrade is complete, so you can re-use that space to build the new replica?  That doesn't seem very safe to me--what if during the rebuilding of the replica you run into a major problem and have to roll the whole thing back?  What will the old array which is holding the current replica server be doing in all of this?

Cheers,

Jeff

Re: Shortest offline window on database migration

From
Ian Lawrence Barwick
Date:
> 2019年5月31日(金) 6:53 Haroldo Kerry <hkerry@callix.com.br>:
>>> On Thu, May 30, 2019 at 12:54 PM Steven Winfield <Steven.Winfield@cantabcapital.com> wrote:
>>
>>     >Has anyone been through this type of problem?
>>     >
>>
>>     You could set up a new, empty db (with checksums enabled, etc.) on the new hardware and then use logical
replicationto sync across all the data from the existing cluster. 
>>     (This logical replica could be doing binary replication to hot standbys too, if you like).
>>
>>     When the sync has finished you could perhaps gradually shift read-only load over to the new db, and finally
switchwrite load too - your downtime would then be limited to how long this final cut-over takes. 
>>
>     Steve.
>     Hello Steven,
>     Unfortunately logical replication is a pg10+ feature. One more reason for upgrading from 9.6.10 :-)

Have you looked at the pglogical extension from 2ndQuadrant?

    https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE

Regards

Ian Barwick

--
  Ian Barwick                   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services



Re: Shortest offline window on database migration

From
Fabio Pardi
Date:
On 5/30/19 5:08 PM, Haroldo Kerry wrote:
> Hello,
> 
> We are migrating our PostgreSQL 9.6.10 database (with streaming
> replication active) to a faster disk array.
> We are using this opportunity to enable checksums, 


I would stay away from performing 2 big changes in one go.


regards,

fabio pardi





Re: Shortest offline window on database migration

From
Haroldo Kerry
Date:
Jeff,

We are using the following command to dump the database:

docker exec pg-1 bash -c 'pg_dump -v -U postgres -Fc --file=/var/lib/postgresql/backup/sc2-ssd.bkp smartcenter2_prod' 2>> /var/log/sc2-bkp-ssd.log &

The bottleneck at dump is CPU (a single one, on a 44 thread server), as we are using the -Fc option, that does not allow multiple jobs.
We tried some time ago to use the --jobs option of pg_dump but it was slower, even with more threads. Our guess is the sheer volume of files outweighs the processing gains of using a compressed file output. Also pg_dump even with multiple jobs spends a lot of time (1h+) on the "reading dependency data" section that seems to be single threaded (our application is multi-tenant and we use schemas to separate tenant data, hence we have a lot of tables).
We are dumping the backup to the old array.

We are creating the replica using :
docker exec pg-2 pg_basebackup -h 192.168.0.107 -U replication -P --xlog -D /var/lib/postgresql/data_9.6 
and it is taking 1h10m , instead of the 2h I reported initially, because we were using rsync with checksums to do it, after experimenting with pg_basebackup we found out it is faster, rsync was taking 1h just to calculate all checksums. Thanks for your insight on this taking too long.

Regarding blowing the old array after the upgrade is complete:
Our plan keeps the old array data volume. We restore the backup to the new array with checksums, delete the old replica (space issues), and use the space for the new replica with checksums.
If the new array does not work as expected, we switch to the replica with the old array. If all things go wrong, we can just switch back to the old array data volume (without a replica for some time). 

I'm glad to report that we executed the plan over dawn (4h downtime in the end) and everything worked, the new array is performing as expected.
New array database volume:
4 x Intel 960GB SATA SSD D3-S4610 on a RAID 10 configuration, on a local PERC H730 Controller.
Specs:
  • Mfr part number: SSDSC2KG960G801
  • Form Factor: 2.5 inch
  • Latency: read - 36 µs; write - 37 µs
  • Random Read (100% span): 96, 000 IOPS
  • Random Write (100% span): 51, 000 IOPS

Old array database volume (now used for the replica server):
7 x Samsung 480GB SAS SSD PM-1633 on a Dell SC2020 Compellent iSCSI storage, with dual 1 Gbps interfaces (max bandwidth 2 Gbps)
Specs:  Sequential Read Up to 1,400 MB/s Sequential Write Up to 930 MB/s Random Read Up to 200,000 IOPS Random Write Up to 37,000 IOPS 

Kind of surprisingly to us the local array outperforms the older (and more expensive) by more than 2x.
 
Thanks for the help.

Regards,
Haroldo Kerry

On Thu, May 30, 2019 at 10:42 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, May 30, 2019 at 11:08 AM Haroldo Kerry <hkerry@callix.com.br> wrote:
Hello,

We are migrating our PostgreSQL 9.6.10 database (with streaming replication active) to a faster disk array.
We are using this opportunity to enable checksums, so we will have to do a full backup-restore.
The database size is about 500GB, it takes about 2h:30min for a full backup, and then about 1h to fully restore it with checksum enabled on the new array, plus 2h to recreate the replica on the old array.

As others have noticed, your "trick" won't work.  So back to basics.  Are you using the best degree of parallelization on each one of these tasks?  What is the bottleneck of each one (CPU, disk, network)? how are you creating the replica?  Can you share the actual command lines for each one?  It seems odd that the dump (which only needs to dump the index and constraint definitions) is so much slower than the restore (which actually needs to build those indexes and validate the constraints). Is that because the dump is happening from the old slow disk and restore a new fast ones?  Same with creating the replica, why is that slower than actually doing the restore?

It sounds like you are planning on blowing away the old master server on the old array as soon as the upgrade is complete, so you can re-use that space to build the new replica?  That doesn't seem very safe to me--what if during the rebuilding of the replica you run into a major problem and have to roll the whole thing back?  What will the old array which is holding the current replica server be doing in all of this?

Cheers,

Jeff


--

Haroldo Kerry

CTO/COO

Rua do Rócio, 220, 7° andar, conjunto 72

São Paulo – SP / CEP 04552-000

hkerry@callix.com.br

www.callix.com.br

Re: Shortest offline window on database migration

From
Stephen Frost
Date:
Greetings,

* Haroldo Kerry (hkerry@callix.com.br) wrote:
> The bottleneck at dump is CPU (a single one, on a 44 thread server), as we
> are using the -Fc option, that does not allow multiple jobs.
> We tried some time ago to use the --jobs option of pg_dump but it was
> slower, even with more threads. Our guess is the sheer volume of files
> outweighs the processing gains of using a compressed file output. Also
> pg_dump even with multiple jobs spends a lot of time (1h+) on the "reading
> dependency data" section that seems to be single threaded (our application
> is multi-tenant and we use schemas to separate tenant data, hence we have a
> lot of tables).

You might want to reconsider using the separate-schemas-for-tenants
approach.  This isn't the only annoyance you can run into with lots and
lots of tables.  That said, are you using the newer version of pg_dump
(which is what you should be doing when migrating to a newer version of
PG, always)?  We've improved it over time, though I can't recall off-hand
if this particular issue was improved of in-between the releases being
discussed here.  Of course, lots of little files and dealing with them
could drag down performance when working in parallel.  Still a bit
surprised that it's ending up slower than -Fc.

> We are creating the replica using :
> docker exec pg-2 pg_basebackup -h 192.168.0.107 -U replication -P --xlog -D
> /var/lib/postgresql/data_9.6
> and it is taking 1h10m , instead of the 2h I reported initially, because we
> were using rsync with checksums to do it, after experimenting with
> pg_basebackup we found out it is faster, rsync was taking 1h just to
> calculate all checksums. Thanks for your insight on this taking too long.

So, it's a bit awkward still, unfortunately, but you can use pgbackrest
to effectively give you a parallel-replica-build.  The steps are
something like:

Get pgbackrest WAL archiving up and going, with the repo on the
destination server/filesystem, but have 'compress=n' in the
pgbackrest.conf for the repo.

Run: pgbackrest --stanza=mydb --type=full --process-max=8 backup

Once that's done, just do:

mv /path/to/repo/backup/mydb/20190605-120000F/pg_data /new/pgdata
chmod -R g-rwx /new/pgdata

Then in /new/pgdata, create a recovery.conf file like:

restore_command = 'pgbackrest --stanza=mydb archive-get %f "%p"'

And start up the DB server.

We have some ideas about how make that whole thing cleaner but the
rewrite into C has delayed our efforts, perhaps once that's done (this
fall), we can look at it.

Of course, you won't have an actual backup of the new database server at
that point yet, so you'll want to clean things up and make that happen
ASAP.  Another option, which is what I usually recommend, is just to
take a new backup (properly) and then do a restore from it, but that'll
obviously take longer since there's two copies being done instead of one
(though you can parallelize to your heart's content, so it can still be
quite fast if you have enough CPU and I/O).

Thanks,

Stephen

Attachment