Thread: Difference in the tablespace folders on primary and secondary nodes
I recently set up a Postgres15 master-slave (Primary-secondary) cluster on 2 nodes. At the time of starting, I ensured that data files are exactly the same on both the nodes. The size of the DB is 1.5TB.
The directory structure on both nodes looks as follows:
```
- /PROD/datadg/
|
|-> /PROD/datadg/tablespace
| |-> /PROD/datadg/tablespace/tablespace1
| |-> /PROD/datadg/tablespace/tablespace2
|
|-> /PROD/datadg/data
| |-> /PROD/datadg/data/pg_tblspc
| | | -> /PROD/datadg/data/pg_tblspc/16432 -> /PROD/datadg/tablespace/tablespace1
| | | -> /PROD/datadg/data/pg_tblspc/16433 -> /PROD/datadg/tablespace/tablespace2
```
Almost a week later now, I see almost a 2GB size difference in the tablespace folders on the 2 nodes. I also see some file count difference on both the nodes.
`autovacuum` is on on both the nodes and there aren't any `idle in transaction` queries on the slave node. Also there hasn't been any disruption on the streaming replication. I did not get any error like `WAL segment already removed` or so on the slave node. `pg_stat_replication` on the master node also doesn't show anything out of the ordinary and the `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not see a difference in counts of most tables either, haven't verified for all of them.
So my **first question** is:
* Why is there a difference in the files in the tablespace folder? I can understand the difference in the modification timestamps, but some files are just missing on the slave node.
Now if I were to run `vacuumdb` on the master node, there are chances that the slave node will break and give an error like this
```
PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at 875E/21A70BD0 for
Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel 16405/16419/533716933, fork 2, blk 26; blkref #1: rel 16405/16419/533716933, blk 853758",,,,"","startup
```
In the case when slave node breaks, these are the steps I usually do to bring the slave node back:
1) Start `pg_backup_start('backup')` on the master node
2) rsync the files from master to slave by running the following on the slave node:
```
rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data --exclude 'pg_log' --exclude 'pg_replslot'
```
3. Stop `pg_backup_stop()` on master node
4. Start the slave node again and it usually works, even though the tablespace files might not still be the same.
**Second question**:
* What is the best way to bring the slave node back? Is the `rsync` between tablespaces required? And if yes, what is the best method to do it for very large databases, something maybe as big as 30TB or more. I don't want to `rsync` all the files even if the timestamp on them is different. So is a command like this safe to do? Or should an option like `--checksum` be used?
```
rsync -av --delete master_node:/PROD/datadg/tablespace/ /PROD/datadg/tablespace --size-only
```
**Third question:**
* Is it advised to run `vacuumdb` before or after bringing the slave node back again?
The directory structure on both nodes looks as follows:
```
- /PROD/datadg/
|
|-> /PROD/datadg/tablespace
| |-> /PROD/datadg/tablespace/tablespace1
| |-> /PROD/datadg/tablespace/tablespace2
|
|-> /PROD/datadg/data
| |-> /PROD/datadg/data/pg_tblspc
| | | -> /PROD/datadg/data/pg_tblspc/16432 -> /PROD/datadg/tablespace/tablespace1
| | | -> /PROD/datadg/data/pg_tblspc/16433 -> /PROD/datadg/tablespace/tablespace2
```
Almost a week later now, I see almost a 2GB size difference in the tablespace folders on the 2 nodes. I also see some file count difference on both the nodes.
`autovacuum` is on on both the nodes and there aren't any `idle in transaction` queries on the slave node. Also there hasn't been any disruption on the streaming replication. I did not get any error like `WAL segment already removed` or so on the slave node. `pg_stat_replication` on the master node also doesn't show anything out of the ordinary and the `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not see a difference in counts of most tables either, haven't verified for all of them.
So my **first question** is:
* Why is there a difference in the files in the tablespace folder? I can understand the difference in the modification timestamps, but some files are just missing on the slave node.
Now if I were to run `vacuumdb` on the master node, there are chances that the slave node will break and give an error like this
```
PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at 875E/21A70BD0 for
Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel 16405/16419/533716933, fork 2, blk 26; blkref #1: rel 16405/16419/533716933, blk 853758",,,,"","startup
```
In the case when slave node breaks, these are the steps I usually do to bring the slave node back:
1) Start `pg_backup_start('backup')` on the master node
2) rsync the files from master to slave by running the following on the slave node:
```
rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data --exclude 'pg_log' --exclude 'pg_replslot'
```
3. Stop `pg_backup_stop()` on master node
4. Start the slave node again and it usually works, even though the tablespace files might not still be the same.
**Second question**:
* What is the best way to bring the slave node back? Is the `rsync` between tablespaces required? And if yes, what is the best method to do it for very large databases, something maybe as big as 30TB or more. I don't want to `rsync` all the files even if the timestamp on them is different. So is a command like this safe to do? Or should an option like `--checksum` be used?
```
rsync -av --delete master_node:/PROD/datadg/tablespace/ /PROD/datadg/tablespace --size-only
```
**Third question:**
* Is it advised to run `vacuumdb` before or after bringing the slave node back again?
This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).
Greetings, * Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote: > I recently set up a Postgres15 master-slave (Primary-secondary) cluster on > 2 nodes. At the time of starting, I ensured that data files are exactly the > same on both the nodes. The size of the DB is 1.5TB. > The directory structure on both nodes looks as follows: > ``` > - /PROD/datadg/ > | > |-> /PROD/datadg/tablespace > | |-> /PROD/datadg/tablespace/tablespace1 > | |-> /PROD/datadg/tablespace/tablespace2 > | > |-> /PROD/datadg/data > | |-> /PROD/datadg/data/pg_tblspc > | | | -> /PROD/datadg/data/pg_tblspc/16432 -> > /PROD/datadg/tablespace/tablespace1 > | | | -> /PROD/datadg/data/pg_tblspc/16433 -> > /PROD/datadg/tablespace/tablespace2 > > ``` > Almost a week later now, I see almost a 2GB size difference in the > tablespace folders on the 2 nodes. I also see some file count difference on > both the nodes. Not really enough to go on here. > `autovacuum` is on on both the nodes and there aren't any `idle in > transaction` queries on the slave node. Also there hasn't been any > disruption on the streaming replication. I did not get any error like `WAL > segment already removed` or so on the slave node. `pg_stat_replication` on > the master node also doesn't show anything out of the ordinary and the > `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not > see a difference in counts of most tables either, haven't verified for all > of them. > So my **first question** is: > > * Why is there a difference in the files in the tablespace folder? I can > understand the difference in the modification timestamps, but some files > are just missing on the slave node. Unlogged tables would certainly be a pretty easy explanation of size differences between the two. There's also temporary files that might be created on one system but not the other for in-progress queries. These aren't the only possibilities but just a couple of likely candidates. > Now if I were to run `vacuumdb` on the master node, there are chances that > the slave node will break and give an error like this > ``` > PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at > 875E/21A70BD0 for > Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel > 16405/16419/533716933, fork 2, blk 26; blkref #1: rel > 16405/16419/533716933, blk 853758",,,,"","startup > ``` Did you actually run vacuumdb and actually see this? Are you able to reproduce it? > In the case when slave node breaks, these are the steps I usually do to > bring the slave node back: > 1) Start `pg_backup_start('backup')` on the master node > 2) rsync the files from master to slave by running the following on the > slave node: > ``` > rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data > --exclude 'pg_log' --exclude 'pg_replslot' > ``` > 3. Stop `pg_backup_stop()` on master node This really isn't good enough as rsync will use timestamp/size by default and this also doesn't grab and restore the absolutely required backup_label file that's returned from pg_backup_stop(). Basically, this is not a valid way to perform a backup/restore of PG. Please go read the documentation for how to use the low-level API ... or, even better, use an existing well maintained backup/restore tool for this. > 4. Start the slave node again and it usually works, even though the > tablespace files might not still be the same. If you didn't grab the backup_label from pg_backup_stop() and drop it into place on the replica, then you've almost certainly got a corrupt system. > **Second question**: > > * What is the best way to bring the slave node back? Is the `rsync` between > tablespaces required? And if yes, what is the best method to do it for very > large databases, something maybe as big as 30TB or more. I don't want to > `rsync` all the files even if the timestamp on them is different. So is a > command like this safe to do? Or should an option like `--checksum` be > used? > ``` > rsync -av --delete master_node:/PROD/datadg/tablespace/ > /PROD/datadg/tablespace --size-only > ``` Using --size-only is an absolutely horrid idea and will miss lots of changes. What you're doing isn't working because you're just creating corrupt replicas by not following the process correctly for performing a backup and restore of PG. This stuff isn't easy. Use a tool for it, especially for larger systems. > **Third question:** > > * Is it advised to run `vacuumdb` before or after bringing the slave node > back again? No, this isn't necessary. Thanks, Stephen
Attachment
Hi Stephen
Thank you for your reply.
Basically, this is not a valid way to perform a backup/restore of PG.
Is it not valid only for PG 15 or even for earlier versions? I have always referred to this https://www.postgresql.org/docs/8.1/backup-online.html
better, use an existing well maintained backup/restore tool for this
Is there any tool you could recommend? A tool to bring back the standby node when it was down for a day or so for some OS/firmware upgrade.
I have tried pgBackRest, but it requires a control node. So anything that doesn't involve a 3rd server for this situation would be helpful.
I have tried pgBackRest, but it requires a control node. So anything that doesn't involve a 3rd server for this situation would be helpful.
this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop()
I tried running pg_backup_start('backup') and pg_backup_stop() on my DB, but I did not see any backup_label_file being created.
psql (15.1)
Type "help" for help.
postgres=# select pg_backup_start('backup');
pg_backup_start
-----------------
68/32000028
(1 row)
postgres=# select pg_backup_stop();
NOTICE: all required WAL segments have been archived
pg_backup_stop
-------------------------------------------------------------------------------
(68/32000100,"START WAL LOCATION: 68/32000028 (file 000000010000006800000032)+
CHECKPOINT LOCATION: 68/32000060 +
BACKUP METHOD: streamed +
BACKUP FROM: primary +
START TIME: 2023-07-26 08:51:28 JST +
LABEL: backup +
START TIMELINE: 1 +
","16724 /PGDATA/datadg/tbs1 +
16725 /PGDATA/datadg/tbs2 +
")
(1 row)
Type "help" for help.
postgres=# select pg_backup_start('backup');
pg_backup_start
-----------------
68/32000028
(1 row)
postgres=# select pg_backup_stop();
NOTICE: all required WAL segments have been archived
pg_backup_stop
-------------------------------------------------------------------------------
(68/32000100,"START WAL LOCATION: 68/32000028 (file 000000010000006800000032)+
CHECKPOINT LOCATION: 68/32000060 +
BACKUP METHOD: streamed +
BACKUP FROM: primary +
START TIME: 2023-07-26 08:51:28 JST +
LABEL: backup +
START TIMELINE: 1 +
","16724 /PGDATA/datadg/tbs1 +
16725 /PGDATA/datadg/tbs2 +
")
(1 row)
I read the documentation on this page https://www.postgresql.org/docs/current/functions-admin.html
The desired contents of the backup label file and the tablespace map file are returned as part of the result of the function and must be written to files in the backup area.
I don't understand, "must be written to files in the backup area". Does it mean we need to manually create a file first on the master node and then rsync it to the backup node?
Thanks
On Tue, Jul 25, 2023 at 9:22 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote:
> I recently set up a Postgres15 master-slave (Primary-secondary) cluster on
> 2 nodes. At the time of starting, I ensured that data files are exactly the
> same on both the nodes. The size of the DB is 1.5TB.
> The directory structure on both nodes looks as follows:
> ```
> - /PROD/datadg/
> |
> |-> /PROD/datadg/tablespace
> | |-> /PROD/datadg/tablespace/tablespace1
> | |-> /PROD/datadg/tablespace/tablespace2
> |
> |-> /PROD/datadg/data
> | |-> /PROD/datadg/data/pg_tblspc
> | | | -> /PROD/datadg/data/pg_tblspc/16432 ->
> /PROD/datadg/tablespace/tablespace1
> | | | -> /PROD/datadg/data/pg_tblspc/16433 ->
> /PROD/datadg/tablespace/tablespace2
>
> ```
> Almost a week later now, I see almost a 2GB size difference in the
> tablespace folders on the 2 nodes. I also see some file count difference on
> both the nodes.
Not really enough to go on here.
> `autovacuum` is on on both the nodes and there aren't any `idle in
> transaction` queries on the slave node. Also there hasn't been any
> disruption on the streaming replication. I did not get any error like `WAL
> segment already removed` or so on the slave node. `pg_stat_replication` on
> the master node also doesn't show anything out of the ordinary and the
> `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not
> see a difference in counts of most tables either, haven't verified for all
> of them.
> So my **first question** is:
>
> * Why is there a difference in the files in the tablespace folder? I can
> understand the difference in the modification timestamps, but some files
> are just missing on the slave node.
Unlogged tables would certainly be a pretty easy explanation of size
differences between the two. There's also temporary files that might be
created on one system but not the other for in-progress queries. These
aren't the only possibilities but just a couple of likely candidates.
> Now if I were to run `vacuumdb` on the master node, there are chances that
> the slave node will break and give an error like this
> ```
> PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at
> 875E/21A70BD0 for
> Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel
> 16405/16419/533716933, fork 2, blk 26; blkref #1: rel
> 16405/16419/533716933, blk 853758",,,,"","startup
> ```
Did you actually run vacuumdb and actually see this? Are you able to
reproduce it?
> In the case when slave node breaks, these are the steps I usually do to
> bring the slave node back:
> 1) Start `pg_backup_start('backup')` on the master node
> 2) rsync the files from master to slave by running the following on the
> slave node:
> ```
> rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data
> --exclude 'pg_log' --exclude 'pg_replslot'
> ```
> 3. Stop `pg_backup_stop()` on master node
This really isn't good enough as rsync will use timestamp/size by
default and this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop(). Basically,
this is not a valid way to perform a backup/restore of PG. Please go
read the documentation for how to use the low-level API ... or, even
better, use an existing well maintained backup/restore tool for this.
> 4. Start the slave node again and it usually works, even though the
> tablespace files might not still be the same.
If you didn't grab the backup_label from pg_backup_stop() and drop it
into place on the replica, then you've almost certainly got a corrupt
system.
> **Second question**:
>
> * What is the best way to bring the slave node back? Is the `rsync` between
> tablespaces required? And if yes, what is the best method to do it for very
> large databases, something maybe as big as 30TB or more. I don't want to
> `rsync` all the files even if the timestamp on them is different. So is a
> command like this safe to do? Or should an option like `--checksum` be
> used?
> ```
> rsync -av --delete master_node:/PROD/datadg/tablespace/
> /PROD/datadg/tablespace --size-only
> ```
Using --size-only is an absolutely horrid idea and will miss lots of
changes. What you're doing isn't working because you're just creating
corrupt replicas by not following the process correctly for performing a
backup and restore of PG.
This stuff isn't easy. Use a tool for it, especially for larger
systems.
> **Third question:**
>
> * Is it advised to run `vacuumdb` before or after bringing the slave node
> back again?
No, this isn't necessary.
Thanks,
Stephen
This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).
Hi Stephen
I got the latest documentation and understood that I was supposed to run
`select * from pg_backup_stop();` and store the labelfile output to a file in the data directory of the secondary node.
I also understood that this Low-level API backup is not really a recommended way and I need to move to another method.
However, till I fully set up a tool like pgBackRest (of which you are one of the contributors - very impressive!) for PROD, can you please tell me:
1) Are there any other options that are safe to use yet fast? Like you said rsync --size-only would miss a lot of changes, but anything that would also not start syncing every data file with only different timestamp, even though it is exactly the same, including the checksum.
2) While rsyncing from the master node, do I need to exclude `pg_wal` folder?
3) The replica/standby node should have access to the WAL segments archived during time between start and stop backup, right?
4) What kind of data corruption would have been introduced due to the absence of labelfile and if there is a way to identify the corruption? So for example, if I am able to reindex all the tables, select all the data and vacuum the db, do I still need to pg_dump and pg_restore those DBs?
Thanks
On Wed, Jul 26, 2023 at 11:00 AM Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote:
Hi StephenThank you for your reply.Basically, this is not a valid way to perform a backup/restore of PG.Is it not valid only for PG 15 or even for earlier versions? I have always referred to this https://www.postgresql.org/docs/8.1/backup-online.html
better, use an existing well maintained backup/restore tool for thisIs there any tool you could recommend? A tool to bring back the standby node when it was down for a day or so for some OS/firmware upgrade.
I have tried pgBackRest, but it requires a control node. So anything that doesn't involve a 3rd server for this situation would be helpful.this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop()I tried running pg_backup_start('backup') and pg_backup_stop() on my DB, but I did not see any backup_label_file being created.psql (15.1)
Type "help" for help.
postgres=# select pg_backup_start('backup');
pg_backup_start
-----------------
68/32000028
(1 row)
postgres=# select pg_backup_stop();
NOTICE: all required WAL segments have been archived
pg_backup_stop
-------------------------------------------------------------------------------
(68/32000100,"START WAL LOCATION: 68/32000028 (file 000000010000006800000032)+
CHECKPOINT LOCATION: 68/32000060 +
BACKUP METHOD: streamed +
BACKUP FROM: primary +
START TIME: 2023-07-26 08:51:28 JST +
LABEL: backup +
START TIMELINE: 1 +
","16724 /PGDATA/datadg/tbs1 +
16725 /PGDATA/datadg/tbs2 +
")
(1 row)I read the documentation on this page https://www.postgresql.org/docs/current/functions-admin.htmlThe desired contents of the backup label file and the tablespace map file are returned as part of the result of the function and must be written to files in the backup area.I don't understand, "must be written to files in the backup area". Does it mean we need to manually create a file first on the master node and then rsync it to the backup node?ThanksOn Tue, Jul 25, 2023 at 9:22 PM Stephen Frost <sfrost@snowman.net> wrote:Greetings,
* Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote:
> I recently set up a Postgres15 master-slave (Primary-secondary) cluster on
> 2 nodes. At the time of starting, I ensured that data files are exactly the
> same on both the nodes. The size of the DB is 1.5TB.
> The directory structure on both nodes looks as follows:
> ```
> - /PROD/datadg/
> |
> |-> /PROD/datadg/tablespace
> | |-> /PROD/datadg/tablespace/tablespace1
> | |-> /PROD/datadg/tablespace/tablespace2
> |
> |-> /PROD/datadg/data
> | |-> /PROD/datadg/data/pg_tblspc
> | | | -> /PROD/datadg/data/pg_tblspc/16432 ->
> /PROD/datadg/tablespace/tablespace1
> | | | -> /PROD/datadg/data/pg_tblspc/16433 ->
> /PROD/datadg/tablespace/tablespace2
>
> ```
> Almost a week later now, I see almost a 2GB size difference in the
> tablespace folders on the 2 nodes. I also see some file count difference on
> both the nodes.
Not really enough to go on here.
> `autovacuum` is on on both the nodes and there aren't any `idle in
> transaction` queries on the slave node. Also there hasn't been any
> disruption on the streaming replication. I did not get any error like `WAL
> segment already removed` or so on the slave node. `pg_stat_replication` on
> the master node also doesn't show anything out of the ordinary and the
> `sent_lsn`, `write_lsn` and `flush_lsn` are regularly updated. I can not
> see a difference in counts of most tables either, haven't verified for all
> of them.
> So my **first question** is:
>
> * Why is there a difference in the files in the tablespace folder? I can
> understand the difference in the modification timestamps, but some files
> are just missing on the slave node.
Unlogged tables would certainly be a pretty easy explanation of size
differences between the two. There's also temporary files that might be
created on one system but not the other for in-progress queries. These
aren't the only possibilities but just a couple of likely candidates.
> Now if I were to run `vacuumdb` on the master node, there are chances that
> the slave node will break and give an error like this
> ```
> PANIC,XX000,"WAL contains references to invalid pages",,,,,"WAL redo at
> 875E/21A70BD0 for
> Heap2/VISIBLE: cutoff xid 60350476 flags 0x01; blkref #0: rel
> 16405/16419/533716933, fork 2, blk 26; blkref #1: rel
> 16405/16419/533716933, blk 853758",,,,"","startup
> ```
Did you actually run vacuumdb and actually see this? Are you able to
reproduce it?
> In the case when slave node breaks, these are the steps I usually do to
> bring the slave node back:
> 1) Start `pg_backup_start('backup')` on the master node
> 2) rsync the files from master to slave by running the following on the
> slave node:
> ```
> rsync -av --delete master_node:/PROD/datadg/data/ /PROD/datadg/data
> --exclude 'pg_log' --exclude 'pg_replslot'
> ```
> 3. Stop `pg_backup_stop()` on master node
This really isn't good enough as rsync will use timestamp/size by
default and this also doesn't grab and restore the absolutely required
backup_label file that's returned from pg_backup_stop(). Basically,
this is not a valid way to perform a backup/restore of PG. Please go
read the documentation for how to use the low-level API ... or, even
better, use an existing well maintained backup/restore tool for this.
> 4. Start the slave node again and it usually works, even though the
> tablespace files might not still be the same.
If you didn't grab the backup_label from pg_backup_stop() and drop it
into place on the replica, then you've almost certainly got a corrupt
system.
> **Second question**:
>
> * What is the best way to bring the slave node back? Is the `rsync` between
> tablespaces required? And if yes, what is the best method to do it for very
> large databases, something maybe as big as 30TB or more. I don't want to
> `rsync` all the files even if the timestamp on them is different. So is a
> command like this safe to do? Or should an option like `--checksum` be
> used?
> ```
> rsync -av --delete master_node:/PROD/datadg/tablespace/
> /PROD/datadg/tablespace --size-only
> ```
Using --size-only is an absolutely horrid idea and will miss lots of
changes. What you're doing isn't working because you're just creating
corrupt replicas by not following the process correctly for performing a
backup and restore of PG.
This stuff isn't easy. Use a tool for it, especially for larger
systems.
> **Third question:**
>
> * Is it advised to run `vacuumdb` before or after bringing the slave node
> back again?
No, this isn't necessary.
Thanks,
Stephen
This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).
Greetings, * Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote: > > Basically, this is not a valid way to perform a backup/restore of PG. > > Is it not valid only for PG 15 or even for earlier versions? I have always > referred to this https://www.postgresql.org/docs/8.1/backup-online.html For earlier versions too. > > better, use an existing well maintained backup/restore tool for this > > Is there any tool you could recommend? A tool to bring back the standby > node when it was down for a day or so for some OS/firmware upgrade. > I have tried pgBackRest, but it requires a control node. So anything that > doesn't involve a 3rd server for this situation would be helpful. pgBackRest doesn't actually require a control node, though it's certainly recommended to have one. Also, if you use physical replication slots, you can have the primary hold on to the WAL necessary for the replica to catch back up until the replica is back and then it can just fetch the WAL from the primary and catch up without any kind of re-sync being necessary. If you're worried about running out of space on the primary for this (which is an entirely valid concern...) then you could ship the WAL files to an archive system or location using pgBackRest or possibly some other tool and then have the replica fetch the WAL from there once it's back online. pgBackRest also has the ability to do incremental backups and delta restores. An incremental backup will only store the data that's changed since the prior backup, minimizing the storage space in the repo and the time required for the backup. A delta restore will only restore those files on the replica which are different from what was in the backup and that can be done using timestamp/file-size because pgBackRest tracks that information and will set it on restore. pgBackRest does also have an option to do checksum-based restores, which it will automatically use if anything looks odd regarding the timestamps. > > this also doesn't grab and restore the absolutely required > > backup_label file that's returned from pg_backup_stop() > > I tried running pg_backup_start('backup') and pg_backup_stop() on my DB, > but I did not see any backup_label_file being created. > psql (15.1) > Type "help" for help. > > postgres=# select pg_backup_start('backup'); > pg_backup_start > ----------------- > 68/32000028 > (1 row) > > postgres=# select pg_backup_stop(); > NOTICE: all required WAL segments have been archived > pg_backup_stop > ------------------------------------------------------------------------------- > (68/32000100,"START WAL LOCATION: 68/32000028 (file > 000000010000006800000032)+ > CHECKPOINT LOCATION: 68/32000060 > + > BACKUP METHOD: streamed > + > BACKUP FROM: primary > + > START TIME: 2023-07-26 08:51:28 JST > + > LABEL: backup > + > START TIMELINE: 1 > + > ","16724 /PGDATA/datadg/tbs1 > + > 16725 /PGDATA/datadg/tbs2 > + > ") > (1 row) > > I read the documentation on this page > https://www.postgresql.org/docs/current/functions-admin.html What's returned from pg_backup_stop() is the backup_label that needs to be stored with the files which were part of the backup. Note that you should *not* store the backup_label in the data directory of the primary because if the system crashes then it won't come back up without someone going in and removing that file. That's how the old exclusive method worked which was deprecated and then finally removed because of this issue. > > The desired contents of the backup label file and the tablespace map file > > are returned as part of the result of the function and must be written to > > files in the backup area. > > I don't understand, "*must be written to files in the backup area*". Does > it mean we need to manually create a file first on the master node and then > rsync it to the backup node? No, don't do that. You should back up all the files and then store the backup_label with those files. Do *not* put a backup_label into the data directory on the primary. * Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote: > I got the latest documentation > <https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP>and > understood that I was supposed to run > `select * from pg_backup_stop();` and store the labelfile output to a file > in the data directory of the secondary node. The backup_label should be stored with the backed up files and then restored as part of restoring the backup, yes. > I also understood that this Low-level API backup is not really > a recommended way and I need to move to another method. I would recommend that, yes. > However, till I fully set up a tool like pgBackRest (of which you are one > of the contributors - very impressive!) for PROD, can you please tell me: > 1) Are there any other options that are safe to use yet fast? Like you said > rsync --size-only would miss a lot of changes, but anything that would also > not start syncing every data file with only different timestamp, even > though it is exactly the same, including the checksum. Re-syncing a replica due to it being offline for OS patching or such really isn't typically the approach to use- better to have a WAL archive and just let the replica catch up by playing the WAL forward. A re-sync might make sense if the replica becomes very far out of date such that there's a huge backlog of WAL or if you're generating a huge amount of WAL that WAL replay just can't keep up, but I'm guessing neither of those are really the case here. If they are though, increental backups and delta restores with pgBackRest is an approach that has worked for high-throughput systems like those. > 2) While rsyncing from the master node, do I need to exclude `pg_wal` > folder? Along with a bunch of other ones, as discussed in the documentation. > 3) The replica/standby node should have access to the WAL segments archived > during time between start and stop backup, right? To perform any recovery of an online backup, all of the WAL generated during the backup (between the pg_backup_start and pg_backup_stop) must be preserved and made available via the restore_command for replay. > 4) What kind of data corruption would have been introduced due to the > absence of labelfile and if there is a way to identify the corruption? So > for example, if I am able to reindex all the tables, select all the data > and vacuum the db, do I still need to pg_dump and pg_restore those DBs? Any kind. Even taking those steps wouldn't provide a guarantee that there's no corruption. Doing a pg_dump/pg_restore would at least help you find any data that is invalid according to the constraints which are set in the database and that's a good first step but there could still be other corruption. Note that the corruption being discussed here would only be on the replica, because that's where the writes that happened between the backup start and the backup end got lost due to the lack of a backup label. Of course, if you've failed over to the replica, then you'd end up with the now-primary having that corruption. Thanks, Stephen
Attachment
Hi Stephen
Thank you for the very detailed reply. I tried the above method and it works. I'm still setting up pgBackRest.
I have one last question if you can answer that too please.
A delta restore will only restore those
files on the replica which are different from what was in the backup and
that can be done using timestamp/file-size because pgBackRest tracks
that information and will set it on restore.
As per my understanding, primary and standby nodes can have exactly the same data, with no data corruption, but still have different timestamps on the data files in the tablespace folder. Please correct me if I am wrong.
If that's the case, then will the backup (incremental/delta) taken from the primary node on pgBackRest, restore the files on the secondary node, just because they have a different timestamp? Or does pgBackRest have some mechanism to detect this and skip those files? Please assume for this case that we ran out of capacity to store the WAL segments from the primary while the secondary node was down.
Thanks
On Wed, 26 Jul 2023, 22:55 Stephen Frost, <sfrost@snowman.net> wrote:
Greetings,
* Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote:
> > Basically, this is not a valid way to perform a backup/restore of PG.
>
> Is it not valid only for PG 15 or even for earlier versions? I have always
> referred to this https://www.postgresql.org/docs/8.1/backup-online.html
For earlier versions too.
> > better, use an existing well maintained backup/restore tool for this
>
> Is there any tool you could recommend? A tool to bring back the standby
> node when it was down for a day or so for some OS/firmware upgrade.
> I have tried pgBackRest, but it requires a control node. So anything that
> doesn't involve a 3rd server for this situation would be helpful.
pgBackRest doesn't actually require a control node, though it's
certainly recommended to have one.
Also, if you use physical replication slots, you can have the primary
hold on to the WAL necessary for the replica to catch back up until the
replica is back and then it can just fetch the WAL from the primary and
catch up without any kind of re-sync being necessary.
If you're worried about running out of space on the primary for this
(which is an entirely valid concern...) then you could ship the WAL
files to an archive system or location using pgBackRest or possibly some
other tool and then have the replica fetch the WAL from there once it's
back online.
pgBackRest also has the ability to do incremental backups and delta
restores. An incremental backup will only store the data that's changed
since the prior backup, minimizing the storage space in the repo and the
time required for the backup. A delta restore will only restore those
files on the replica which are different from what was in the backup and
that can be done using timestamp/file-size because pgBackRest tracks
that information and will set it on restore. pgBackRest does also have
an option to do checksum-based restores, which it will automatically use
if anything looks odd regarding the timestamps.
> > this also doesn't grab and restore the absolutely required
> > backup_label file that's returned from pg_backup_stop()
>
> I tried running pg_backup_start('backup') and pg_backup_stop() on my DB,
> but I did not see any backup_label_file being created.
> psql (15.1)
> Type "help" for help.
>
> postgres=# select pg_backup_start('backup');
> pg_backup_start
> -----------------
> 68/32000028
> (1 row)
>
> postgres=# select pg_backup_stop();
> NOTICE: all required WAL segments have been archived
> pg_backup_stop
> -------------------------------------------------------------------------------
> (68/32000100,"START WAL LOCATION: 68/32000028 (file
> 000000010000006800000032)+
> CHECKPOINT LOCATION: 68/32000060
> +
> BACKUP METHOD: streamed
> +
> BACKUP FROM: primary
> +
> START TIME: 2023-07-26 08:51:28 JST
> +
> LABEL: backup
> +
> START TIMELINE: 1
> +
> ","16724 /PGDATA/datadg/tbs1
> +
> 16725 /PGDATA/datadg/tbs2
> +
> ")
> (1 row)
>
> I read the documentation on this page
> https://www.postgresql.org/docs/current/functions-admin.html
What's returned from pg_backup_stop() is the backup_label that needs to
be stored with the files which were part of the backup. Note that you
should *not* store the backup_label in the data directory of the primary
because if the system crashes then it won't come back up without someone
going in and removing that file. That's how the old exclusive method
worked which was deprecated and then finally removed because of this
issue.
> > The desired contents of the backup label file and the tablespace map file
> > are returned as part of the result of the function and must be written to
> > files in the backup area.
>
> I don't understand, "*must be written to files in the backup area*". Does
> it mean we need to manually create a file first on the master node and then
> rsync it to the backup node?
No, don't do that. You should back up all the files and then store the
backup_label with those files. Do *not* put a backup_label into the
data directory on the primary.
* Abhishek Bhola (abhishek.bhola@japannext.co.jp) wrote:
> I got the latest documentation
> <https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP>and
> understood that I was supposed to run
> `select * from pg_backup_stop();` and store the labelfile output to a file
> in the data directory of the secondary node.
The backup_label should be stored with the backed up files and then
restored as part of restoring the backup, yes.
> I also understood that this Low-level API backup is not really
> a recommended way and I need to move to another method.
I would recommend that, yes.
> However, till I fully set up a tool like pgBackRest (of which you are one
> of the contributors - very impressive!) for PROD, can you please tell me:
> 1) Are there any other options that are safe to use yet fast? Like you said
> rsync --size-only would miss a lot of changes, but anything that would also
> not start syncing every data file with only different timestamp, even
> though it is exactly the same, including the checksum.
Re-syncing a replica due to it being offline for OS patching or such
really isn't typically the approach to use- better to have a WAL archive
and just let the replica catch up by playing the WAL forward. A re-sync
might make sense if the replica becomes very far out of date such that
there's a huge backlog of WAL or if you're generating a huge amount of
WAL that WAL replay just can't keep up, but I'm guessing neither of
those are really the case here. If they are though, increental backups
and delta restores with pgBackRest is an approach that has worked for
high-throughput systems like those.
> 2) While rsyncing from the master node, do I need to exclude `pg_wal`
> folder?
Along with a bunch of other ones, as discussed in the documentation.
> 3) The replica/standby node should have access to the WAL segments archived
> during time between start and stop backup, right?
To perform any recovery of an online backup, all of the WAL generated
during the backup (between the pg_backup_start and pg_backup_stop) must
be preserved and made available via the restore_command for replay.
> 4) What kind of data corruption would have been introduced due to the
> absence of labelfile and if there is a way to identify the corruption? So
> for example, if I am able to reindex all the tables, select all the data
> and vacuum the db, do I still need to pg_dump and pg_restore those DBs?
Any kind. Even taking those steps wouldn't provide a guarantee that
there's no corruption. Doing a pg_dump/pg_restore would at least help
you find any data that is invalid according to the constraints which are
set in the database and that's a good first step but there could still
be other corruption.
Note that the corruption being discussed here would only be on the
replica, because that's where the writes that happened between the
backup start and the backup end got lost due to the lack of a backup
label. Of course, if you've failed over to the replica, then you'd end
up with the now-primary having that corruption.
Thanks,
Stephen
This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).
Greetings,
On Wed, Jul 26, 2023 at 21:31 Abhishek Bhola <abhishek.bhola@japannext.co.jp> wrote:
Thank you for the very detailed reply. I tried the above method and it works. I'm still setting up pgBackRest.I have one last question if you can answer that too please.A delta restore will only restore those
files on the replica which are different from what was in the backup and
that can be done using timestamp/file-size because pgBackRest tracks
that information and will set it on restore.As per my understanding, primary and standby nodes can have exactly the same data, with no data corruption, but still have different timestamps on the data files in the tablespace folder. Please correct me if I am wrong.If that's the case, then will the backup (incremental/delta) taken from the primary node on pgBackRest, restore the files on the secondary node, just because they have a different timestamp? Or does pgBackRest have some mechanism to detect this and skip those files? Please assume for this case that we ran out of capacity to store the WAL segments from the primary while the secondary node was down.
When pgbackrest does a restore, it will also set the timestamps for the files that it restores to what the timestamp was in the manifest. This allows us to detect if those files were changed since the restore happened. If they’ve not changed since the restore, then we skip checking them to see if they need to be restored from the repository.
Thanks,
Stephen