Re: Backup failure Postgres - Mailing list pgsql-general
From | Torsten Förtsch |
---|---|
Subject | Re: Backup failure Postgres |
Date | |
Msg-id | CAKkG4_nK1DYU=DTe+hb-TQnS5+05hJ1MXTf5QNKJSfsLrZ-7iA@mail.gmail.com Whole thread Raw |
In response to | Re: Backup failure Postgres (Jethish Jethish <jethish777@gmail.com>) |
List | pgsql-general |
Look, you have to compromise somewhere. Let me explain the problem. PG uses MVCC. That means if you update or delete rows, rows are not actually modified or added back to free space. They are just marked for later removal. That actual removal is VACUUM's task. The reason for doing so is that a concurrent transaction might still need to see the modified or deleted row. Now vacuum comes along and wants to actually add things back to free space. On the master that works fine because the master knows all concurrent transactions and what they might still need. So, vacuum will simply skip those rows.
However, that knowledge does not extend to the replica. The master does not know which transactions are running on the replica. So a vacuum operation on the master might remove something that's still needed on the replica. Now, that modification made by vacuum also needs to be replayed on the replica. The way that works is by adding all modifications including insert or vacuum or any other change in sequential order to a log (write-ahead-log or WAL). This log is then simply shipped to the replica and replayed.
It's not difficult to understand that these changes must be replayed in the same sequential order. Otherwise you get chaos. Now imagine a vacuum operation at the replica which removes stuff that is still needed by a transaction running on the replica like your COPY. Now the replica has 2 choices:
- abort the transaction and prefer replaying WAL
- pause replaying WAL and wait for the long running transaction
The 1st case is obviously bad for the transaction. The 2nd choice is bad for everybody else because WAL can be replayed only in the same order as it is generated. So, nothing that happened after that vacuum can be replayed which leads to stale data on the replica.
One way to mitigate this is hot_standby_feedback. That way the replica tells the master from time to time which old rows it still needs to see. The drawback of this is that your tables on the master might accumulate garbage that would normally be removed by vacuum earlier. That can affect query performance.
Then you have the option to pause WAL replay one or the other way. max_standby_streaming_delay, disconnecting from the master or explicitly pausing replay, all fall in that category.
The last option I know of would be to use logical replication. That comes with other problems. DDL becomes a bit finicky. Initial setup can be tricky. The process applying the changes can become a bottleneck.
If you are really time-critical and you just want the COPY job to be done and neither lag nor bloat are acceptable, then maybe you create another streaming replica, disconnect it from the master, run your COPY job and destroy the replica. If 3TB is the database size, then that does not look unsurmountable. Of course, you need the resources. In my environment I'd estimate 3-4 hours.
If you want a simple solution, then try hot_standby_feedback.
On Thu, May 23, 2024 at 12:46 PM Jethish Jethish <jethish777@gmail.com> wrote:
Hi Torsten,I have tried by increasing the max_standby_streaming_delay but I'm facing lag issues on the replica server.When i increase the max_standby_streaming_delay even if a query runs for 2 minutes I'm facing lag issues for 2 minutes.Please suggest here.Data size is 3TBOn Thu, May 23, 2024, 3:53 PM Torsten Förtsch <tfoertsch123@gmail.com> wrote:As the error message says, your query was aborted due to it conflicting with recovery. There are many ways to deal with that. You could enable hot_standby_feedback on the replica. You could disconnect the replica from the master for the time the COPY takes (reset primary_conninfo). You could increase max_standby_streaming_delay. Perhaps you could also wrap the COPY operation in pg_wal_replay_pause() / pg_wal_replay_resume().On Thu, May 23, 2024 at 11:59 AM Jethish Jethish <jethish777@gmail.com> wrote:I'm frequently facing the below error while performing backup. Someone please tell how solve this issues.Failed : pg_dump: error: Dumping the contents of table "botsession" failed: PQgetResult() failed. pg_dump: error: Error message from server: ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. pg_dump: error: The command was: COPY public.botsession (id, userid, data, iscompressed) TO stdout;
pgsql-general by date: