BUG #17846: pg_dump doesn't properly dump with paused WAL replay - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17846: pg_dump doesn't properly dump with paused WAL replay
Date
Msg-id 17846-1a0e5ce976f4c01a@postgresql.org
Whole thread Raw
Responses Re: BUG #17846: pg_dump doesn't properly dump with paused WAL replay  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17846
Logged by:          Francisco Reinolds
Email address:      francisco.reinolds@channable.com
PostgreSQL version: 13.8
Operating system:   Ubuntu 22.04.2 LTS
Description:

In our setup, we have primary and secondary databases, taking advantage of
physical streaming replication to keep the secondary up to date, so that we
can perform backups from it, lessening the load on the primary.

For backups, we use pg_dump to perform a full database dump. Before we start
a backup, we pause the WAL replay on the secondary, unpausing it after it is
concluded. This was done since we previously encountered problems with
pg_dump failing when an AccessExclusiveLock was held on a table that pg_dump
was going to dump.

For some time we faced no problems with this setup, but starting some months
ago, we started witnessing sporadic failures when we attempted to restore
the dumps of one of our databases, to verify the dump's integrity. These
restore failures would occur due to a key not being present in a table:


```
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 3279; 2606 203326243 FK CONSTRAINT table_2
table_2_table_1_id_fkey realtime_alembic
pg_restore: error: could not execute query: ERROR:  insert or update on
table "table_2" violates foreign key constraint "table_2_table_1_id_fkey"
DETAIL:  Key (table_1_id)=(34553066) is not present in table "table_1".
Command was: ALTER TABLE ONLY public.table_2
    ADD CONSTRAINT table_2_table_1_id_fkey FOREIGN KEY (table_1_id) REFERENCES
public.table_1(id);
```


This has been puzzling us for some time since:
- this is the only database of our infrastructure where this occurs, despite
this setup being replicated across it;
- the failures started occurring sporadically, but have become more
consistent;

We have managed, with some help from the Postgres IRC channel (special
thanks to user nickb), to work around the problem. The solution was to begin
a transaction, and extract a snapshot that'd be passed as a pg_dump
argument, and only then pause WAL replay. From our understanding, pg_dump
should already implicitly pick a suitable point to start the dump but it
apparently is not the case, hence the bug report.

Thanks for the attention,

Francisco


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17845: insert into on conflict bug .
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: pg_read_server_files doesn't let me use pg_ls_dir() or pg_read_file?