Thread: 9.0 standby - could not open file global/XXXXX
Hi. There is a large (>5T) database on PostgreSQL 9.0.23. I would like to setup new WAL-shipping standby. https://www.postgresql.org/docs/9.0/warm-standby.html On my way I find unexpected issues. Here's the story, in short: 1. WAL archiving to remote archive is setup & verified 2. base backup is transferred directly to new server using pg_start_backup + rsync + pg_stop_backup. 3. recovery.conf is created 4. Server is started and consumes all the remaining WAL segments accumulated in the archive - finishing with optimistic message LOG: consistent recovery state reached at 9FC1/112BEE10. 5. When I go to postgres on the standby and try to connect system "postgres" database psql: FATAL: could not open file "global/11819": No such file or directory I guessed the OID refereds to the pg_authid, but other system tables might be affected too. What could be wrong here? Thanks!
Greetings, * Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote: > There is a large (>5T) database on PostgreSQL 9.0.23. First off, I hope you understand that 9.0 has been *long* out of support and that you *really* need to upgrade to a supported version of PostgreSQL (9.4 and up these days...). > I would like to setup new WAL-shipping standby. > https://www.postgresql.org/docs/9.0/warm-standby.html > > On my way I find unexpected issues. Here's the story, in short: > > 1. WAL archiving to remote archive is setup & verified > > 2. base backup is transferred directly to new server using > pg_start_backup + rsync + pg_stop_backup. > > 3. recovery.conf is created > > 4. Server is started and consumes all the remaining WAL segments > accumulated in the archive - finishing with optimistic message LOG: > consistent recovery state reached at 9FC1/112BEE10. > > 5. When I go to postgres on the standby and try to connect system > "postgres" database psql: FATAL: could not open file "global/11819": > No such file or directory That seems pretty odd- does that file exist on the existing database system..? Thanks! Stephen
Attachment
Hi, On 2019-02-25 20:06:42 +0100, Filip Rembiałkowski wrote: > There is a large (>5T) database on PostgreSQL 9.0.23. As Stephen said, this is long out of support. > 2. base backup is transferred directly to new server using > pg_start_backup + rsync + pg_stop_backup. Do you exclude any files? Remove any? If so which? Greetings, Andres Freund
On Mon, Feb 25, 2019 at 11:45 PM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote: > > There is a large (>5T) database on PostgreSQL 9.0.23. > > First off, I hope you understand that 9.0 has been *long* out of > support and that you *really* need to upgrade to a supported version of > PostgreSQL (9.4 and up these days...). Yes I do. Unfortunately it was not possible. >> > I would like to setup new WAL-shipping standby. > > https://www.postgresql.org/docs/9.0/warm-standby.html > > > > On my way I find unexpected issues. Here's the story, in short: > > > > 1. WAL archiving to remote archive is setup & verified > > > > 2. base backup is transferred directly to new server using > > pg_start_backup + rsync + pg_stop_backup. > > > > 3. recovery.conf is created > > > > 4. Server is started and consumes all the remaining WAL segments > > accumulated in the archive - finishing with optimistic message LOG: > > consistent recovery state reached at 9FC1/112BEE10. > > > > 5. When I go to postgres on the standby and try to connect system > > "postgres" database psql: FATAL: could not open file "global/11819": > > No such file or directory > > That seems pretty odd- does that file exist on the existing database > system..? No. "global/11819" does not exist on the primary. That's what makes it a mystery for me. At first, I was thinking - maybe OIDs of some objects in pg_global tablespace were "rotated" (VACUUM FULL, etc) during the rsync. But it seems unlikely.
On Tue, Feb 26, 2019 at 2:39 AM Andres Freund <andres@anarazel.de> wrote: > > > 2. base backup is transferred directly to new server using > > pg_start_backup + rsync + pg_stop_backup. > I excluded contents of pg_xlog only. Exact command was: # start script psql -Xc "select pg_start_backup('mirror to $standby', true)" ( cd $PGDATA \ && find . -type d \( -path ./pg_xlog \) -prune -o -type f -print \ | xargs -n $nargs \ | xargs -P $njobs -I {} bash -c "while ! rsync -a -u -z -R --inplace {} $standby:$PGDATA; do echo retrying; done" ) psql -Xc "select pg_stop_backup()" # end script Actually I am now in the process of repeating this with single-threaded rsync - and I'll see if same issue happens. This is what I'm going to use: rsync -a -z --inplace $PGDATA/ $standby:$PGDATA/ But it's unlikely to change anything. Still no idea what could be the root cause. Thanks!
OK I have it fixed;; just for anyone who's interested - the error was in the base backup procedure.
When switched to plain "rsync -az" - it works like a charm.
Most probably, the fault was I assumed that you can use the rsync --update option when doing base backup.
You cannot, especially when time sync on both servers is not accurate. In my case, destination server clock was few minutes in future.
So the pg_clog was broken due to this. Which means a completely corrupted database.
thanks Stephen & Andres for your responses.
On Mon, Feb 25, 2019 at 8:06 PM Filip Rembiałkowski <filip.rembialkowski@gmail.com> wrote:
Hi.
There is a large (>5T) database on PostgreSQL 9.0.23.
I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html
On my way I find unexpected issues. Here's the story, in short:
1. WAL archiving to remote archive is setup & verified
2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.
3. recovery.conf is created
4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.
5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL: could not open file "global/11819":
No such file or directory
I guessed the OID refereds to the pg_authid, but other system tables
might be affected too.
What could be wrong here?
Thanks!