Thread: 9.0 standby - could not open file global/XXXXX

9.0 standby - could not open file global/XXXXX

From
Filip Rembiałkowski
Date:
Hi.

There is a large (>5T) database on PostgreSQL 9.0.23.

I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL:  could not open file "global/11819":
No such file or directory

I guessed the OID refereds to the pg_authid, but other system tables
might be affected too.

What could be wrong here?

Thanks!


Re: 9.0 standby - could not open file global/XXXXX

From
Stephen Frost
Date:
Greetings,

* Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote:
> There is a large (>5T) database on PostgreSQL 9.0.23.

First off, I hope you understand that 9.0 has been *long* out of
support and that you *really* need to upgrade to a supported version of
PostgreSQL (9.4 and up these days...).

> I would like to setup new WAL-shipping standby.
> https://www.postgresql.org/docs/9.0/warm-standby.html
>
> On my way I find unexpected issues. Here's the story, in short:
>
> 1. WAL archiving to remote archive is setup & verified
>
> 2. base backup is transferred directly to new server using
> pg_start_backup + rsync + pg_stop_backup.
>
> 3. recovery.conf is created
>
> 4. Server is started and consumes all the remaining WAL segments
> accumulated in the archive - finishing with optimistic message LOG:
> consistent recovery state reached at 9FC1/112BEE10.
>
> 5. When I go to postgres on the standby and try to connect system
> "postgres" database psql: FATAL:  could not open file "global/11819":
> No such file or directory

That seems pretty odd- does that file exist on the existing database
system..?

Thanks!

Stephen

Attachment

Re: 9.0 standby - could not open file global/XXXXX

From
Andres Freund
Date:
Hi,

On 2019-02-25 20:06:42 +0100, Filip Rembiałkowski wrote:
> There is a large (>5T) database on PostgreSQL 9.0.23.

As Stephen said, this is long out of support.


> 2. base backup is transferred directly to new server using
> pg_start_backup + rsync + pg_stop_backup.

Do you exclude any files? Remove any? If so which?

Greetings,

Andres Freund


Re: 9.0 standby - could not open file global/XXXXX

From
Filip Rembiałkowski
Date:
On Mon, Feb 25, 2019 at 11:45 PM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Filip Rembiałkowski (filip.rembialkowski@gmail.com) wrote:
> > There is a large (>5T) database on PostgreSQL 9.0.23.
>
> First off, I hope you understand that 9.0 has been *long* out of
> support and that you *really* need to upgrade to a supported version of
> PostgreSQL (9.4 and up these days...).

Yes I do. Unfortunately it was not possible.

>> > I would like to setup new WAL-shipping standby.
> > https://www.postgresql.org/docs/9.0/warm-standby.html
> >
> > On my way I find unexpected issues. Here's the story, in short:
> >
> > 1. WAL archiving to remote archive is setup & verified
> >
> > 2. base backup is transferred directly to new server using
> > pg_start_backup + rsync + pg_stop_backup.
> >
> > 3. recovery.conf is created
> >
> > 4. Server is started and consumes all the remaining WAL segments
> > accumulated in the archive - finishing with optimistic message LOG:
> > consistent recovery state reached at 9FC1/112BEE10.
> >
> > 5. When I go to postgres on the standby and try to connect system
> > "postgres" database psql: FATAL:  could not open file "global/11819":
> > No such file or directory
>
> That seems pretty odd- does that file exist on the existing database
> system..?

No. "global/11819" does not exist on the primary. That's what makes it
a mystery for me.

At first, I was thinking - maybe OIDs of some objects in pg_global
tablespace were "rotated" (VACUUM FULL, etc) during the rsync.
But it seems unlikely.


Re: 9.0 standby - could not open file global/XXXXX

From
Filip Rembiałkowski
Date:
On Tue, Feb 26, 2019 at 2:39 AM Andres Freund <andres@anarazel.de> wrote:
>
> > 2. base backup is transferred directly to new server using
> > pg_start_backup + rsync + pg_stop_backup.
>

I excluded contents of pg_xlog only. Exact command was:

# start script
psql -Xc "select pg_start_backup('mirror to $standby', true)"
( cd $PGDATA \
  && find . -type d \( -path ./pg_xlog \) -prune -o -type f -print \
  | xargs -n $nargs \
  | xargs -P $njobs -I {} bash -c "while ! rsync -a -u -z -R --inplace
{} $standby:$PGDATA; do echo retrying; done"
)
psql -Xc "select pg_stop_backup()"
# end script

Actually I am now in the process of repeating this with
single-threaded rsync - and I'll see if same issue happens.
This is what I'm going to use:

rsync -a -z --inplace $PGDATA/ $standby:$PGDATA/


But it's unlikely to change anything. Still no idea what could be the
root cause.


Thanks!


Re: 9.0 standby - could not open file global/XXXXX

From
Filip Rembiałkowski
Date:

OK I have it fixed;; just for anyone who's interested - the error was in the base backup procedure.
When switched to plain "rsync -az" - it  works like a charm.

Most probably, the fault was I assumed that you can use the rsync --update option when doing base backup.
You cannot, especially when time sync on both servers is not accurate. In my case, destination server clock was few minutes in future.
So the pg_clog was broken due to this. Which means a completely corrupted database.

thanks Stephen & Andres for your responses.



On Mon, Feb 25, 2019 at 8:06 PM Filip Rembiałkowski <filip.rembialkowski@gmail.com> wrote:
Hi.

There is a large (>5T) database on PostgreSQL 9.0.23.

I would like to setup new WAL-shipping standby.
https://www.postgresql.org/docs/9.0/warm-standby.html

On my way I find unexpected issues. Here's the story, in short:

1. WAL archiving to remote archive is setup & verified

2. base backup is transferred directly to new server using
pg_start_backup + rsync + pg_stop_backup.

3. recovery.conf is created

4. Server is started and consumes all the remaining WAL segments
accumulated in the archive - finishing with optimistic message LOG:
consistent recovery state reached at 9FC1/112BEE10.

5. When I go to postgres on the standby and try to connect system
"postgres" database psql: FATAL:  could not open file "global/11819":
No such file or directory

I guessed the OID refereds to the pg_authid, but other system tables
might be affected too.

What could be wrong here?

Thanks!