Home > mailing lists

Re: Trying to handle db corruption 9.6 - Mailing list pgsql-performance

From	Mariel Cherkassky
Subject	Re: Trying to handle db corruption 9.6
Date	May 21, 2019 13:12:57
Msg-id	CA+t6e1m61Q_0+WwJLc5fDc0z4YwcHamEFTN2oRY+fqS2pxEUyQ@mail.gmail.com Whole thread
In response to	Re: Trying to handle db corruption 9.6 (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List	pgsql-performance

Tree view

Tomas - Well, when I run the restore_command manually it works (archive dir exists on the secondary..). Thank for the explanation on the system catalogs..

Greg - My restore command copy the wals from archive dir in the primary to an archive dir in the secondary(different from the pg_xlog in the secondary). Should I run it manually and see if the archives are copied to the archive dir in the secondary or should I just copy all of them to the xlog dir in the secondary ?

I tried to start the secondary as a primary (I have a backup..) but I still got an error (invalid checkpoint record from primary./ secondary). Does it means that my backup is corrupted ?

‫בתאריך יום ג׳, 21 במאי 2019 ב-16:07 מאת ‪Tomas Vondra‬‏ <‪tomas.vondra@2ndquadrant.com‬‏>:‬

On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:
>Tomas :
>
>Well, when you say it does not work, why do you think so? Does it print
>some error, or what? Does it even get executed? It does not seem to be
>the case, judging by the log (there's no archive_command message).
>
>How was the "secondary machine" created? You said you have all the WAL
>since then - how do you know that?
>
>Well, when I start the secondary in recovery mode (the primary is down,
>auto failover is disabled..) it doesnt start recovering the archive wals
>from the primary. The logs of the secondary :
>receiving incremental file list
>rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such
>file or directory (2)
>
>sent 8 bytes received 10 bytes 36.00 bytes/sec
>total size is 0 speedup is 0.00
>rsync error: some files/attrs were not transferred (see previous errors)
>(code 23) at main.c(1505) [receiver=3.0.6]
>sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
>2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode
>2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record
>2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in
>control file
>2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid checkpoint
>record
>2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was
>terminated by signal 6: Aborted
>2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup
>process failure
>2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down
>2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in
>recovery at 2019-05-01 09:40:02 PDT
>
>As I said, the secondary was down for a month and I have all the archives
>of the wals in my primary. I was hoping that the secondary will use the
>restore_command to restore them :
>restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
>/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
>
>my archive_command on the primary was :
>archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'
>
>Am I missing something ?
>

First of all, the way you quote message is damn confusing - there's no
clear difference between your message and the message you quote. I don't
know which mail client you're using, but I suppose it can be configured to
quote sensibly ...

Well, clearly the standby tries to fetch WAL from archive, but the rsync
command fails for some reason. You're in the position to investigate
further, because you can run it manually - we can't. This has nothing to
do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
the standby, and it's confusing because archive uses the same path.

>Another question, If I'll run initdb and initiate a new cluster and i'll
>copy the data files of my old cluster into the new one, is there any chance
>that it will work ?
>I mean right now, my primary is down and cant start up because it is
>missing an offset file in the pg_multixtrans/offset dir.
>

No, because you won't have contents of system catalogs, mapping the data
files to relations (tables, indexes) and containing information about the
structure (which columns / data types are in the data).

The data files are pretty useless on their own. It might be possible to do
some manualy recovery - say, you might create the same tables in the new
schema, and then guess which data files belong to them. But there are
various caveats e.g. due to dropped columns, etc.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-performance by date:

From: Tomas Vondra
Date: 21 May 2019, 13:07:14
Subject: Re: Trying to handle db corruption 9.6

From: Tomas Vondra
Date: 21 May 2019, 16:37:27
Subject: Re: Trying to handle db corruption 9.6

Re: Trying to handle db corruption 9.6 - Mailing list pgsql-performance

Previous

Next