Thread: Trying to handle db corruption 9.6

Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down. 
When I started the db (service postgresql start) I saw the next error in the logs :

DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files : 0000 and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to 0000 and then the db searched for a file in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully.
However after it started I saw the next msg in the logs : 
Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user).

current relevant rows from the control data : 

pg_control version number:            960

Catalog version number:               201608131

Database system identifier:           6692952810876880414

Database cluster state:               shut down

pg_control last modified:             Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:           1837/E3000028

Prior checkpoint location:            1837/E2000028

Latest checkpoint's REDO location:    1837/E3000028

Latest checkpoint's REDO WAL file:    0000000100001837000000E3

Latest checkpoint's TimeLineID:       1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          0:3

Latest checkpoint's NextOID:          10000

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:        3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041




I also checked and I saw that the customer has all the wals (backed up) but without any basebackup..
Any recommendations how to handle the case ?

Re: Trying to handle db corruption 9.6

From
Flo Rance
Date:
Hi,

First of all, as stated in the wiki, you'll need to do a filesystem level copy of the database files and put them on another drive before attempting to do anything else !


regards,
Flo

On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down. 
When I started the db (service postgresql start) I saw the next error in the logs :

DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files : 0000 and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to 0000 and then the db searched for a file in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully.
However after it started I saw the next msg in the logs : 
Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user).

current relevant rows from the control data : 

pg_control version number:            960

Catalog version number:               201608131

Database system identifier:           6692952810876880414

Database cluster state:               shut down

pg_control last modified:             Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:           1837/E3000028

Prior checkpoint location:            1837/E2000028

Latest checkpoint's REDO location:    1837/E3000028

Latest checkpoint's REDO WAL file:    0000000100001837000000E3

Latest checkpoint's TimeLineID:       1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          0:3

Latest checkpoint's NextOID:          10000

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:        3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041




I also checked and I saw that the customer has all the wals (backed up) but without any basebackup..
Any recommendations how to handle the case ?

Re: Trying to handle db corruption 9.6

From
Flo Rance
Date:
Hi,

First of all, as stated in the wiki, you'll need to do a filesystem level copy of the database files and put them on another drive before attempting to do anything else !


regards,
Flo

On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down. 
When I started the db (service postgresql start) I saw the next error in the logs :

DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files : 0000 and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to 0000 and then the db searched for a file in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully.
However after it started I saw the next msg in the logs : 
Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user).

current relevant rows from the control data : 

pg_control version number:            960

Catalog version number:               201608131

Database system identifier:           6692952810876880414

Database cluster state:               shut down

pg_control last modified:             Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:           1837/E3000028

Prior checkpoint location:            1837/E2000028

Latest checkpoint's REDO location:    1837/E3000028

Latest checkpoint's REDO WAL file:    0000000100001837000000E3

Latest checkpoint's TimeLineID:       1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          0:3

Latest checkpoint's NextOID:          10000

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:        3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041




I also checked and I saw that the customer has all the wals (backed up) but without any basebackup..
Any recommendations how to handle the case ?

Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Yes I understand that.. I'm trying to handle it after the backup that I have taken..

On Mon, May 20, 2019, 5:49 PM Flo Rance <trourance@gmail.com> wrote:
Hi,

First of all, as stated in the wiki, you'll need to do a filesystem level copy of the database files and put them on another drive before attempting to do anything else !


regards,
Flo

On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down. 
When I started the db (service postgresql start) I saw the next error in the logs :

DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files : 0000 and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to 0000 and then the db searched for a file in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully.
However after it started I saw the next msg in the logs : 
Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user).

current relevant rows from the control data : 

pg_control version number:            960

Catalog version number:               201608131

Database system identifier:           6692952810876880414

Database cluster state:               shut down

pg_control last modified:             Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:           1837/E3000028

Prior checkpoint location:            1837/E2000028

Latest checkpoint's REDO location:    1837/E3000028

Latest checkpoint's REDO WAL file:    0000000100001837000000E3

Latest checkpoint's TimeLineID:       1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          0:3

Latest checkpoint's NextOID:          10000

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:        3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041




I also checked and I saw that the customer has all the wals (backed up) but without any basebackup..
Any recommendations how to handle the case ?

Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Yes I understand that.. I'm trying to handle it after the backup that I have taken..

On Mon, May 20, 2019, 5:49 PM Flo Rance <trourance@gmail.com> wrote:
Hi,

First of all, as stated in the wiki, you'll need to do a filesystem level copy of the database files and put them on another drive before attempting to do anything else !


regards,
Flo

On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down. 
When I started the db (service postgresql start) I saw the next error in the logs :

DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files : 0000 and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to 0000 and then the db searched for a file in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully.
However after it started I saw the next msg in the logs : 
Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user).

current relevant rows from the control data : 

pg_control version number:            960

Catalog version number:               201608131

Database system identifier:           6692952810876880414

Database cluster state:               shut down

pg_control last modified:             Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:           1837/E3000028

Prior checkpoint location:            1837/E2000028

Latest checkpoint's REDO location:    1837/E3000028

Latest checkpoint's REDO WAL file:    0000000100001837000000E3

Latest checkpoint's TimeLineID:       1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:          0:3

Latest checkpoint's NextOID:          10000

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:        3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041




I also checked and I saw that the customer has all the wals (backed up) but without any basebackup..
Any recommendations how to handle the case ?

Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
A backup was made after the corruption appeared but before I tried using the pg_resetxlog command. Basically I just want to start the database with the data that is available in the files(I'm ok with loosing data that was in the cache and wasnt written to disk).
My question is how can I continue from here ?
I also sent this mail to pgadmin mail list..

‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ <‪Greg.Clough@ihsmarkit.com‬‏>:‬

> Yes I understand that.. I'm trying to handle it after the backup that I have taken..


IMHO the best option here is to keep safe a copy as you have already done and then restore from a backup, and replay whatever WAL you have.  The database you have is terminally corrupted, and should never be relied upon going forward.

 

You can try to get it running, and then extract the data with pg_dump... but even then you will need to manually verify it’s OK because you have no idea which dirty blocks from memory have been written to disk and which have not.  Without the WAL you have no way of making it consistent, and if they have been destroyed then you’re out of luck.

 

If you don’t have backups and archived WAL then fixing what you’ve got may be your only option, but you should only go down that route if you have to.  If you have to “repair”, then I’d recommend engaging a reputable PostgreSQL consultancy to help you.

 

Regards,

Greg.

P.S.  This conversation should probably be moved to something like pgsql-admin
             

 




This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by IHSMarkit is prohibited. This email is subject to all waivers and other terms at the following link: https://ihsmarkit.com/Legal/EmailDisclaimer.html

Please visit www.ihsmarkit.com/about/contact-us.html for contact information on our offices worldwide.

Re: Trying to handle db corruption 9.6

From
Bimal
Date:
I had ran into same issue about year back, luckily I had standby to quickly promote.  But, I wish there was better a documentation on how to handle WAL log fill up and resetting them.  

On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:


A backup was made after the corruption appeared but before I tried using the pg_resetxlog command. Basically I just want to start the database with the data that is available in the files(I'm ok with loosing data that was in the cache and wasnt written to disk).
My question is how can I continue from here ?
I also sent this mail to pgadmin mail list..

‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ <‪Greg.Clough@ihsmarkit.com‬‏>:‬

> Yes I understand that.. I'm trying to handle it after the backup that I have taken..


IMHO the best option here is to keep safe a copy as you have already done and then restore from a backup, and replay whatever WAL you have.  The database you have is terminally corrupted, and should never be relied upon going forward.

 

You can try to get it running, and then extract the data with pg_dump... but even then you will need to manually verify it’s OK because you have no idea which dirty blocks from memory have been written to disk and which have not.  Without the WAL you have no way of making it consistent, and if they have been destroyed then you’re out of luck.

 

If you don’t have backups and archived WAL then fixing what you’ve got may be your only option, but you should only go down that route if you have to.  If you have to “repair”, then I’d recommend engaging a reputable PostgreSQL consultancy to help you.

 

Regards,

Greg.

P.S.  This conversation should probably be moved to something like pgsql-admin
             

 




This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by IHSMarkit is prohibited. This email is subject to all waivers and other terms at the following link: https://ihsmarkit.com/Legal/EmailDisclaimer.html

Please visit www.ihsmarkit.com/about/contact-us.html for contact information on our offices worldwide.

Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Hey Greg,
Basically my backup was made after the first pg_resetxlog so I was wrong. However, the customer had a secondary machine that wasn't synced for a month. I have all the walls since the moment the secondary went out of sync. Once I started it I hoped that it will start recover the wals and fill the gap. However I got an error in the secondary :           2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in control file
2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid checkpoint record
2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was terminated by signal 6: Aborted
2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup process failure
2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.                            I checked my secondary archive dir and pg_xlog dir and it seems that the restore command doesnt work. My restore_command:      
restore_command = 'rsync -avzhe ssh postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup /var/lib/pgsql/archive %r'


On Mon, May 20, 2019, 7:20 PM Bimal <internetuser2008@yahoo.com> wrote:
I had ran into same issue about year back, luckily I had standby to quickly promote.  But, I wish there was better a documentation on how to handle WAL log fill up and resetting them.  

On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky <mariel.cherkassky@gmail.com> wrote:


A backup was made after the corruption appeared but before I tried using the pg_resetxlog command. Basically I just want to start the database with the data that is available in the files(I'm ok with loosing data that was in the cache and wasnt written to disk).
My question is how can I continue from here ?
I also sent this mail to pgadmin mail list..

‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ <‪Greg.Clough@ihsmarkit.com‬‏>:‬

> Yes I understand that.. I'm trying to handle it after the backup that I have taken..


IMHO the best option here is to keep safe a copy as you have already done and then restore from a backup, and replay whatever WAL you have.  The database you have is terminally corrupted, and should never be relied upon going forward.

 

You can try to get it running, and then extract the data with pg_dump... but even then you will need to manually verify it’s OK because you have no idea which dirty blocks from memory have been written to disk and which have not.  Without the WAL you have no way of making it consistent, and if they have been destroyed then you’re out of luck.

 

If you don’t have backups and archived WAL then fixing what you’ve got may be your only option, but you should only go down that route if you have to.  If you have to “repair”, then I’d recommend engaging a reputable PostgreSQL consultancy to help you.

 

Regards,

Greg.

P.S.  This conversation should probably be moved to something like pgsql-admin
             

 




This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by IHSMarkit is prohibited. This email is subject to all waivers and other terms at the following link: https://ihsmarkit.com/Legal/EmailDisclaimer.html

Please visit www.ihsmarkit.com/about/contact-us.html for contact information on our offices worldwide.

Re: Trying to handle db corruption 9.6

From
Tomas Vondra
Date:
On Mon, May 20, 2019 at 04:20:45PM +0000, Bimal wrote:
>   I had ran into same issue about year back, luckily I had standby to
>   quickly promote.  But, I wish there was better a documentation on how to
>   handle WAL log fill up and resetting them.  

pg_resetxlog is not a tool to deal with "WAL fill up". It's a last
resort option to deal with corrupted WAL, and can easily make matters
worse when used without due consideration. That seems to be the case
here, unfortunately.

On a properly behaving system, running out of disk space for pg_xlog
results in database shutdown. If you also get corrupted WAL, you have
bigger problems, I'm afraid.

Also, data corruption issues are one-off events, mostly unique. That
makes it rather difficult (~impossible) to write docs about recovering
from them. And it's why there are no magic tools.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Trying to handle db corruption 9.6

From
Tomas Vondra
Date:
On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:
>   Hey Greg,
>   Basically my backup was made after the first pg_resetxlog so I was wrong.

Bummer.

>   However, the customer had a secondary machine that wasn't synced for a
>   month. I have all the walls since the moment the secondary went out of
>   sync. Once I started it I hoped that it will start recover the wals and
>   fill the gap. However I got an error in the secondary :         
>    2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
>   2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
>   2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in
>   control file
>   2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid
>   checkpoint record
>   2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
>   terminated by signal 6: Aborted
>   2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
>   process failure
>   2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.       
>                       I checked my secondary archive dir and pg_xlog dir and
>   it seems that the restore command doesnt work. My restore_command:      
>   restore_command = 'rsync -avzhe ssh
>   postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
>   gunzip < /var/lib/pgsql/archive/%f > %p'
>   archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
>   /var/lib/pgsql/archive %r'

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Tomas : 

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?  


Well, when I start the secondary in recovery mode (the primary is down, auto failover is disabled..)  it doesnt start recovering the archive wals from the primary. The logs of the secondary : 
receiving incremental file list
rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such file or directory (2)

sent 8 bytes  received 10 bytes  36.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1505) [receiver=3.0.6]
sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in control file
2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint record
2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was terminated by signal 6: Aborted
2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup process failure
2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in recovery at 2019-05-01 09:40:02 PDT

As I said, the secondary was down for a month and I have all the archives of the wals in my primary. I was hoping that the secondary will use the restore_command to restore them :
restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'

my archive_command on the primary was : 
archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

Am I missing something ?

Another question, If I'll run initdb and initiate a new cluster and i'll copy the data files of my old cluster into the new one, is there any chance that it will work ?
I mean right now, my primary is down and cant start up because it is missing an offset file in the pg_multixtrans/offset dir.

‫בתאריך יום ג׳, 21 במאי 2019 ב-0:04 מאת ‪Tomas Vondra‬‏ <‪tomas.vondra@2ndquadrant.com‬‏>:‬
On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:
>   Hey Greg,
>   Basically my backup was made after the first pg_resetxlog so I was wrong.

Bummer.

>   However, the customer had a secondary machine that wasn't synced for a
>   month. I have all the walls since the moment the secondary went out of
>   sync. Once I started it I hoped that it will start recover the wals and
>   fill the gap. However I got an error in the secondary :         
>    2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
>   2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
>   2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in
>   control file
>   2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid
>   checkpoint record
>   2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
>   terminated by signal 6: Aborted
>   2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
>   process failure
>   2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.       
>                       I checked my secondary archive dir and pg_xlog dir and
>   it seems that the restore command doesnt work. My restore_command:      
>   restore_command = 'rsync -avzhe ssh
>   postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
>   gunzip < /var/lib/pgsql/archive/%f > %p'
>   archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
>   /var/lib/pgsql/archive %r'

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trying to handle db corruption 9.6

From
Tomas Vondra
Date:
On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:
>Tomas :
>
>Well, when you say it does not work, why do you think so? Does it print
>some error, or what? Does it even get executed? It does not seem to be
>the case, judging by the log (there's no archive_command message).
>
>How was the "secondary machine" created? You said you have all the WAL
>since then - how do you know that?
>
>Well, when I start the secondary in recovery mode (the primary is down,
>auto failover is disabled..)  it doesnt start recovering the archive wals
>from the primary. The logs of the secondary :
>receiving incremental file list
>rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such
>file or directory (2)
>
>sent 8 bytes  received 10 bytes  36.00 bytes/sec
>total size is 0  speedup is 0.00
>rsync error: some files/attrs were not transferred (see previous errors)
>(code 23) at main.c(1505) [receiver=3.0.6]
>sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
>2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
>2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
>2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
>control file
>2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint
>record
>2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
>terminated by signal 6: Aborted
>2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
>process failure
>2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
>2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
>recovery at 2019-05-01 09:40:02 PDT
>
>As I said, the secondary was down for a month and I have all the archives
>of the wals in my primary. I was hoping that the secondary will use the
>restore_command to restore them :
>restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
>/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
>
>my archive_command on the primary was :
>archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'
>
>Am I missing something ?
>

First of all, the way you quote message is damn confusing - there's no
clear difference between your message and the message you quote. I don't
know which mail client you're using, but I suppose it can be configured to
quote sensibly ...

Well, clearly the standby tries to fetch WAL from archive, but the rsync
command fails for some reason. You're in the position to investigate
further, because you can run it manually - we can't. This has nothing to
do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
the standby, and it's confusing because archive uses the same path.


>Another question, If I'll run initdb and initiate a new cluster and i'll
>copy the data files of my old cluster into the new one, is there any chance
>that it will work ?
>I mean right now, my primary is down and cant start up because it is
>missing an offset file in the pg_multixtrans/offset dir.
>

No, because you won't have contents of system catalogs, mapping the data
files to relations (tables, indexes) and containing information about the
structure (which columns / data types are in the data).

The data files are pretty useless on their own. It might be possible to do
some manualy recovery - say, you might create the same tables in the new
schema, and then guess which data files belong to them. But there are
various caveats e.g. due to dropped columns, etc.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Trying to handle db corruption 9.6

From
Mariel Cherkassky
Date:
Tomas - Well,  when I run the restore_command manually it works (archive dir exists on the secondary..). Thank for the explanation on the system catalogs..

Greg - My restore command copy the wals from archive dir in the primary to an archive dir in the secondary(different from the pg_xlog in the secondary). Should I run it manually and see if the archives are copied to the archive dir in the secondary or should I just copy all of them to the xlog dir in the secondary ? 
I tried to start the secondary as a primary (I have a backup..) but I still got an error (invalid checkpoint record from primary./ secondary). Does it means that my backup is corrupted ?

‫בתאריך יום ג׳, 21 במאי 2019 ב-16:07 מאת ‪Tomas Vondra‬‏ <‪tomas.vondra@2ndquadrant.com‬‏>:‬
On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:
>Tomas :
>
>Well, when you say it does not work, why do you think so? Does it print
>some error, or what? Does it even get executed? It does not seem to be
>the case, judging by the log (there's no archive_command message).
>
>How was the "secondary machine" created? You said you have all the WAL
>since then - how do you know that?
>
>Well, when I start the secondary in recovery mode (the primary is down,
>auto failover is disabled..)  it doesnt start recovering the archive wals
>from the primary. The logs of the secondary :
>receiving incremental file list
>rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such
>file or directory (2)
>
>sent 8 bytes  received 10 bytes  36.00 bytes/sec
>total size is 0  speedup is 0.00
>rsync error: some files/attrs were not transferred (see previous errors)
>(code 23) at main.c(1505) [receiver=3.0.6]
>sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
>2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
>2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
>2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
>control file
>2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint
>record
>2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
>terminated by signal 6: Aborted
>2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
>process failure
>2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
>2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
>recovery at 2019-05-01 09:40:02 PDT
>
>As I said, the secondary was down for a month and I have all the archives
>of the wals in my primary. I was hoping that the secondary will use the
>restore_command to restore them :
>restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
>/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
>
>my archive_command on the primary was :
>archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'
>
>Am I missing something ?
>

First of all, the way you quote message is damn confusing - there's no
clear difference between your message and the message you quote. I don't
know which mail client you're using, but I suppose it can be configured to
quote sensibly ...

Well, clearly the standby tries to fetch WAL from archive, but the rsync
command fails for some reason. You're in the position to investigate
further, because you can run it manually - we can't. This has nothing to
do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
the standby, and it's confusing because archive uses the same path.


>Another question, If I'll run initdb and initiate a new cluster and i'll
>copy the data files of my old cluster into the new one, is there any chance
>that it will work ?
>I mean right now, my primary is down and cant start up because it is
>missing an offset file in the pg_multixtrans/offset dir.
>

No, because you won't have contents of system catalogs, mapping the data
files to relations (tables, indexes) and containing information about the
structure (which columns / data types are in the data).

The data files are pretty useless on their own. It might be possible to do
some manualy recovery - say, you might create the same tables in the new
schema, and then guess which data files belong to them. But there are
various caveats e.g. due to dropped columns, etc.

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Trying to handle db corruption 9.6

From
Tomas Vondra
Date:
On Tue, May 21, 2019 at 04:03:52PM +0000, Greg Clough wrote:
>>  My restore command copy the wals from archive dir in the primary to an
>>  archive dir in the secondary(different from the pg_xlog in the
>>  secondary)
>
>I think that you're restore command puts them back into the archive, and
>then uncompresses them into pg_xlog, which is what %p represents.
>
>
>> Should I run it manually and see if the archives are copied to the
>> archive dir in the secondary or should I just copy all of them to the
>> xlog dir in the secondary ?
>
>That would be my first test, but as Thomas mentioned, you don't have any
>hint of WAL archives being restored in the postgresql.log... so it's not
>even trying.  It's not likely that archive_command is your problem at the
>moment.
>
>
>> I tried to start the secondary as a primary (I have a backup..) but I
>> still got an error (invalid checkpoint record from primary./
>> secondary). Does it means that my backup is corrupted ?
>
>I think so, but Thomas could probably confirm if all hope is lost.  Also,
>I'm not sure if there is a terminology difference but a "standby" is
>never considered a "backup".  I realise it's late in the day, but even if
>you have a correctly configured Standby you should also take backups with
>pg_basebackup, Barman, pgBackRest, etc.
>

Well, I have no idea. We still got no information about how the standby
was created, if it was ever running fine, and so on. Considering it does
not seem to be getting data from the archive, it might be the case it was
created in some strange way and never really worked. And if there really
are no log messages about the restore_command, it probably fails before
the standby even tries to execute it.

So I don't know.

>Restoring backups is where I would be heading now, as things seem
>terribly broken.
>

Right. But my impression is there are no backups ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services