run pg_rewind on an uncleanly shut down cluster. - Mailing list pgsql-hackers

From Oleksii Kliukin
Subject run pg_rewind on an uncleanly shut down cluster.
Date
Msg-id 1444059667.1508260.401673937.23F703F1@webmail.messagingengine.com
Whole thread Raw
Responses Re: run pg_rewind on an uncleanly shut down cluster.  (Bruce Momjian <bruce@momjian.us>)
Re: run pg_rewind on an uncleanly shut down cluster.  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
Hello,

I'm trying to find out how to rewind a cluster that was not shut down
cleanly, in order to implement pg_rewind support in patroni (an
automated failover system, https://github.com/zalando/patroni).

At the moment, pg_rewind limits itself to only cleanly shut down
clusters. This works nicely in the case of a split brain caused by the
network partitioning. However, it doesn't cover the important case of a
suddenly crashed master: the crashed cluster cannot be rewound to the
new master. 

One idea to overcome this limitation is to start the former master for a
short time, just to let automatic recovery do its job, and stop it
cleanly afterwards. There are some indications on the list that it
works:
http://www.postgresql.org/message-id/79F6CEB4-F519-40FA-9C72-167DEF1EB3B5@simply.name

However, in our testing we had an issue with a missing WAL segment on a
former master, which prevented pg_rewind from bringing it up to date
with the current master:

Suppose, the current XLOG segment right before we crash the master is:

postgres=# select * from pg_xlogfile_name(pg_current_xlog_location());    pg_xlogfile_name
--------------------------000000010000000000000003
(1 row)

(the master is configured to archive all segments into the external
directory).

The latest checkpoint location right before the crash is:

Latest checkpoint's REDO location:    0/2000028
Latest checkpoint's REDO WAL file:    000000010000000000000002

and pg_xlog contains the following data
$ ls -R  postgresql0/pg_xlog/
000000010000000000000001                
000000010000000000000002.00000028.backup archive_status
000000010000000000000002                 000000010000000000000003

postgresql0/pg_xlog//archive_status:
000000010000000000000001.done                
000000010000000000000002.done
000000010000000000000002.00000028.backup.done

Now, if we crash the master by sending it SIGKILL, and then start it
again with:

$ postgres  -D data/postgresql0 -c "max_replication_slots=5" -c
"wal_level=hot_standby" -c "wal_log_hints=on"
LOG:  database system was interrupted; last known up at 2015-10-05
17:28:04 CEST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/2000028
LOG:  invalid record length at 0/3000060
LOG:  redo done at 0/3000028

we'll get the following contents of postgresql0/pg_xlog:

$ ls -R  postgresql0/pg_xlog/
000000010000000000000002.00000028.backup 000000010000000000000004               archive_status
000000010000000000000003                 000000010000000000000005

postgresql0/pg_xlog//archive_status:
000000010000000000000002.00000028.backup.done

Note, that at some moment the master removed the segment
000000010000000000000002 from its pg_xlog.

In the pg_controldata, I get:

Latest checkpoint's REDO location:    0/3000108
Latest checkpoint's REDO WAL file:    000000010000000000000003

When I try to run pg_rewind, I'm getting:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres"
The servers diverged at WAL position 0/3000060 on timeline 1.
could not open file "data/postgresql0/pg_xlog/000000010000000000000002":
No such file or directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I know I can copy the segment back from the archive, but I'd like to
avoid putting this logic into the failover tool if possible. Is there
anything we can do to avoid the problem described above, or is there a
better way to bring up the former master after the crash with pg_rewind?

Kind regards,
--
Oleksii Kliukin



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Less than ideal error reporting in pg_stat_statements
Next
From: Tom Lane
Date:
Subject: Re: Less than ideal error reporting in pg_stat_statements