Thread: Review of pg_rewind
While testing pg_rewind I encountered following problem.
I used following process to do the testing, Please correct me if I am doing it in wrong way.
Problem-1:
pg_rewind gives error (target master must be shut down cleanly.) when master crashed unexpectedly.
1. Setup Streaming Replication (stand alone machine : master server port -5432, standby server port-5433 )
postgres=# create table test(id int);
3. Crash the Postgres process of master:
kill -9 [pid of postgres process of master server]
kill -9 [pid of postgres process of master server]
4. Promote standby server
5. Run pg_rewind:
$ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D /samrat/master-data/ --source-server='host=localhost port=5433 dbname=postgres' -v
connected to remote server
fetched file "global/pg_control", length 8192
target master must be shut down cleanly.
$ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D /samrat/master-data/ --source-server='host=localhost port=5433 dbname=postgres' -v
connected to remote server
fetched file "global/pg_control", length 8192
target master must be shut down cleanly.
6. Check masters control information:
$ /samrat/postgresql/install/bin/pg_controldata /samrat/master-data/ | grep "Database cluster state"
Database cluster state: in production
Database cluster state: in production
IIUC It is because pg_rewind does some checks before resynchronizing the PostgreSQL data directories.
But In real time scenarios, for example due to hardware failure if master crashed and its controldata shows the state "in production" then pg_rewind will fail to pass this check.
Problem-2:
For zero length WAL record pf_rewind gives error.
1. Setup Streaming Replication (stand alone machine : master server port -5432, standby server port-5433 )
2. Cleanly shutdown master (Do not add any data on master)
3. Promote standby server
3. Promote standby server
4. Create table on new master (promoted standby)
postgres=# create table test(id int);
5. Run pg_rewind: postgres=# create table test(id int);
$ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D /samrat/master-data/ --source-server='host=localhost port=5433
connected to remote server
connected to remote server
fetched file "global/pg_control", length 8192
fetched file "pg_xlog/00000002.history", length 41
Last common WAL position: 0/4000090 on timeline 1
could not previous WAL record at 0/4000090: record with zero length at 0/4000090
Also it as you already listed in README of pg_rewind the it has a problem of tablespace support.
I will continue with testing it further to help in improving it :)
Hi, Thanks for the feedback. Btw, pg_rewind is not a project included in Postgres core as a contrib module or anything, so could you send your feedback and the issues you find directly on github instead? The URL of the project is https://github.com/vmware/pg_rewind. Either way, here are some comments below... On Wed, Oct 23, 2013 at 6:07 PM, Samrat Revagade <revagade.samrat@gmail.com> wrote: > While testing pg_rewind I encountered following problem. > I used following process to do the testing, Please correct me if I am doing > it in wrong way. > > Problem-1: > pg_rewind gives error (target master must be shut down cleanly.) when > master crashed unexpectedly. > > 1. Setup Streaming Replication (stand alone machine : master server port > -5432, standby server port-5433 ) > 2. Do some operation on master server: > postgres=# create table test(id int); > 3. Crash the Postgres process of master: > kill -9 [pid of postgres process of master server] > 4. Promote standby server > 5. Run pg_rewind: > $ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D > /samrat/master-data/ --source-server='host=localhost port=5433 > dbname=postgres' -v > connected to remote server > fetched file "global/pg_control", length 8192 > target master must be shut down cleanly. > 6. Check masters control information: > $ /samrat/postgresql/install/bin/pg_controldata > /samrat/master-data/ | grep "Database cluster state" > Database cluster state: in production > > IIUC It is because pg_rewind does some checks before resynchronizing the > PostgreSQL data directories. > But In real time scenarios, for example due to hardware failure if master > crashed and its controldata shows the state "in production" then pg_rewind > will fail to pass this check. Yeah, you could call that a limitation of this module. When I looked at its code some time ago, I had on top of my mind the addition of an option of the type --force that could attempt resynchronization of a master even if it did not shut down correctly. > > Problem-2: > For zero length WAL record pf_rewind gives error. > > 1. Setup Streaming Replication (stand alone machine : master server port > -5432, standby server port-5433 ) > 2. Cleanly shutdown master (Do not add any data on master) > 3. Promote standby server > 4. Create table on new master (promoted standby) > postgres=# create table test(id int); > 5. Run pg_rewind: > $ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D > /samrat/master-data/ --source-server='host=localhost port=5433 > connected to remote server > connected to remote server > fetched file "global/pg_control", length 8192 > fetched file "pg_xlog/00000002.history", length 41 > Last common WAL position: 0/4000090 on timeline 1 > could not previous WAL record at 0/4000090: record with zero length > at 0/4000090 This is rather interesting. When I tested it I did not find this error. > Also it as you already listed in README of pg_rewind the it has a problem of > tablespace support. > > I will continue with testing it further to help in improving it :) Thanks! -- Michael
On Wed, Oct 23, 2013 at 2:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
Hi,
Thanks for the feedback. Btw, pg_rewind is not a project included in
Postgres core as a contrib module or anything, so could you send your
feedback and the issues you find directly on github instead? The URL
of the project is https://github.com/vmware/pg_rewind.
Sure, I will add those issues on github.
Either way, here are some comments below...Yeah, you could call that a limitation of this module. When I looked
On Wed, Oct 23, 2013 at 6:07 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
> While testing pg_rewind I encountered following problem.
> I used following process to do the testing, Please correct me if I am doing
> it in wrong way.
>
> Problem-1:
> pg_rewind gives error (target master must be shut down cleanly.) when
> master crashed unexpectedly.
>
> 1. Setup Streaming Replication (stand alone machine : master server port
> -5432, standby server port-5433 )
> 2. Do some operation on master server:
> postgres=# create table test(id int);
> 3. Crash the Postgres process of master:
> kill -9 [pid of postgres process of master server]
> 4. Promote standby server
> 5. Run pg_rewind:
> $ /samrat/postgresql/contrib/pg_rewind/pg_rewind -D
> /samrat/master-data/ --source-server='host=localhost port=5433
> dbname=postgres' -v
> connected to remote server
> fetched file "global/pg_control", length 8192
> target master must be shut down cleanly.
> 6. Check masters control information:
> $ /samrat/postgresql/install/bin/pg_controldata
> /samrat/master-data/ | grep "Database cluster state"
> Database cluster state: in production
>
> IIUC It is because pg_rewind does some checks before resynchronizing the
> PostgreSQL data directories.
> But In real time scenarios, for example due to hardware failure if master
> crashed and its controldata shows the state "in production" then pg_rewind
> will fail to pass this check.
at its code some time ago, I had on top of my mind the addition of an
option of the type --force that could attempt resynchronization of a
master even if it did not shut down correctly.
This sounds good :)
Greetings,
Samrat Revagade