BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio - Mailing list pgsql-bugs

From skeefe@rdx.com
Subject BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
Date
Msg-id 20140425174336.2721.61539@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      10142
Logged by:          Sean Keefe
Email address:      skeefe@rdx.com
PostgreSQL version: 9.2.8
Operating system:   Redhat 6.4
Description:

The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL
Replication. If the master goes down during a massive transaction and we
promote the first slave then next slave looks for a WAL log that never
existed, New timeline before the split of timelines. Below is how to re
create the issue:

1.    Create M using postgresql.conf_M. Start M.
CREATE TABLE t_test (id int4);

2.    Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used
rsync). Start S1

3.    Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used
rsync). Start S2

4.    Insert data in t_test table in M
INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ;
5.    Important: Do not shutdown M. If you want you can crash M by killing
pids. I just let it run and immediately proceeded to next step. The idea
here is to promote S1 before M transmits the last WAL which has the COMMIT
of the above INSERT.

6.    Promote S1. S1 will change its timeline.

7.    S2 will not recognize the new timeline of its master S1. PGSTOP S2 and
then PGSTART. S2 will now change its timeline. However, as you see in the
pg_log, it will wait for a WAL that will never arrive. It will look for WALs
from previous timeline in new timeline file naming format. E.g it will wait
for 0000000A00000026000000F1. You will see that such log exists in the name
0000000900000026000000F1. So it will wait forever and if you try to connect
to S2 you will see error “FATAL:  the database system is starting up”

Recovery.conf for S1:
restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'

recovery_target_timeline = 'latest'

recovery.conf for S2:
restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'

recovery_target_timeline = 'latest'

If you need any of the other configuration files let me know and i can send
them to you.

pgsql-bugs by date:

Previous
From: Evgen Bodunov
Date:
Subject: Re: BUG #10141: Server fails to send query result.
Next
From: sdfasdf asdfasdf
Date:
Subject: Re[2]: [BUGS] BUG #10140: Configured for 127.0.0.1 but binds to all IP