Thread: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
From
skeefe@rdx.com
Date:
The following bug has been logged on the website: Bug reference: 10142 Logged by: Sean Keefe Email address: skeefe@rdx.com PostgreSQL version: 9.2.8 Operating system: Redhat 6.4 Description: The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL Replication. If the master goes down during a massive transaction and we promote the first slave then next slave looks for a WAL log that never existed, New timeline before the split of timelines. Below is how to re create the issue: 1. Create M using postgresql.conf_M. Start M. CREATE TABLE t_test (id int4); 2. Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used rsync). Start S1 3. Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used rsync). Start S2 4. Insert data in t_test table in M INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ; 5. Important: Do not shutdown M. If you want you can crash M by killing pids. I just let it run and immediately proceeded to next step. The idea here is to promote S1 before M transmits the last WAL which has the COMMIT of the above INSERT. 6. Promote S1. S1 will change its timeline. 7. S2 will not recognize the new timeline of its master S1. PGSTOP S2 and then PGSTART. S2 will now change its timeline. However, as you see in the pg_log, it will wait for a WAL that will never arrive. It will look for WALs from previous timeline in new timeline file naming format. E.g it will wait for 0000000A00000026000000F1. You will see that such log exists in the name 0000000900000026000000F1. So it will wait forever and if you try to connect to S2 you will see error âFATAL: the database system is starting upâ Recovery.conf for S1: restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' recovery_target_timeline = 'latest' recovery.conf for S2: restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' recovery_target_timeline = 'latest' If you need any of the other configuration files let me know and i can send them to you.
Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
From
Heikki Linnakangas
Date:
On 04/25/2014 08:43 PM, skeefe@rdx.com wrote: > The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL > Replication. If the master goes down during a massive transaction and we > promote the first slave then next slave looks for a WAL log that never > existed, New timeline before the split of timelines. I can't reproduce this. Would it be possible to create a self-contained script that reproduces the whole scenario? Something like the attached (which I used to try to reproduce this). > Below is how to recreate the issue: > > 1. Create M using postgresql.conf_M. Start M. > CREATE TABLE t_test (id int4); > > 2. Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used > rsync). Start S1 > > 3. Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used > rsync). Start S2 > > 4. Insert data in t_test table in M > INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ; > 5. Important: Do not shutdown M. If you want you can crash M by killing > pids. I just let it run and immediately proceeded to next step. The idea > here is to promote S1 before M transmits the last WAL which has the COMMIT > of the above INSERT. > > 6. Promote S1. S1 will change its timeline. > > 7. S2 will not recognize the new timeline of its master S1. Yeah, that's expected behavior, or a known issue if you will, which was fixed in 9.3. However, S1 should automatically terminate the connection, with a message in the log like this: LOG: terminating all walsender processes to force cascaded standby(s) to update timeline and reconnect That should allow S2 to find the new timeline, without restarting, as long as you have a WAL archive set up. > PGSTOP S2 and > then PGSTART. S2 will now change its timeline. However, as you see in the > pg_log, it will wait for a WAL that will never arrive. It will look for WALs > from previous timeline in new timeline file naming format. E.g it will wait > for 0000000A00000026000000F1. You will see that such log exists in the name > 0000000900000026000000F1. So it will wait forever and if you try to connect > to S2 you will see error “FATAL: the database system is starting up” This seems to be the crux of this bug report. I just tested this and didn't see this behavior. S2 tries restoring files from the archive first, but then it connects to S1 and catches up. > Recovery.conf for S1: > restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' > recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' > > recovery_target_timeline = 'latest' > > recovery.conf for S2: > restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r' > recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg' > > recovery_target_timeline = 'latest' There are no primary_conninfo lines here, so you're either not showing us the full recovery.conf files used, or you haven't in fact set up cascading replication. - Heikki
Attachment
Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
From
Heikki Linnakangas
Date:
(cc'ing pgsql-bugs, please keep the mailing list cc'd so that others can join in the discussion) On 04/29/2014 05:50 PM, Sean Keefe wrote: > Also it seems that you were setting up streaming based replication not file > based replication that we were using. I wanted to point this out. Ah, you mentioned "cascading replication" in the subject, and I assumed that you meant the built-in streaming cascading replication feature. But from the scripts, I see that you're using pg_standby, and rsync for the cascade. pg_standby is not the recommended way to set up replication anymore. Since version 9.0, there is a built-in standby feature that can be used in file-based mode too. See http://www.postgresql.org/docs/9.2/static/warm-standby.html#STANDBY-SERVER-SETUP. I don't know if pg_standby can be made to work across timeline switches. - Heikki