Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
Date
Msg-id 535F538E.9020206@vmware.com
Whole thread Raw
In response to BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio  (skeefe@rdx.com)
List pgsql-bugs
On 04/25/2014 08:43 PM, skeefe@rdx.com wrote:
> The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL
> Replication. If the master goes down during a massive transaction and we
> promote the first slave then next slave looks for a WAL log that never
> existed, New timeline before the split of timelines.

I can't reproduce this. Would it be possible to create a self-contained
script that reproduces the whole scenario? Something like the attached
(which I used to try to reproduce this).

> Below is how to recreate the issue:
>
> 1.    Create M using postgresql.conf_M. Start M.
> CREATE TABLE t_test (id int4);
>
> 2.    Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used
> rsync). Start S1
>
> 3.    Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used
> rsync). Start S2
>
> 4.    Insert data in t_test table in M
> INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ;
> 5.    Important: Do not shutdown M. If you want you can crash M by killing
> pids. I just let it run and immediately proceeded to next step. The idea
> here is to promote S1 before M transmits the last WAL which has the COMMIT
> of the above INSERT.
>
> 6.    Promote S1. S1 will change its timeline.
>
> 7.    S2 will not recognize the new timeline of its master S1.

Yeah, that's expected behavior, or a known issue if you will, which was
fixed in 9.3. However, S1 should automatically terminate the connection,
with a message in the log like this:

LOG: terminating all walsender processes to force cascaded standby(s) to
update timeline and reconnect

That should allow S2 to find the new timeline, without restarting, as
long as you have a WAL archive set up.

> PGSTOP S2 and
> then PGSTART. S2 will now change its timeline. However, as you see in the
> pg_log, it will wait for a WAL that will never arrive. It will look for WALs
> from previous timeline in new timeline file naming format. E.g it will wait
> for 0000000A00000026000000F1. You will see that such log exists in the name
> 0000000900000026000000F1. So it will wait forever and if you try to connect
> to S2 you will see error “FATAL:  the database system is starting up”

This seems to be the crux of this bug report. I just tested this and
didn't see this behavior. S2 tries restoring files from the archive
first, but then it connects to S1 and catches up.

> Recovery.conf for S1:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'
>
> recovery.conf for S2:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'

There are no primary_conninfo lines here, so you're either not showing
us the full recovery.conf files used, or you haven't in fact set up
cascading replication.

- Heikki

Attachment

pgsql-bugs by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: BUG #10155: BUG? Cann't remove new generated tuples after repeatable read transaction start.
Next
From: pilum.70@uni-muenster.de
Date:
Subject: BUG #10171: Specific prepared statement cannot use bitmapOr index scan since 9.2