Re: Strange issues with 9.2 pg_basebackup & replication - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Strange issues with 9.2 pg_basebackup & replication
Date
Msg-id CAHGQGwF86Mbp7z9Heuc93hWsrJ46JMvcLikDXc6O0adXG0V=+w@mail.gmail.com
Whole thread Raw
In response to Re: Strange issues with 9.2 pg_basebackup & replication  (Thom Brown <thom@linux.com>)
List pgsql-hackers
On Thu, May 17, 2012 at 1:07 AM, Thom Brown <thom@linux.com> wrote:
> On 16 May 2012 11:36, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, May 16, 2012 at 2:29 AM, Thom Brown <thom@linux.com> wrote:
>>> On 15 May 2012 13:15, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> On Wed, May 16, 2012 at 1:36 AM, Thom Brown <thom@linux.com> wrote:
>>>>> However, this isn't true when I restart the standby.  I've been
>>>>> informed that this should work fine if a WAL archive has been
>>>>> configured (which should be used anyway).
>>>>
>>>> The WAL archive should be shared by master-replica and replica-replica,
>>>> and recovery_target_timeline should be set to latest in replica-replica.
>>>> If you configure that way, replica-replica would successfully reconnect to
>>>> master-replica with no need to restart it.
>>>
>>> I had set the archive_command on the primary, then produced a base
>>> backup which would have copied the archive settings, but I also added
>>> a corresponding recovery_command setting, so everything was pointing
>>> at the same archive.
>>
>> Hmm.. when doing the same, the replica-replica successfully reconnected
>> to the master-replica after I shutdown the master-master and promoted the
>> master-replica. archive_command is the same in three servers,
>> restore_command is the same in two standby servers (i.e., master-replica
>> and replica-replica), and recovery_target_timeline is set to 'latest' in two
>> standby servers.
>
> I didn't shut down the master-master, but I didn't expect to need to.
>
> I also had recovery_target_timeline set to latest.  I also tried
> explicitly setting it to the new timeline, and got an error saying
> there was no such timeline.

What did the replica-replica do after you got such an error? Repeated
such an error? Emit PANIC error and exited? Got stuck? Successfully
reconnected to the master-replica? ....

In theory, the gap of timeline should be resolved as follows:

1. promote master-replica, which terminates cascade replication.
2. while replica-replica is repeating to reconnect to master-replica,   if it finds new timeline history file in the
archive,it adjusts 
its timeline   to new one.
3. as the result of promotion, master-replica increments its timeline,   creates the timeline history file and archives
it.
4. finally replica-replica finds new timeline history file in the archive,   adjusts its timeline to new one, and
successfullyreconnects to the   master-replica. 

Note that you might see the timeline mismatch error some times
before replication is successfully restarted because of the timing
problem.

>
>>> But in any case, shouldn't the replication connection be
>>> terminated when pg_basebackup is terminated?
>>
>> +1 To do this, we would need to define SIGINT signal handler and make it
>> send QueryCancel packet when Ctrl-C is typed.
>
> Also could we provide some feedback when using the -c spread option,
> when there isn't progress within a short period of time?  Something
> like "Waiting for checkpoint.  This can take up to
> %checkpoint_timeout%", or something similar, rather than seeing
> nothing happening and wondering if something has gone wrong.

+1, at least for the case where -P option is specified in pg_basebackup.

Regards,

--
Fujii Masao


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Draft release notes complete
Next
From: Tom Lane
Date:
Subject: Re: psql bug