Re: Strange issues with 9.2 pg_basebackup & replication - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Strange issues with 9.2 pg_basebackup & replication |
Date | |
Msg-id | CAHGQGwF86Mbp7z9Heuc93hWsrJ46JMvcLikDXc6O0adXG0V=+w@mail.gmail.com Whole thread Raw |
In response to | Re: Strange issues with 9.2 pg_basebackup & replication (Thom Brown <thom@linux.com>) |
List | pgsql-hackers |
On Thu, May 17, 2012 at 1:07 AM, Thom Brown <thom@linux.com> wrote: > On 16 May 2012 11:36, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, May 16, 2012 at 2:29 AM, Thom Brown <thom@linux.com> wrote: >>> On 15 May 2012 13:15, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> On Wed, May 16, 2012 at 1:36 AM, Thom Brown <thom@linux.com> wrote: >>>>> However, this isn't true when I restart the standby. I've been >>>>> informed that this should work fine if a WAL archive has been >>>>> configured (which should be used anyway). >>>> >>>> The WAL archive should be shared by master-replica and replica-replica, >>>> and recovery_target_timeline should be set to latest in replica-replica. >>>> If you configure that way, replica-replica would successfully reconnect to >>>> master-replica with no need to restart it. >>> >>> I had set the archive_command on the primary, then produced a base >>> backup which would have copied the archive settings, but I also added >>> a corresponding recovery_command setting, so everything was pointing >>> at the same archive. >> >> Hmm.. when doing the same, the replica-replica successfully reconnected >> to the master-replica after I shutdown the master-master and promoted the >> master-replica. archive_command is the same in three servers, >> restore_command is the same in two standby servers (i.e., master-replica >> and replica-replica), and recovery_target_timeline is set to 'latest' in two >> standby servers. > > I didn't shut down the master-master, but I didn't expect to need to. > > I also had recovery_target_timeline set to latest. I also tried > explicitly setting it to the new timeline, and got an error saying > there was no such timeline. What did the replica-replica do after you got such an error? Repeated such an error? Emit PANIC error and exited? Got stuck? Successfully reconnected to the master-replica? .... In theory, the gap of timeline should be resolved as follows: 1. promote master-replica, which terminates cascade replication. 2. while replica-replica is repeating to reconnect to master-replica, if it finds new timeline history file in the archive,it adjusts its timeline to new one. 3. as the result of promotion, master-replica increments its timeline, creates the timeline history file and archives it. 4. finally replica-replica finds new timeline history file in the archive, adjusts its timeline to new one, and successfullyreconnects to the master-replica. Note that you might see the timeline mismatch error some times before replication is successfully restarted because of the timing problem. > >>> But in any case, shouldn't the replication connection be >>> terminated when pg_basebackup is terminated? >> >> +1 To do this, we would need to define SIGINT signal handler and make it >> send QueryCancel packet when Ctrl-C is typed. > > Also could we provide some feedback when using the -c spread option, > when there isn't progress within a short period of time? Something > like "Waiting for checkpoint. This can take up to > %checkpoint_timeout%", or something similar, rather than seeing > nothing happening and wondering if something has gone wrong. +1, at least for the case where -P option is specified in pg_basebackup. Regards, -- Fujii Masao
pgsql-hackers by date: