Re: Strange issues with 9.2 pg_basebackup & replication - Mailing list pgsql-hackers
From | Thom Brown |
---|---|
Subject | Re: Strange issues with 9.2 pg_basebackup & replication |
Date | |
Msg-id | CAA-aLv7RcQMX+k6eFaqNK8By1CPySNZfp1jWTmnOBb=rcJDZ8A@mail.gmail.com Whole thread Raw |
In response to | Re: Strange issues with 9.2 pg_basebackup & replication (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Strange issues with 9.2 pg_basebackup & replication
|
List | pgsql-hackers |
On 13 May 2012 16:08, Josh Berkus <josh@agliodbs.com> wrote: > More issues: promoting intermediate standby breaks replication. > > To be a bit blunt here, has anyone tested cascading replication *at all* > before this? > > So, same setup as previous message. > > 1. Shut down master-master. > > 2. pg_ctl promote master-replica > > 3. replication breaks. error message on replica-replica: > > FATAL: timeline 2 of the primary does not match recovery target timeline 1 > > 4. No amount of adjustment on replica-replica will get it replicating > again. > > Note that replica-replica was configured with: > > recovery_target_timeline = 'latest' I can recreate this "issue", although the docs say: "Promoting a cascading standby terminates the immediate downstream replication connections which it serves. This is because the timeline becomes different between standbys, and they can no longer continue replication. The affected standby(s) may reconnect to reestablish streaming replication." (http://www.postgresql.org/docs/9.2/static/warm-standby.html#CASCADING-REPLICATION) However, this isn't true when I restart the standby. I've been informed that this should work fine if a WAL archive has been configured (which should be used anyway). But one new problem I appear to have is that once I set up archiving and restart, then try pg_basebackup, it gets stuck and never shows any progress. If I terminate pg_basebackup in this state and attempt to restart it more times than max_wal_senders, it can no longer run, as pg_basebackup didn't disconnect the stream, so ends up using all senders. And these show up in pg_stat_replication. I have a theory that if archiving is enabled, restart postgres then generate some WAL to the point there is a file or two in the archive, pg_basebackup can't stream anything. Once I restart the server, it's fine and continues as normal. This has the same symptoms of the "pg_basebackup from running standby with streaming" issue. Steps to recreate: 1) initdb new cluster 2) start new cluster 3) make archive dir (in my case, /tmp/arch) and set the following: wal_level = hot_standby max_wal_senders = 3 archive_mode= on archive_command = 'cp %p /tmp/arch/%f' 4) Set pg_hba.conf to allow streaming replication connections 5) Restart the cluster 6) Create a table and insert a few hundred thousand rows until /tmp/arch shows some WAL files 7) Run: pg_basebackup -x stream -D s1 -Pv This actually does finish eventually but it appears to need some encouragement by generating some WAL and issuing a checkpoint: thom@swift:~/Development$ time pg_basebackup -x stream -D s1 -Pv xlog start point: 0/4000020 pg_basebackup: starting background WAL receiver 53951/53951 kB (100%), 1/1 tablespace xlog end point: 0/5DE15E0 pg_basebackup: waiting for background process to finish streaming... pg_basebackup: base backup completed real 2m37.456s user 0m0.016s sys 0m0.724s If I terminate pg_basebackup and restart it without generating additional WAL, it doesn't appear to release the streaming connection ever (or not within my patience limit of a few minutes). And I can't free these connections without restarting the cluster. But once I get the standby up and running and acting as a hot standby, and ignore the current issue with it getting stuck creating a standby from a standby, I still get the mismatched timeline issue, so the addition of WAL archiving didn't appear to resolve this for me. -- Thom
pgsql-hackers by date: