Thread: Issue observed in cascade standby setup and analysis for same

Issue observed in cascade standby setup and analysis for same

From

Amit Kapila

Date:

06 September 2012, 17:07:25

<div class="WordSection1"><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Below
mentionedissue might be current behavior of PostgreSQL, but I am not able to find that in documentation so I have done
briefanalysis as well.</span><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><br
/><b><u>Issue</u></b>:cascade standby server raise errors as “<u>FATAL:  the database system is starting up</u>” when
tryingto connect a client. Cascaded standby server is started on database which is received from Standby node using
pg_basebackuptool. </span><p><b><u><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">Steps to
reproduce</span></u></b><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif"">: </span><p><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">Mhost is primary, S host is standby and CS host is cascaded
standby.<br /><br />1.        Set up postgresql-9.2beta2 on  all hosts. <br />2.        Execute command initdb on host
Mto create fresh database. <br />3.        Modify the configure file postgresql.conf on host M like this</span><span
lang="JA"style="font-size:10.0pt;font-family:SimSun">：</span><span
style="font-size:10.0pt;font-family:"Arial","sans-serif""><br/>           listen_addresses = 'M' <br />           port
=15210 <br />           wal_level = hot_standby   <br />           max_wal_senders = 4 <br />           hot_standby =
on<br />4.        modify the configure file pg_hba.conf on host M  like this</span><span lang="JA"
style="font-size:10.0pt;font-family:SimSun">：</span><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""><br
/>           host    replication     repl             M/24            md5 <br />5.        Start the server on host M as
primary.<br />6.        Connect one client to primary server and create a user ‘repl’ <br />                   create
userrepl superuser password '123'; <br />7.        Use the command pg_basebackup on the host S to retrieve database of
primaryhost <br />                    pg_basebackup  -D /opt/t38917/data -F p -x fetch -c fast -l repl_backup -P -v -h
M-p 15210 -U repl –W <br />8.        Copy one recovery.conf.sample from share folder of package to database folder of
thehost S. Then rename this file to recovery.conf <br />9.        Modify the file recovery.conf on host S as below: <br
/>            standby_mode = on <br />             primary_conninfo = 'host=M port=15210 user=repl password=123' <br
/>10.       Modify the file postgresql.conf on host S as follow: <br />             listen_addresses = 'S' <br />11.  
    Start the server on host S as standby server. <br />12.        Use the command pg_basebackup on the host CS to
retrievedatabase of standby host <br />             pg_basebackup  -D /opt/t38917/data -F p -x fetch -c fast -l
repl_backup-P -v -h M -p 15210 -U repl –W <br />13.        Modify the file recovery.conf on host CS as below: <br
/>                 standby_mode = on <br />                  primary_conninfo = 'host=S port=15210 user=repl
password=123'<br />14.        Modify the file postgresql.conf on host S as follow: <br
/>                  listen_addresses= 'CS' <br />15.        Start the server on host CS as Cascaded standby server
node.<br />16.        Try to connect a client to host CS but it gives error as: <br />                        FATAL:
 thedatabase system is starting up <br /><br /><b><u>Observation based on code analysis:</u></b><br />In the above
scenarioit is observed that Start-up process has read all data (in our defect scenario minRecoveryPoint is 5016220)
tillthe position 5016220 and then it goes and check for recovery consistency by following condition in function
CheckRecoveryConsistency:<br />        if (!reachedConsistency && <br />               
XLByteLE(minRecoveryPoint,EndRecPtr) && <br />               
XLogRecPtrIsInvalid(ControlFile->backupStartPoint))<br /><br />At this point first two conditions are true but last
conditionis not true because still redo has not been applied and hence backupStartPoint has not been reset. So it does
notsignal postmaster regarding consistent stage. After this it goes and applies the redo and then reset
backupStartPointand then it goes to read next set of record. Since all records have been already read, so it starts
waitingfor the new record from the Standby node. But since there is no new record from Standby node coming so it keeps
waitingfor that and it does not get chance to recheck the recovery consistent level. And hence client connection does
notget allowed.</span><p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">Suggestions?</span><pclass="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><pclass="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">WithRegards,</span><p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">AmitKapila.</span></div>

Re: Issue observed in cascade standby setup and analysis for same

From

Josh Berkus

Date:

06 September 2012, 19:28:27

On 9/6/12 7:06 AM, Amit Kapila wrote:
> 1.        Set up postgresql-9.2beta2 on  all hosts.

Did you retest this with 9.2rc1?  Beta2 was a while ago ....

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Issue observed in cascade standby setup and analysis for same

From

Amit Kapila

Date:

07 September 2012, 17:57:18

On Thursday, September 06, 2012 9:58 PM Josh Berkus wrote:
On 9/6/12 7:06 AM, Amit Kapila wrote:
>> 1.        Set up postgresql-9.2beta2 on  all hosts.

> Did you retest this with 9.2rc1?  Beta2 was a while ago ....
 Tested in 9.2rc1, the problem occurs incase I use database and backup of 9.2 Beta2. However when created fresh
databaseand backup, it doesn't occur.  This problem doesn't occur every time, so I will try more to reproduce it on 9.2
RC1database as well. 

With Regards,
Amit Kapila.

Re: Issue observed in cascade standby setup and analysis for same

From

Amit Kapila

Date:

12 September 2012, 14:37:47

On Friday, September 07, 2012 8:24 PM Amit Kapila wrote:
On Thursday, September 06, 2012 9:58 PM Josh Berkus wrote:
On 9/6/12 7:06 AM, Amit Kapila wrote:
>>> 1.        Set up postgresql-9.2beta2 on  all hosts.

>> Did you retest this with 9.2rc1?  Beta2 was a while ago ....

>  Tested in 9.2rc1, the problem occurs incase I use database and backup of 9.2 Beta2. However when created fresh
> database and backup, it doesn't occur.  This problem doesn't occur every time, so I will try more to reproduce it on
> 9.2 RC1 database as well.

According to my analysis, this defect is not apparently visible in 9.2RC1 due to commit
b8b69d89905e04b910bcd65efce1791477b45d35by Tom. 

This can be visible in 9.2RC1 as well if checkpoint interval is long. So the conclusion point is on cascaded standby,
theclients will not be allowed to connect till the Checkpoint interval of master incase there is no activity on master. 

So according to my analysis this defect still holds for 9.2RC1 as well and I shall raise a bug for it.

With Regards,
Amit Kapila.