Re: Switching timeline over streaming replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Switching timeline over streaming replication
Date
Msg-id 001701cda237$723d8db0$56b8a910$@kapila@huawei.com
Whole thread Raw
In response to Re: Switching timeline over streaming replication  (Amit Kapila <amit.kapila@huawei.com>)
Responses Re: Switching timeline over streaming replication  (Amit Kapila <amit.kapila@huawei.com>)
List pgsql-hackers
> On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote:
> On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:
> > Thanks for the thorough review! I committed the xlog.c refactoring
> patch
> > now. Attached is a new version of the main patch, comments on specific
> > points below. I didn't adjust the docs per your comments yet, will do
> > that next.
> 
> I have some doubts regarding the comments fixed by you and some more new
> review comments.
> After this I shall focus majorly towards testing of this Patch.
> 

Testing
-----------

Failed Case
--------------
1. promotion of standby to master and follow standby to new master.
2. Stop standby and master. Restart standby first and then master
3. Restart of standby gives below errors
E:\pg_git_code\installation\bin>LOG:  database system was shut down in
recovery 
at 2012-10-04 18:36:00 IST 
LOG:  entering standby mode 
LOG:  consistent recovery state reached at 0/176B800 
LOG:  redo starts at 0/176B800 
LOG:  record with zero length at 0/176BD68 
LOG:  database system is ready to accept read only connections 
LOG:  streaming replication successfully connected to primary 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000 
00001, offset 0 
FATAL:  terminating walreceiver process due to administrator command 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000 
00001, offset 0 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000 
00001, offset 0 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000 
00001, offset 0 
LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000 
00001, offset 0

Once this error comes, restart master/standby in any order or do some
operations on master, always there is above error
On standby.


Passed Cases
-------------
1. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.   In this case recovery.conf recovery_target_timeline set to latest. It
ables to connect to new master and started   streaming as per expectation.   - As per expected behavior.    
2. After promoting standby as new master, try to make previous master
(having more WAL compare to new master) as standby,   error is displayed.   - As per expected behavior    
3. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.   In this case recovery.conf recovery_target_timeline is not set.
Following
LOG is displayed.   LOG:  fetching timeline history file for timeline 2 from primary server   LOG:  replication
terminatedby primary server   DETAIL:  End of WAL reached on timeline 1   LOG:  walreceiver ended streaming and awaits
newinstructions   LOG:  re-handshaking at position 0/1000000 on tli 1   LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1   LOG:  walreceiver ended streaming and awaits new instructions   LOG:
re-handshakingat position 0/1000000 on tli 1   LOG:  replication terminated by primary server   DETAIL:  End of WAL
reachedon timeline 1   - As per expected behavior
 


Pending Cases which needs to be tested (these are scenarios, some more
testing I will do based on these scenarios)
---------------------------------------
1. a. Master  M-1   b. Standby S-1 follows M-1   c. Standby S-2 follows M-1   d. Promote S-1 as master   e. Try to
followS-2  to S-1 -- operation should be success    
 
2. a. Master M-1   b. Standby S-1 follows M-1   c. Stop S-1, M-1   d. Do the PITR in M-1 2 times. This is to increment
timelinein M-1   e. try to follow standby S-1 to M-1 -- it should be success.    
 
3. a. Master M-1   b. Standby S-1, S-2 follows M1   c. Standby S-3, S-4 follows S-1   d. Promote Standby which has
highestWAL.   e. follow all standby's to the new master.    
 
4. a. Master M-1   b. Synchronous Standby S-1, S-2   c. Promote S-1   d. Follow M-1, S-2 to S-1 -- this operation
shouldbe success.    
 
  Concurrent Operations   ---------------------------   1. a. Master M-1 , Standby S-1 follows M-1, Standby S-2 follows
M-1     b. Many concurrent operations on master M-1      c. During concurrent ops, Promote S-1      d. try S-2 to
followS-1 -- it should happen successfully.      2. During Promotion, call pg_basebackup 
 
  3. During Promotion, try to connect client 
  Resource Testing   ------------------   1. a.Make standby follow master which is many time lines ahead      b.
Observeif there is any resource leak      c. Allow the streaming replication for 30 mins      d. Observe if there is
anyresource leak
 

Code Review
-------------
Libpqrcv_readtimelinehistoryfile()
{ .. ..
+       if (PQnfields(res) != 2 || PQntuples(res) != 1) 
+       { 
+               int                     ntuples = PQntuples(res); 
+               int                     nfields = PQnfields(res); 
+ 
+               PQclear(res); 
+               ereport(ERROR, 
+                               (errmsg("invalid response from primary
server"), 
+                                errdetail("Expected 1 tuple with 3 fields,
got %d tuples with %d fields.", 
+                                                  ntuples, nfields))); 
+       }

..
}

The error message is saying 3 fields needs to be read in timeline history,
but the check seems to be is done for 2 fields.


Kindly let me know if you want me to focus on any other areas for testing
this feature.

With Regards,
Amit Kapila.




pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Raise a WARNING if a REVOKE affects nothing?
Next
From: Jon Nelson
Date:
Subject: Re: xmalloc => pg_malloc