Re: A failure of standby to follow timeline switch - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: A failure of standby to follow timeline switch
Date
Msg-id 697adab0-a3fe-e1cb-436b-3a8eaa9a2266@oss.nttdata.com
Whole thread Raw
In response to A failure of standby to follow timeline switch  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: A failure of standby to follow timeline switch
List pgsql-hackers

On 2020/12/09 17:43, Kyotaro Horiguchi wrote:
> Hello.
> 
> We found a behavioral change (which seems to be a bug) in recovery at
> PG13.
> 
> The following steps might seem somewhat strange but the replication
> code deliberately cope with the case.  This is a sequense seen while
> operating a HA cluseter using Pacemaker.
> 
> - Run initdb to create a primary.
> - Set archive_mode=on on the primary.
> - Start the primary.
> 
> - Create a standby using pg_basebackup from the primary.
> - Stop the standby.
> - Stop the primary.
> 
> - Put stnadby.signal to the primary then start it.
> - Promote the primary.
> 
> - Start the standby.
> 
> 
> Until PG12, the parimary signals end-of-timeline to the standby and
> switches to the next timeline.  Since PG13, that doesn't happen and
> the standby continues to request for the segment of the older
> timeline, which no longer exists.
> 
> FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000003 has already
beenremoved
 
> 
> It is because WalSndSegmentOpen() can fail to detect a timeline switch
> on a historic timeline, due to use of a wrong variable to check
> that. It is using state->seg.ws_segno but it seems to be a thinko when
> the code around was refactored in 709d003fbd.
> 
> The first patch detects the wrong behavior.  The second small patch
> fixes it.

Thanks for reporting this! This looks like a bug.

When I applied two patches in the master branch and
ran "make check-world", I got the following error.

============== creating database "contrib_regression" ==============
# Looks like you planned 37 tests but ran 36.
# Looks like your test exited with 255 just after 36.
t/001_stream_rep.pl ..................
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/37 subtests
...
Test Summary Report
-------------------
t/001_stream_rep.pl                (Wstat: 65280 Tests: 36 Failed: 0)
   Non-zero exit status: 255
   Parse errors: Bad plan.  You planned 37 tests but ran 36.
Files=21, Tests=239, 302 wallclock secs ( 0.10 usr  0.05 sys + 41.69 cusr 39.84 csys = 81.68 CPU)
Result: FAIL
make[2]: *** [check] Error 1
make[1]: *** [check-recovery-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
t/070_dropuser.pl ......... ok


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Cannot ship records to subscriber for partition tables using logical replication (publish_via_partition_root=false)
Next
From: Bharath Rupireddy
Date:
Subject: Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped