Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Fair enough. But I'd still like an explanation of why only about
>> half of the population is showing a failure here. Seems like every
>> machine should be seeing the LSN as moving backwards in this test.
>> So (a) why aren't they all failing, and (b) should we change the
>> test to make sure every platform sees that happening?
> Every machine sees the LSN moving backwards, but the code path that
> had the assertion only reached if it decides to interpolate, which is
> timing dependent: there needs to be a future sample in the lag
> tracking buffer, which I guess is not the case in those runs.
I'm dissatisfied with this explanation because if it's just timing,
it doesn't seem very likely that some machines would reproduce the
failure every single time while others never would. Maybe that can be
blamed on kernel scheduler vagaries + different numbers of cores, but
I can't escape the feeling that there's something here we've not
fully understood.
While chasing after this earlier today, I turned on some debug logging
and noted that the standby's reports look like
2017-04-23 15:46:46.206 EDT [34829] LOG: database system is ready to accept read only connections
2017-04-23 15:46:46.212 EDT [34834] LOG: fetching timeline history file for timeline 2 from primary server
2017-04-23 15:46:46.212 EDT [34834] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2017-04-23 15:46:46.213 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34834] LOG: replication terminated by primary server
2017-04-23 15:46:46.214 EDT [34834] DETAIL: End of WAL reached on timeline 1 at 0/3028470.
2017-04-23 15:46:46.214 EDT [34834] LOG: sending write 0/3028470 flush 0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34830] LOG: new target timeline is 2
2017-04-23 15:46:46.214 EDT [34834] LOG: restarted WAL streaming at 0/3000000 on timeline 2
2017-04-23 15:46:46.228 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470
So you're right that the standby's reported "write" position can go
backward, but it seems pretty darn odd that the flush and apply
positions didn't go backward too. Is there a bug there?
I remain of the opinion that if we can't tell from the transmitted
data whether a timeline switch has caused the position to go backward,
then that's a protocol shortcoming that ought to be fixed.
regards, tom lane