RE: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id OS0PR01MB571666018400F782BD1FDD1C940D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (vignesh C <vignesh21@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
On Tuesday, December 24, 2024 8:57 PM Michail Nikolaev <michail.nikolaev@gmail.com>  wrote:

Hi,

> Yesterday I got a strange set of test errors, probably somehow related to
> that patch. It happened on changed master branch (based on
> d96d1d5152f30d15678e08e75b42756101b7cab6) but I don't think my changes were
> affecting it.
> 
> My setup is a little bit tricky: Windows 11 run WSL2 with Ubuntu, meson.
> 
> So, `recovery ` suite started failing on:
> 
> 1) at /src/test/recovery/t/http://019_replslot_limit.pl line 530.
> 2) at /src/test/recovery/t/http://040_standby_failover_slots_sync.pl line
>    198.
> 
> It was failing almost every run, one test or another. I was lurking around
> for about 10 min, and..... it just stopped failing. And I can't reproduce it
> anymore.
> 
> But I have logs of two fails. I am not sure if it is helpful, but decided to
> mail them here just in case.

Thanks for reporting the issue.

After checking the log, I think the failure is caused by the unexpected
behavior of the local system clock.

It's clear from the '019_replslot_limit_primary4.log'[1] that the clock went
backwards which makes the slot's inactive_since go backwards as well. That's
why the last testcase didn't pass.

And for 040_standby_failover_slots_sync, we can see that the clock of standby
lags behind that of the primary, which caused the inactive_since of newly synced
slot on standby to be earlier than the one on the primary.

So, I think it's not a bug in the committed patch but an issue in the testing
environment. Besides, since we have not seen such failures on BF, I think it
may not be necessary to improve the testcases.

[1]
2024-12-24 01:37:19.967 CET [161409] sub STATEMENT:  START_REPLICATION SLOT "lsub4_slot" LOGICAL 0/0 (proto_version
'4',streaming 'parallel', origin 'any', publication_names '"pub"')
 
...
2024-12-24 01:37:20.025 CET [161447] 019_replslot_limit.pl LOG:  statement: SELECT '0/30003D8' <= replay_lsn AND state
='streaming'
 
...
2024-12-24 01:37:19.388 CET [161097] LOG:  received fast shutdown request

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information
Next
From: Yura Sokolov
Date:
Subject: Bug in v13 due to "Fix corruption when relation truncation fails."