Re: Fix 035_standby_logical_decoding.pl race conditions - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Fix 035_standby_logical_decoding.pl race conditions |
Date | |
Msg-id | CAA4eK1J5FmEbcuioPgywYx3aBsNkYfFg_xs4kk3QWfk-CcdCXg@mail.gmail.com Whole thread Raw |
Responses |
Re: Fix 035_standby_logical_decoding.pl race conditions
|
List | pgsql-hackers |
On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Please find attached a patch to $SUBJECT. > > In rare circumstances (and on slow machines) it is possible that a xl_running_xacts > is emitted and that the catalog_xmin of a logical slot on the standby advances > past the conflict point. In that case, no conflict is reported and the test > fails. It has been observed several times and the last discussion can be found > in [1]. > Is my understanding correct that bgwriter on primary node has created a xl_running_xacts, then that record is replicated to standby, and while decoding it (xl_running_xacts) on standby via active_slot, we advanced the catalog_xmin of active_slot? If this happens then the replay of vacuum record on standby won't be able to invalidate the active slot, right? So, if the above is correct, the reason for generating extra xl_running_xacts on primary is Vacuum followed by Insert on primary via below part of test: $node_primary->safe_psql( 'testdb', qq[VACUUM $vac_option verbose $to_vac; INSERT INTO flush_wal DEFAULT VALUES;]); > To avoid the race condition to occur this commit adds an injection point to prevent > the catalog_xmin of a logical slot to advance past the conflict point. > > While working on this patch, some adjustements have been needed for injection > points (they are proposed in 0001): > > - Adds the ability to wakeup() and detach() while ensuring that no process can > wait in between. It's done thanks to a new injection_points_wakeup_detach() > function that is holding the spinlock during the whole duration. > > - If the walsender is waiting on the injection point and that the logical slot > is conflicting, then the walsender process is killed and so it is not able to > "empty" it's injection slot. So the next injection_wait() should reuse this slot > (instead of using an empty one). injection_wait() has been modified that way > in 0001. > > With 0001 in place, then we can make use of an injection point in > LogicalConfirmReceivedLocation() and update 035_standby_logical_decoding.pl to > prevent the catalog_xmin of a logical slot to advance past the conflict point. > > Remarks: > > R1. The issue still remains in v16 though (as injection points are available since > v17). > This is not idle case because the test would still keep failing intermittently on 16. I am wondering what if we start a transaction before vacuum and do some DML in it but didn't commit that xact till the active_slot test is finished then even the extra logging of xl_running_xacts shouldn't advance xmin during decoding. This is because reorder buffer may point to the xmin before vacuum. See following code: SnapBuildProcessRunningXacts() .... xmin = ReorderBufferGetOldestXmin(builder->reorder); if (xmin == InvalidTransactionId) xmin = running->oldestRunningXid; elog(DEBUG3, "xmin: %u, xmax: %u, oldest running: %u, oldest xmin: %u", builder->xmin, builder->xmax, running->oldestRunningXid, xmin); LogicalIncreaseXminForSlot(lsn, xmin); ... Note that I have not tested this case, so I could be wrong. But if possible, we should try to find some solution which could be backpatched to 16 as well. -- With Regards, Amit Kapila.
pgsql-hackers by date: