Thread: Re: [HACKERS] make async slave to wait for lsn to be replayed
Intro========== The main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. We cannot store this lsn inside database, since reads are distributed across all replicas and primary. https://www.postgresql.org/message-id/195e2d07ead315b1620f1a053313f490%40postgrespro.ru Suggestions ========== Lots of proposals were made how this feature may look like. I aggregate them into the following four types. 1) Classic (wait_classic_v1.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru ========== advantages: multiple events, standalone WAIT disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] where event is one of: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v1.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru ========== advantages: no new words in grammar, standalone AFTER disadvantages: a little harder to understand AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch) https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com ========== advantages: no new words in grammar,like it made in pg_last_wal_replay_lsn, no snapshots need disadvantages: a little harder to remember names SELECT pg_waitlsn(‘LSN’, timeout); SELECT pg_waitlsn_infinite(‘LSN’); SELECT pg_waitlsn_no_wait(‘LSN’); 4) Brackets style: Kondratov https://www.postgresql.org/message-id/a8bff0350a27e0a87a6eaf0905d6737f%40postgrespro.ru ========== advantages: only one new word in grammar,like it made in VACUUM and REINDEX, ability to extend parameters without grammar fixes disadvantages: WAIT (LSN '16/B374D848', TIMEOUT 100); BEGIN WAIT (LSN '16/B374D848' [, etc_options]); ... COMMIT; Consequence ========== Below I provide the implementation of patches for the first three types. I propose to discuss this feature again/ Regards -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, 28 Feb 2023 at 05:13, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Below I provide the implementation of patches for the first three types. > I propose to discuss this feature again/ Oof, that doesn't really work with the cfbot. It tries to apply all three patches and of course the second and third fail to apply. In any case this seems like a lot of effort to me. I would suggest you just pick one avenue and provide that patch for discussion and just ask whether people would prefer any of the alternative syntaxes. Fwiw I prefer the functions approach. I do like me some nice syntax but I don't see any particular advantage of the special syntax in this case. They don't seem to provide any additional expressiveness. That said, I'm not a fan of the specific function names. Remember that we have polymorphic functions so you could probably just have an option argument: pg_lsn_wait('LSN', [timeout]) returns boolean (just call it with a timeout of 0 to do a no-wait) I'll set the patch to "Waiting on Author" for now. If you feel you're still looking for more opinions from others maybe set it back to Needs Review but honestly there are a lot of patches so you probably won't see much this commitfest unless you have a patch that shows in cfbot.cputube.org as applying and which looks ready to commit. -- greg
On Wed, Mar 01, 2023 at 03:31:06PM -0500, Greg Stark wrote: > Fwiw I prefer the functions approach. I do like me some nice syntax > but I don't see any particular advantage of the special syntax in this > case. They don't seem to provide any additional expressiveness. So do I, eventhough I saw a point that sticking to a function or a procedure approach makes the wait stick with more MVCC rules, like the fact that the wait may be holding a snapshot for longer than necessary. The grammar can be more extensible without more keywords with DefElems, still I'd like to think that we should not introduce more restrictions in the parser if we have ways to work around it. Using a procedure or function approach is more extensible in its own ways, and it also depends on the data waiting for (there could be more than one field as well for a single wait pattern?). > I'll set the patch to "Waiting on Author" for now. If you feel you're > still looking for more opinions from others maybe set it back to Needs > Review but honestly there are a lot of patches so you probably won't > see much this commitfest unless you have a patch that shows in > cfbot.cputube.org as applying and which looks ready to commit. While looking at all the patches proposed, I have noticed that all the approaches proposed force a wakeup of the waiters in the redo loop of the startup process for each record, before reading the next record. It strikes me that there is some interaction with custom resource managers here, where it is possible to poke at the waiters not for each record, but after reading some specific records. Something out-of-core would not be as responsive as the per-record approach, still responsive enough that the waiters wait on input for an acceptable amount of time, depending on the frequency of the records generated by a primary to wake them up. Just something that popped into my mind while looking a bit at the threads. -- Michael
Attachment
On 28.02.23 11:10, Kartyshov Ivan wrote: > 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch) > https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us > https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com > ========== > advantages: no new words in grammar,like it made in > pg_last_wal_replay_lsn, no snapshots need > disadvantages: a little harder to remember names > SELECT pg_waitlsn(‘LSN’, timeout); > SELECT pg_waitlsn_infinite(‘LSN’); > SELECT pg_waitlsn_no_wait(‘LSN’); Of the presented options, I prefer this one. (Maybe with a "_" between "wait" and "lsn".) But I wonder how a client is going to get the LSN. How would all of this be used by a client? I can think of a scenarios where you have an application that issues a bunch of SQL commands and you have some kind of pooler in the middle that redirects those commands to different hosts, and what you really want is to have it transparently behave as if it's just a single host. Do we want to inject a bunch of "SELECT pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that? I'm tempted to think this could be a protocol-layer facility. Every query automatically returns the current LSN, and every query can also send along an LSN to wait for, and the client library would just keep track of the LSN for (what it thinks of as) the connection. So you get some automatic serialization without having to modify your client code. That said, exposing this functionality using functions could be a valid step in that direction, so that you can at least build out the actual internals of the functionality and test it out.
Here I made new patch of feature, discussed above. WAIT FOR procedure - waits for certain lsn on pause ========== Synopsis ========== SELECT pg_wait_lsn(‘LSN’, timeout) returns boolean Where timeout = 0, will wait infinite without timeout And if timeout = 1, then just check if lsn was replayed How to use it ========== Greg Stark wrote: > That said, I'm not a fan of the specific function names. Remember that > we have polymorphic functions so you could probably just have an > option argument: If you have any example, I will be glade to see them. Ьy searches have not been fruitful. Michael Paquier wrote: > While looking at all the patches proposed, I have noticed that all the > approaches proposed force a wakeup of the waiters in the redo loop of > the startup process for each record, before reading the next record. > It strikes me that there is some interaction with custom resource > managers here, where it is possible to poke at the waiters not for > each record, but after reading some specific records. Something > out-of-core would not be as responsive as the per-record approach, > still responsive enough that the waiters wait on input for an > acceptable amount of time, depending on the frequency of the records > generated by a primary to wake them up. Just something that popped > into my mind while looking a bit at the threads. I`ll work on this idea to have less impact on the redo system. On 2023-03-02 13:33, Peter Eisentraut wrote: > But I wonder how a client is going to get the LSN. How would all of > this be used by a client? As I wrote earlier main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. > I'm tempted to think this could be a protocol-layer facility. Every > query automatically returns the current LSN, and every query can also > send along an LSN to wait for, and the client library would just keep > track of the LSN for (what it thinks of as) the connection. So you > get some automatic serialization without having to modify your client > code. Yes it sounds very tempted. But I think community will be against it. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Update patch to fix conflict with master -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Fix build.meson troubles -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
All rebased and tested
--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hi, Ivan!
On Fri, Jun 30, 2023 at 11:32 AM Картышов Иван <i.kartyshov@postgrespro.ru> wrote:
All rebased and tested
Thank you for continuing to work on this patch.
I see you're concentrating on the procedural version of this feature. But when you're calling a procedure within a normal SQL statement, the executor gets a snapshot and holds it until the procedure finishes. In the case the WAL record conflicts with this snapshot, the query will be canceled. Alternatively, when hot_standby_feedback = on, the query and WAL replayer will be in a deadlock (WAL replayer will wait for the query to finish, and the query will wait for WAL replayed). Do you see this issue? Or do you think I'm missing something?
XLogRecPtr
GetMinWaitedLSN(void)
{
return state->min_lsn.value;
}
{
return state->min_lsn.value;
}
You definitely shouldn't access directly the fields inside pg_atomic_uint64. In this particular case, you should use pg_atomic_read_u64().
Also, I think there is a race condition.
/* Check if we already reached the needed LSN */
if (cur_lsn >= target_lsn)
return true;
if (cur_lsn >= target_lsn)
return true;
AddWaitedLSN(target_lsn);
------
Regards,
Alexander Korotkov
Hi! On Wed, Oct 4, 2023 at 1:22 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > I see you're concentrating on the procedural version of this feature. But when you're calling a procedure within a normalSQL statement, the executor gets a snapshot and holds it until the procedure finishes. In the case the WAL record conflictswith this snapshot, the query will be canceled. Alternatively, when hot_standby_feedback = on, the query and WALreplayer will be in a deadlock (WAL replayer will wait for the query to finish, and the query will wait for WAL replayed). Do you see this issue? Or do you think I'm missing something? I'm sorry, I actually meant hot_standby_feedback = off (hot_standby_feedback = on actually avoids query conflicts). I managed to reproduce this problem. master: create table test as (select i from generate_series(1,10000) i); slave conn1: select pg_wal_replay_pause(); master: delete from test; master: vacuum test; master: select pg_current_wal_lsn(); slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0); slave conn1: select pg_wal_replay_resume(); slave conn2: ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. Needless to say, this is very undesirable behavior. This happens because pg_wait_lsn() has to run within a snapshot as any other function. This is why I think this functionality should be implemented as a separate statement. Another issue I found is that pg_wait_lsn() hangs on the primary. I think an error should be reported instead. ------ Regards, Alexander Korotkov
Alexander, thank you for your review and pointing this issues. According to
them I made some fixes and rebase all patch.
But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor
hot_standby_feedback = off.
Patch rebased and ready for review.
them I made some fixes and rebase all patch.
But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor
hot_standby_feedback = off.
master: create table test as (select i from generate_series(1,10000) i);Also I use little hack to work out of snapshot similar to SnapshotResetXmin.
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
Patch rebased and ready for review.
Attachment
Hi,
I used the latest code and found some conflicts while applying. Which PG version did you rebase?
Regards
Bowen Shi
On Thu, Nov 23, 2023 at 5:52 AM Bowen Shi <zxwsbg12138@gmail.com> wrote: > I used the latest code and found some conflicts while applying. Which PG version did you rebase? I've successfully applied the patch on bc3c8db8ae. But I've used "patch -p1 < wait_proc_v6.patch", git am doesn't work. ------ Regards, Alexander Korotkov
On Mon, Nov 20, 2023 at 1:10 PM Картышов Иван <i.kartyshov@postgrespro.ru> wrote: > Alexander, thank you for your review and pointing this issues. According to > them I made some fixes and rebase all patch. > > But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor > hot_standby_feedback = off. > > master: create table test as (select i from generate_series(1,10000) i); > slave conn1: select pg_wal_replay_pause(); > master: delete from test; > master: vacuum test; > master: select pg_current_wal_lsn(); > slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0); > slave conn1: select pg_wal_replay_resume(); > slave conn2: ERROR: canceling statement due to conflict with recovery > DETAIL: User query might have needed to see row versions that must be removed. > > Also I use little hack to work out of snapshot similar to SnapshotResetXmin. > > Patch rebased and ready for review. I've retried my case with v6 and it doesn't fail anymore. But I wonder how safe it is to reset xmin within the user-visible function? We have no guarantee that the function is not called inside the complex query. Then how will the rest of the query work with xmin reset? Separate utility statement still looks like more safe option for me. ------ Regards, Alexander Korotkov
On 2023-11-27 03:08, Alexander Korotkov wrote: > I've retried my case with v6 and it doesn't fail anymore. But I > wonder how safe it is to reset xmin within the user-visible function? > We have no guarantee that the function is not called inside the > complex query. Then how will the rest of the query work with xmin > reset? Separate utility statement still looks like more safe option > for me. As you mentioned, we can`t guarantee that the function is not called inside the complex query, but we can return the xmin after waiting. But you are right and separate utility statement still looks more safe. So I want to bring up the discussion on separate utility statement again. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Should rise disscusion on separate utility statement or find case where procedure version is failed. 1) Classic (wait_classic_v3.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru ========== advantages: multiple wait events, separate WAIT FOR statement disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] event: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v2.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru ========== advantages: no new words in grammar disadvantages: a little harder to understand AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch) https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com ========== advantages: no new words in grammar,like it made in pg_last_wal_replay_lsn disadvantages: use snapshot xmin trick SELECT pg_waitlsn(‘LSN’, timeout); SELECT pg_waitlsn_infinite(‘LSN’); SELECT pg_waitlsn_no_wait(‘LSN’); Regards -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On Fri, Dec 8, 2023 at 11:20 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > On 2023-11-27 03:08, Alexander Korotkov wrote: > > I've retried my case with v6 and it doesn't fail anymore. But I > > wonder how safe it is to reset xmin within the user-visible function? > > We have no guarantee that the function is not called inside the > > complex query. Then how will the rest of the query work with xmin > > reset? Separate utility statement still looks like more safe option > > for me. > > As you mentioned, we can`t guarantee that the function is not called > inside the complex query, but we can return the xmin after waiting. Returning xmin back isn't safe. Especially after potentially long waiting. The snapshot could be no longer valid, because the corresponding tuples could be VACUUM'ed. ------ Regards, Alexander Korotkov
On Fri, Dec 8, 2023 at 11:46 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Should rise disscusion on separate utility statement or find > case where procedure version is failed. > > 1) Classic (wait_classic_v3.patch) > https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru > ========== > advantages: multiple wait events, separate WAIT FOR statement > disadvantages: new words in grammar > > > > WAIT FOR [ANY | ALL] event [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ WAIT FOR [ANY | ALL] event [, ...]] > event: > LSN value > TIMEOUT number_of_milliseconds > timestamp Nice, but as you stated requires new keywords. > 2) After style: Kyotaro and Freund (wait_after_within_v2.patch) > https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru > ========== > advantages: no new words in grammar > disadvantages: a little harder to understand > > > > AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] +1 from me > 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch) > https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us > https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com > ========== > advantages: no new words in grammar,like it made in > pg_last_wal_replay_lsn > disadvantages: use snapshot xmin trick > SELECT pg_waitlsn(‘LSN’, timeout); > SELECT pg_waitlsn_infinite(‘LSN’); > SELECT pg_waitlsn_no_wait(‘LSN’); Nice, because simplicity. But only safe if called within the simple query containing nothing else. Validating this from the function kills the simplicity. ------ Regards, Alexander Korotkov
On Fri, 8 Dec 2023 at 15:17, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Should rise disscusion on separate utility statement or find > case where procedure version is failed. > > 1) Classic (wait_classic_v3.patch) > https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru > ========== > advantages: multiple wait events, separate WAIT FOR statement > disadvantages: new words in grammar > > > > WAIT FOR [ANY | ALL] event [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ WAIT FOR [ANY | ALL] event [, ...]] > event: > LSN value > TIMEOUT number_of_milliseconds > timestamp > > > > 2) After style: Kyotaro and Freund (wait_after_within_v2.patch) > https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru > ========== > advantages: no new words in grammar > disadvantages: a little harder to understand > > > > AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > > > > 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch) > https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us > https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com > ========== > advantages: no new words in grammar,like it made in > pg_last_wal_replay_lsn > disadvantages: use snapshot xmin trick > SELECT pg_waitlsn(‘LSN’, timeout); > SELECT pg_waitlsn_infinite(‘LSN’); > SELECT pg_waitlsn_no_wait(‘LSN’); Few of the tests have aborted at [1] in CFBot with: 0000058`9c7ff550 00007ff6`5bdff1f4 postgres!pg_atomic_compare_exchange_u64_impl( struct pg_atomic_uint64 * ptr = 0x00000000`00000008, unsigned int64 * expected = 0x00000058`9c7ff5a0, unsigned int64 newval = 0)+0x34 [c:\cirrus\src\include\port\atomics\generic-msvc.h @ 83] 00000058`9c7ff580 00007ff6`5bdff256 postgres!pg_atomic_read_u64_impl( struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x24 [c:\cirrus\src\include\port\atomics\generic.h @ 323] 00000058`9c7ff5c0 00007ff6`5bdfef67 postgres!pg_atomic_read_u64( struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x46 [c:\cirrus\src\include\port\atomics.h @ 430] 00000058`9c7ff5f0 00007ff6`5bc98fc3 postgres!GetMinWaitedLSN(void)+0x17 [c:\cirrus\src\backend\commands\wait.c @ 176] 00000058`9c7ff620 00007ff6`5bc82fb9 postgres!PerformWalRecovery(void)+0x4c3 [c:\cirrus\src\backend\access\transam\xlogrecovery.c @ 1788] 00000058`9c7ff6e0 00007ff6`5bffc651 postgres!StartupXLOG(void)+0x989 [c:\cirrus\src\backend\access\transam\xlog.c @ 5562] 00000058`9c7ff870 00007ff6`5bfed38b postgres!StartupProcessMain(void)+0xd1 [c:\cirrus\src\backend\postmaster\startup.c @ 288] 00000058`9c7ff8a0 00007ff6`5bff49fd postgres!AuxiliaryProcessMain( AuxProcType auxtype = StartupProcess (0n0))+0x1fb [c:\cirrus\src\backend\postmaster\auxprocess.c @ 139] 00000058`9c7ff8e0 00007ff6`5beb7674 postgres!SubPostmasterMain( More details are available at [2]. [1] - https://cirrus-ci.com/task/5618308515364864 [2] - https://api.cirrus-ci.com/v1/artifact/task/5618308515364864/crashlog/crashlog-postgres.exe_0008_2023-12-08_07-48-37-722.txt Regards, Vignesh
Rebased and ready for review. I left only versions (due to irreparable problems) 1) Classic (wait_classic_v4.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru ========== advantages: multiple wait events, separate WAIT FOR statement disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] event: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v3.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru ========== advantages: no new words in grammar disadvantages: a little harder to understand AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
Add some fixes and rebase. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there was a CFbot test failure last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/4221/ [2] https://cirrus-ci.com/task/5618308515364864 Kind Regards, Peter Smith.
On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Add some fixes and rebase. > While quickly looking into the patch, I understood the idea of what we are trying to achieve here and I feel that a useful feature. But while looking at both the patches I could not quickly differentiate between these two approaches. I believe, internally at the core both are implementing similar wait logic but providing different syntaxes. So if we want to keep both these approaches open for the sake of discussion then better first to create a patch that implements the core approach i.e. the waiting logic and the other common part and then add top-up patches with 2 different approaches that would be easy for review. I also see in v4 that there is no documentation for the syntax part so it makes it even harder to understand. I think this thread is implementing a useful feature so my suggestion is to add some documentation in v4 and also make it more readable w.r.t. What are the clear differences between these two approaches, maybe adding commit message will also help. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Updated, rebased, fixed Ci and added documentation. We left two different solutions. Help me please to choose the best. 1) Classic (wait_classic_v6.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru ========== advantages: multiple wait events, separate WAIT FOR statement disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] event: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v5.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru ========== advantages: no new words in grammar disadvantages: a little harder to understand, fewer events to wait AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] Regards -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
Intro ========== The main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. We cannot store this lsn inside database, since reads are distributed across all replicas and primary. Two implementations of one feature ========== We left two different solutions. Help me please to choose the best. 1) Classic (wait_classic_v7.patch) https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru Synopsis ========== advantages: multiple wait events, separate WAIT FOR statement disadvantages: new words in grammar WAIT FOR [ANY | ALL] event [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] event: LSN value TIMEOUT number_of_milliseconds timestamp 2) After style: Kyotaro and Freund (wait_after_within_v6.patch) https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru Synopsis ========== advantages: no new words in grammar disadvantages: a little harder to understand, fewer events to wait AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ AFTER lsn_event [ WITHIN delay_milliseconds ]] Examples ========== primary standby ------- -------- postgresql.conf recovery_min_apply_delay = 10s CREATE TABLE tbl AS SELECT generate_series(1,10) AS a; INSERT INTO tbl VALUES (generate_series(11, 20)); SELECT pg_current_wal_lsn(); BEGIN WAIT FOR LSN '0/3002AE8'; SELECT * FROM tbl; // read fresh insertions COMMIT; Rebased and ready for review. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On Thu, Mar 7, 2024 at 5:14 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Intro > ========== > The main purpose of the feature is to achieve > read-your-writes-consistency, while using async replica for reads and > primary for writes. In that case lsn of last modification is stored > inside application. We cannot store this lsn inside database, since > reads are distributed across all replicas and primary. > > > Two implementations of one feature > ========== > We left two different solutions. Help me please to choose the best. > > > 1) Classic (wait_classic_v7.patch) > https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru > Synopsis > ========== > advantages: multiple wait events, separate WAIT FOR statement > disadvantages: new words in grammar > > > > WAIT FOR [ANY | ALL] event [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ WAIT FOR [ANY | ALL] event [, ...]] > event: > LSN value > TIMEOUT number_of_milliseconds > timestamp > > > > 2) After style: Kyotaro and Freund (wait_after_within_v6.patch) > https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru > Synopsis > ========== > advantages: no new words in grammar > disadvantages: a little harder to understand, fewer events to wait > > > > AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...] > BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] > [ AFTER lsn_event [ WITHIN delay_milliseconds ]] > +1 for the second one not only because it avoids new words in grammar but also sounds to convey the meaning. I think you can explain in docs how this feature can be used basically how will one get the correct LSN value to specify. As suggested previously also pick one of the approaches (I would advocate the second one) and keep an option for the second one by mentioning it in the commit message. I hope to see more reviews/discussions or usage like how will users get the LSN value to be specified on the core logic of the feature at this stage. IF possible, state, how real-world applications could leverage this feature. -- With Regards, Amit Kapila.
Hi! I've decided to put my hands on this patch. On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > +1 for the second one not only because it avoids new words in grammar > but also sounds to convey the meaning. I think you can explain in docs > how this feature can be used basically how will one get the correct > LSN value to specify. I picked the second option and left only the AFTER clause for the BEGIN statement. I think this should be enough for the beginning. > As suggested previously also pick one of the approaches (I would > advocate the second one) and keep an option for the second one by > mentioning it in the commit message. I hope to see more > reviews/discussions or usage like how will users get the LSN value to > be specified on the core logic of the feature at this stage. IF > possible, state, how real-world applications could leverage this > feature. I've added a paragraph to the docs about the usage. After you made some changes on primary, you run pg_current_wal_insert_lsn(). Then connect to replica and run 'BEGIN AFTER lsn' with the just obtained LSN. Now you're guaranteed to see the changes made to the primary. Also, I've significantly reworked other aspects of the patch. The most significant changes are: 1) Waiters are now stored in the array sorted by LSN. This saves us from scanning of wholeper-backend array. 2) Waiters are removed from the array immediately once their LSNs are replayed. Otherwise, the WAL replayer will keep scanning the shared memory array till waiters wake up. 3) To clean up after errors, we now call WaitLSNCleanup() on backend shmem exit. I think this is preferable over the previous approach to remove from the queue before ProcessInterrupts(). 4) There is now condition to recheck if LSN is replayed after adding to the shared memory array. This should save from the race conditions. 5) I've renamed too generic names for functions and files. ------ Regards, Alexander Korotkov
Attachment
On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > I've decided to put my hands on this patch. > > On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > +1 for the second one not only because it avoids new words in grammar > > but also sounds to convey the meaning. I think you can explain in docs > > how this feature can be used basically how will one get the correct > > LSN value to specify. > > I picked the second option and left only the AFTER clause for the > BEGIN statement. I think this should be enough for the beginning. > > > As suggested previously also pick one of the approaches (I would > > advocate the second one) and keep an option for the second one by > > mentioning it in the commit message. I hope to see more > > reviews/discussions or usage like how will users get the LSN value to > > be specified on the core logic of the feature at this stage. IF > > possible, state, how real-world applications could leverage this > > feature. > > I've added a paragraph to the docs about the usage. After you made > some changes on primary, you run pg_current_wal_insert_lsn(). Then > connect to replica and run 'BEGIN AFTER lsn' with the just obtained > LSN. Now you're guaranteed to see the changes made to the primary. > > Also, I've significantly reworked other aspects of the patch. The > most significant changes are: > 1) Waiters are now stored in the array sorted by LSN. This saves us > from scanning of wholeper-backend array. > 2) Waiters are removed from the array immediately once their LSNs are > replayed. Otherwise, the WAL replayer will keep scanning the shared > memory array till waiters wake up. > 3) To clean up after errors, we now call WaitLSNCleanup() on backend > shmem exit. I think this is preferable over the previous approach to > remove from the queue before ProcessInterrupts(). > 4) There is now condition to recheck if LSN is replayed after adding > to the shared memory array. This should save from the race > conditions. > 5) I've renamed too generic names for functions and files. I went through this patch another time, and made some minor adjustments. Now it looks good, I'm going to push it if no objections. ------ Regards, Alexander Korotkov
Attachment
On Fri, Mar 15, 2024 at 4:20 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I've decided to put my hands on this patch.
>
> On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +1 for the second one not only because it avoids new words in grammar
> > but also sounds to convey the meaning. I think you can explain in docs
> > how this feature can be used basically how will one get the correct
> > LSN value to specify.
>
> I picked the second option and left only the AFTER clause for the
> BEGIN statement. I think this should be enough for the beginning.
>
> > As suggested previously also pick one of the approaches (I would
> > advocate the second one) and keep an option for the second one by
> > mentioning it in the commit message. I hope to see more
> > reviews/discussions or usage like how will users get the LSN value to
> > be specified on the core logic of the feature at this stage. IF
> > possible, state, how real-world applications could leverage this
> > feature.
>
> I've added a paragraph to the docs about the usage. After you made
> some changes on primary, you run pg_current_wal_insert_lsn(). Then
> connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> LSN. Now you're guaranteed to see the changes made to the primary.
>
> Also, I've significantly reworked other aspects of the patch. The
> most significant changes are:
> 1) Waiters are now stored in the array sorted by LSN. This saves us
> from scanning of wholeper-backend array.
> 2) Waiters are removed from the array immediately once their LSNs are
> replayed. Otherwise, the WAL replayer will keep scanning the shared
> memory array till waiters wake up.
> 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> shmem exit. I think this is preferable over the previous approach to
> remove from the queue before ProcessInterrupts().
> 4) There is now condition to recheck if LSN is replayed after adding
> to the shared memory array. This should save from the race
> conditions.
> 5) I've renamed too generic names for functions and files.
I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.
The revised patch version with cosmetic fixes proposed by Alexander Lakhin.
Regards,
Alexander Korotkov
Attachment
On 2024-03-11 13:44, Alexander Korotkov wrote: > I picked the second option and left only the AFTER clause for the > BEGIN statement. I think this should be enough for the beginning. Thank you for your rework on your patch, here I made some fixes: 0) autocomplete 1) less jumps 2) more description and add cases in doc I think, it will be useful to have stand-alone statement. Why you would like to see only AFTER clause for the BEGIN statement? -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On 2024-03-15 22:59, Kartyshov Ivan wrote: > On 2024-03-11 13:44, Alexander Korotkov wrote: >> I picked the second option and left only the AFTER clause for the >> BEGIN statement. I think this should be enough for the beginning. > > Thank you for your rework on your patch, here I made some fixes: > 0) autocomplete > 1) less jumps > 2) more description and add cases in doc > > I think, it will be useful to have stand-alone statement. > Why you would like to see only AFTER clause for the BEGIN statement? Rebase and update patch. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On Fri, Mar 15, 2024 at 10:32 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > On 2024-03-15 22:59, Kartyshov Ivan wrote: > > On 2024-03-11 13:44, Alexander Korotkov wrote: > >> I picked the second option and left only the AFTER clause for the > >> BEGIN statement. I think this should be enough for the beginning. > > > > Thank you for your rework on your patch, here I made some fixes: > > 0) autocomplete > > 1) less jumps > > 2) more description and add cases in doc Thank you! > > I think, it will be useful to have stand-alone statement. > > Why you would like to see only AFTER clause for the BEGIN statement? Yes, stand-alone statements might be also useful. But I think that the best way for this feature to get into the core would be to commit the minimal version first. The BEGIN clause has minimal invasiveness for the syntax and I believe covers most typical use-cases. Once we figure out it's OK and have positive feedback from users, we can do more enchantments incrementally. > Rebase and update patch. Cool, I was just about to ask you to do this. ------ Regards, Alexander Korotkov
On Sat, Mar 16, 2024 at 2:04 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > Rebase and update patch. Thanks for working on this. I took a quick look at v11 patch. Here are some comments: 1. +#include "utils/timestamp.h" +#include "executor/spi.h" +#include "utils/fmgrprotos.h" Please place executor/spi.h in the alphabetical order. Also, look at all other header files and place them in the order. 2. It seems like pgindent is not happy with src/backend/access/transam/xlogrecovery.c and src/backend/commands/waitlsn.c. Please run it to keep BF member koel happy post commit. 3. This patch changes, SQL explicit transaction statement syntax, is it (this deviation) okay from SQL standard perspective? 4. I think some more checks are needed for better input validations. 4.1 With invalid LSN succeeds, shouldn't it error out? Or at least, add a fast path/quick exit to WaitForLSN()? BEGIN AFTER '0/0'; 4.2 With an unreasonably high future LSN, BEGIN command waits unboundedly, shouldn't we check if the specified LSN is more than pg_last_wal_receive_lsn() error out? BEGIN AFTER '0/FFFFFFFF'; SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset BEGIN AFTER :'future_receive_lsn'; 4.3 With an unreasonably high wait time, BEGIN command waits unboundedly, shouldn't we restrict the wait time to some max value, say a day or so? SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset BEGIN AFTER :'future_receive_lsn' WITHIN 100000; 5. +#include <float.h> +#include <math.h> +#include "postgres.h" +#include "pgstat.h" postgres.h must be included at the first, and then the system header files, and then all postgres header files, just like below. See a very recent commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=97d85be365443eb4bf84373a7468624762382059. +#include "postgres.h" +#include <float.h> +#include <math.h> +#include "access/transam.h" +#include "access/xact.h" +#include "access/xlog.h" 6. +/* Set all latches in shared memory to signal that new LSN has been replayed */ +void +WaitLSNSetLatches(XLogRecPtr curLSN) +{ I see this patch is waking up all the waiters in the recovery path after applying every WAL record, which IMO is a hot path. Is the impact of this change on recovery measured, perhaps using https://github.com/macdice/redo-bench or similar tools? 7. In continuation to comment #6, why not use Conditional Variables instead of proc latches to sleep and wait for all the waiters in WaitLSNSetLatches? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > I went through this patch another time, and made some minor > adjustments. Now it looks good, I'm going to push it if no > objections. > I have a question related to usability, if the regular reads (say a Select statement or reads via function/procedure) need a similar guarantee to see the changes on standby then do they also always need to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or in other words, shouldn't we think of something for implicit transactions? In general, it seems this patch has been stuck for a long time on the decision to choose an appropriate UI (syntax), and we thought of moving it further so that the other parts of the patch can be reviewed/discussed. So, I feel before pushing this we should see comments from a few (at least two) other senior members who earlier shared their opinion on the syntax. I know we don't have much time left but OTOH pushing such a change (where we didn't have a consensus on syntax) without much discussion at this point of time could lead to discussions after commit. -- With Regards, Amit Kapila.
On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > I went through this patch another time, and made some minor > > adjustments. Now it looks good, I'm going to push it if no > > objections. > > > > I have a question related to usability, if the regular reads (say a > Select statement or reads via function/procedure) need a similar > guarantee to see the changes on standby then do they also always need > to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or > in other words, shouldn't we think of something for implicit > transactions? +1 to have support for implicit txns. A strawman solution I can think of is to let primary send its current insert LSN to the standby every time it sends a bunch of WAL, and the standby waits for that LSN to be replayed on it at the start of every implicit txn automatically. The new BEGIN syntax requires application code changes. This led me to think how one can achieve read-after-write consistency today in a primary - standby set up. All the logic of this patch, that is, waiting for the standby to pass a given primary LSN needs to be done in the application code (or in proxy or in load balancer?). I believe there might be someone doing this already, it's good to hear from them. > In general, it seems this patch has been stuck for a long time on the > decision to choose an appropriate UI (syntax), and we thought of > moving it further so that the other parts of the patch can be > reviewed/discussed. So, I feel before pushing this we should see > comments from a few (at least two) other senior members who earlier > shared their opinion on the syntax. I know we don't have much time > left but OTOH pushing such a change (where we didn't have a consensus > on syntax) without much discussion at this point of time could lead to > discussions after commit. +1 to gain consensus first on the syntax changes. With this, we might be violating the SQL standard for explicit txn commands (I stand for correction about the SQL standard though). -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Amit, Hi Bharath, On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > In general, it seems this patch has been stuck for a long time on the > > decision to choose an appropriate UI (syntax), and we thought of > > moving it further so that the other parts of the patch can be > > reviewed/discussed. So, I feel before pushing this we should see > > comments from a few (at least two) other senior members who earlier > > shared their opinion on the syntax. I know we don't have much time > > left but OTOH pushing such a change (where we didn't have a consensus > > on syntax) without much discussion at this point of time could lead to > > discussions after commit. > > +1 to gain consensus first on the syntax changes. With this, we might > be violating the SQL standard for explicit txn commands (I stand for > correction about the SQL standard though). Thank you for your feedback. Generally, I agree it's correct to get consensus on syntax first. And yes, this patch has been here since 2016. We didn't get consensus for syntax for 8 years. Frankly speaking, I don't see a reason why this shouldn't take another 8 years. At the same time the ability to wait on standby given LSN is replayed seems like pretty basic and simple functionality. Thus, it's quite frustrating it already took that long and still unclear when/how this could be finished. My current attempt was to commit minimal implementation as less invasive as possible. A new clause for BEGIN doesn't require additional keywords and doesn't introduce additional statements. But yes, this is still a new qual. And, yes, Amit you're right that even if I had committed that, there was still a high risk of further debates and revert. Given my specsis about agreement over syntax, I'd like to check another time if we could go without new syntax at all. There was an attempt to implement waiting for lsn as a function. But function holds a snapshot, which could prevent WAL records from being replayed. Releasing a snapshot could break the parent query. But now we have procedures, which need a dedicated statement for the call and can even control transactions. Could we implement a waitlsn in a procedure that: 1. First, check that it was called with non-atomic context (that is, it's not called within a transaction). Trigger error if called with atomic context. 2. Release a snapshot to be able to wait without risk of WAL replay stuck. Procedure is still called within the snapshot. It's a bit of a hack to release a snapshot, but Vacuum statements already do so. Amit, Bharath, what do you think about this approach? Is this a way to go? ------ Regards, Alexander Korotkov
On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > In general, it seems this patch has been stuck for a long time on the > > > decision to choose an appropriate UI (syntax), and we thought of > > > moving it further so that the other parts of the patch can be > > > reviewed/discussed. So, I feel before pushing this we should see > > > comments from a few (at least two) other senior members who earlier > > > shared their opinion on the syntax. I know we don't have much time > > > left but OTOH pushing such a change (where we didn't have a consensus > > > on syntax) without much discussion at this point of time could lead to > > > discussions after commit. > > > > +1 to gain consensus first on the syntax changes. With this, we might > > be violating the SQL standard for explicit txn commands (I stand for > > correction about the SQL standard though). > > Thank you for your feedback. Generally, I agree it's correct to get > consensus on syntax first. And yes, this patch has been here since > 2016. We didn't get consensus for syntax for 8 years. Frankly > speaking, I don't see a reason why this shouldn't take another 8 > years. At the same time the ability to wait on standby given LSN is > replayed seems like pretty basic and simple functionality. Thus, it's > quite frustrating it already took that long and still unclear when/how > this could be finished. > > My current attempt was to commit minimal implementation as less > invasive as possible. A new clause for BEGIN doesn't require > additional keywords and doesn't introduce additional statements. But > yes, this is still a new qual. And, yes, Amit you're right that even > if I had committed that, there was still a high risk of further > debates and revert. > > Given my specsis about agreement over syntax, I'd like to check > another time if we could go without new syntax at all. There was an > attempt to implement waiting for lsn as a function. But function > holds a snapshot, which could prevent WAL records from being replayed. > Releasing a snapshot could break the parent query. But now we have > procedures, which need a dedicated statement for the call and can even > control transactions. Could we implement a waitlsn in a procedure > that: > > 1. First, check that it was called with non-atomic context (that is, > it's not called within a transaction). Trigger error if called with > atomic context. > 2. Release a snapshot to be able to wait without risk of WAL replay > stuck. Procedure is still called within the snapshot. It's a bit of > a hack to release a snapshot, but Vacuum statements already do so. > Can you please provide a bit more details with some example what is the existing problem with functions and how using procedures will resolve it? How will this this address the implicit transaction case or do we have any other workaround for those cases? -- With Regards, Amit Kapila.
On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > In general, it seems this patch has been stuck for a long time on the
> > > > decision to choose an appropriate UI (syntax), and we thought of
> > > > moving it further so that the other parts of the patch can be
> > > > reviewed/discussed. So, I feel before pushing this we should see
> > > > comments from a few (at least two) other senior members who earlier
> > > > shared their opinion on the syntax. I know we don't have much time
> > > > left but OTOH pushing such a change (where we didn't have a consensus
> > > > on syntax) without much discussion at this point of time could lead to
> > > > discussions after commit.
> > >
> > > +1 to gain consensus first on the syntax changes. With this, we might
> > > be violating the SQL standard for explicit txn commands (I stand for
> > > correction about the SQL standard though).
> >
> > Thank you for your feedback. Generally, I agree it's correct to get
> > consensus on syntax first. And yes, this patch has been here since
> > 2016. We didn't get consensus for syntax for 8 years. Frankly
> > speaking, I don't see a reason why this shouldn't take another 8
> > years. At the same time the ability to wait on standby given LSN is
> > replayed seems like pretty basic and simple functionality. Thus, it's
> > quite frustrating it already took that long and still unclear when/how
> > this could be finished.
> >
> > My current attempt was to commit minimal implementation as less
> > invasive as possible. A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements. But
> > yes, this is still a new qual. And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
> >
> > Given my specsis about agreement over syntax, I'd like to check
> > another time if we could go without new syntax at all. There was an
> > attempt to implement waiting for lsn as a function. But function
> > holds a snapshot, which could prevent WAL records from being replayed.
> > Releasing a snapshot could break the parent query. But now we have
> > procedures, which need a dedicated statement for the call and can even
> > control transactions. Could we implement a waitlsn in a procedure
> > that:
> >
> > 1. First, check that it was called with non-atomic context (that is,
> > it's not called within a transaction). Trigger error if called with
> > atomic context.
> > 2. Release a snapshot to be able to wait without risk of WAL replay
> > stuck. Procedure is still called within the snapshot. It's a bit of
> > a hack to release a snapshot, but Vacuum statements already do so.
> >
>
> Can you please provide a bit more details with some example what is
> the existing problem with functions and how using procedures will
> resolve it? How will this this address the implicit transaction case
> or do we have any other workaround for those cases?
Please check [1] and [2] for the explanation of the problem with functions.
Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the following lines.
We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay.
if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic context")));
if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());
The function call could be added either before the BEGIN statement or before the implicit transaction.
CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > In general, it seems this patch has been stuck for a long time on the
> > > > decision to choose an appropriate UI (syntax), and we thought of
> > > > moving it further so that the other parts of the patch can be
> > > > reviewed/discussed. So, I feel before pushing this we should see
> > > > comments from a few (at least two) other senior members who earlier
> > > > shared their opinion on the syntax. I know we don't have much time
> > > > left but OTOH pushing such a change (where we didn't have a consensus
> > > > on syntax) without much discussion at this point of time could lead to
> > > > discussions after commit.
> > >
> > > +1 to gain consensus first on the syntax changes. With this, we might
> > > be violating the SQL standard for explicit txn commands (I stand for
> > > correction about the SQL standard though).
> >
> > Thank you for your feedback. Generally, I agree it's correct to get
> > consensus on syntax first. And yes, this patch has been here since
> > 2016. We didn't get consensus for syntax for 8 years. Frankly
> > speaking, I don't see a reason why this shouldn't take another 8
> > years. At the same time the ability to wait on standby given LSN is
> > replayed seems like pretty basic and simple functionality. Thus, it's
> > quite frustrating it already took that long and still unclear when/how
> > this could be finished.
> >
> > My current attempt was to commit minimal implementation as less
> > invasive as possible. A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements. But
> > yes, this is still a new qual. And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
> >
> > Given my specsis about agreement over syntax, I'd like to check
> > another time if we could go without new syntax at all. There was an
> > attempt to implement waiting for lsn as a function. But function
> > holds a snapshot, which could prevent WAL records from being replayed.
> > Releasing a snapshot could break the parent query. But now we have
> > procedures, which need a dedicated statement for the call and can even
> > control transactions. Could we implement a waitlsn in a procedure
> > that:
> >
> > 1. First, check that it was called with non-atomic context (that is,
> > it's not called within a transaction). Trigger error if called with
> > atomic context.
> > 2. Release a snapshot to be able to wait without risk of WAL replay
> > stuck. Procedure is still called within the snapshot. It's a bit of
> > a hack to release a snapshot, but Vacuum statements already do so.
> >
>
> Can you please provide a bit more details with some example what is
> the existing problem with functions and how using procedures will
> resolve it? How will this this address the implicit transaction case
> or do we have any other workaround for those cases?
Please check [1] and [2] for the explanation of the problem with functions.
Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the following lines.
We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay.
if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic context")));
if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());
The function call could be added either before the BEGIN statement or before the implicit transaction.
CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
Attachment
On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 1. First, check that it was called with non-atomic context (that is, > > > it's not called within a transaction). Trigger error if called with > > > atomic context. > > > 2. Release a snapshot to be able to wait without risk of WAL replay > > > stuck. Procedure is still called within the snapshot. It's a bit of > > > a hack to release a snapshot, but Vacuum statements already do so. > > > > > > > Can you please provide a bit more details with some example what is > > the existing problem with functions and how using procedures will > > resolve it? How will this this address the implicit transaction case > > or do we have any other workaround for those cases? > > Please check [1] and [2] for the explanation of the problem with functions. > > Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the followinglines. > > We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay. > > if (context->atomic) > ereport(ERROR, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("pg_wait_lsn() must be only called in non-atomic context"))); > > if (ActiveSnapshotSet()) > PopActiveSnapshot(); > Assert(!ActiveSnapshotSet()); > > The function call could be added either before the BEGIN statement or before the implicit transaction. > > CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN; > CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...; > I haven't thought in detail about whether there are any other problems with this idea but sounds like it should solve the problems you shared with a function call approach. BTW, if the application has to anyway know the LSN till where replica needs to wait, why can't they simply monitor the pg_last_wal_replay_lsn() value? -- With Regards, Amit Kapila.
On Tue, Mar 19, 2024 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 1. First, check that it was called with non-atomic context (that is, > > > > it's not called within a transaction). Trigger error if called with > > > > atomic context. > > > > 2. Release a snapshot to be able to wait without risk of WAL replay > > > > stuck. Procedure is still called within the snapshot. It's a bit of > > > > a hack to release a snapshot, but Vacuum statements already do so. > > > > > > > > > > Can you please provide a bit more details with some example what is > > > the existing problem with functions and how using procedures will > > > resolve it? How will this this address the implicit transaction case > > > or do we have any other workaround for those cases? > > > > Please check [1] and [2] for the explanation of the problem with functions. > > > > Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the followinglines. > > > > We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay. > > > > if (context->atomic) > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("pg_wait_lsn() must be only called in non-atomic context"))); > > > > if (ActiveSnapshotSet()) > > PopActiveSnapshot(); > > Assert(!ActiveSnapshotSet()); > > > > The function call could be added either before the BEGIN statement or before the implicit transaction. > > > > CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN; > > CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...; > > > > I haven't thought in detail about whether there are any other problems > with this idea but sounds like it should solve the problems you shared > with a function call approach. BTW, if the application has to anyway > know the LSN till where replica needs to wait, why can't they simply > monitor the pg_last_wal_replay_lsn() value? Amit, thank you for your feedback. Yes, the application can monitor pg_last_wal_replay_lsn() value, that's our state of the art solution. But that's rather inconvenient and takes extra latency and network traffic. And it can't be wrapped into a server-side function in procedural language for the reasons we can't implement it as a built-in function. ------ Regards, Alexander Korotkov
Intro ========== The main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads and primary for writes. In that case lsn of last modification is stored inside application. We cannot store this lsn inside database, since reads are distributed across all replicas and primary. Procedure style implementation ========== https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com CALL pg_wait_lsn(‘LSN’, timeout); Examples ========== primary standby ------- -------- postgresql.conf recovery_min_apply_delay = 10s CREATE TABLE tbl AS SELECT generate_series(1,10) AS a; INSERT INTO tbl VALUES (generate_series(11, 20)); SELECT pg_current_wal_lsn(); CALL pg_wait_lsn('0/3002AE8', 10000); BEGIN; SELECT * FROM tbl; // read fresh insertions COMMIT; Fixed and ready to review. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
Bharath Rupireddy, thank you for you review. But here is some points. On 2024-03-16 10:02, Bharath Rupireddy wrote: > 4.1 With invalid LSN succeeds, shouldn't it error out? Or at least, > add a fast path/quick exit to WaitForLSN()? > BEGIN AFTER '0/0'; In postgresql '0/0' is Valid pg_lsn, but it is always reached. > 4.2 With an unreasonably high future LSN, BEGIN command waits > unboundedly, shouldn't we check if the specified LSN is more than > pg_last_wal_receive_lsn() error out? > BEGIN AFTER '0/FFFFFFFF'; > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset > BEGIN AFTER :'future_receive_lsn'; This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn > 4.3 With an unreasonably high wait time, BEGIN command waits > unboundedly, shouldn't we restrict the wait time to some max value, > say a day or so? > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset > BEGIN AFTER :'future_receive_lsn' WITHIN 100000; Good idea, I put it 1 day. But this limit we should to discuss. > 6. > +/* Set all latches in shared memory to signal that new LSN has been > replayed */ > +void > +WaitLSNSetLatches(XLogRecPtr curLSN) > +{ > > I see this patch is waking up all the waiters in the recovery path > after applying every WAL record, which IMO is a hot path. Is the > impact of this change on recovery measured, perhaps using > https://github.com/macdice/redo-bench or similar tools? > > 7. In continuation to comment #6, why not use Conditional Variables > instead of proc latches to sleep and wait for all the waiters in > WaitLSNSetLatches? Waiters are stored in the array sorted by LSN. This help us to wake only PIDs with replayed LSN. This saves us from scanning of whole array. So it`s not so hot path. Add some fixes 1) make waiting timeont more simple (as pg_terminate_backend()) 2) removed the 1 minute wait because INTERRUPTS don’t arrive for a long time, changed it to 0.5 seconds 3) add more tests 4) added and expanded sections in the documentation 5) add default variant of timeout pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0) example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0) 6) now big timeout will be restricted to 1 day (86400000ms) CALL pg_wait_lsn('0/34FB5A1',10000000000); WARNING: Timeout for pg_wait_lsn() restricted to 1 day -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:
> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> > unboundedly, shouldn't we check if the specified LSN is more than
> > pg_last_wal_receive_lsn() error out?
I think limiting wait lsn by current received lsn would destroy the whole value of this feature. The value is to wait till given LSN is replayed, whether it's already received or not.
> > BEGIN AFTER '0/FFFFFFFF';
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn';
>
> This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn
FWIW,
# SELECT '0/FFFFFFFF'::pg_lsn + 1;
?column?
----------
1/0
(1 row)
But I don't see a problem here. On the replica, it's out of our control to check which lsn is good and which is not. We can't check whether the lsn, which is in future for the replica, is already issued by primary.
> 5) add default variant of timeout
> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
> 6) now big timeout will be restricted to 1 day (86400000ms)
> CALL pg_wait_lsn('0/34FB5A1',10000000000);
> WARNING: Timeout for pg_wait_lsn() restricted to 1 day
I don't think we need to mention individuals, who made proposals, in the source code comments. Otherwise, our source code would be a crazy mess of names. Also, if this is the restriction, it has to be an error. And it should be a proper full ereport().
------
Regards,
Alexander Korotkov
> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> > unboundedly, shouldn't we check if the specified LSN is more than
> > pg_last_wal_receive_lsn() error out?
I think limiting wait lsn by current received lsn would destroy the whole value of this feature. The value is to wait till given LSN is replayed, whether it's already received or not.
> > BEGIN AFTER '0/FFFFFFFF';
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn';
>
> This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn
FWIW,
# SELECT '0/FFFFFFFF'::pg_lsn + 1;
?column?
----------
1/0
(1 row)
But I don't see a problem here. On the replica, it's out of our control to check which lsn is good and which is not. We can't check whether the lsn, which is in future for the replica, is already issued by primary.
For the case of wrong lsn, which could cause potentially infinite wait, there is the timeout and the manual query cancel.
> > 4.3 With an unreasonably high wait time, BEGIN command waits
> > unboundedly, shouldn't we restrict the wait time to some max value,
> > say a day or so?
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>
> Good idea, I put it 1 day. But this limit we should to discuss.
> > unboundedly, shouldn't we restrict the wait time to some max value,
> > say a day or so?
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>
> Good idea, I put it 1 day. But this limit we should to discuss.
Do you think that specifying timeout in milliseconds is suitable? I would prefer to switch to seconds (with ability to specify fraction of second). This was expressed before by Alexander Lakhin.
> > 6.
> > +/* Set all latches in shared memory to signal that new LSN has been
> > replayed */
> > +void
> > +WaitLSNSetLatches(XLogRecPtr curLSN)
> > +{
> >
> > I see this patch is waking up all the waiters in the recovery path
> > after applying every WAL record, which IMO is a hot path. Is the
> > impact of this change on recovery measured, perhaps using
> > https://github.com/macdice/redo-bench or similar tools?
> > replayed */
> > +void
> > +WaitLSNSetLatches(XLogRecPtr curLSN)
> > +{
> >
> > I see this patch is waking up all the waiters in the recovery path
> > after applying every WAL record, which IMO is a hot path. Is the
> > impact of this change on recovery measured, perhaps using
> > https://github.com/macdice/redo-bench or similar tools?
Ivan, could you do this?
> > 7. In continuation to comment #6, why not use Conditional Variables
> > instead of proc latches to sleep and wait for all the waiters in
> > WaitLSNSetLatches?
>
> Waiters are stored in the array sorted by LSN. This help us to wake
> only PIDs with replayed LSN. This saves us from scanning of whole
> array. So it`s not so hot path.
> > instead of proc latches to sleep and wait for all the waiters in
> > WaitLSNSetLatches?
>
> Waiters are stored in the array sorted by LSN. This help us to wake
> only PIDs with replayed LSN. This saves us from scanning of whole
> array. So it`s not so hot path.
+1
This saves us from ConditionVariableBroadcast() every time we replay the WAL record.
> Add some fixes
>
> 1) make waiting timeont more simple (as pg_terminate_backend())
> 2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
> long time, changed it to 0.5 seconds
>
> 1) make waiting timeont more simple (as pg_terminate_backend())
> 2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
> long time, changed it to 0.5 seconds
I don't see this change in the patch. Normally if a process gets a signal, that causes WaitLatch() to exit immediately. It also exists immediately on query cancel. IIRC, this 1 minute timeout is needed to handle some extreme cases when an interrupt is missing. Other places have it equal to 1 minute. I don't see why we should have it different.
> 3) add more tests
> 4) added and expanded sections in the documentation
> 4) added and expanded sections in the documentation
I don't see this in the patch. I see only a short description in func.sgml, which is definitely not sufficient. We need at least everything we have in the docs before to be adjusted with the current approach of procedure.
> 5) add default variant of timeout
> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
Does zero here mean no timeout? I think this should be documented. Also, I would prefer to see the timeout by default. Probably one minute would be good for default.
> 6) now big timeout will be restricted to 1 day (86400000ms)
> CALL pg_wait_lsn('0/34FB5A1',10000000000);
> WARNING: Timeout for pg_wait_lsn() restricted to 1 day
I don't think we need to mention individuals, who made proposals, in the source code comments. Otherwise, our source code would be a crazy mess of names. Also, if this is the restriction, it has to be an error. And it should be a proper full ereport().
------
Regards,
Alexander Korotkov
On 19.03.24 18:38, Kartyshov Ivan wrote: > CALL pg_wait_lsn('0/3002AE8', 10000); > BEGIN; > SELECT * FROM tbl; // read fresh insertions > COMMIT; I'm not endorsing this or any other approach, but I think the timeout parameter should be of type interval, not an integer with a unit that is hidden in the documentation.
On 17.03.24 15:09, Alexander Korotkov wrote: > My current attempt was to commit minimal implementation as less > invasive as possible. A new clause for BEGIN doesn't require > additional keywords and doesn't introduce additional statements. But > yes, this is still a new qual. And, yes, Amit you're right that even > if I had committed that, there was still a high risk of further > debates and revert. I had written in [0] about my questions related to using this with connection poolers. I don't think this was addressed at all. I haven't seen any discussion about how to make this kind of facility usable in a full system. You have to manually query and send LSNs; that seems pretty cumbersome. Sure, this is part of something that could be useful, but how would an actual user with actual application code get to use this? [0]: https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com
Thank you for your feedback. On 2024-03-20 12:11, Alexander Korotkov wrote: > On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan > <i.kartyshov@postgrespro.ru> wrote: >> > 4.2 With an unreasonably high future LSN, BEGIN command waits >> > unboundedly, shouldn't we check if the specified LSN is more than >> > pg_last_wal_receive_lsn() error out? > > I think limiting wait lsn by current received lsn would destroy the > whole value of this feature. The value is to wait till given LSN is > replayed, whether it's already received or not. Ok sounds reasonable, I`ll rollback the changes. > But I don't see a problem here. On the replica, it's out of our > control to check which lsn is good and which is not. We can't check > whether the lsn, which is in future for the replica, is already issued > by primary. > > For the case of wrong lsn, which could cause potentially infinite > wait, there is the timeout and the manual query cancel. Fully agree with this take. >> > 4.3 With an unreasonably high wait time, BEGIN command waits >> > unboundedly, shouldn't we restrict the wait time to some max > value, >> > say a day or so? >> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset >> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000; >> >> Good idea, I put it 1 day. But this limit we should to discuss. > > Do you think that specifying timeout in milliseconds is suitable? I > would prefer to switch to seconds (with ability to specify fraction of > second). This was expressed before by Alexander Lakhin. It sounds like an interesting idea. Please review the result. >> > https://github.com/macdice/redo-bench or similar tools? > > Ivan, could you do this? Yes, test redo-bench/crash-recovery.sh This patch on master 91.327, 1.973 105.907, 3.338 98.412, 4.579 95.818, 4.19 REL_13-STABLE 116.645, 3.005 113.212, 2.568 117.644, 3.183 111.411, 2.782 master 124.712, 2.047 117.012, 1.736 116.328, 2.035 115.662, 1.797 Strange behavior, patched version is faster then REL_13-STABLE and master. > I don't see this change in the patch. Normally if a process gets a > signal, that causes WaitLatch() to exit immediately. It also exists > immediately on query cancel. IIRC, this 1 minute timeout is needed to > handle some extreme cases when an interrupt is missing. Other places > have it equal to 1 minute. I don't see why we should have it > different. Ok, I`ll rollback my changes. >> 4) added and expanded sections in the documentation > > I don't see this in the patch. I see only a short description in > func.sgml, which is definitely not sufficient. We need at least > everything we have in the docs before to be adjusted with the current > approach of procedure. I didn't find another section where to add the description of pg_wait_lsn(). So I extend description on the bottom of the table. >> 5) add default variant of timeout >> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0) >> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0) > > Does zero here mean no timeout? I think this should be documented. > Also, I would prefer to see the timeout by default. Probably one > minute would be good for default. Lets discuss this point. Loop in function WaitForLSN is made that way, if we choose delay=0, only then we can wait infinitely to wait LSN without timeout. So default must be 0. Please take one more look on the patch. PS sorry, the strange BUG throw my mails out of thread. https://www.postgresql.org/message-id/flat/f2ff071aa9141405bb8efee67558a058%40postgrespro.ru -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut <peter@eisentraut.org> wrote: > > I had written in [0] about my questions related to using this with > connection poolers. I don't think this was addressed at all. I haven't > seen any discussion about how to make this kind of facility usable in a > full system. You have to manually query and send LSNs; that seems > pretty cumbersome. Sure, this is part of something that could be > useful, but how would an actual user with actual application code get to > use this? > > [0]: > https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com I share the same concern as yours and had proposed something upthread [1]. The idea is something like how each query takes a snapshot at the beginning of txn/query (depending on isolation level), the same way the standby can wait for the primary's current LSN as of the moment (at the time of taking snapshot). And, primary keeps sending its current LSN as part of regular WAL to standbys so that the standbys doesn't have to make connections to the primary to know its current LSN every time. Perhps, this may not even fully guarantee (considered to be achieving) the read-after-write consistency on standbys unless there's a way for the application to tell the wait LSN. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Mar 22, 2024 at 12:50 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 19.03.24 18:38, Kartyshov Ivan wrote: > > CALL pg_wait_lsn('0/3002AE8', 10000); > > BEGIN; > > SELECT * FROM tbl; // read fresh insertions > > COMMIT; > > I'm not endorsing this or any other approach, but I think the timeout > parameter should be of type interval, not an integer with a unit that is > hidden in the documentation. I'm not sure a timeout needs to deal with complexity of our interval datatype. At the same time, the integer number of milliseconds looks a bit weird. Could the float8 number of seconds be an option? ------ Regards, Alexander Korotkov
On Fri, Mar 22, 2024 at 12:58 AM Peter Eisentraut <peter@eisentraut.org> wrote: > On 17.03.24 15:09, Alexander Korotkov wrote: > > My current attempt was to commit minimal implementation as less > > invasive as possible. A new clause for BEGIN doesn't require > > additional keywords and doesn't introduce additional statements. But > > yes, this is still a new qual. And, yes, Amit you're right that even > > if I had committed that, there was still a high risk of further > > debates and revert. > > I had written in [0] about my questions related to using this with > connection poolers. I don't think this was addressed at all. I haven't > seen any discussion about how to make this kind of facility usable in a > full system. You have to manually query and send LSNs; that seems > pretty cumbersome. Sure, this is part of something that could be > useful, but how would an actual user with actual application code get to > use this? The current usage pattern of this functionality is the following. 1. Do the write transaction on primary 2. Query pg_current_wal_insert_lsn() on primary 3. Call pg_wait_lsn() with the value obtained on the previous step on replica 4. Do the read transaction of replica This usage pattern could be implemented either on the application level, or on the pooler level. For application level, it would require a somewhat advanced level of database-aware programming, but this is still a valid usage. Regarding poolers, if some poolers manage to automatically distinguish reading and writing queries, dealing with LSNs wouldn't be too complex for them. Having this functionality on protocol level would be ideal, but let's do this step-by-step. The built-in procedure isn't very invasive, but that could give us some adoption and open the way forward. ------ Regards, Alexander Korotkov
On Sun, Mar 24, 2024 at 4:39 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I share the same concern as yours and had proposed something upthread > [1]. The idea is something like how each query takes a snapshot at the > beginning of txn/query (depending on isolation level), the same way > the standby can wait for the primary's current LSN as of the moment > (at the time of taking snapshot). And, primary keeps sending its > current LSN as part of regular WAL to standbys so that the standbys > doesn't have to make connections to the primary to know its current > LSN every time. Perhps, this may not even fully guarantee (considered > to be achieving) the read-after-write consistency on standbys unless > there's a way for the application to tell the wait LSN. Oh, no. Please, check [1]. The idea is to wait for a particular transaction to become visible. The one who made a change on primary brings the lsn value from there to replica. For instance, an application made a change on primary and then willing to run some report on replica. And the report should be guaranteed to contain the change just made. So, the application query the LSN from primary after making a write transaction, then calls pg_wait_lsn() on replicate before running the report. This is quite simple server functionality, which could be used at application-level, ORM-level or pooler-level. And it unlocks the way forward for in-protocol implementation as proposed by Peter Eisentraut. Links. 1. https://www.postgresql.org/message-id/CAPpHfdtny81end69PzEdRsROKnsybsj%3DOs8DUM-6HeKGKnCuQQ%40mail.gmail.com ------ Regards, Alexander Korotkov
On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > Thank you for your interest to the patch. > I understand you questions, but I fully support Alexander Korotkov idea > to commit the minimal required functionality. And then keep working on > other improvements. I did further improvements in the patch. Notably, I decided to rename the procedure to pg_wait_for_wal_replay_lsn(). This makes the name look consistent with other WAL-related functions. Also it clearly states that we're waiting for lsn to be replayed (not received, written or flushed). Also, I did implements in the docs, commit message and some minor code fixes. I'm continuing to look at this patch. ------ Regards, Alexander Korotkov
On Thu, Mar 28, 2024 at 2:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan > <i.kartyshov@postgrespro.ru> wrote: > > Thank you for your interest to the patch. > > I understand you questions, but I fully support Alexander Korotkov idea > > to commit the minimal required functionality. And then keep working on > > other improvements. > > I did further improvements in the patch. > > Notably, I decided to rename the procedure to > pg_wait_for_wal_replay_lsn(). This makes the name look consistent > with other WAL-related functions. Also it clearly states that we're > waiting for lsn to be replayed (not received, written or flushed). > > Also, I did implements in the docs, commit message and some minor code fixes. > > I'm continuing to look at this patch. Sorry, I forgot the attachment. ------ Regards, Alexander Korotkov
Attachment
> v12 Hi all, I didn't review the patch but one thing jumped out: I don't think it's OK to hold a spinlock while (1) looping over an array of backends and (2) making system calls (SetLatch()).
On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > v12 > > Hi all, > > I didn't review the patch but one thing jumped out: I don't think it's > OK to hold a spinlock while (1) looping over an array of backends and > (2) making system calls (SetLatch()). Good catch, thank you. Fixed along with other issues spotted by Alexander Lakhin. ------ Regards, Alexander Korotkov
Attachment
On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
Fixed along with other issues spotted by Alexander Lakhin.
[I didn't read the whole thread. I'm sorry if I missed something ...]
You renamed the function in a previous version but let me suggest another one:
pg_wal_replay_wait. It uses the same pattern as the other recovery control
functions [1]. I think "for" doesn't add much for the function name and "lsn" is
used in functions that return an LSN (that's not the case here).
postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_pause
Result data type | void
Argument data types |
Type | func
-[ RECORD 2 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_resume
Result data type | void
Argument data types |
Type | func
Regarding the arguments, I think the timeout should be bigint. There is at least
another function that implements a timeout that uses bigint.
postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]-------+--------------------------------------
Schema | pg_catalog
Name | pg_terminate_backend
Result data type | boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type | func
I also suggests that the timeout unit should be milliseconds, hence, using
bigint is perfectly fine for the timeout argument.
+ <para>
+ Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+ on standby within given timeout. Parameter <parameter>timeout</parameter>
+ is the time in seconds to wait for the <parameter>target_lsn</parameter>
+ replay. When <parameter>timeout</parameter> value equals to zero no
+ timeout is applied.
+ </para></entry>
Hi, Euler! On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote: > On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote: > > Fixed along with other issues spotted by Alexander Lakhin. > > > [I didn't read the whole thread. I'm sorry if I missed something ...] > > You renamed the function in a previous version but let me suggest another one: > pg_wal_replay_wait. It uses the same pattern as the other recovery control > functions [1]. I think "for" doesn't add much for the function name and "lsn" is > used in functions that return an LSN (that's not the case here). > > postgres=# \df pg_wal_replay* > List of functions > -[ RECORD 1 ]-------+--------------------- > Schema | pg_catalog > Name | pg_wal_replay_pause > Result data type | void > Argument data types | > Type | func > -[ RECORD 2 ]-------+--------------------- > Schema | pg_catalog > Name | pg_wal_replay_resume > Result data type | void > Argument data types | > Type | func Makes sense to me. I tried to make a new procedure name consistent with functions acquiring various WAL positions. But you're right, it's better to be consistent with other functions controlling wal replay. > Regarding the arguments, I think the timeout should be bigint. There is at least > another function that implements a timeout that uses bigint. > > postgres=# \df pg_terminate_backend > List of functions > -[ RECORD 1 ]-------+-------------------------------------- > Schema | pg_catalog > Name | pg_terminate_backend > Result data type | boolean > Argument data types | pid integer, timeout bigint DEFAULT 0 > Type | func > > I also suggests that the timeout unit should be milliseconds, hence, using > bigint is perfectly fine for the timeout argument. > > + <para> > + Throws an ERROR if the target <acronym>lsn</acronym> was not replayed > + on standby within given timeout. Parameter <parameter>timeout</parameter> > + is the time in seconds to wait for the <parameter>target_lsn</parameter> > + replay. When <parameter>timeout</parameter> value equals to zero no > + timeout is applied. > + </para></entry> This generally makes sense, but I'm not sure about this. The milliseconds timeout was used initially but received critics in [1]. Links. 1. https://www.postgresql.org/message-id/b45ff979-9d12-4828-a22a-e4cb327e115c%40eisentraut.org ------ Regards, Alexander Korotkov
Attachment
Hi, hackers!
On Fri, 29 Mar 2024 at 16:45, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Euler!
On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote:
> On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
>
> Fixed along with other issues spotted by Alexander Lakhin.
>
>
> [I didn't read the whole thread. I'm sorry if I missed something ...]
>
> You renamed the function in a previous version but let me suggest another one:
> pg_wal_replay_wait. It uses the same pattern as the other recovery control
> functions [1]. I think "for" doesn't add much for the function name and "lsn" is
> used in functions that return an LSN (that's not the case here).
>
> postgres=# \df pg_wal_replay*
> List of functions
> -[ RECORD 1 ]-------+---------------------
> Schema | pg_catalog
> Name | pg_wal_replay_pause
> Result data type | void
> Argument data types |
> Type | func
> -[ RECORD 2 ]-------+---------------------
> Schema | pg_catalog
> Name | pg_wal_replay_resume
> Result data type | void
> Argument data types |
> Type | func
Makes sense to me. I tried to make a new procedure name consistent
with functions acquiring various WAL positions. But you're right,
it's better to be consistent with other functions controlling wal
replay.
> Regarding the arguments, I think the timeout should be bigint. There is at least
> another function that implements a timeout that uses bigint.
>
> postgres=# \df pg_terminate_backend
> List of functions
> -[ RECORD 1 ]-------+--------------------------------------
> Schema | pg_catalog
> Name | pg_terminate_backend
> Result data type | boolean
> Argument data types | pid integer, timeout bigint DEFAULT 0
> Type | func
>
> I also suggests that the timeout unit should be milliseconds, hence, using
> bigint is perfectly fine for the timeout argument.
>
> + <para>
> + Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
> + on standby within given timeout. Parameter <parameter>timeout</parameter>
> + is the time in seconds to wait for the <parameter>target_lsn</parameter>
> + replay. When <parameter>timeout</parameter> value equals to zero no
> + timeout is applied.
> + </para></entry>
This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].
I see in Postgres we already have different units for timeouts:
e.g in guc's:
wal_receiver_status_interval in seconds
tcp_keepalives_idle in seconds
commit_delay in microseconds
max_standby_archive_delay in milliseconds
vacuum_cost_delay in milliseconds
autovacuum_vacuum_cost_delay in milliseconds
etc..
I haven't counted precisely, but I feel that milliseconds are the most often used in both guc's and functions. So I'd propose using milliseconds for the patch as it was proposed originally.
Regards,
Pavel Borisov
Supabase.
On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:
This generally makes sense, but I'm not sure about this. Themilliseconds timeout was used initially but received critics in [1].
Alexander, I see why you changed the patch.
Peter suggested to use an interval but you proposed another data type:
float. The advantage of the interval data type is that you don't need to
carefully think about the unit, however, if you use the integer data
type you have to propose one. (If that's the case, milliseconds is a
good granularity for this feature.) I don't have a strong preference
between integer and interval data types but I don't like the float for
this case. The 2 main reasons are (a) that we treat time units (hours,
minutes, seconds, ...) as integers so it seems natural for a human being
to use a unit time as integer and (b) depending on the number of digits
after the decimal separator you still don't have an integer in the
internal unit, hence, you have to round it to integer.
We already have functions that use integer (such as pg_terminate_backend)
and interval (such as pg_sleep_for) and if i searched correctly it will
be the first timeout argument as float.
On Fri, Mar 29, 2024 at 6:50 PM Euler Taveira <euler@eulerto.com> wrote: > On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote: > > This generally makes sense, but I'm not sure about this. The > milliseconds timeout was used initially but received critics in [1]. > > > Alexander, I see why you changed the patch. > > Peter suggested to use an interval but you proposed another data type: > float. The advantage of the interval data type is that you don't need to > carefully think about the unit, however, if you use the integer data > type you have to propose one. (If that's the case, milliseconds is a > good granularity for this feature.) I don't have a strong preference > between integer and interval data types but I don't like the float for > this case. The 2 main reasons are (a) that we treat time units (hours, > minutes, seconds, ...) as integers so it seems natural for a human being > to use a unit time as integer and (b) depending on the number of digits > after the decimal separator you still don't have an integer in the > internal unit, hence, you have to round it to integer. > > We already have functions that use integer (such as pg_terminate_backend) > and interval (such as pg_sleep_for) and if i searched correctly it will > be the first timeout argument as float. Thank you for the detailed explanation. Float seconds are used in pg_sleep() just similar to the interval in pg_sleep_for(). However, that's a delay, not exactly a timeout. Given the precedent of milliseconds timeout in pg_terminate_backend(), your and Pavel's points, I've switched back to integer milliseconds timeout. Some fixes spotted off-list by Alexander Lakhin. 1) We don't need an explicit check for the postmaster being alive as soon as we pass WL_EXIT_ON_PM_DEATH to WaitLatch(). 2) When testing for unreachable LSN, we need to select LSN well in advance so that autovacuum couldn't affect that. I'm going to push this if no objections. ------ Regards, Alexander Korotkov
Attachment
Thank you Alexander for working on patch, may be we should change some names: 1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and waitlsn.h In waitlsn.c and waitlsn.h variables: 2) targret_lsn -> trgLSN like curLSN 3) lsn -> trgLSN like curLSN -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Hi! On Sat, Mar 30, 2024 at 6:14 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Thank you Alexander for working on patch, may be we should change some > names: > 1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and > waitlsn.h I renamed that to 043_wal_replay_wait.pl to match the name of SQL procedure. > In waitlsn.c and waitlsn.h variables: > 2) targret_lsn -> trgLSN like curLSN I prefer this to match the SQL procedure parameter name. > 3) lsn -> trgLSN like curLSN Done. Also I implemented termination of wal replay waits on standby promotion (with test). In the test I change recovery_min_apply_delay to 1s in order to make the test pass faster. ------ Regards, Alexander Korotkov
Attachment
On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi! Thanks for the patch. I have a few comments on the v16 patch. 1. Is there any specific reason for pg_wal_replay_wait() being a procedure rather than a function? I haven't read the full thread, but I vaguely noticed the discussion on the new wait mechanism holding up a snapshot or some other resource. Is that the reason to use a stored procedure over a function? If yes, can we specify it somewhere in the commit message and just before the procedure definition in system_functions.sql? 2. Is the pg_wal_replay_wait first procedure that postgres provides out of the box? 3. Defining a procedure for the first time in system_functions.sql which is supposed to be for functions seems a bit unusual to me. 4. + + endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout); + + if (timeout > 0) + { + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; + latch_events |= WL_TIMEOUT; + if (delay_ms <= 0) + break; + } Why is endtime calculated even for timeout <= 0 only to just skip it later? Can't we just do a fastpath exit if timeout = 0 and targetLSN < 5. Parameter + <parameter>timeout</parameter> is the time in milliseconds to wait + for the <parameter>target_lsn</parameter> + replay. When <parameter>timeout</parameter> value equals to zero no + timeout is applied. + replay. When <parameter>timeout</parameter> value equals to zero no + timeout is applied. It turns out to be "When timeout value equals to zero no timeout is applied." I guess, we can just say something like the following which I picked up from pg_terminate_backend timeout parameter description. <para> If <parameter>timeout</parameter> is not specified or zero, this function returns if the WAL upto <literal>target_lsn</literal> is replayed. If the <parameter>timeout</parameter> is specified (in milliseconds) and greater than zero, the function waits until the server actually replays the WAL upto <literal>target_lsn</literal> or until the given time has passed. On timeout, an error is emitted. </para></entry> 6. + ereport(ERROR, + (errcode(ERRCODE_QUERY_CANCELED), + errmsg("canceling waiting for LSN due to timeout"))); We can be a bit more informative here and say targetLSN and currentLSN something like - "timed out while waiting for target LSN %X/%X to be replayed; current LSN %X/%X"? 7. + if (context->atomic) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("pg_wal_replay_wait() must be only called in non-atomic context"))); + Can we say what a "non-atomic context '' is in a user understandable way like explicit txn or whatever that might be? "non-atomic context' might not sound great to the end -user. 8. + the <literal>movie</literal> table and get the <acronym>lsn</acronym> after + changes just made. This example uses <function>pg_current_wal_insert_lsn</function> + to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname> + could be set to <literal>off</literal>. Can we just mention that run pg_current_wal_insert_lsn on the primary? 9. To me the following query blocks even though I didn't mention timeout. CALL pg_wal_replay_wait('0/fffffff'); 10. Can't we do some input validation on the timeout parameter and emit an error for negative values just like pg_terminate_backend? CALL pg_wal_replay_wait('0/ffffff', -100); 11. + + if (timeout > 0) + { + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; + latch_events |= WL_TIMEOUT; + if (delay_ms <= 0) + break; + } + Can we avoid calling GetCurrentTimestamp in a for loop which can be costly at times especially when pg_wal_replay_wait is called with larger timeouts on multiple backends? Can't we reuse pg_terminate_backend's timeout logic in pg_wait_until_termination, perhaps reducing waittime to 1msec or so? 12. Why should we let every user run pg_wal_replay_wait procedure? Can't we revoke execute from the public in system_functions.sql so that one can decide who to run this function? Per comment #11, one can easily cause a lot of activity by running this function on hundreds of sessions. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Bharath, Thank you for your feedback. On Sun, Mar 31, 2024 at 8:44 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Thanks for the patch. I have a few comments on the v16 patch. > > 1. Is there any specific reason for pg_wal_replay_wait() being a > procedure rather than a function? I haven't read the full thread, but > I vaguely noticed the discussion on the new wait mechanism holding up > a snapshot or some other resource. Is that the reason to use a stored > procedure over a function? If yes, can we specify it somewhere in the > commit message and just before the procedure definition in > system_functions.sql? Surely, there is a reason. Function should be executed in a snapshot, which can prevent WAL records from being replayed. See [1] for a particular test scenario. In a procedure we may enforce non-atomic context and release the snapshot. I've mentioned that in the commit message and in the procedure code. I don't think system_functions.sql is the place for this type of comment. We only use system_functions.sql to push the default values. > 2. Is the pg_wal_replay_wait first procedure that postgres provides > out of the box? Yes, it appears first. I see nothing wrong about that. > 3. Defining a procedure for the first time in system_functions.sql > which is supposed to be for functions seems a bit unusual to me. From the scope of DDL and system catalogue procedure is just another kind of function (prokind == 'p'). So, I don't feel wrong about that. > 4. > + > + endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout); > + > > + if (timeout > 0) > + { > + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; > + latch_events |= WL_TIMEOUT; > + if (delay_ms <= 0) > + break; > + } > > Why is endtime calculated even for timeout <= 0 only to just skip it > later? Can't we just do a fastpath exit if timeout = 0 and targetLSN < OK, fixed. > 5. > Parameter > + <parameter>timeout</parameter> is the time in milliseconds to wait > + for the <parameter>target_lsn</parameter> > + replay. When <parameter>timeout</parameter> value equals to zero no > + timeout is applied. > + replay. When <parameter>timeout</parameter> value equals to zero no > + timeout is applied. > > It turns out to be "When timeout value equals to zero no timeout is > applied." I guess, we can just say something like the following which > I picked up from pg_terminate_backend timeout parameter description. > > <para> > If <parameter>timeout</parameter> is not specified or zero, this > function returns if the WAL upto > <literal>target_lsn</literal> is replayed. > If the <parameter>timeout</parameter> is specified (in > milliseconds) and greater than zero, the function waits until the > server actually replays the WAL upto <literal>target_lsn</literal> or > until the given time has passed. On timeout, an error is emitted. > </para></entry> Applied as you suggested with some edits from me. > 6. > + ereport(ERROR, > + (errcode(ERRCODE_QUERY_CANCELED), > + errmsg("canceling waiting for LSN due to timeout"))); > > We can be a bit more informative here and say targetLSN and currentLSN > something like - "timed out while waiting for target LSN %X/%X to be > replayed; current LSN %X/%X"? Done this way. Adjusted other ereport()'s as well. > 7. > + if (context->atomic) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("pg_wal_replay_wait() must be only called in > non-atomic context"))); > + > > Can we say what a "non-atomic context '' is in a user understandable > way like explicit txn or whatever that might be? "non-atomic context' > might not sound great to the end -user. Added errdetail() to this ereport(). > 8. > + the <literal>movie</literal> table and get the <acronym>lsn</acronym> after > + changes just made. This example uses > <function>pg_current_wal_insert_lsn</function> > + to get the <acronym>lsn</acronym> given that > <varname>synchronous_commit</varname> > + could be set to <literal>off</literal>. > > Can we just mention that run pg_current_wal_insert_lsn on the primary? The mention is added. > 9. To me the following query blocks even though I didn't mention timeout. > CALL pg_wal_replay_wait('0/fffffff'); If your primary server is freshly initialized, you need to do quite data modifications to reach this LSN. > 10. Can't we do some input validation on the timeout parameter and > emit an error for negative values just like pg_terminate_backend? > CALL pg_wal_replay_wait('0/ffffff', -100); Reasonable, added. > 11. > + > + if (timeout > 0) > + { > + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; > + latch_events |= WL_TIMEOUT; > + if (delay_ms <= 0) > + break; > + } > + > > Can we avoid calling GetCurrentTimestamp in a for loop which can be > costly at times especially when pg_wal_replay_wait is called with > larger timeouts on multiple backends? Can't we reuse > pg_terminate_backend's timeout logic in > pg_wait_until_termination, perhaps reducing waittime to 1msec or so? Normally there shouldn't be many loops. It only happens on spurious wakeups. For instance, some process was going to set our latch before and for another reason, but due to kernel scheduling it does only now. So, normally there is only one wakeup. pg_wait_until_termination() may sacrifice timeout accuracy due to possible spurious wakeups and time spent outside of WaitLatch(). I don't feel reasonable to repeat this login in WaitForLSN() especially given that we don't need frequent wakeups here. > 12. Why should we let every user run pg_wal_replay_wait procedure? > Can't we revoke execute from the public in system_functions.sql so > that one can decide who to run this function? Per comment #11, one can > easily cause a lot of activity by running this function on hundreds of > sessions. Generally, if a user can make many connections, then this user can make them busy and can consume resources. Given my explanation above, pg_wal_replay_wait() even wouldn't make the connection busy, it would just wait on the latch. I don't see why pg_wal_replay_wait() could do more harm than pg_sleep(). So, I would leave pg_wal_replay_wait() public. Links. 1. https://www.postgresql.org/message-id/CAPpHfdtiGgn0iS1KbW2HTam-1%2BoK%2BvhXZDAcnX9hKaA7Oe%3DF-A%40mail.gmail.com ------ Regards, Alexander Korotkov
On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > 9. To me the following query blocks even though I didn't mention timeout. > > CALL pg_wal_replay_wait('0/fffffff'); > > If your primary server is freshly initialized, you need to do quite > data modifications to reach this LSN. Right, but why pg_wal_replay_wait blocks without a timeout? It must return an error saying it can't reach the target LSN, no? Did you forget to attach the new patch? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 1, 2024 at 5:25 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > 9. To me the following query blocks even though I didn't mention timeout.
> > CALL pg_wal_replay_wait('0/fffffff');
>
> If your primary server is freshly initialized, you need to do quite
> data modifications to reach this LSN.
Right, but why pg_wal_replay_wait blocks without a timeout? It must
return an error saying it can't reach the target LSN, no?
How can replica know this? It doesn't look feasible to distinguish this situation from the situation when connection between primary and replica became slow. I'd keep this simple. We have pg_sleep_for() which waits for the specified time whatever it is. And we have pg_wal_replay_wait() waits till replay lsn grows to the target whatever it is. It's up to the user to specify the correct value.
Did you forget to attach the new patch?
Yes, here it is.
Regards,
Alexander Korotkov
Attachment
Alexander Korotkov <aekorotkov@gmail.com> writes: Hello, > > Did you forget to attach the new patch? > > Yes, here it is. > > ------ > Regards, > Alexander Korotkov > > [4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]... + </indexterm> + <function>pg_wal_replay_wait</function> ( + <parameter>target_lsn</parameter> <type>pg_lsn</type>, + <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>) + <returnvalue>void</returnvalue> + </para> Should we return the millseconds of waiting time? I think this information may be useful for customer if they want to know how long time it waits for for minitor purpose. -- Best Regards Andy Fan
Hi, Andy! On Tue, Apr 2, 2024 at 6:29 AM Andy Fan <zhihuifan1213@163.com> wrote: > > Did you forget to attach the new patch? > > > > Yes, here it is. > > > > [4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]... > > + </indexterm> > + <function>pg_wal_replay_wait</function> ( > + <parameter>target_lsn</parameter> <type>pg_lsn</type>, > + <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>) > + <returnvalue>void</returnvalue> > + </para> > > Should we return the millseconds of waiting time? I think this > information may be useful for customer if they want to know how long > time it waits for for minitor purpose. Please, check it more carefully. In v17 timeout is in integer milliseconds. ------ Regards, Alexander Korotkov
Hi Alexander! >> + </indexterm> >> + <function>pg_wal_replay_wait</function> ( >> + <parameter>target_lsn</parameter> <type>pg_lsn</type>, >> + <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>) >> + <returnvalue>void</returnvalue> >> + </para> >> >> Should we return the millseconds of waiting time? I think this >> information may be useful for customer if they want to know how long >> time it waits for for minitor purpose. > > Please, check it more carefully. In v17 timeout is in integer milliseconds. I guess one of us misunderstand the other:( and I do didn't check the code very carefully. Acutally I meant the "return value" rather than function argument. IIUC the current return value is void per below statement. >> + <returnvalue>void</returnvalue> If so, when users call pg_wal_replay_wait, they can get informed when the wal is replaied to the target_lsn, but they can't know how long time it waits unless they check it in application side, I think such information will be useful for monitor purpose sometimes. select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in this case, I want this function return 1. -- Best Regards Andy Fan
On 2024-04-02 11:14, Andy Fan wrote: > If so, when users call pg_wal_replay_wait, they can get informed when > the wal is replaied to the target_lsn, but they can't know how long > time > it waits unless they check it in application side, I think such > information will be useful for monitor purpose sometimes. > > select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in > this case, I want this function return 1. Hi Andy, to get timing we can use \time in psql. Here is an example. postgres=# \timing Timing is on. postgres=# select 1; ?column? ---------- 1 (1 row) Time: 0.536 ms > <returnvalue>void</returnvalue> And returning VOID is the best option, rather than returning TRUE|FALSE or timing. It left the logic of the procedure very simple, we get an error if LSN is not reached. 8 years, we tried to add this feature, and now we suggest the best way for this feature is to commit the minimal version first. Let's discuss further improvements in future versions. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > 8 years, we tried to add this feature, and now we suggest the best way > for this feature is to commit the minimal version first. Just curious, do you or anyone else have an immediate use for this function? If yes, how are they achieving read-after-write-consistency on streaming standbys in their application right now without a function like this? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan > <i.kartyshov@postgrespro.ru> wrote: >> >> 8 years, we tried to add this feature, and now we suggest the best way >> for this feature is to commit the minimal version first. > > Just curious, do you or anyone else have an immediate use for this > function? If yes, how are they achieving read-after-write-consistency > on streaming standbys in their application right now without a > function like this? The link [1] may be helpful and I think the reason there is reasonable to me. Actually we also disucss how to make sure the "read your writes consistency" internally, and the soluation here looks good to me. Glad to know that this patch will be committed very soon. [1] https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com -- Best Regards Andy Fan
On Tue, Apr 2, 2024 at 1:11 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > On 2024-04-02 11:14, Andy Fan wrote: > > If so, when users call pg_wal_replay_wait, they can get informed when > > the wal is replaied to the target_lsn, but they can't know how long > > time > > it waits unless they check it in application side, I think such > > information will be useful for monitor purpose sometimes. > > > > select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in > > this case, I want this function return 1. > > Hi Andy, to get timing we can use \time in psql. > Here is an example. > postgres=# \timing > Timing is on. > postgres=# select 1; > ?column? > ---------- > 1 > (1 row) > > Time: 0.536 ms > > > > <returnvalue>void</returnvalue> > And returning VOID is the best option, rather than returning TRUE|FALSE > or timing. It left the logic of the procedure very simple, we get an > error if LSN is not reached. > > 8 years, we tried to add this feature, and now we suggest the best way > for this feature is to commit the minimal version first. > > Let's discuss further improvements in future versions. +1, It seems there was not yet a precedent of builtin PostgreSQL function returning its duration. And I don't think we need to introduce such precedent, at least now. This seems like we're placing the responsibility on monitoring resources usage to an application. I'd also like to note that pg_wal_replay_wait() comes with a dedicated wait event. So, one could monitor the average duration of these waits using sampling of wait events. ------ Regards, Alexander Korotkov
On Tue, Apr 2, 2024 at 2:47 PM Andy Fan <zhihuifan1213@163.com> wrote: > Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > > > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan > > <i.kartyshov@postgrespro.ru> wrote: > >> > >> 8 years, we tried to add this feature, and now we suggest the best way > >> for this feature is to commit the minimal version first. > > > > Just curious, do you or anyone else have an immediate use for this > > function? If yes, how are they achieving read-after-write-consistency > > on streaming standbys in their application right now without a > > function like this? > > The link [1] may be helpful and I think the reason there is reasonable > to me. > > Actually we also disucss how to make sure the "read your writes > consistency" internally, and the soluation here looks good to me. > > Glad to know that this patch will be committed very soon. > > [1] > https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com Thank you for your feedback. I also can confirm that a lot of users would be very happy to have "read your writes consistency" and ready to do something to achieve this at an application level. However, they typically don't know what exactly they need. So, blogging about pg_wal_replay_wait() and spreading words about it at conferences would be highly appreciated. ------ Regards, Alexander Korotkov
On 2024-04-02 13:15, Bharath Rupireddy wrote: > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan > <i.kartyshov@postgrespro.ru> wrote: >> >> 8 years, we tried to add this feature, and now we suggest the best way >> for this feature is to commit the minimal version first. > > Just curious, do you or anyone else have an immediate use for this > function? If yes, how are they achieving read-after-write-consistency > on streaming standbys in their application right now without a > function like this? Just now, application runs pg_current_wal_lsn() after update and then waits on loop pg_current_wal_flush_lsn() until updated. Or use slow synchronous_commit. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a spinlock) to protect the contents of waitLSN -- and it's used to walk an arbitrary long list of processes waiting ... and also, an arbitrary number of processes could be calling this code. I think using a spinlock for this is unwise, as it'll cause busy-waiting whenever there's contention. Wouldn't it be better to use an LWLock for this? Then the processes would sleep until the lock is freed. While nosing about the code, other things struck me: I think there should be more comments about WaitLSNProcInfo and WaitLSNState in waitlsn.h. In addLSNWaiter it'd be better to assign 'cur' before acquiring the lock. Is a plan array really the most efficient data structure for this, considering that you have to reorder each time you add an element? Maybe it is, but wouldn't it make sense to use memmove() when adding one element rather iterating all the remaining elements to the end of the queue? I think the include list in waitlsn.c could be tightened a bit: @@ -18,28 +18,18 @@ #include <math.h> #include "pgstat.h" -#include "fmgr.h" -#include "access/transam.h" -#include "access/xact.h" #include "access/xlog.h" -#include "access/xlogdefs.h" #include "access/xlogrecovery.h" -#include "catalog/pg_type.h" #include "commands/waitlsn.h" -#include "executor/spi.h" #include "funcapi.h" #include "miscadmin.h" -#include "storage/ipc.h" #include "storage/latch.h" -#include "storage/pmsignal.h" #include "storage/proc.h" #include "storage/shmem.h" -#include "storage/sinvaladt.h" -#include "utils/builtins.h" #include "utils/pg_lsn.h" #include "utils/snapmgr.h" -#include "utils/timestamp.h" #include "utils/fmgrprotos.h" +#include "utils/wait_event_types.h" -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)
Buildfarm animal mamba (NetBSD 10.0 on macppc) started failing on this with what seems like a bogus compiler warning: waitlsn.c: In function 'WaitForLSN': waitlsn.c:275:24: error: 'endtime' may be used uninitialized in this function [-Werror=maybe-uninitialized] 275 | delay_ms = (endtime - GetCurrentTimestamp()) / 1000; | ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors endtime is indeed initialized further up, but initializing endtime at declaration seems innocent enough and should make this compiler happy and the buildfarm greener. diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c index 6679378156..17ad0057ad 100644 --- a/src/backend/commands/waitlsn.c +++ b/src/backend/commands/waitlsn.c @@ -226,7 +226,7 @@ void WaitForLSN(XLogRecPtr targetLSN, int64 timeout) { XLogRecPtr currentLSN; - TimestampTz endtime; + TimestampTz endtime = 0; /* Shouldn't be called when shmem isn't initialized */ Assert(waitLSN); -- Daniel Gustafsson
> I also can confirm that a lot of users would be very happy to have > "read your writes consistency" and ready to do something to achieve > this at an application level. However, they typically don't know what > exactly they need. > > So, blogging about pg_wal_replay_wait() and spreading words about it > at conferences would be highly appreciated. Sure, once it is committed, I promise I can doing a knowledge sharing in our organization and write a article in chinese language. -- Best Regards Andy Fan
Hi, Alvaro! Thank you for your feedback. On Wed, Apr 3, 2024 at 9:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a > spinlock) to protect the contents of waitLSN -- and it's used to walk an > arbitrary long list of processes waiting ... and also, an arbitrary > number of processes could be calling this code. I think using a > spinlock for this is unwise, as it'll cause busy-waiting whenever > there's contention. Wouldn't it be better to use an LWLock for this? > Then the processes would sleep until the lock is freed. > > While nosing about the code, other things struck me: > > I think there should be more comments about WaitLSNProcInfo and > WaitLSNState in waitlsn.h. > > In addLSNWaiter it'd be better to assign 'cur' before acquiring the > lock. > > Is a plan array really the most efficient data structure for this, > considering that you have to reorder each time you add an element? > Maybe it is, but wouldn't it make sense to use memmove() when adding one > element rather iterating all the remaining elements to the end of the > queue? > > I think the include list in waitlsn.c could be tightened a bit: I've just pushed commit, which shortens the include list and fixes the order of 'cur' assigning and taking spinlock in addLSNWaiter(). Regarding the shmem data structure for LSN waiters. I didn't pick LWLock or ConditionVariable, because I needed the ability to wake up only those waiters whose LSN is already replayed. In my experience waking up a process is way slower than scanning a short flat array. However, I agree that when the number of waiters is very high and flat array may become a problem. It seems that the pairing heap is not hard to use for shmem structures. The only memory allocation call in paritingheap.c is in pairingheap_allocate(). So, it's only needed to be able to initialize the pairing heap in-place, and it will be fine for shmem. I'll come back with switching to the pairing heap shortly. ------ Regards, Alexander Korotkov
Hello Alexander, On 2024-Apr-03, Alexander Korotkov wrote: > Regarding the shmem data structure for LSN waiters. I didn't pick > LWLock or ConditionVariable, because I needed the ability to wake up > only those waiters whose LSN is already replayed. In my experience > waking up a process is way slower than scanning a short flat array. I agree, but I think that's unrelated to what I was saying, which is just the patch I attach here. > However, I agree that when the number of waiters is very high and flat > array may become a problem. It seems that the pairing heap is not > hard to use for shmem structures. The only memory allocation call in > paritingheap.c is in pairingheap_allocate(). So, it's only needed to > be able to initialize the pairing heap in-place, and it will be fine > for shmem. Ok. With the code as it stands today, everything in WaitLSNState apart from the pairing heap is accessed without any locking. I think this is at least partly OK because each backend only accesses its own entry; but it deserves a comment. Or maybe something more, because WaitLSNSetLatches does modify the entry for other backends. (Admittedly, this could only happens for backends that are already sleeping, and it only happens with the lock acquired, so it's probably okay. But clearly it deserves a comment.) Don't we need to WaitLSNCleanup() during error recovery or something? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "World domination is proceeding according to plan" (Andrew Morton)
Attachment
On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Apr-03, Alexander Korotkov wrote: > > > Regarding the shmem data structure for LSN waiters. I didn't pick > > LWLock or ConditionVariable, because I needed the ability to wake up > > only those waiters whose LSN is already replayed. In my experience > > waking up a process is way slower than scanning a short flat array. > > I agree, but I think that's unrelated to what I was saying, which is > just the patch I attach here. Oh, sorry for the confusion. I'd re-read your message. Indeed you meant this very clearly! I'm good with the patch. Attached revision contains a bit of a commit message. > > However, I agree that when the number of waiters is very high and flat > > array may become a problem. It seems that the pairing heap is not > > hard to use for shmem structures. The only memory allocation call in > > paritingheap.c is in pairingheap_allocate(). So, it's only needed to > > be able to initialize the pairing heap in-place, and it will be fine > > for shmem. > > Ok. > > With the code as it stands today, everything in WaitLSNState apart from > the pairing heap is accessed without any locking. I think this is at > least partly OK because each backend only accesses its own entry; but it > deserves a comment. Or maybe something more, because WaitLSNSetLatches > does modify the entry for other backends. (Admittedly, this could only > happens for backends that are already sleeping, and it only happens > with the lock acquired, so it's probably okay. But clearly it deserves > a comment.) Please, check 0002 patch attached. I found it easier to move two assignments we previously moved out of lock, into the lock; then claim WaitLSNState.procInfos is also protected by WaitLSNLock. > Don't we need to WaitLSNCleanup() during error recovery or something? Yes, there is WaitLSNCleanup(). It's currently only called from one place, given that waiting for LSN can't be invoked from background workers or inside the transaction. ------ Regards, Alexander Korotkov
Hi, Alexander!
On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Apr-03, Alexander Korotkov wrote:
>
> > Regarding the shmem data structure for LSN waiters. I didn't pick
> > LWLock or ConditionVariable, because I needed the ability to wake up
> > only those waiters whose LSN is already replayed. In my experience
> > waking up a process is way slower than scanning a short flat array.
>
> I agree, but I think that's unrelated to what I was saying, which is
> just the patch I attach here.
Oh, sorry for the confusion. I'd re-read your message. Indeed you
meant this very clearly!
I'm good with the patch. Attached revision contains a bit of a commit message.
> > However, I agree that when the number of waiters is very high and flat
> > array may become a problem. It seems that the pairing heap is not
> > hard to use for shmem structures. The only memory allocation call in
> > paritingheap.c is in pairingheap_allocate(). So, it's only needed to
> > be able to initialize the pairing heap in-place, and it will be fine
> > for shmem.
>
> Ok.
>
> With the code as it stands today, everything in WaitLSNState apart from
> the pairing heap is accessed without any locking. I think this is at
> least partly OK because each backend only accesses its own entry; but it
> deserves a comment. Or maybe something more, because WaitLSNSetLatches
> does modify the entry for other backends. (Admittedly, this could only
> happens for backends that are already sleeping, and it only happens
> with the lock acquired, so it's probably okay. But clearly it deserves
> a comment.)
Please, check 0002 patch attached. I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.
Could you re-attach 0002. Seems it failed to attach to the previous message.
Regards,
Pavel
On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> >> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> > >> > On 2024-Apr-03, Alexander Korotkov wrote: >> > >> > > Regarding the shmem data structure for LSN waiters. I didn't pick >> > > LWLock or ConditionVariable, because I needed the ability to wake up >> > > only those waiters whose LSN is already replayed. In my experience >> > > waking up a process is way slower than scanning a short flat array. >> > >> > I agree, but I think that's unrelated to what I was saying, which is >> > just the patch I attach here. >> >> Oh, sorry for the confusion. I'd re-read your message. Indeed you >> meant this very clearly! >> >> I'm good with the patch. Attached revision contains a bit of a commit message. >> >> > > However, I agree that when the number of waiters is very high and flat >> > > array may become a problem. It seems that the pairing heap is not >> > > hard to use for shmem structures. The only memory allocation call in >> > > paritingheap.c is in pairingheap_allocate(). So, it's only needed to >> > > be able to initialize the pairing heap in-place, and it will be fine >> > > for shmem. >> > >> > Ok. >> > >> > With the code as it stands today, everything in WaitLSNState apart from >> > the pairing heap is accessed without any locking. I think this is at >> > least partly OK because each backend only accesses its own entry; but it >> > deserves a comment. Or maybe something more, because WaitLSNSetLatches >> > does modify the entry for other backends. (Admittedly, this could only >> > happens for backends that are already sleeping, and it only happens >> > with the lock acquired, so it's probably okay. But clearly it deserves >> > a comment.) >> >> Please, check 0002 patch attached. I found it easier to move two >> assignments we previously moved out of lock, into the lock; then claim >> WaitLSNState.procInfos is also protected by WaitLSNLock. > > Could you re-attach 0002. Seems it failed to attach to the previous message. I actually forgot both! ------ Regards, Alexander Korotkov
Attachment
Hello, BTW I noticed that https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html says that lsn_cmp is not covered by the tests. This probably indicates that the tests are a little too light, but I'm not sure how much extra effort we want to spend. I'm still concerned that WaitLSNCleanup is only called in ProcKill. Does this mean that if a process throws an error while waiting, it'll not get cleaned up until it exits? Maybe this is not a big deal, but it seems odd. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Now I have my system running, not a byte was off the shelf; It rarely breaks and when it does I fix the code myself. It's stable, clean and elegant, and lightning fast as well, And it doesn't cost a nickel, so Bill Gates can go to hell."
Hi, Alvaro! Thank you for your care on this matter. On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > BTW I noticed that > https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html > says that lsn_cmp is not covered by the tests. This probably indicates > that the tests are a little too light, but I'm not sure how much extra > effort we want to spend. I'm aware of this. Ivan promised to send a patch to improve the test. If he doesn't, I'll care about it. > I'm still concerned that WaitLSNCleanup is only called in ProcKill. > Does this mean that if a process throws an error while waiting, it'll > not get cleaned up until it exits? Maybe this is not a big deal, but it > seems odd. I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed that together with the improvements upthread. ------ Regards, Alexander Korotkov
I did some experiments over synchronous replications and got that cascade replication can`t be synchronous. And pg_wal_replay_wait() allows us to read your writes consistency on cascade replication. Beyond that, I added more tests on multi-standby replication and cascade replications. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
On 07/04/2024 00:52, Alexander Korotkov wrote: > On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> I'm still concerned that WaitLSNCleanup is only called in ProcKill. >> Does this mean that if a process throws an error while waiting, it'll >> not get cleaned up until it exits? Maybe this is not a big deal, but it >> seems odd. > > I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed > that together with the improvements upthread. Race condition: 1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN 2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend process to the LSN heap and returns 3. replay: rm_redo record '0/1234' 4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr() 5. backend: current replay LSN location is '0/1234', so we exit the loop 6. replay: calls WaitLSNSetLatches() In a nutshell, it's possible for the loop in WaitForLSN to exit without cleaning up the process from the heap. I was able to hit that by adding a delay after the addLSNWaiter() call: > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b] > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc] > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] I think there's a similar race condition if the timeout is reached at the same time that the startup process wakes up the process. > * At first, we check that pg_wal_replay_wait() is called in a non-atomic > * context. That is, a procedure call isn't wrapped into a transaction, > * another procedure call, or a function call. > * It's pretty unfortunate to have all these restrictions. It would be nice to do: select pg_wal_replay_wait('0/1234'); select * from foo; in a single multi-query call, to avoid the round-trip to the client. You can avoid it with libpq or protocol level pipelining, too, but it's more complicated. > * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic > * context, CallStmt is processed with a snapshot. Thankfully, we can pop > * this snapshot, because PortalRunUtility() can tolerate this. This assumption that PortalRunUtility() can tolerate us popping the snapshot sounds very fishy. I haven't looked at what's going on there, but doesn't sound like a great assumption. If recovery ends while a process is waiting for an LSN to arrive, does it ever get woken up? The docs could use some-copy-editing, but just to point out one issue: > There are also procedures to control the progress of recovery. That's copy-pasted from an earlier sentence at the table that lists functions like pg_promote(), pg_wal_replay_pause(), and pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the progress of recovery like those functions do, it only causes the calling backend to wait. Overall, this feature doesn't feel quite ready for v17, and IMHO should be reverted. It's a nice feature, so I'd love to have it fixed and reviewed early in the v18 cycle. -- Heikki Linnakangas Neon (https://neon.tech)
Hi, Heikki! Thank you for your interest in the subject. On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 07/04/2024 00:52, Alexander Korotkov wrote: > > On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> I'm still concerned that WaitLSNCleanup is only called in ProcKill. > >> Does this mean that if a process throws an error while waiting, it'll > >> not get cleaned up until it exits? Maybe this is not a big deal, but it > >> seems odd. > > > > I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed > > that together with the improvements upthread. > > Race condition: > > 1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN > 2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend > process to the LSN heap and returns > 3. replay: rm_redo record '0/1234' > 4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr() > 5. backend: current replay LSN location is '0/1234', so we exit the loop > 6. replay: calls WaitLSNSetLatches() > > In a nutshell, it's possible for the loop in WaitForLSN to exit without > cleaning up the process from the heap. I was able to hit that by adding > a delay after the addLSNWaiter() call: > > > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 > > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b] > > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] > > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc] > > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] > > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] > > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] > > I think there's a similar race condition if the timeout is reached at > the same time that the startup process wakes up the process. Thank you for catching this. I think WaitForLSN() just needs to call deleteLSNWaiter() unconditionally after exit from the loop. > > * At first, we check that pg_wal_replay_wait() is called in a non-atomic > > * context. That is, a procedure call isn't wrapped into a transaction, > > * another procedure call, or a function call. > > * > > It's pretty unfortunate to have all these restrictions. It would be nice > to do: > > select pg_wal_replay_wait('0/1234'); select * from foo; This works for me, except that it needs "call" not "select". # call pg_wal_replay_wait('0/1234'); select * from foo; CALL i --- (0 rows) > in a single multi-query call, to avoid the round-trip to the client. You > can avoid it with libpq or protocol level pipelining, too, but it's more > complicated. > > > * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic > > * context, CallStmt is processed with a snapshot. Thankfully, we can pop > > * this snapshot, because PortalRunUtility() can tolerate this. > > This assumption that PortalRunUtility() can tolerate us popping the > snapshot sounds very fishy. I haven't looked at what's going on there, > but doesn't sound like a great assumption. This is what PortalRunUtility() says about this. /* * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from * under us, so don't complain if it's now empty. Otherwise, our snapshot * should be the top one; pop it. Note that this could be a different * snapshot from the one we made above; see EnsurePortalSnapshotExists. */ So, if the vacuum pops a snapshot when it needs to run without a snapshot, then it's probably OK for other utilities. But I agree this decision needs some consensus. > If recovery ends while a process is waiting for an LSN to arrive, does > it ever get woken up? If the recovery target is promote, then the user will get an error. If the recovery target is shutdown, then connection will get interrupted. If the recovery target is pause, then waiting will continue during the pause. Not sure about the latter case. > The docs could use some-copy-editing, but just to point out one issue: > > > There are also procedures to control the progress of recovery. > > That's copy-pasted from an earlier sentence at the table that lists > functions like pg_promote(), pg_wal_replay_pause(), and > pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the > progress of recovery like those functions do, it only causes the calling > backend to wait. > > Overall, this feature doesn't feel quite ready for v17, and IMHO should > be reverted. It's a nice feature, so I'd love to have it fixed and > reviewed early in the v18 cycle. Thank you for your review. I've reverted this. Will repost this for early v18. ------ Regards, Alexander Korotkov
On 11/04/2024 18:09, Alexander Korotkov wrote: > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 07/04/2024 00:52, Alexander Korotkov wrote: >>> * At first, we check that pg_wal_replay_wait() is called in a non-atomic >>> * context. That is, a procedure call isn't wrapped into a transaction, >>> * another procedure call, or a function call. >>> * >> >> It's pretty unfortunate to have all these restrictions. It would be nice >> to do: >> >> select pg_wal_replay_wait('0/1234'); select * from foo; > > This works for me, except that it needs "call" not "select". > > # call pg_wal_replay_wait('0/1234'); select * from foo; > CALL > i > --- > (0 rows) If you do that from psql prompt, it works because psql parses and sends it as two separate round-trips. Try: psql postgres -p 5433 -c "call pg_wal_replay_wait('0/4101BBB8'); select 1" ERROR: pg_wal_replay_wait() must be only called in non-atomic context DETAIL: Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function. >> This assumption that PortalRunUtility() can tolerate us popping the >> snapshot sounds very fishy. I haven't looked at what's going on there, >> but doesn't sound like a great assumption. > > This is what PortalRunUtility() says about this. > > /* > * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from > * under us, so don't complain if it's now empty. Otherwise, our snapshot > * should be the top one; pop it. Note that this could be a different > * snapshot from the one we made above; see EnsurePortalSnapshotExists. > */ > > So, if the vacuum pops a snapshot when it needs to run without a > snapshot, then it's probably OK for other utilities. But I agree this > decision needs some consensus. Ok, so it's at least somewhat documented that it's fine. >> Overall, this feature doesn't feel quite ready for v17, and IMHO should >> be reverted. It's a nice feature, so I'd love to have it fixed and >> reviewed early in the v18 cycle. > > Thank you for your review. I've reverted this. Will repost this for early v18. Thanks Alexander for working on this. -- Heikki Linnakangas Neon (https://neon.tech)
Hi, Alexander, Here, I made some improvements according to your discussion with Heikki. On 2024-04-11 18:09, Alexander Korotkov wrote: > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > wrote: >> In a nutshell, it's possible for the loop in WaitForLSN to exit >> without >> cleaning up the process from the heap. I was able to hit that by >> adding >> a delay after the addLSNWaiter() call: >> >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b] >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc] >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] >> >> I think there's a similar race condition if the timeout is reached at >> the same time that the startup process wakes up the process. > > Thank you for catching this. I think WaitForLSN() just needs to call > deleteLSNWaiter() unconditionally after exit from the loop. Fix and add injection point test on this race condition. > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > wrote: >> The docs could use some-copy-editing, but just to point out one issue: >> >> > There are also procedures to control the progress of recovery. >> >> That's copy-pasted from an earlier sentence at the table that lists >> functions like pg_promote(), pg_wal_replay_pause(), and >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control >> the >> progress of recovery like those functions do, it only causes the >> calling >> backend to wait. Fix documentation and add extra tests on multi-standby replication and cascade replication. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
Hi, Ivan! On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > > Hi, Alexander, Here, I made some improvements according to your > discussion with Heikki. > > On 2024-04-11 18:09, Alexander Korotkov wrote: > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > > wrote: > >> In a nutshell, it's possible for the loop in WaitForLSN to exit > >> without > >> cleaning up the process from the heap. I was able to hit that by > >> adding > >> a delay after the addLSNWaiter() call: > >> > >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 > >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b] > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] > >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc] > >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] > >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] > >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] > >> > >> I think there's a similar race condition if the timeout is reached at > >> the same time that the startup process wakes up the process. > > > > Thank you for catching this. I think WaitForLSN() just needs to call > > deleteLSNWaiter() unconditionally after exit from the loop. > > Fix and add injection point test on this race condition. > > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > > wrote: > >> The docs could use some-copy-editing, but just to point out one issue: > >> > >> > There are also procedures to control the progress of recovery. > >> > >> That's copy-pasted from an earlier sentence at the table that lists > >> functions like pg_promote(), pg_wal_replay_pause(), and > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control > >> the > >> progress of recovery like those functions do, it only causes the > >> calling > >> backend to wait. > > Fix documentation and add extra tests on multi-standby replication > and cascade replication. Thank you for the revised patch. I see couple of items which are not addressed in this revision. * As Heikki pointed, that it's currently not possible in one round trip to call call pg_wal_replay_wait() and do other job. The attached patch addresses this. It milds the requirement. Now, it's not necessary to be in atomic context. It's only necessary to have no active snapshot. This new requirement works for me so far. I appreciate a feedback on this. * As Alvaro pointed, multiple waiters case isn't covered by the test suite. That leads to no coverage of some code paths. The attached patch addresses this by adding a test case with multiple waiters. The rest looks good to me. Links. 1. https://www.postgresql.org/message-id/d1303584-b763-446c-9409-f4516118219f%40iki.fi 2.https://www.postgresql.org/message-id/202404051815.eri4u5q6oj26%40alvherre.pgsql ------ Regards, Alexander Korotkov Supabase
Attachment
On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan > <i.kartyshov@postgrespro.ru> wrote: > > > > Hi, Alexander, Here, I made some improvements according to your > > discussion with Heikki. > > > > On 2024-04-11 18:09, Alexander Korotkov wrote: > > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > > > wrote: > > >> In a nutshell, it's possible for the loop in WaitForLSN to exit > > >> without > > >> cleaning up the process from the heap. I was able to hit that by > > >> adding > > >> a delay after the addLSNWaiter() call: > > >> > > >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152 > > >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b] > > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8] > > >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc] > > >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5] > > >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a] > > >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9] > > >> > > >> I think there's a similar race condition if the timeout is reached at > > >> the same time that the startup process wakes up the process. > > > > > > Thank you for catching this. I think WaitForLSN() just needs to call > > > deleteLSNWaiter() unconditionally after exit from the loop. > > > > Fix and add injection point test on this race condition. > > > > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> > > > wrote: > > >> The docs could use some-copy-editing, but just to point out one issue: > > >> > > >> > There are also procedures to control the progress of recovery. > > >> > > >> That's copy-pasted from an earlier sentence at the table that lists > > >> functions like pg_promote(), pg_wal_replay_pause(), and > > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control > > >> the > > >> progress of recovery like those functions do, it only causes the > > >> calling > > >> backend to wait. > > > > Fix documentation and add extra tests on multi-standby replication > > and cascade replication. > > Thank you for the revised patch. > > I see couple of items which are not addressed in this revision. > * As Heikki pointed, that it's currently not possible in one round > trip to call call pg_wal_replay_wait() and do other job. The attached > patch addresses this. It milds the requirement. Now, it's not > necessary to be in atomic context. It's only necessary to have no > active snapshot. This new requirement works for me so far. I > appreciate a feedback on this. > * As Alvaro pointed, multiple waiters case isn't covered by the test > suite. That leads to no coverage of some code paths. The attached > patch addresses this by adding a test case with multiple waiters. > > The rest looks good to me. Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. 1. It's not clear why this test needs node_standby2 at all. It seems useless. 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This location seems to be unachievable in this test. So, it's not clear what race condition this test could potentially detect. 3. I think it would make sense to check for the race condition reported by Heikki. That is to insert the injection point at the beginning of WaitLSNSetLatches(). Links. 1. https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920 ------ Regards, Alexander Korotkov Supabase
Hi, I looked through the patch and have some comments. ====== L68: + <title>Recovery Procedures</title> It looks somewhat confusing and appears as if the section is intended to explain how to perform recovery. Since this is the first built-in procedure, I'm not sure how should this section be written. However, the section immediately above is named "Recovery Control Functions", so "Reocvery Synchronization Functions" would align better with the naming of the upper section. (I don't believe we need to be so strcit about the distinction between functions and procedures here.) It looks strange that the procedure signature includes the return type. ====== L93: + If <parameter>timeout</parameter> is not specified or zero, this + procedure returns once WAL is replayed upto + <literal>target_lsn</literal>. + If the <parameter>timeout</parameter> is specified (in + milliseconds) and greater than zero, the procedure waits until the + server actually replays the WAL upto <literal>target_lsn</literal> or + until the given time has passed. On timeout, an error is emitted. The first sentence should mention the main functionality. Following precedents, it might be better to use something like "Waits until recovery surpasses the specified LSN. If no timeout is specified or it is set to zero, this procedure waits indefinitely for the LSN. If the timeout is specified (in milliseconds) and is greater than zero, the procedure waits until the LSN is reached or the specified time has elapsed. On timeout, or if the server is promoted before the LSN is reached, an error is emitted." The detailed explanation that follows the above seems somewhat too verbose to me, as other functions don't have such detailed examples. ====== L484 /* + * Set latches for processes, whose waited LSNs are already replayed. This + * involves spinlocks. So, we shouldn't do this under a spinlock. + */ Here, I'm not quite sure what specifically spinlock (or mutex?) is referring to. However, more importantly, shouldn't we explain that it is okay not to use any locks at all, rather than mentioning that spinlocks should not be used here? I found a similar case around freelist.c:238, which is written as follows. > * Not acquiring ProcArrayLock here which is slightly icky. It's > * actually fine because procLatch isn't ever freed, so we just can > * potentially set the wrong process' (or no process') latch. > */ > SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch); ===== L518 +void +WaitForLSN(XLogRecPtr targetLSN, int64 timeout) This function is only called within the same module. I'm not sure if we need to expose it. I we do, the name should probably be more specific. I'm not quite sure if the division of functionality between this function and its only caller function is appropriate. As a possible refactoring, we could have WaitForLSN() just return the result as [reached, timedout, promoted] and delegate prerequisition checks and error reporting to the SQL function. ===== L524 + /* Shouldn't be called when shmem isn't initialized */ + Assert(waitLSN); Seeing this assertion, I feel that the name "waitLSN" is a bit obscure. How about renaming it to "waitLSNStates"? ===== L527 + /* Should be only called by a backend */ + Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends); This is somewhat excessive, causing a server crash when ending with an error would suffice. By the way, if I execute "CALL pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes. The condition doesn't seem appropriate. ===== L561 + errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.", I don't think we need "the" before "target" in the above message. ===== L565 + if (timeout > 0) + { + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; + latch_events |= WL_TIMEOUT; + if (delay_ms <= 0) + break; "timeout" is immutable in the function. Therefore, we can calculate "latch_events" before entering the loop. By the way, the name 'latch_events' seems a bit off. Latch is a kind of event the function can wait for. Therefore, something like wait_events might be more appropriate. ===== L567 + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; We can use TimestampDifferenceMilliseconds() here. ==== L578 + if (rc & WL_LATCH_SET) + ResetLatch(MyLatch); I think we usually reset latches unconditionally after returning from WaitLatch(), even when waiting for timeouts. ===== L756 +{ oid => '16387', descr => 'wait for LSN with timeout', The description seems a bit off. While timeout is mentioned, the more important characteristic that the LSN is the replay LSN is not included. ===== L791 + * WaitLSNProcInfo [e2 80 93] the shared memory structure representing information This line contains a non-ascii character EN Dash(U+2013). ===== L798 + * A process number, same as the index of this item in waitLSN->procInfos. + * Stored for convenience. + */ + int procnum; It is described as "(just) for convenience". However, it is referenced by Startup to fetch the PGPROC entry for the waiter, which necessary for Startup. That aside, why don't we hold (the pointer to) procLatch instead of procnum? It makes things simpler and I believe it is our standard practice. ===== L809 + /* A flag indicating that this item is added to waitLSN->waitersHeap */ + bool inHeap; The name "inHeap" seems too leteral and might be hard to understand in most occurances. How about using "waiting" instead? ===== L920 +# I +# Make sure that pg_wal_replay_wait() works: add new content to Hmm. I feel that Arabic numerals look nicer than Greek ones here. ===== L940 +# Check that new data is visible after calling pg_wal_replay_wait() On the other hand, the comment for the check for this test states that > +# Make sure the current LSN on standby and is the same as primary's > LSN +ok($output eq 30, "standby reached the same LSN as primary"); I think the first comment and the second should be consistent. > Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. > > 1. It's not clear why this test needs node_standby2 at all. It seems useless. I agree with you. What we would need is a second *waiter client* connecting to the same stanby rather than a second standby. I feel like having a test where the first waiter is removed while multiple waiters are waiting, as well as a test where promotion occurs under the same circumstances. > 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This > location seems to be unachievable in this test. So, it's not clear > what race condition this test could potentially detect. > 3. I think it would make sense to check for the race condition > reported by Heikki. That is to insert the injection point at the > beginning of WaitLSNSetLatches(). I think the race condition you mentioned refers to the inconsistency between the inHeap flag and the pairing heap caused by a race condition between timeout and wakeup (or perhaps other combinations? I'm not sure which version of the patch the mentioned race condition refers to). However, I imagine it is difficult to reliably reproduce this condition. In that regard, in the latest patch, the coherence between the inHeap flag and the pairing heap is protected by LWLock, so I believe we no longer need that test. regards. > Links. > 1. https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920 -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for your interest in the patch. On 2024-06-20 11:30, Kyotaro Horiguchi wrote: > Hi, I looked through the patch and have some comments. > > > ====== L68: > + <title>Recovery Procedures</title> > > It looks somewhat confusing and appears as if the section is intended > to explain how to perform recovery. Since this is the first built-in > procedure, I'm not sure how should this section be written. However, > the section immediately above is named "Recovery Control Functions", > so "Reocvery Synchronization Functions" would align better with the > naming of the upper section. (I don't believe we need to be so strcit > about the distinction between functions and procedures here.) > > It looks strange that the procedure signature includes the return type. Good point, change Recovery Procedures -> Recovery Synchronization Procedures > ====== L93: > + If <parameter>timeout</parameter> is not specified or zero, > this > + procedure returns once WAL is replayed upto > + <literal>target_lsn</literal>. > + If the <parameter>timeout</parameter> is specified (in > + milliseconds) and greater than zero, the procedure waits until > the > + server actually replays the WAL upto > <literal>target_lsn</literal> or > + until the given time has passed. On timeout, an error is > emitted. > > The first sentence should mention the main functionality. Following > precedents, it might be better to use something like "Waits until > recovery surpasses the specified LSN. If no timeout is specified or it > is set to zero, this procedure waits indefinitely for the LSN. If the > timeout is specified (in milliseconds) and is greater than zero, the > procedure waits until the LSN is reached or the specified time has > elapsed. On timeout, or if the server is promoted before the LSN is > reached, an error is emitted." > > The detailed explanation that follows the above seems somewhat too > verbose to me, as other functions don't have such detailed examples. Please offer your description. I think it would be better. > ====== L484 > /* > + * Set latches for processes, whose waited LSNs are already replayed. > This > + * involves spinlocks. So, we shouldn't do this under a spinlock. > + */ > > Here, I'm not quite sure what specifically spinlock (or mutex?) is > referring to. However, more importantly, shouldn't we explain that it > is okay not to use any locks at all, rather than mentioning that > spinlocks should not be used here? I found a similar case around > freelist.c:238, which is written as follows. > >> * Not acquiring ProcArrayLock here which is slightly icky. It's >> * actually fine because procLatch isn't ever freed, so we just can >> * potentially set the wrong process' (or no process') latch. >> */ >> SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch); ??? > ===== L518 > +void > +WaitForLSN(XLogRecPtr targetLSN, int64 timeout) > > This function is only called within the same module. I'm not sure if > we need to expose it. I we do, the name should probably be more > specific. I'm not quite sure if the division of functionality between > this function and its only caller function is appropriate. As a > possible refactoring, we could have WaitForLSN() just return the > result as [reached, timedout, promoted] and delegate prerequisition > checks and error reporting to the SQL function. waitLSN -> waitLSNStates No, waitLSNStates is not the best name, because waitLSNState is a state, and waitLSN is not the array of waitLSNStates. We can think about another name, what you think? > ===== L524 > + /* Shouldn't be called when shmem isn't initialized */ > + Assert(waitLSN); > > Seeing this assertion, I feel that the name "waitLSN" is a bit > obscure. How about renaming it to "waitLSNStates"? > ===== L527 > + /* Should be only called by a backend */ > + Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends); > > This is somewhat excessive, causing a server crash when ending with an > error would suffice. By the way, if I execute "CALL > pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes. > The condition doesn't seem appropriate. Can you give more information on your server crashes, so I could repeat them. > ===== L565 > + if (timeout > 0) > + { > + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; > + latch_events |= WL_TIMEOUT; > + if (delay_ms <= 0) > + break; > > "timeout" is immutable in the function. Therefore, we can calculate > "latch_events" before entering the loop. By the way, the name > 'latch_events' seems a bit off. Latch is a kind of event the function > can wait for. Therefore, something like wait_events might be more > appropriate. "wait_event" - it can't be, because in latch declaration, this events responsible for wake up and not for wait int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32 wait_event_info) > ==== L578 > + if (rc & WL_LATCH_SET) > + ResetLatch(MyLatch); > > I think we usually reset latches unconditionally after returning from > WaitLatch(), even when waiting for timeouts. No, it depends on you logic, when you have several wake_events and you want to choose what event ignited your latch. Check applyparallelworker.c:813 > ===== L798 > + * A process number, same as the index of this item in > waitLSN->procInfos. > + * Stored for convenience. > + */ > + int procnum; > > It is described as "(just) for convenience". However, it is referenced > by Startup to fetch the PGPROC entry for the waiter, which necessary > for Startup. That aside, why don't we hold (the pointer to) procLatch > instead of procnum? It makes things simpler and I believe it is our > standard practice. Can you deeper explane what you meen and give the example? > ===== L809 > + /* A flag indicating that this item is added to waitLSN->waitersHeap > */ > + bool inHeap; > > The name "inHeap" seems too leteral and might be hard to understand in > most occurances. How about using "waiting" instead? No, inHeap leteral mean indeed inHeap. Check the comment. Please suggest a more suitable one. > ===== L940 > +# Check that new data is visible after calling pg_wal_replay_wait() > > On the other hand, the comment for the check for this test states that > >> +# Make sure the current LSN on standby and is the same as primary's >> LSN +ok($output eq 30, "standby reached the same LSN as primary"); > > I think the first comment and the second should be consistent. Thanks, I'll rephrase this comment >> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. >> >> 1. It's not clear why this test needs node_standby2 at all. It seems >> useless. > > I agree with you. What we would need is a second *waiter client* > connecting to the same stanby rather than a second standby. I feel > like having a test where the first waiter is removed while multiple > waiters are waiting, as well as a test where promotion occurs under > the same circumstances. Can you give more information about this cases step by step, and what means "remove" and "promotion". >> 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This >> location seems to be unachievable in this test. So, it's not clear >> what race condition this test could potentially detect. >> 3. I think it would make sense to check for the race condition >> reported by Heikki. That is to insert the injection point at the >> beginning of WaitLSNSetLatches(). > > I think the race condition you mentioned refers to the inconsistency > between the inHeap flag and the pairing heap caused by a race > condition between timeout and wakeup (or perhaps other combinations? > I'm not sure which version of the patch the mentioned race condition > refers to). However, I imagine it is difficult to reliably reproduce > this condition. In that regard, in the latest patch, the coherence > between the inHeap flag and the pairing heap is protected by LWLock, > so I believe we no longer need that test. No, Alexandre means that Heikki point on race condition just before LWLock. But injection point we can inject and stepin on backend, and WaitLSNSetLatches is used from Recovery process. But I have trouble to wakeup injection point on server. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
>> I think the race condition you mentioned refers to the inconsistency >> between the inHeap flag and the pairing heap caused by a race >> condition between timeout and wakeup (or perhaps other combinations? >> I'm not sure which version of the patch the mentioned race condition >> refers to). However, I imagine it is difficult to reliably reproduce >> this condition. In that regard, in the latest patch, the coherence >> between the inHeap flag and the pairing heap is protected by LWLock, >> so I believe we no longer need that test. > > No, Alexandre means that Heikki point on race condition just before > LWLock. But injection point we can inject and stepin on backend, and > WaitLSNSetLatches is used from Recovery process. But I have trouble > to wakeup injection point on server. One more thing. I want to point, when you move injection point to contrib dir and do the same test (044_wal_replay_wait_injection_test.pl) step by step with hands, wakeup works well. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Thanks to Kyotaro for the review. And thanks to Ivan for the patch revision. I made another revision of the patch. On Tue, Jul 9, 2024 at 12:51 PM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote: > Thank you for your interest in the patch. > > On 2024-06-20 11:30, Kyotaro Horiguchi wrote: > > Hi, I looked through the patch and have some comments. > > > > > > ====== L68: > > + <title>Recovery Procedures</title> > > > > It looks somewhat confusing and appears as if the section is intended > > to explain how to perform recovery. Since this is the first built-in > > procedure, I'm not sure how should this section be written. However, > > the section immediately above is named "Recovery Control Functions", > > so "Reocvery Synchronization Functions" would align better with the > > naming of the upper section. (I don't believe we need to be so strcit > > about the distinction between functions and procedures here.) > > > > It looks strange that the procedure signature includes the return type. > > Good point, change > Recovery Procedures -> Recovery Synchronization Procedures Thank you, looks good to me. > > ====== L93: > > + If <parameter>timeout</parameter> is not specified or zero, > > this > > + procedure returns once WAL is replayed upto > > + <literal>target_lsn</literal>. > > + If the <parameter>timeout</parameter> is specified (in > > + milliseconds) and greater than zero, the procedure waits until > > the > > + server actually replays the WAL upto > > <literal>target_lsn</literal> or > > + until the given time has passed. On timeout, an error is > > emitted. > > > > The first sentence should mention the main functionality. Following > > precedents, it might be better to use something like "Waits until > > recovery surpasses the specified LSN. If no timeout is specified or it > > is set to zero, this procedure waits indefinitely for the LSN. If the > > timeout is specified (in milliseconds) and is greater than zero, the > > procedure waits until the LSN is reached or the specified time has > > elapsed. On timeout, or if the server is promoted before the LSN is > > reached, an error is emitted." > > > > The detailed explanation that follows the above seems somewhat too > > verbose to me, as other functions don't have such detailed examples. > > Please offer your description. I think it would be better. Kyotaro actually provided a paragraph in his message. I've integrated it into the patch. > > ====== L484 > > /* > > + * Set latches for processes, whose waited LSNs are already replayed. > > This > > + * involves spinlocks. So, we shouldn't do this under a spinlock. > > + */ > > > > Here, I'm not quite sure what specifically spinlock (or mutex?) is > > referring to. However, more importantly, shouldn't we explain that it > > is okay not to use any locks at all, rather than mentioning that > > spinlocks should not be used here? I found a similar case around > > freelist.c:238, which is written as follows. > > > >> * Not acquiring ProcArrayLock here which is slightly icky. It's > >> * actually fine because procLatch isn't ever freed, so we just can > >> * potentially set the wrong process' (or no process') latch. > >> */ > >> SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch); > > ??? I've revised the comment. > > ===== L518 > > +void > > +WaitForLSN(XLogRecPtr targetLSN, int64 timeout) > > > > This function is only called within the same module. I'm not sure if > > we need to expose it. I we do, the name should probably be more > > specific. I'm not quite sure if the division of functionality between > > this function and its only caller function is appropriate. As a > > possible refactoring, we could have WaitForLSN() just return the > > result as [reached, timedout, promoted] and delegate prerequisition > > checks and error reporting to the SQL function. I think WaitForLSNReplay() is better name. And it's not clear what API we could need. So I would prefer to keep it static for now. > waitLSN -> waitLSNStates > No, waitLSNStates is not the best name, because waitLSNState is a state, > and waitLSN is not the array of waitLSNStates. We can think about > another name, what you think? > > > ===== L524 > > + /* Shouldn't be called when shmem isn't initialized */ > > + Assert(waitLSN); > > > > Seeing this assertion, I feel that the name "waitLSN" is a bit > > obscure. How about renaming it to "waitLSNStates"? I agree that waitLSN is too generic. I've renamed it to waitLSNState. > > ===== L527 > > + /* Should be only called by a backend */ > > + Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends); > > > > This is somewhat excessive, causing a server crash when ending with an > > error would suffice. By the way, if I execute "CALL > > pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes. > > The condition doesn't seem appropriate. > > Can you give more information on your server crashes, so I could repeat > them. I've rechecked this. We don't need to assert the process to be a backend given that MaxBackends include background workers too. Replaced this just with assert that MyProcNumber is valid. > > ===== L565 > > + if (timeout > 0) > > + { > > + delay_ms = (endtime - GetCurrentTimestamp()) / 1000; > > + latch_events |= WL_TIMEOUT; > > + if (delay_ms <= 0) > > + break; > > > > "timeout" is immutable in the function. Therefore, we can calculate > > "latch_events" before entering the loop. By the way, the name > > 'latch_events' seems a bit off. Latch is a kind of event the function > > can wait for. Therefore, something like wait_events might be more > > appropriate. > > "wait_event" - it can't be, because in latch declaration, this events > responsible for wake up and not for wait > int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32 > wait_event_info) I agree with change "latch_events" => "wake_events" by Ivan. I also made change to calculate wake_events in advance as proposed by Kyotaro. > > ==== L578 > > + if (rc & WL_LATCH_SET) > > + ResetLatch(MyLatch); > > > > I think we usually reset latches unconditionally after returning from > > WaitLatch(), even when waiting for timeouts. > > No, it depends on you logic, when you have several wake_events and you > want to choose what event ignited your latch. > Check applyparallelworker.c:813 +1, Not necessary to reset latch if it hasn't been set. A lot of places where we do that conditionally. > > ===== L798 > > + * A process number, same as the index of this item in > > waitLSN->procInfos. > > + * Stored for convenience. > > + */ > > + int procnum; > > > > It is described as "(just) for convenience". However, it is referenced > > by Startup to fetch the PGPROC entry for the waiter, which necessary > > for Startup. That aside, why don't we hold (the pointer to) procLatch > > instead of procnum? It makes things simpler and I believe it is our > > standard practice. > > Can you deeper explane what you meen and give the example? I wrote the comment that we store procnum for convenience meaning that alternatively we could calculate it as the offset in the waitLSN->procInfos array. But I like your idea to keep latch instead. This should give more flexibility for future advancements. > > ===== L809 > > + /* A flag indicating that this item is added to waitLSN->waitersHeap > > */ > > + bool inHeap; > > > > The name "inHeap" seems too leteral and might be hard to understand in > > most occurances. How about using "waiting" instead? > > No, inHeap leteral mean indeed inHeap. Check the comment. > Please suggest a more suitable one. +1, We use this flag for consistency during operations with heap. I don't see a reason to rename this. > > ===== L940 > > +# Check that new data is visible after calling pg_wal_replay_wait() > > > > On the other hand, the comment for the check for this test states that > > > >> +# Make sure the current LSN on standby and is the same as primary's > >> LSN +ok($output eq 30, "standby reached the same LSN as primary"); > > > > I think the first comment and the second should be consistent. > > Thanks, I'll rephrase this comment I had to revised this comment furthermore. > >> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. > >> > >> 1. It's not clear why this test needs node_standby2 at all. It seems > >> useless. > > > > I agree with you. What we would need is a second *waiter client* > > connecting to the same stanby rather than a second standby. I feel > > like having a test where the first waiter is removed while multiple > > waiters are waiting, as well as a test where promotion occurs under > > the same circumstances. > > Can you give more information about this cases step by step, and what > means "remove" and "promotion". > > >> 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This > >> location seems to be unachievable in this test. So, it's not clear > >> what race condition this test could potentially detect. > >> 3. I think it would make sense to check for the race condition > >> reported by Heikki. That is to insert the injection point at the > >> beginning of WaitLSNSetLatches(). > > > > I think the race condition you mentioned refers to the inconsistency > > between the inHeap flag and the pairing heap caused by a race > > condition between timeout and wakeup (or perhaps other combinations? > > I'm not sure which version of the patch the mentioned race condition > > refers to). However, I imagine it is difficult to reliably reproduce > > this condition. In that regard, in the latest patch, the coherence > > between the inHeap flag and the pairing heap is protected by LWLock, > > so I believe we no longer need that test. > > No, Alexandre means that Heikki point on race condition just before > LWLock. But injection point we can inject and stepin on backend, and > WaitLSNSetLatches is used from Recovery process. But I have trouble > to wakeup injection point on server. My initial hope was to add elegant enough test that will check for concurrency issues between startup process and lsn waiter process. That test should check an issue reported by Heikki and hopefully more. I've tried to revise 044_wal_replay_wait_injection_test.pl, but it becomes cumbersome and finally unclear what it's intended to test. This is why I finally decide to remote this. Regarding test for multiple waiters, 043_wal_replay_wait.pl has some. ------ Regards, Alexander Korotkov Supabase
Attachment
On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > Thanks to Kyotaro for the review. And thanks to Ivan for the patch > revision. I made another revision of the patch. I've noticed failures on cfbot. The attached revision addressed docs build failure. Also it adds some "quits" for background psql sessions for tests. Probably this will address test hangs on windows. ------ Regards, Alexander Korotkov Supabase
Attachment
On Mon, Jul 15, 2024 at 2:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Thanks to Kyotaro for the review. And thanks to Ivan for the patch > > revision. I made another revision of the patch. > > I've noticed failures on cfbot. The attached revision addressed docs > build failure. Also it adds some "quits" for background psql sessions > for tests. Probably this will address test hangs on windows. I made the following changes to the patch. 1) I've changed the check for snapshot in pg_wal_replay_wait(). Now it checks that GetOldestSnapshot() returns NULL. It happens when both ActiveSnapshot is NULL and RegisteredSnapshots pairing heap is empty. This is the same condition when SnapshotResetXmin() sets out xmin to invalid. Thus, we are not preventing WAL from replay. This should be satisfied when pg_wal_replay_wait() isn't called within a transaction with an isolation level higher than READ COMMITTED, another procedure, or a function. Documented it this way. 2) Explicitly document in the PortalRunUtility() comment that pg_wal_replay_wait() is another case when active snapshot gets released. 3) I've removed tests with cascading replication. It's rather unclear what problem these tests could potentially spot. 4) Did some improvements to docs, comments and commit message to make them consistent with the patch contents. The commit to pg17 was inconsiderate. But I feel this patch got much better shape. Especially, now it's clear when the pg_wal_replay_wait() procedure can be used. So, I dare commit this to pg18 if nobody objects. ------ Regards, Alexander Korotkov Supabase
Attachment
I noticed the commit and had a question and a comment.
There is a small problem in func.sgml in the sentence "After that the changes made of primary". Should be "on primary".
In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.
Thanks,
Kevin.
On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote: > I noticed the commit and had a question and a comment. > There is a small problem in func.sgml in the sentence "After that the changes made of primary". Should be "on primary". Thank you for spotting this. Will fix. > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr? > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery. This is intentional. After standby gets promoted, GetXLogReplayRecPtr() returns the last WAL position being replayed while being standby. So, if standby reached target lsn before being promoted, we don't have to throw an error. But this gave me an idea that before the loop we probably need to put RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll recheck that. ------ Regards, Alexander Korotkov Supabase
On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote: > > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr? > > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery. > > This is intentional. After standby gets promoted, > GetXLogReplayRecPtr() returns the last WAL position being replayed > while being standby. So, if standby reached target lsn before being > promoted, we don't have to throw an error. > > But this gave me an idea that before the loop we probably need to put > RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll > recheck that. The attached patchset comprises assorted improvements for pg_wal_replay_wait(). The 0001 patch is intended to improve this situation. Actually, it's not right to just put RecoveryInProgress() after GetXLogReplayRecPtr(), because more wal could be replayed between these calls. Instead we need to recheck GetXLogReplayRecPtr() after getting negative result of RecoveryInProgress() because WAL replay position couldn't get updated after. 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function 0003 patch comprises tests for pg_wal_replay_wait() errors. ------ Regards, Alexander Korotkov Supabase
Attachment
On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > The 0001 patch is intended to improve this situation. Actually, it's > not right to just put RecoveryInProgress() after > GetXLogReplayRecPtr(), because more wal could be replayed between > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > getting negative result of RecoveryInProgress() because WAL replay > position couldn't get updated after. > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > 0003 patch comprises tests for pg_wal_replay_wait() errors. Before adding more tests, could it be possible to stabilize what's in the tree? drongo has reported one failure with the recovery test 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 -- Michael
Attachment
On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > > The 0001 patch is intended to improve this situation. Actually, it's > > not right to just put RecoveryInProgress() after > > GetXLogReplayRecPtr(), because more wal could be replayed between > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > getting negative result of RecoveryInProgress() because WAL replay > > position couldn't get updated after. > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > Before adding more tests, could it be possible to stabilize what's in > the tree? drongo has reported one failure with the recovery test > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 Thank you for pointing! Surely, I'll fix this before. ------ Regards, Alexander Korotkov Supabase
On Tue, Aug 6, 2024 at 11:18 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > > The 0001 patch is intended to improve this situation. Actually, it's
> > > not right to just put RecoveryInProgress() after
> > > GetXLogReplayRecPtr(), because more wal could be replayed between
> > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after
> > > getting negative result of RecoveryInProgress() because WAL replay
> > > position couldn't get updated after.
> > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > > 0003 patch comprises tests for pg_wal_replay_wait() errors.
> >
> > Before adding more tests, could it be possible to stabilize what's in
> > the tree? drongo has reported one failure with the recovery test
> > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
>
> Thank you for pointing!
> Surely, I'll fix this before.
Something breaks in these lines during second iteration of the loop. "SELECT pg_current_wal_insert_lsn()" has been queried from primary, but standby didn't receive "CALL pg_wal_replay_wait('...');"
for (my $i = 0; $i < 5; $i++)
{
print($i);
$node_primary->safe_psql('postgres',
"INSERT INTO wait_test VALUES (${i});");
my $lsn =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn()");
$psql_sessions[$i] = $node_standby1->background_psql('postgres');
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
CALL pg_wal_replay_wait('${lsn}');
SELECT log_count(${i});
]);
}
I wonder what could it be. Probably something hangs inside launching background psql... I'll investigate this more.
------
Regards,
Alexander Korotkov
Supabase
> On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > > The 0001 patch is intended to improve this situation. Actually, it's
> > > not right to just put RecoveryInProgress() after
> > > GetXLogReplayRecPtr(), because more wal could be replayed between
> > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after
> > > getting negative result of RecoveryInProgress() because WAL replay
> > > position couldn't get updated after.
> > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > > 0003 patch comprises tests for pg_wal_replay_wait() errors.
> >
> > Before adding more tests, could it be possible to stabilize what's in
> > the tree? drongo has reported one failure with the recovery test
> > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
>
> Thank you for pointing!
> Surely, I'll fix this before.
Something breaks in these lines during second iteration of the loop. "SELECT pg_current_wal_insert_lsn()" has been queried from primary, but standby didn't receive "CALL pg_wal_replay_wait('...');"
for (my $i = 0; $i < 5; $i++)
{
print($i);
$node_primary->safe_psql('postgres',
"INSERT INTO wait_test VALUES (${i});");
my $lsn =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn()");
$psql_sessions[$i] = $node_standby1->background_psql('postgres');
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
CALL pg_wal_replay_wait('${lsn}');
SELECT log_count(${i});
]);
}
I wonder what could it be. Probably something hangs inside launching background psql... I'll investigate this more.
------
Regards,
Alexander Korotkov
Supabase
On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > > The 0001 patch is intended to improve this situation. Actually, it's > > not right to just put RecoveryInProgress() after > > GetXLogReplayRecPtr(), because more wal could be replayed between > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > getting negative result of RecoveryInProgress() because WAL replay > > position couldn't get updated after. > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > Before adding more tests, could it be possible to stabilize what's in > the tree? drongo has reported one failure with the recovery test > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 I'm currently running a 043_wal_replay_wait test in a loop of drongo. No failures during more than 10 hours. As I pointed in [1] it seems that test stuck somewhere on launching BackgroundPsql. Given that drongo have some strange failures from time to time (for instance [2] or [3]), I doubt there is something specifically wrong in 043_wal_replay_wait test that caused the subject failure. Therefore, while I'm going to continue looking at the reason of failure on drongo in background, I'm going to go ahead with my improvements for pg_wal_replay_wait(). Links. 1. https://www.postgresql.org/message-id/CAPpHfduYkve0sw-qy4aCCmJv_MXfuuAQ7wyRQsX8NjaLVKDE1Q%40mail.gmail.com 2. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-02%2010%3A34%3A45 3. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-06-06%2012%3A36%3A11 ------ Regards, Alexander Korotkov Supabase
On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote: > > > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr? > > > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery. > > > > This is intentional. After standby gets promoted, > > GetXLogReplayRecPtr() returns the last WAL position being replayed > > while being standby. So, if standby reached target lsn before being > > promoted, we don't have to throw an error. > > > > But this gave me an idea that before the loop we probably need to put > > RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll > > recheck that. > > The attached patchset comprises assorted improvements for pg_wal_replay_wait(). > > The 0001 patch is intended to improve this situation. Actually, it's > not right to just put RecoveryInProgress() after > GetXLogReplayRecPtr(), because more wal could be replayed between > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > getting negative result of RecoveryInProgress() because WAL replay > position couldn't get updated after. > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > 0003 patch comprises tests for pg_wal_replay_wait() errors. Here is a revised version of the patchset. I've fixed some typos, identation, etc. I'm going to push this once it passes cfbot. ------ Regards, Alexander Korotkov Supabase
Attachment
On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote: > > > > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr? > > > > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery. > > > > > > This is intentional. After standby gets promoted, > > > GetXLogReplayRecPtr() returns the last WAL position being replayed > > > while being standby. So, if standby reached target lsn before being > > > promoted, we don't have to throw an error. > > > > > > But this gave me an idea that before the loop we probably need to put > > > RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll > > > recheck that. > > > > The attached patchset comprises assorted improvements for pg_wal_replay_wait(). > > > > The 0001 patch is intended to improve this situation. Actually, it's > > not right to just put RecoveryInProgress() after > > GetXLogReplayRecPtr(), because more wal could be replayed between > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > getting negative result of RecoveryInProgress() because WAL replay > > position couldn't get updated after. > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > Here is a revised version of the patchset. I've fixed some typos, > identation, etc. I'm going to push this once it passes cfbot. The next revison of the patchset fixes uninitialized variable usage spotted by cfbot. ------ Regards, Alexander Korotkov Supabase
Attachment
On Sat, Aug 10, 2024 at 6:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > > > The 0001 patch is intended to improve this situation. Actually, it's > > > not right to just put RecoveryInProgress() after > > > GetXLogReplayRecPtr(), because more wal could be replayed between > > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > > getting negative result of RecoveryInProgress() because WAL replay > > > position couldn't get updated after. > > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > > > Before adding more tests, could it be possible to stabilize what's in > > the tree? drongo has reported one failure with the recovery test > > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 > > I'm currently running a 043_wal_replay_wait test in a loop of drongo. > No failures during more than 10 hours. As I pointed in [1] it seems > that test stuck somewhere on launching BackgroundPsql. Given that > drongo have some strange failures from time to time (for instance [2] > or [3]), I doubt there is something specifically wrong in > 043_wal_replay_wait test that caused the subject failure. With help of Andrew Dunstan, I've run 043_wal_replay_wait.pl in a loop for two days, then the whole test suite also for two days. Haven't seen any failures. I don't see the point to run more experiments, because Andrew needs to bring drongo back online as a buildfarm member. It might happen that something exceptional happened on drongo (like inability to launch a new process or something). For now, I think the reasonable strategy would be to wait and see if something similar will repeat on buildfarm. ------ Regards, Alexander Korotkov Supabase
Hi Alexander, 10.08.2024 20:18, Alexander Korotkov wrote: > On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: >> ... >> Here is a revised version of the patchset. I've fixed some typos, >> identation, etc. I'm going to push this once it passes cfbot. > The next revison of the patchset fixes uninitialized variable usage > spotted by cfbot. When running check-world on a rather slow armv7 device, I came across the 043_wal_replay_wait.pl test failure: t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 8. regress_log_043_wal_replay_wait contains: ... 01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data ### Promoting node "standby" # Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l .../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote waiting for server to promote.... done server promoted [21:58:56.637](0.268s) ok 8 - got error after standby promote error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress HINT: Waiting for LSN can only be executed during recovery.' while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140. 043_wal_replay_wait_standby.log contains: 2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG: connection received: host=[local] 2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG: connection authenticated: user="android" method=trust (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117) 2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG: connection authorized: user=android database=postgres application_name=043_wal_replay_wait.pl 2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('2/570CB4E8'); 2024-09-12 21:58:56.535 UTC [15123:7] LOG: received promote request 2024-09-12 21:58:56.535 UTC [15124:2] FATAL: terminating walreceiver process due to administrator command 2024-09-12 21:58:56.537 UTC [15123:8] LOG: invalid record length at 0/300D0B0: expected at least 24, got 0 2024-09-12 21:58:56.537 UTC [15123:9] LOG: redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 14.23 s 2024-09-12 21:58:56.537 UTC [15123:10] LOG: last completed transaction was at log time 2024-09-12 21:58:55.322831+00 2024-09-12 21:58:56.540 UTC [15123:11] LOG: selected new timeline ID: 2 2024-09-12 21:58:56.589 UTC [15123:12] LOG: archive recovery complete 2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress 2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL: Recovery ended before replaying target LSN 2/570CB4E8; last replay LSN 0/300D0B0. 2024-09-12 21:58:56.591 UTC [15121:1] LOG: checkpoint starting: force 2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android database=postgres host=[local] 2024-09-12 21:58:56.595 UTC [15120:4] LOG: database system is ready to accept connections 2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG: connection received: host=[local] 2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG: connection authenticated: user="android" method=trust (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117) 2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG: connection authorized: user=android database=postgres application_name=043_wal_replay_wait.pl 2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300D0E8'); 2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress 2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for LSN can only be executed during recovery. 2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android database=postgres host=[local] Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait call wants to wait for 300D0E8. pg_waldump -p src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003 shows: rmgr: Heap len (rec/tot): 59/ 59, tx: 748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35, flags: 0x00, blkref #0: rel 1663/5/16384 blk 0 rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT 2024-09-12 21:58:55.322831 UTC rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS nextXid 749 latestCompletedXid 748 oldestRunningXid 749 I could reproduce this failure on my workstation with bgworker modified as below: --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -69 +69 @@ int BgWriterDelay = 200; -#define LOG_SNAPSHOT_INTERVAL_MS 15000 +#define LOG_SNAPSHOT_INTERVAL_MS 15 @@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len) - BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN); + 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN); When looking at the test, I noticed probably a typo in the test message: wait for already replayed LSN exists immediately ... shouldn't it be "exits" there (though maybe the whole phrase could be improved)? I also suspect that "achieve" is not suitable word in the context of LSNs and timeouts. Maybe you would find it appropriate to replace it with "reach"? Best regards, Alexander
Hi Alexander, 16.09.2024 21:55, Alexander Korotkov wrote: > Please find two patches attached. The first one does minor cleanup > including misuse of words you've pointed. The second one adds missing > wait_for_catchup(). That should fix the test failure you've spotted. > Please, check if it fixes an issue for you. Thank you for looking at that! Unfortunately, the issue is still here — the test failed for me 6 out of 10 runs, as below: [05:14:02.807](0.135s) ok 8 - got error after standby promote error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress HINT: Waiting for LSN can only be executed during recovery.' while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140. 043_wal_replay_wait_standby.log: 2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is not in progress 2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL: Recovery ended before replaying target LSN 2/570CD648; last replay LSN 0/300F210. 2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT: CALL pg_wal_replay_wait('2/570CD648'); 2024-09-17 05:14:02.714 UTC [1817155] LOG: checkpoint starting: force 2024-09-17 05:14:02.714 UTC [1817154] LOG: database system is ready to accept connections 2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300F248'); 2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is not in progress pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003 rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT 2024-09-17 05:14:01.654874 UTC rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS nextXid 749 latestCompletedXid 748 oldestRunningXid 749 I wonder, can you reproduce this with that bgwriter's modification? I've also found two more "achievements" coined by 3c5db1d6b: doc/src/sgml/func.sgml: It may also happen that target <acronym>lsn</acronym> is not achieved src/backend/access/transam/xlog.c- * recovery was ended before achieving the target LSN. Best regards, Alexander
17.09.2024 10:47, Alexander Korotkov wrote: > Yes, now I did reproduce. I got that the problem could be that insert > LSN is not yet written at primary, thus wait_for_catchup doesn't wait > for it. I've workarounded that using pg_switch_wal(). The revised > patchset is attached. Thank you for the revised patch! The improved test works reliably for me (100 out of 100 runs passed), Best regards, Alexander