Thread: Re: [HACKERS] make async slave to wait for lsn to be replayed

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

28 February 2023, 10:10:47

Intro==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored 
inside
application. We cannot store this lsn inside database, since reads are
distributed across all replicas and primary.

https://www.postgresql.org/message-id/195e2d07ead315b1620f1a053313f490%40postgrespro.ru

Suggestions
==========
Lots of proposals were made how this feature may look like.
I aggregate them into the following four types.

1) Classic (wait_classic_v1.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==========
advantages: multiple events, standalone WAIT
disadvantages: new words in grammar

WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ WAIT FOR [ANY | ALL] event [, ...]]
where event is one of:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v1.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==========
advantages: no new words in grammar, standalone AFTER
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
==========
advantages: no new words in grammar,like it made in 
pg_last_wal_replay_lsn, no snapshots need
disadvantages: a little harder to remember names
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

4) Brackets style: Kondratov
https://www.postgresql.org/message-id/a8bff0350a27e0a87a6eaf0905d6737f%40postgrespro.ru 
==========
advantages: only one new word in grammar,like it made in VACUUM and 
REINDEX, ability to extend parameters without grammar fixes
disadvantages: 
WAIT (LSN '16/B374D848', TIMEOUT 100);
BEGIN WAIT (LSN '16/B374D848' [, etc_options]);
...
COMMIT;

Consequence
==========
Below I provide the implementation of patches for the first three types.
I propose to discuss this feature again/

Regards

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Greg Stark

Date:

01 March 2023, 20:31:06

On Tue, 28 Feb 2023 at 05:13, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:
>
> Below I provide the implementation of patches for the first three types.
> I propose to discuss this feature again/

Oof, that doesn't really work with the cfbot. It tries to apply all
three patches and of course the second and third fail to apply.

In any case this seems like a lot of effort to me. I would suggest you
just pick one avenue and provide that patch for discussion and just
ask whether people would prefer any of the alternative syntaxes.

Fwiw I prefer the functions approach. I do like me some nice syntax
but I don't see any particular advantage of the special syntax in this
case. They don't seem to provide any additional expressiveness.

That said, I'm not a fan of the specific function names. Remember that
we have polymorphic functions so you could probably just have an
option argument:

pg_lsn_wait('LSN', [timeout]) returns boolean

(just call it with a timeout of 0 to do a no-wait)

I'll set the patch to "Waiting on Author" for now. If you feel you're
still looking for more opinions from others maybe set it back to Needs
Review but honestly there are a lot of patches so you probably won't
see much this commitfest unless you have a patch that shows in
cfbot.cputube.org as applying and which looks ready to commit.

-- 
greg

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Michael Paquier

Date:

02 March 2023, 01:22:21

On Wed, Mar 01, 2023 at 03:31:06PM -0500, Greg Stark wrote:
> Fwiw I prefer the functions approach. I do like me some nice syntax
> but I don't see any particular advantage of the special syntax in this
> case. They don't seem to provide any additional expressiveness.

So do I, eventhough I saw a point that sticking to a function or a
procedure approach makes the wait stick with more MVCC rules, like the
fact that the wait may be holding a snapshot for longer than
necessary.  The grammar can be more extensible without more keywords
with DefElems, still I'd like to think that we should not introduce
more restrictions in the parser if we have ways to work around it.
Using a procedure or function approach is more extensible in its own
ways, and it also depends on the data waiting for (there could be more
than one field as well for a single wait pattern?).

> I'll set the patch to "Waiting on Author" for now. If you feel you're
> still looking for more opinions from others maybe set it back to Needs
> Review but honestly there are a lot of patches so you probably won't
> see much this commitfest unless you have a patch that shows in
> cfbot.cputube.org as applying and which looks ready to commit.

While looking at all the patches proposed, I have noticed that all the
approaches proposed force a wakeup of the waiters in the redo loop of
the startup process for each record, before reading the next record.
It strikes me that there is some interaction with custom resource
managers here, where it is possible to poke at the waiters not for
each record, but after reading some specific records.  Something
out-of-core would not be as responsive as the per-record approach,
still responsive enough that the waiters wait on input for an
acceptable amount of time, depending on the frequency of the records
generated by a primary to wake them up.  Just something that popped
into my mind while looking a bit at the threads.
--
Michael

Attachment

signature.asc

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Peter Eisentraut

Date:

02 March 2023, 10:33:01

On 28.02.23 11:10, Kartyshov Ivan wrote:
> 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
> https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
> https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
> ==========
> advantages: no new words in grammar,like it made in 
> pg_last_wal_replay_lsn, no snapshots need
> disadvantages: a little harder to remember names
> SELECT pg_waitlsn(‘LSN’, timeout);
> SELECT pg_waitlsn_infinite(‘LSN’);
> SELECT pg_waitlsn_no_wait(‘LSN’);

Of the presented options, I prefer this one.  (Maybe with a "_" between 
"wait" and "lsn".)

But I wonder how a client is going to get the LSN.  How would all of 
this be used by a client?  I can think of a scenarios where you have an 
application that issues a bunch of SQL commands and you have some kind 
of pooler in the middle that redirects those commands to different 
hosts, and what you really want is to have it transparently behave as if 
it's just a single host.  Do we want to inject a bunch of "SELECT 
pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that?

I'm tempted to think this could be a protocol-layer facility.  Every 
query automatically returns the current LSN, and every query can also 
send along an LSN to wait for, and the client library would just keep 
track of the LSN for (what it thinks of as) the connection.  So you get 
some automatic serialization without having to modify your client code.

That said, exposing this functionality using functions could be a valid 
step in that direction, so that you can at least build out the actual 
internals of the functionality and test it out.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

04 March 2023, 15:36:49

Here I made new patch of feature, discussed above.

WAIT FOR procedure - waits for certain lsn on pause
==========
Synopsis
==========
     SELECT pg_wait_lsn(‘LSN’, timeout) returns boolean

     Where timeout = 0, will wait infinite without timeout
     And if timeout = 1, then just check if lsn was replayed

How to use it
==========

Greg Stark  wrote:
> That said, I'm not a fan of the specific function names. Remember that
> we have polymorphic functions so you could probably just have an
> option argument:

If you have any example, I will be glade to see them. Ьy searches have
not been fruitful.

Michael Paquier wrote:
> While looking at all the patches proposed, I have noticed that all the
> approaches proposed force a wakeup of the waiters in the redo loop of
> the startup process for each record, before reading the next record.
> It strikes me that there is some interaction with custom resource
> managers here, where it is possible to poke at the waiters not for
> each record, but after reading some specific records.  Something
> out-of-core would not be as responsive as the per-record approach,
> still responsive enough that the waiters wait on input for an
> acceptable amount of time, depending on the frequency of the records
> generated by a primary to wake them up.  Just something that popped
> into my mind while looking a bit at the threads.

I`ll work on this idea to have less impact on the redo system.

On 2023-03-02 13:33, Peter Eisentraut wrote:
> But I wonder how a client is going to get the LSN.  How would all of
> this be used by a client?
As I wrote earlier main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application.

> I'm tempted to think this could be a protocol-layer facility.  Every
> query automatically returns the current LSN, and every query can also
> send along an LSN to wait for, and the client library would just keep
> track of the LSN for (what it thinks of as) the connection.  So you
> get some automatic serialization without having to modify your client
> code.
Yes it sounds very tempted. But I think community will be against it.

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

wait_proc_v2.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

06 March 2023, 09:40:16

Update patch to fix conflict with master
-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

wait_proc_v3.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

07 March 2023, 06:55:48

Fix build.meson troubles

-- 
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

wait_proc_v4.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Картышов Иван

Date:

30 June 2023, 08:32:23

All rebased and tested

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

wait_proc_v5.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

04 October 2023, 10:22:13

Hi, Ivan!

On Fri, Jun 30, 2023 at 11:32 AM Картышов Иван <i.kartyshov@postgrespro.ru> wrote:

All rebased and tested

Thank you for continuing to work on this patch.

I see you're concentrating on the procedural version of this feature. But when you're calling a procedure within a normal SQL statement, the executor gets a snapshot and holds it until the procedure finishes. In the case the WAL record conflicts with this snapshot, the query will be canceled. Alternatively, when hot_standby_feedback = on, the query and WAL replayer will be in a deadlock (WAL replayer will wait for the query to finish, and the query will wait for WAL replayed). Do you see this issue? Or do you think I'm missing something?

XLogRecPtr

GetMinWaitedLSN(void)
{
return state->min_lsn.value;
}

You definitely shouldn't access directly the fields inside pg_atomic_uint64. In this particular case, you should use pg_atomic_read_u64().

Also, I think there is a race condition.

/* Check if we already reached the needed LSN */
if (cur_lsn >= target_lsn)
return true;

AddWaitedLSN(target_lsn);

Imagine, PerformWalRecovery() will replay a record after the check, but before AddWaitedLSN(). This code will start the waiting cycle even if the LSN is already achieved. Surely this cycle will end soon because it rechecks LSN value each 100 ms. But anyway, I think there should be another check after AddWaitedLSN() for the sake of consistency.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 October 2023, 00:57:18

Hi!

On Wed, Oct 4, 2023 at 1:22 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I see you're concentrating on the procedural version of this feature.  But when you're calling a procedure within a
normalSQL statement, the executor gets a snapshot and holds it until the procedure finishes. In the case the WAL record
conflictswith this snapshot, the query will be canceled.  Alternatively, when hot_standby_feedback = on, the query and
WALreplayer will be in a deadlock (WAL replayer will wait for the query to finish, and the query will wait for WAL
replayed). Do you see this issue?  Or do you think I'm missing something? 

I'm sorry, I actually meant hot_standby_feedback = off
(hot_standby_feedback = on actually avoids query conflicts).  I
managed to reproduce this problem.

master: create table test as (select i from generate_series(1,10000) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

Needless to say, this is very undesirable behavior.  This happens
because pg_wait_lsn() has to run within a snapshot as any other
function.  This is why I think this functionality should be
implemented as a separate statement.

Another issue I found is that pg_wait_lsn() hangs on the primary.   I
think an error should be reported instead.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Картышов Иван

Date:

20 November 2023, 11:10:43

Alexander, thank you for your review and pointing this issues. According to
them I made some fixes and rebase all patch.

But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor
hot_standby_feedback = off.

master: create table test as (select i from generate_series(1,10000) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

Also I use little hack to work out of snapshot similar to SnapshotResetXmin.

Patch rebased and ready for review.

Attachment

wait_proc_v6.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bowen Shi

Date:

23 November 2023, 03:52:22

Hi,

I used the latest code and found some conflicts while applying. Which PG version did you rebase?

Regards

Bowen Shi

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

27 November 2023, 00:05:13

On Thu, Nov 23, 2023 at 5:52 AM Bowen Shi <zxwsbg12138@gmail.com> wrote:
> I used the latest code and found some conflicts while applying. Which PG version did you rebase?

I've successfully applied the patch on bc3c8db8ae.  But I've used
"patch -p1 < wait_proc_v6.patch", git am doesn't work.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

27 November 2023, 00:08:26

On Mon, Nov 20, 2023 at 1:10 PM Картышов Иван
<i.kartyshov@postgrespro.ru> wrote:
> Alexander, thank you for your review and pointing this issues. According to
> them I made some fixes and rebase all patch.
>
> But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor
> hot_standby_feedback = off.
>
> master: create table test as (select i from generate_series(1,10000) i);
> slave conn1: select pg_wal_replay_pause();
> master: delete from test;
> master: vacuum test;
> master: select pg_current_wal_lsn();
> slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
> slave conn1: select pg_wal_replay_resume();
> slave conn2: ERROR: canceling statement due to conflict with recovery
> DETAIL: User query might have needed to see row versions that must be removed.
>
> Also I use little hack to work out of snapshot similar to SnapshotResetXmin.
>
> Patch rebased and ready for review.

I've retried my case with v6 and it doesn't fail anymore.  But I
wonder how safe it is to reset xmin within the user-visible function?
We have no guarantee that the function is not called inside the
complex query.  Then how will the rest of the query work with xmin
reset?  Separate utility statement still looks like more safe option
for me.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

08 December 2023, 09:20:28

On 2023-11-27 03:08, Alexander Korotkov wrote:
> I've retried my case with v6 and it doesn't fail anymore.  But I
> wonder how safe it is to reset xmin within the user-visible function?
> We have no guarantee that the function is not called inside the
> complex query.  Then how will the rest of the query work with xmin
> reset?  Separate utility statement still looks like more safe option
> for me.

As you mentioned, we can`t guarantee that the function is not called
inside the complex query, but we can return the xmin after waiting.
But you are right and separate utility statement still looks more safe.
So I want to bring up the discussion on separate utility statement 
again.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

08 December 2023, 09:46:55

Should rise disscusion on separate utility statement or find
case where procedure version is failed.

1) Classic (wait_classic_v3.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]



3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn
disadvantages: use snapshot xmin trick
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);


Regards
-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

08 December 2023, 10:27:16

On Fri, Dec 8, 2023 at 11:20 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> On 2023-11-27 03:08, Alexander Korotkov wrote:
> > I've retried my case with v6 and it doesn't fail anymore.  But I
> > wonder how safe it is to reset xmin within the user-visible function?
> > We have no guarantee that the function is not called inside the
> > complex query.  Then how will the rest of the query work with xmin
> > reset?  Separate utility statement still looks like more safe option
> > for me.
>
> As you mentioned, we can`t guarantee that the function is not called
> inside the complex query, but we can return the xmin after waiting.

Returning xmin back isn't safe.  Especially after potentially long
waiting.  The snapshot could be no longer valid, because the
corresponding tuples could be VACUUM'ed.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

08 December 2023, 10:39:03

On Fri, Dec 8, 2023 at 11:46 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Should rise disscusion on separate utility statement or find
> case where procedure version is failed.
>
> 1) Classic (wait_classic_v3.patch)
> https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
> ==========
> advantages: multiple wait events, separate WAIT FOR statement
> disadvantages: new words in grammar
>
>
>
> WAIT FOR  [ANY | ALL] event [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ WAIT FOR [ANY | ALL] event [, ...]]
> event:
> LSN value
> TIMEOUT number_of_milliseconds
> timestamp

Nice, but as you stated requires new keywords.

> 2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
> https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
> ==========
> advantages: no new words in grammar
> disadvantages: a little harder to understand
>
>
>
> AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
> START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]

+1 from me

> 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
> https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
> https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
> ==========
> advantages: no new words in grammar,like it made in
> pg_last_wal_replay_lsn
> disadvantages: use snapshot xmin trick
> SELECT pg_waitlsn(‘LSN’, timeout);
> SELECT pg_waitlsn_infinite(‘LSN’);
> SELECT pg_waitlsn_no_wait(‘LSN’);

Nice, because simplicity.  But only safe if called within the simple
query containing nothing else.  Validating this from the function
kills the simplicity.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

vignesh C

Date:

09 January 2024, 08:40:17

On Fri, 8 Dec 2023 at 15:17, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:
>
> Should rise disscusion on separate utility statement or find
> case where procedure version is failed.
>
> 1) Classic (wait_classic_v3.patch)
> https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
> ==========
> advantages: multiple wait events, separate WAIT FOR statement
> disadvantages: new words in grammar
>
>
>
> WAIT FOR  [ANY | ALL] event [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ WAIT FOR [ANY | ALL] event [, ...]]
> event:
> LSN value
> TIMEOUT number_of_milliseconds
> timestamp
>
>
>
> 2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
> https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
> ==========
> advantages: no new words in grammar
> disadvantages: a little harder to understand
>
>
>
> AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
> START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
>
>
>
> 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
> https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
> https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
> ==========
> advantages: no new words in grammar,like it made in
> pg_last_wal_replay_lsn
> disadvantages: use snapshot xmin trick
> SELECT pg_waitlsn(‘LSN’, timeout);
> SELECT pg_waitlsn_infinite(‘LSN’);
> SELECT pg_waitlsn_no_wait(‘LSN’);

Few of the tests have aborted at [1] in CFBot with:
0000058`9c7ff550 00007ff6`5bdff1f4
postgres!pg_atomic_compare_exchange_u64_impl(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008,
unsigned int64 * expected = 0x00000058`9c7ff5a0,
unsigned int64 newval = 0)+0x34
[c:\cirrus\src\include\port\atomics\generic-msvc.h @ 83]
00000058`9c7ff580 00007ff6`5bdff256     postgres!pg_atomic_read_u64_impl(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x24
[c:\cirrus\src\include\port\atomics\generic.h @ 323]
00000058`9c7ff5c0 00007ff6`5bdfef67     postgres!pg_atomic_read_u64(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x46
[c:\cirrus\src\include\port\atomics.h @ 430]
00000058`9c7ff5f0 00007ff6`5bc98fc3
postgres!GetMinWaitedLSN(void)+0x17
[c:\cirrus\src\backend\commands\wait.c @ 176]
00000058`9c7ff620 00007ff6`5bc82fb9
postgres!PerformWalRecovery(void)+0x4c3
[c:\cirrus\src\backend\access\transam\xlogrecovery.c @ 1788]
00000058`9c7ff6e0 00007ff6`5bffc651
postgres!StartupXLOG(void)+0x989
[c:\cirrus\src\backend\access\transam\xlog.c @ 5562]
00000058`9c7ff870 00007ff6`5bfed38b
postgres!StartupProcessMain(void)+0xd1
[c:\cirrus\src\backend\postmaster\startup.c @ 288]
00000058`9c7ff8a0 00007ff6`5bff49fd     postgres!AuxiliaryProcessMain(
AuxProcType auxtype = StartupProcess (0n0))+0x1fb
[c:\cirrus\src\backend\postmaster\auxprocess.c @ 139]
00000058`9c7ff8e0 00007ff6`5beb7674     postgres!SubPostmasterMain(

More details are available at [2].

[1] - https://cirrus-ci.com/task/5618308515364864
[2] -
https://api.cirrus-ci.com/v1/artifact/task/5618308515364864/crashlog/crashlog-postgres.exe_0008_2023-12-08_07-48-37-722.txt

Regards,
Vignesh

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

11 January 2024, 08:57:44

Rebased and ready for review.
I left only versions (due to irreparable problems)

1) Classic (wait_classic_v4.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
      [ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v3.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]


-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

17 January 2024, 08:16:35

Add some fixes and rebase.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Peter Smith

Date:

22 January 2024, 05:00:43

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4221/
[2] https://cirrus-ci.com/task/5618308515364864

Kind Regards,
Peter Smith.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Dilip Kumar

Date:

22 January 2024, 05:49:57

On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Add some fixes and rebase.
>
While quickly looking into the patch, I understood the idea of what we
are trying to achieve here and I feel that a useful feature.  But
while looking at both the patches I could not quickly differentiate
between these two approaches.  I believe, internally at the core both
are implementing similar wait logic but providing different syntaxes.
So if we want to keep both these approaches open for the sake of
discussion then better first to create a patch that implements the
core approach i.e. the waiting logic and the other common part and
then add top-up patches with 2 different approaches that would be easy
for review.  I also see in v4 that there is no documentation for the
syntax part so it makes it even harder to understand.

I think this thread is implementing a useful feature so my suggestion
is to add some documentation in v4 and also make it more readable
w.r.t. What are the clear differences between these two approaches,
maybe adding commit message will also help.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Fwd: Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

01 February 2024, 21:29:28

Updated, rebased, fixed Ci and added documentation.
We left two different solutions. Help me please to choose the best.

1) Classic (wait_classic_v6.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v5.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]




Regards

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

07 March 2024, 11:44:32

Intro
==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.


Two implementations of one feature
==========
We left two different solutions. Help me please to choose the best.


1) Classic (wait_classic_v7.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
Synopsis
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
Synopsis
==========
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
     [ AFTER lsn_event [ WITHIN delay_milliseconds ]]


Examples
==========

primary         standby
-------         --------
             postgresql.conf
             recovery_min_apply_delay = 10s

CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();

             BEGIN WAIT FOR LSN '0/3002AE8';
             SELECT * FROM tbl; // read fresh insertions
             COMMIT;

Rebased and ready for review.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Amit Kapila

Date:

07 March 2024, 12:24:55

On Thu, Mar 7, 2024 at 5:14 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Intro
> ==========
> The main purpose of the feature is to achieve
> read-your-writes-consistency, while using async replica for reads and
> primary for writes. In that case lsn of last modification is stored
> inside application. We cannot store this lsn inside database, since
> reads are distributed across all replicas and primary.
>
>
> Two implementations of one feature
> ==========
> We left two different solutions. Help me please to choose the best.
>
>
> 1) Classic (wait_classic_v7.patch)
> https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
> Synopsis
> ==========
> advantages: multiple wait events, separate WAIT FOR statement
> disadvantages: new words in grammar
>
>
>
> WAIT FOR  [ANY | ALL] event [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ WAIT FOR [ANY | ALL] event [, ...]]
> event:
> LSN value
> TIMEOUT number_of_milliseconds
> timestamp
>
>
>
> 2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
> https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
> Synopsis
> ==========
> advantages: no new words in grammar
> disadvantages: a little harder to understand, fewer events to wait
>
>
>
> AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
> START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>      [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
>

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

--
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

11 March 2024, 10:44:53

Hi!

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> +1 for the second one not only because it avoids new words in grammar
> but also sounds to convey the meaning. I think you can explain in docs
> how this feature can be used basically how will one get the correct
> LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement.  I think this should be enough for the beginning.

> As suggested previously also pick one of the approaches (I would
> advocate the second one) and keep an option for the second one by
> mentioning it in the commit message. I hope to see more
> reviews/discussions or usage like how will users get the LSN value to
> be specified on the core logic of the feature at this stage. IF
> possible, state, how real-world applications could leverage this
> feature.

I've added a paragraph to the docs about the usage.  After you made
some changes on primary, you run pg_current_wal_insert_lsn().  Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN.  Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch.  The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN.  This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed.  Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit.  I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array.  This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

------
Regards,
Alexander Korotkov

Attachment

v8-0001-Implement-AFTER-clause-for-BEGIN-command.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 March 2024, 14:20:25

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I've decided to put my hands on this patch.
>
> On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +1 for the second one not only because it avoids new words in grammar
> > but also sounds to convey the meaning. I think you can explain in docs
> > how this feature can be used basically how will one get the correct
> > LSN value to specify.
>
> I picked the second option and left only the AFTER clause for the
> BEGIN statement.  I think this should be enough for the beginning.
>
> > As suggested previously also pick one of the approaches (I would
> > advocate the second one) and keep an option for the second one by
> > mentioning it in the commit message. I hope to see more
> > reviews/discussions or usage like how will users get the LSN value to
> > be specified on the core logic of the feature at this stage. IF
> > possible, state, how real-world applications could leverage this
> > feature.
>
> I've added a paragraph to the docs about the usage.  After you made
> some changes on primary, you run pg_current_wal_insert_lsn().  Then
> connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> LSN.  Now you're guaranteed to see the changes made to the primary.
>
> Also, I've significantly reworked other aspects of the patch.  The
> most significant changes are:
> 1) Waiters are now stored in the array sorted by LSN.  This saves us
> from scanning of wholeper-backend array.
> 2) Waiters are removed from the array immediately once their LSNs are
> replayed.  Otherwise, the WAL replayer will keep scanning the shared
> memory array till waiters wake up.
> 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> shmem exit.  I think this is preferable over the previous approach to
> remove from the queue before ProcessInterrupts().
> 4) There is now condition to recheck if LSN is replayed after adding
> to the shared memory array.  This should save from the race
> conditions.
> 5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments.  Now it looks good, I'm going to push it if no
objections.

------
Regards,
Alexander Korotkov

Attachment

v9-0001-Implement-AFTER-clause-for-BEGIN-command.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 March 2024, 19:47:55

On Fri, Mar 15, 2024 at 4:20 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I've decided to put my hands on this patch.
>
> On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +1 for the second one not only because it avoids new words in grammar
> > but also sounds to convey the meaning. I think you can explain in docs
> > how this feature can be used basically how will one get the correct
> > LSN value to specify.
>
> I picked the second option and left only the AFTER clause for the
> BEGIN statement. I think this should be enough for the beginning.
>
> > As suggested previously also pick one of the approaches (I would
> > advocate the second one) and keep an option for the second one by
> > mentioning it in the commit message. I hope to see more
> > reviews/discussions or usage like how will users get the LSN value to
> > be specified on the core logic of the feature at this stage. IF
> > possible, state, how real-world applications could leverage this
> > feature.
>
> I've added a paragraph to the docs about the usage. After you made
> some changes on primary, you run pg_current_wal_insert_lsn(). Then
> connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> LSN. Now you're guaranteed to see the changes made to the primary.
>
> Also, I've significantly reworked other aspects of the patch. The
> most significant changes are:
> 1) Waiters are now stored in the array sorted by LSN. This saves us
> from scanning of wholeper-backend array.
> 2) Waiters are removed from the array immediately once their LSNs are
> replayed. Otherwise, the WAL replayer will keep scanning the shared
> memory array till waiters wake up.
> 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> shmem exit. I think this is preferable over the previous approach to
> remove from the queue before ProcessInterrupts().
> 4) There is now condition to recheck if LSN is replayed after adding
> to the shared memory array. This should save from the race
> conditions.
> 5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.

The revised patch version with cosmetic fixes proposed by Alexander Lakhin.

------
Regards,

Alexander Korotkov

Attachment

v10-0001-Implement-AFTER-clause-for-BEGIN-command.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

15 March 2024, 19:59:44

On 2024-03-11 13:44, Alexander Korotkov wrote:
> I picked the second option and left only the AFTER clause for the
> BEGIN statement.  I think this should be enough for the beginning.

Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v10_after_within.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

15 March 2024, 20:32:23

On 2024-03-15 22:59, Kartyshov Ivan wrote:
> On 2024-03-11 13:44, Alexander Korotkov wrote:
>> I picked the second option and left only the AFTER clause for the
>> BEGIN statement.  I think this should be enough for the beginning.
> 
> Thank you for your rework on your patch, here I made some fixes:
> 0) autocomplete
> 1) less jumps
> 2) more description and add cases in doc
> 
> I think, it will be useful to have stand-alone statement.
> Why you would like to see only AFTER clause for the BEGIN statement?

Rebase and update patch.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v11-0001-Implement-AFTER-clause-for-BEGIN-command.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 March 2024, 20:34:24

On Fri, Mar 15, 2024 at 10:32 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> On 2024-03-15 22:59, Kartyshov Ivan wrote:
> > On 2024-03-11 13:44, Alexander Korotkov wrote:
> >> I picked the second option and left only the AFTER clause for the
> >> BEGIN statement.  I think this should be enough for the beginning.
> >
> > Thank you for your rework on your patch, here I made some fixes:
> > 0) autocomplete
> > 1) less jumps
> > 2) more description and add cases in doc

Thank you!

> > I think, it will be useful to have stand-alone statement.
> > Why you would like to see only AFTER clause for the BEGIN statement?

Yes, stand-alone statements might be also useful.  But I think that
the best way for this feature to get into the core would be to commit
the minimal version first.  The BEGIN clause has minimal invasiveness
for the syntax and I believe covers most typical use-cases.  Once we
figure out it's OK and have positive feedback from users, we can do
more enchantments incrementally.

> Rebase and update patch.

Cool, I was just about to ask you to do this.

 ------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

16 March 2024, 07:02:43

On Sat, Mar 16, 2024 at 2:04 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > Rebase and update patch.

Thanks for working on this. I took a quick look at v11 patch. Here are
some comments:

1.
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"

Please place executor/spi.h in the alphabetical order. Also, look at
all other header files and place them in the order.

2. It seems like pgindent is not happy with
src/backend/access/transam/xlogrecovery.c and
src/backend/commands/waitlsn.c. Please run it to keep BF member koel
happy post commit.

3. This patch changes, SQL explicit transaction statement syntax, is
it (this deviation) okay from SQL standard perspective?

4. I think some more checks are needed for better input validations.

4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
add a fast path/quick exit to WaitForLSN()?
BEGIN AFTER '0/0';

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?
BEGIN AFTER '0/FFFFFFFF';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

5.
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"

postgres.h must be included at the first, and then the system header
files, and then all postgres header files, just like below. See a very
recent commit
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=97d85be365443eb4bf84373a7468624762382059.

+#include "postgres.h"

+#include <float.h>
+#include <math.h>

+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"

6.
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Amit Kapila

Date:

16 March 2024, 10:56:22

On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> I went through this patch another time, and made some minor
> adjustments.  Now it looks good, I'm going to push it if no
> objections.
>

I have a question related to usability, if the regular reads (say a
Select statement or reads via function/procedure) need a similar
guarantee to see the changes on standby then do they also always need
to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
in other words, shouldn't we think of something for implicit
transactions?

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

--
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

16 March 2024, 15:05:28

On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > I went through this patch another time, and made some minor
> > adjustments.  Now it looks good, I'm going to push it if no
> > objections.
> >
>
> I have a question related to usability, if the regular reads (say a
> Select statement or reads via function/procedure) need a similar
> guarantee to see the changes on standby then do they also always need
> to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
> in other words, shouldn't we think of something for implicit
> transactions?

+1 to have support for implicit txns. A strawman solution I can think
of is to let primary send its current insert LSN to the standby every
time it sends a bunch of WAL, and the standby waits for that LSN to be
replayed on it at the start of every implicit txn automatically.

The new BEGIN syntax requires application code changes. This led me to
think how one can achieve read-after-write consistency today in a
primary - standby set up. All the logic of this patch, that is,
waiting for the standby to pass a given primary LSN needs to be done
in the application code (or in proxy or in load balancer?). I believe
there might be someone doing this already, it's good to hear from
them.

> In general, it seems this patch has been stuck for a long time on the
> decision to choose an appropriate UI (syntax), and we thought of
> moving it further so that the other parts of the patch can be
> reviewed/discussed. So, I feel before pushing this we should see
> comments from a few (at least two) other senior members who earlier
> shared their opinion on the syntax. I know we don't have much time
> left but OTOH pushing such a change (where we didn't have a consensus
> on syntax) without much discussion at this point of time could lead to
> discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

17 March 2024, 14:09:57

Hi Amit,
Hi Bharath,

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > In general, it seems this patch has been stuck for a long time on the
> > decision to choose an appropriate UI (syntax), and we thought of
> > moving it further so that the other parts of the patch can be
> > reviewed/discussed. So, I feel before pushing this we should see
> > comments from a few (at least two) other senior members who earlier
> > shared their opinion on the syntax. I know we don't have much time
> > left but OTOH pushing such a change (where we didn't have a consensus
> > on syntax) without much discussion at this point of time could lead to
> > discussions after commit.
>
> +1 to gain consensus first on the syntax changes. With this, we might
> be violating the SQL standard for explicit txn commands (I stand for
> correction about the SQL standard though).

Thank you for your feedback.  Generally, I agree it's correct to get
consensus on syntax first.  And yes, this patch has been here since
2016.  We didn't get consensus for syntax for 8 years.  Frankly
speaking, I don't see a reason why this shouldn't take another 8
years.  At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality.  Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible.  A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements.  But
yes, this is still a new qual.  And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all.  There was an
attempt to implement waiting for lsn as a function.  But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query.  But now we have
procedures, which need a dedicated statement for the call and can even
control transactions.  Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck.  Procedure is still called within the snapshot.  It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Amit, Bharath, what do you think about this approach?  Is this a way to go?

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Amit Kapila

Date:

18 March 2024, 03:17:04

On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > In general, it seems this patch has been stuck for a long time on the
> > > decision to choose an appropriate UI (syntax), and we thought of
> > > moving it further so that the other parts of the patch can be
> > > reviewed/discussed. So, I feel before pushing this we should see
> > > comments from a few (at least two) other senior members who earlier
> > > shared their opinion on the syntax. I know we don't have much time
> > > left but OTOH pushing such a change (where we didn't have a consensus
> > > on syntax) without much discussion at this point of time could lead to
> > > discussions after commit.
> >
> > +1 to gain consensus first on the syntax changes. With this, we might
> > be violating the SQL standard for explicit txn commands (I stand for
> > correction about the SQL standard though).
>
> Thank you for your feedback.  Generally, I agree it's correct to get
> consensus on syntax first.  And yes, this patch has been here since
> 2016.  We didn't get consensus for syntax for 8 years.  Frankly
> speaking, I don't see a reason why this shouldn't take another 8
> years.  At the same time the ability to wait on standby given LSN is
> replayed seems like pretty basic and simple functionality.  Thus, it's
> quite frustrating it already took that long and still unclear when/how
> this could be finished.
>
> My current attempt was to commit minimal implementation as less
> invasive as possible.  A new clause for BEGIN doesn't require
> additional keywords and doesn't introduce additional statements.  But
> yes, this is still a new qual.  And, yes, Amit you're right that even
> if I had committed that, there was still a high risk of further
> debates and revert.
>
> Given my specsis about agreement over syntax, I'd like to check
> another time if we could go without new syntax at all.  There was an
> attempt to implement waiting for lsn as a function.  But function
> holds a snapshot, which could prevent WAL records from being replayed.
> Releasing a snapshot could break the parent query.  But now we have
> procedures, which need a dedicated statement for the call and can even
> control transactions.  Could we implement a waitlsn in a procedure
> that:
>
> 1. First, check that it was called with non-atomic context (that is,
> it's not called within a transaction). Trigger error if called with
> atomic context.
> 2. Release a snapshot to be able to wait without risk of WAL replay
> stuck.  Procedure is still called within the snapshot.  It's a bit of
> a hack to release a snapshot, but Vacuum statements already do so.
>

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

--
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

18 March 2024, 09:54:11

On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > In general, it seems this patch has been stuck for a long time on the
> > > > decision to choose an appropriate UI (syntax), and we thought of
> > > > moving it further so that the other parts of the patch can be
> > > > reviewed/discussed. So, I feel before pushing this we should see
> > > > comments from a few (at least two) other senior members who earlier
> > > > shared their opinion on the syntax. I know we don't have much time
> > > > left but OTOH pushing such a change (where we didn't have a consensus
> > > > on syntax) without much discussion at this point of time could lead to
> > > > discussions after commit.
> > >
> > > +1 to gain consensus first on the syntax changes. With this, we might
> > > be violating the SQL standard for explicit txn commands (I stand for
> > > correction about the SQL standard though).
> >
> > Thank you for your feedback. Generally, I agree it's correct to get
> > consensus on syntax first. And yes, this patch has been here since
> > 2016. We didn't get consensus for syntax for 8 years. Frankly
> > speaking, I don't see a reason why this shouldn't take another 8
> > years. At the same time the ability to wait on standby given LSN is
> > replayed seems like pretty basic and simple functionality. Thus, it's
> > quite frustrating it already took that long and still unclear when/how
> > this could be finished.
> >
> > My current attempt was to commit minimal implementation as less
> > invasive as possible. A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements. But
> > yes, this is still a new qual. And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
> >
> > Given my specsis about agreement over syntax, I'd like to check
> > another time if we could go without new syntax at all. There was an
> > attempt to implement waiting for lsn as a function. But function
> > holds a snapshot, which could prevent WAL records from being replayed.
> > Releasing a snapshot could break the parent query. But now we have
> > procedures, which need a dedicated statement for the call and can even
> > control transactions. Could we implement a waitlsn in a procedure
> > that:
> >
> > 1. First, check that it was called with non-atomic context (that is,
> > it's not called within a transaction). Trigger error if called with
> > atomic context.
> > 2. Release a snapshot to be able to wait without risk of WAL replay
> > stuck. Procedure is still called within the snapshot. It's a bit of
> > a hack to release a snapshot, but Vacuum statements already do so.
> >
>
> Can you please provide a bit more details with some example what is
> the existing problem with functions and how using procedures will
> resolve it? How will this this address the implicit transaction case
> or do we have any other workaround for those cases?

Please check [1] and [2] for the explanation of the problem with functions.

Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the following lines.

We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay.

if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic context")));

if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());

The function call could be added either before the BEGIN statement or before the implicit transaction.

CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;

CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;

Links
1. https://www.postgresql.org/message-id/CAPpHfduBSN8j5j5Ynn5x%3DThD%3D8ypNd53D608VXGweBsPzxPvqA%40mail.gmail.com
2. https://www.postgresql.org/message-id/CAPpHfdtiGgn0iS1KbW2HTam-1%2BoK%2BvhXZDAcnX9hKaA7Oe%3DF-A%40mail.gmail.com

------
Regards,
Alexander Korotkov

Attachment

v11-0001-Implement-AFTER-clause-for-BEGIN-command.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Amit Kapila

Date:

19 March 2024, 11:51:46

On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > 1. First, check that it was called with non-atomic context (that is,
> > > it's not called within a transaction). Trigger error if called with
> > > atomic context.
> > > 2. Release a snapshot to be able to wait without risk of WAL replay
> > > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > > a hack to release a snapshot, but Vacuum statements already do so.
> > >
> >
> > Can you please provide a bit more details with some example what is
> > the existing problem with functions and how using procedures will
> > resolve it? How will this this address the implicit transaction case
> > or do we have any other workaround for those cases?
>
> Please check [1] and [2] for the explanation of the problem with functions.
>
> Also, please find a draft patch implementing the procedure.  The issue with the snapshot is addressed with the
followinglines. 
>
> We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the
same). Then we should have no active snapshot and it's safe to wait for lsn replay. 
>
>     if (context->atomic)
>         ereport(ERROR,
>                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                  errmsg("pg_wait_lsn() must be only called in non-atomic context")));
>
>     if (ActiveSnapshotSet())
>         PopActiveSnapshot();
>     Assert(!ActiveSnapshotSet());
>
> The function call could be added either before the BEGIN statement or before the implicit transaction.
>
> CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
>

I haven't thought in detail about whether there are any other problems
with this idea but sounds like it should solve the problems you shared
with a function call approach. BTW, if the application has to anyway
know the LSN till where replica needs to wait, why can't they simply
monitor the pg_last_wal_replay_lsn() value?

--
With Regards,
Amit Kapila.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

19 March 2024, 11:58:36

On Tue, Mar 19, 2024 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > 1. First, check that it was called with non-atomic context (that is,
> > > > it's not called within a transaction). Trigger error if called with
> > > > atomic context.
> > > > 2. Release a snapshot to be able to wait without risk of WAL replay
> > > > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > > > a hack to release a snapshot, but Vacuum statements already do so.
> > > >
> > >
> > > Can you please provide a bit more details with some example what is
> > > the existing problem with functions and how using procedures will
> > > resolve it? How will this this address the implicit transaction case
> > > or do we have any other workaround for those cases?
> >
> > Please check [1] and [2] for the explanation of the problem with functions.
> >
> > Also, please find a draft patch implementing the procedure.  The issue with the snapshot is addressed with the
followinglines. 
> >
> > We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the
same). Then we should have no active snapshot and it's safe to wait for lsn replay. 
> >
> >     if (context->atomic)
> >         ereport(ERROR,
> >                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                  errmsg("pg_wait_lsn() must be only called in non-atomic context")));
> >
> >     if (ActiveSnapshotSet())
> >         PopActiveSnapshot();
> >     Assert(!ActiveSnapshotSet());
> >
> > The function call could be added either before the BEGIN statement or before the implicit transaction.
> >
> > CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> > CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
> >
>
> I haven't thought in detail about whether there are any other problems
> with this idea but sounds like it should solve the problems you shared
> with a function call approach. BTW, if the application has to anyway
> know the LSN till where replica needs to wait, why can't they simply
> monitor the pg_last_wal_replay_lsn() value?

Amit, thank you for your feedback.

Yes, the application can monitor pg_last_wal_replay_lsn() value,
that's our state of the art solution.  But that's rather inconvenient
and takes extra latency and network traffic. And it can't be wrapped
into a server-side function in procedural language for the reasons we
can't implement it as a built-in function.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

19 March 2024, 17:38:55

Intro
==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.


Procedure style implementation
==========
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com

CALL pg_wait_lsn(‘LSN’, timeout);

Examples
==========

primary         standby
-------         --------
              postgresql.conf
              recovery_min_apply_delay = 10s


CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();


              CALL pg_wait_lsn('0/3002AE8', 10000);
              BEGIN;
              SELECT * FROM tbl; // read fresh insertions
              COMMIT;

Fixed and ready to review.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v12_0001-Procedure-wait-lsn.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

19 March 2024, 22:34:51

Bharath Rupireddy, thank you for you review.
But here is some points.

On 2024-03-16 10:02, Bharath Rupireddy wrote:
> 4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
> add a fast path/quick exit to WaitForLSN()?
> BEGIN AFTER '0/0';

In postgresql '0/0' is Valid pg_lsn, but it is always reached.

> 4.2 With an unreasonably high future LSN, BEGIN command waits
> unboundedly, shouldn't we check if the specified LSN is more than
> pg_last_wal_receive_lsn() error out?
> BEGIN AFTER '0/FFFFFFFF';
> SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> BEGIN AFTER :'future_receive_lsn';

This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn

> 4.3 With an unreasonably high wait time, BEGIN command waits
> unboundedly, shouldn't we restrict the wait time to some max value,
> say a day or so?
> SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

> 6.
> +/* Set all latches in shared memory to signal that new LSN has been 
> replayed */
> +void
> +WaitLSNSetLatches(XLogRecPtr curLSN)
> +{
> 
> I see this patch is waking up all the waiters in the recovery path
> after applying every WAL record, which IMO is a hot path. Is the
> impact of this change on recovery measured, perhaps using
> https://github.com/macdice/redo-bench or similar tools?
> 
> 7. In continuation to comment #6, why not use Conditional Variables
> instead of proc latches to sleep and wait for all the waiters in
> WaitLSNSetLatches?

Waiters are stored in the array sorted by LSN. This help us to wake
only PIDs with replayed LSN. This saves us from scanning of whole
array. So it`s not so hot path.

Add some fixes

1) make waiting timeont more simple (as pg_terminate_backend())
2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
long time, changed it to 0.5 seconds
3) add more tests
4) added and expanded sections in the documentation
5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
6) now big timeout will be restricted to 1 day (86400000ms)
CALL pg_wait_lsn('0/34FB5A1',10000000000);
WARNING:  Timeout for pg_wait_lsn() restricted to 1 day

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v13_0001-Procedure-wait-lsn.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

20 March 2024, 09:11:13

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:
> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> > unboundedly, shouldn't we check if the specified LSN is more than
> > pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the whole value of this feature. The value is to wait till given LSN is replayed, whether it's already received or not.

> > BEGIN AFTER '0/FFFFFFFF';
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn';
>
> This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn

FWIW,

# SELECT '0/FFFFFFFF'::pg_lsn + 1;
?column?
----------
1/0
(1 row)

But I don't see a problem here. On the replica, it's out of our control to check which lsn is good and which is not. We can't check whether the lsn, which is in future for the replica, is already issued by primary.

For the case of wrong lsn, which could cause potentially infinite wait, there is the timeout and the manual query cancel.

> > 4.3 With an unreasonably high wait time, BEGIN command waits
> > unboundedly, shouldn't we restrict the wait time to some max value,
> > say a day or so?
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>
> Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I would prefer to switch to seconds (with ability to specify fraction of second). This was expressed before by Alexander Lakhin.

> > 6.

> > +/* Set all latches in shared memory to signal that new LSN has been
> > replayed */
> > +void
> > +WaitLSNSetLatches(XLogRecPtr curLSN)
> > +{
> >
> > I see this patch is waking up all the waiters in the recovery path
> > after applying every WAL record, which IMO is a hot path. Is the
> > impact of this change on recovery measured, perhaps using
> > https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

> > 7. In continuation to comment #6, why not use Conditional Variables
> > instead of proc latches to sleep and wait for all the waiters in
> > WaitLSNSetLatches?
>
> Waiters are stored in the array sorted by LSN. This help us to wake
> only PIDs with replayed LSN. This saves us from scanning of whole
> array. So it`s not so hot path.

This saves us from ConditionVariableBroadcast() every time we replay the WAL record.

> Add some fixes
>
> 1) make waiting timeont more simple (as pg_terminate_backend())
> 2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
> long time, changed it to 0.5 seconds

I don't see this change in the patch. Normally if a process gets a signal, that causes WaitLatch() to exit immediately. It also exists immediately on query cancel. IIRC, this 1 minute timeout is needed to handle some extreme cases when an interrupt is missing. Other places have it equal to 1 minute. I don't see why we should have it different.

> 3) add more tests
> 4) added and expanded sections in the documentation

I don't see this in the patch. I see only a short description in func.sgml, which is definitely not sufficient. We need at least everything we have in the docs before to be adjusted with the current approach of procedure.

> 5) add default variant of timeout
> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout? I think this should be documented. Also, I would prefer to see the timeout by default. Probably one minute would be good for default.

> 6) now big timeout will be restricted to 1 day (86400000ms)
> CALL pg_wait_lsn('0/34FB5A1',10000000000);
> WARNING: Timeout for pg_wait_lsn() restricted to 1 day

I don't think we need to mention individuals, who made proposals, in the source code comments. Otherwise, our source code would be a crazy mess of names. Also, if this is the restriction, it has to be an error. And it should be a proper full ereport().

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Peter Eisentraut

Date:

21 March 2024, 22:50:09

On 19.03.24 18:38, Kartyshov Ivan wrote:
>               CALL pg_wait_lsn('0/3002AE8', 10000);
>               BEGIN;
>               SELECT * FROM tbl; // read fresh insertions
>               COMMIT;

I'm not endorsing this or any other approach, but I think the timeout 
parameter should be of type interval, not an integer with a unit that is 
hidden in the documentation.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Peter Eisentraut

Date:

21 March 2024, 22:58:06

On 17.03.24 15:09, Alexander Korotkov wrote:
> My current attempt was to commit minimal implementation as less
> invasive as possible.  A new clause for BEGIN doesn't require
> additional keywords and doesn't introduce additional statements.  But
> yes, this is still a new qual.  And, yes, Amit you're right that even
> if I had committed that, there was still a high risk of further
> debates and revert.

I had written in [0] about my questions related to using this with 
connection poolers.  I don't think this was addressed at all.  I haven't 
seen any discussion about how to make this kind of facility usable in a 
full system.  You have to manually query and send LSNs; that seems 
pretty cumbersome.  Sure, this is part of something that could be 
useful, but how would an actual user with actual application code get to 
use this?

[0]: 
https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

22 March 2024, 19:42:52

Thank you for your feedback.

On 2024-03-20 12:11, Alexander Korotkov wrote:
> On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> <i.kartyshov@postgrespro.ru> wrote:
>> > 4.2 With an unreasonably high future LSN, BEGIN command waits
>> > unboundedly, shouldn't we check if the specified LSN is more than
>> > pg_last_wal_receive_lsn() error out?
> 
> I think limiting wait lsn by current received lsn would destroy the
> whole value of this feature.  The value is to wait till given LSN is
> replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

> But I don't see a problem here. On the replica, it's out of our
> control to check which lsn is good and which is not.  We can't check
> whether the lsn, which is in future for the replica, is already issued
> by primary.
> 
> For the case of wrong lsn, which could cause potentially infinite
> wait, there is the timeout and the manual query cancel.

Fully agree with this take.

>> > 4.3 With an unreasonably high wait time, BEGIN command waits
>> > unboundedly, shouldn't we restrict the wait time to some max
> value,
>> > say a day or so?
>> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
>> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>> 
>> Good idea, I put it 1 day. But this limit we should to discuss.
> 
> Do you think that specifying timeout in milliseconds is suitable?  I
> would prefer to switch to seconds (with ability to specify fraction of
> second).  This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

>> > https://github.com/macdice/redo-bench or similar tools?
> 
> Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and 
master.

> I don't see this change in the patch.  Normally if a process gets a
> signal, that causes WaitLatch() to exit immediately.  It also exists
> immediately on query cancel.  IIRC, this 1 minute timeout is needed to
> handle some extreme cases when an interrupt is missing.  Other places
> have it equal to 1 minute. I don't see why we should have it
> different.

Ok, I`ll rollback my changes.

>> 4) added and expanded sections in the documentation
> 
> I don't see this in the patch.  I see only a short description in
> func.sgml, which is definitely not sufficient.  We need at least
> everything we have in the docs before to be adjusted with the current
> approach of procedure.

I didn't find another section where to add the description of 
pg_wait_lsn().
So I extend description on the bottom of the table.

>> 5) add default variant of timeout
>> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
>> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
> 
> Does zero here mean no timeout?  I think this should be documented.
> Also, I would prefer to see the timeout by default.  Probably one
> minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

PS sorry, the strange BUG throw my mails out of thread.
https://www.postgresql.org/message-id/flat/f2ff071aa9141405bb8efee67558a058%40postgrespro.ru

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v14_0001-Procedure-wait-lsn.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

24 March 2024, 02:39:19

On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> I had written in [0] about my questions related to using this with
> connection poolers.  I don't think this was addressed at all.  I haven't
> seen any discussion about how to make this kind of facility usable in a
> full system.  You have to manually query and send LSNs; that seems
> pretty cumbersome.  Sure, this is part of something that could be
> useful, but how would an actual user with actual application code get to
> use this?
>
> [0]:
> https://www.postgresql.org/message-id/8b5b172f-0ae7-d644-8358-e2851dded43b%40enterprisedb.com

I share the same concern as yours and had proposed something upthread
[1]. The idea is something like how each query takes a snapshot at the
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

26 March 2024, 23:49:17

On Fri, Mar 22, 2024 at 12:50 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 19.03.24 18:38, Kartyshov Ivan wrote:
> >               CALL pg_wait_lsn('0/3002AE8', 10000);
> >               BEGIN;
> >               SELECT * FROM tbl; // read fresh insertions
> >               COMMIT;
>
> I'm not endorsing this or any other approach, but I think the timeout
> parameter should be of type interval, not an integer with a unit that is
> hidden in the documentation.

I'm not sure a timeout needs to deal with complexity of our interval
datatype.  At the same time, the integer number of milliseconds looks
a bit weird.  Could the float8 number of seconds be an option?

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

26 March 2024, 23:59:04

On Fri, Mar 22, 2024 at 12:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 17.03.24 15:09, Alexander Korotkov wrote:
> > My current attempt was to commit minimal implementation as less
> > invasive as possible.  A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements.  But
> > yes, this is still a new qual.  And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
>
> I had written in [0] about my questions related to using this with
> connection poolers.  I don't think this was addressed at all.  I haven't
> seen any discussion about how to make this kind of facility usable in a
> full system.  You have to manually query and send LSNs; that seems
> pretty cumbersome.  Sure, this is part of something that could be
> useful, but how would an actual user with actual application code get to
> use this?

The current usage pattern of this functionality is the following.

1. Do the write transaction on primary
2. Query pg_current_wal_insert_lsn() on primary
3. Call pg_wait_lsn() with the value obtained on the previous step on replica
4. Do the read transaction of replica

This usage pattern could be implemented either on the application
level, or on the pooler level.  For application level, it would
require a somewhat advanced level of database-aware programming, but
this is still a valid usage.  Regarding poolers, if some poolers
manage to automatically distinguish reading and writing queries,
dealing with LSNs wouldn't be too complex for them.

Having this functionality on protocol level would be ideal, but let's
do this step-by-step.  The built-in procedure isn't very invasive, but
that could give us some adoption and open the way forward.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

27 March 2024, 00:06:53

On Sun, Mar 24, 2024 at 4:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> I share the same concern as yours and had proposed something upthread
> [1]. The idea is something like how each query takes a snapshot at the
> beginning of txn/query (depending on isolation level), the same way
> the standby can wait for the primary's current LSN as of the moment
> (at the time of taking snapshot). And, primary keeps sending its
> current LSN as part of regular WAL to standbys so that the standbys
> doesn't have to make connections to the primary to know its current
> LSN every time. Perhps, this may not even fully guarantee (considered
> to be achieving) the read-after-write consistency on standbys unless
> there's a way for the application to tell the wait LSN.

Oh, no.  Please, check [1].  The idea is to wait for a particular
transaction to become visible.  The one who made a change on primary
brings the lsn value from there to replica.  For instance, an
application made a change on primary and then willing to run some
report on replica.  And the report should be guaranteed to contain the
change just made.  So, the application query the LSN from primary
after making a write transaction, then calls pg_wait_lsn() on
replicate before running the report.

This is quite simple server functionality, which could be used at
application-level, ORM-level or pooler-level.  And it unlocks the way
forward for in-protocol implementation as proposed by Peter
Eisentraut.

Links.
1. https://www.postgresql.org/message-id/CAPpHfdtny81end69PzEdRsROKnsybsj%3DOs8DUM-6HeKGKnCuQQ%40mail.gmail.com

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

28 March 2024, 00:24:23

On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
> Thank you for your interest to the patch.
> I understand you questions, but I fully support Alexander Korotkov idea
> to commit the minimal required functionality. And then keep working on
> other improvements.

I did further improvements in the patch.

Notably, I decided to rename the procedure to
pg_wait_for_wal_replay_lsn().  This makes the name look consistent
with other WAL-related functions.  Also it clearly states that we're
waiting for lsn to be replayed (not received, written or flushed).

Also, I did implements in the docs, commit message and some minor code fixes.

I'm continuing to look at this patch.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

28 March 2024, 06:23:28

On Thu, Mar 28, 2024 at 2:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
> <i.kartyshov@postgrespro.ru> wrote:
> > Thank you for your interest to the patch.
> > I understand you questions, but I fully support Alexander Korotkov idea
> > to commit the minimal required functionality. And then keep working on
> > other improvements.
>
> I did further improvements in the patch.
>
> Notably, I decided to rename the procedure to
> pg_wait_for_wal_replay_lsn().  This makes the name look consistent
> with other WAL-related functions.  Also it clearly states that we're
> waiting for lsn to be replayed (not received, written or flushed).
>
> Also, I did implements in the docs, commit message and some minor code fixes.
>
> I'm continuing to look at this patch.

Sorry, I forgot the attachment.

------
Regards,
Alexander Korotkov

Attachment

v12-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Thomas Munro

Date:

28 March 2024, 07:36:32

> v12

Hi all,

I didn't review the patch but one thing jumped out: I don't think it's
OK to hold a spinlock while (1) looping over an array of backends and
(2) making system calls (SetLatch()).

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

28 March 2024, 12:39:25

On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > v12
>
> Hi all,
>
> I didn't review the patch but one thing jumped out: I don't think it's
> OK to hold a spinlock while (1) looping over an array of backends and
> (2) making system calls (SetLatch()).

Good catch, thank you.

Fixed along with other issues spotted by Alexander Lakhin.

------
Regards,
Alexander Korotkov

Attachment

v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

"Euler Taveira"

Date:

28 March 2024, 23:37:54

On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:

Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest another one:

pg_wal_replay_wait. It uses the same pattern as the other recovery control

functions [1]. I think "for" doesn't add much for the function name and "lsn" is

used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*

List of functions

-[ RECORD 1 ]-------+---------------------

Schema | pg_catalog

Name | pg_wal_replay_pause

Result data type | void

Argument data types |

Type | func

-[ RECORD 2 ]-------+---------------------

Schema | pg_catalog

Name | pg_wal_replay_resume

Result data type | void

Argument data types |

Type | func

Regarding the arguments, I think the timeout should be bigint. There is at least

another function that implements a timeout that uses bigint.

postgres=# \df pg_terminate_backend

List of functions

-[ RECORD 1 ]-------+--------------------------------------

Schema | pg_catalog

Name | pg_terminate_backend

Result data type | boolean

Argument data types | pid integer, timeout bigint DEFAULT 0

Type | func

I also suggests that the timeout unit should be milliseconds, hence, using

bigint is perfectly fine for the timeout argument.

+ <para>

+ Throws an ERROR if the target <acronym>lsn</acronym> was not replayed

+ on standby within given timeout. Parameter <parameter>timeout</parameter>

+ is the time in seconds to wait for the <parameter>target_lsn</parameter>

+ replay. When <parameter>timeout</parameter> value equals to zero no

+ timeout is applied.

+ </para></entry>

[1] https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL

Euler Taveira

EDB https://www.enterprisedb.com/

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

29 March 2024, 12:44:56

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote:
> On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
>
> Fixed along with other issues spotted by Alexander Lakhin.
>
>
> [I didn't read the whole thread. I'm sorry if I missed something ...]
>
> You renamed the function in a previous version but let me suggest another one:
> pg_wal_replay_wait. It uses the same pattern as the other recovery control
> functions [1]. I think "for" doesn't add much for the function name and "lsn" is
> used in functions that return an LSN (that's not the case here).
>
> postgres=# \df pg_wal_replay*
> List of functions
> -[ RECORD 1 ]-------+---------------------
> Schema              | pg_catalog
> Name                | pg_wal_replay_pause
> Result data type    | void
> Argument data types |
> Type                | func
> -[ RECORD 2 ]-------+---------------------
> Schema              | pg_catalog
> Name                | pg_wal_replay_resume
> Result data type    | void
> Argument data types |
> Type                | func

Makes sense to me.  I tried to make a new procedure name consistent
with functions acquiring various WAL positions.  But you're right,
it's better to be consistent with other functions controlling wal
replay.

> Regarding the arguments, I think the timeout should be bigint. There is at least
> another function that implements a timeout that uses bigint.
>
> postgres=# \df pg_terminate_backend
> List of functions
> -[ RECORD 1 ]-------+--------------------------------------
> Schema              | pg_catalog
> Name                | pg_terminate_backend
> Result data type    | boolean
> Argument data types | pid integer, timeout bigint DEFAULT 0
> Type                | func
>
> I also suggests that the timeout unit should be milliseconds, hence, using
> bigint is perfectly fine for the timeout argument.
>
> +       <para>
> +        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
> +        on standby within given timeout.  Parameter <parameter>timeout</parameter>
> +        is the time in seconds to wait for the <parameter>target_lsn</parameter>
> +        replay. When <parameter>timeout</parameter> value equals to zero no
> +        timeout is applied.
> +       </para></entry>

This generally makes sense, but I'm not sure about this.  The
milliseconds timeout was used initially but received critics in [1].

Links.
1. https://www.postgresql.org/message-id/b45ff979-9d12-4828-a22a-e4cb327e115c%40eisentraut.org

------
Regards,
Alexander Korotkov

Attachment

v14-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Pavel Borisov

Date:

29 March 2024, 13:21:23

Hi, hackers!

On Fri, 29 Mar 2024 at 16:45, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote:
> On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
>
> Fixed along with other issues spotted by Alexander Lakhin.
>
>
> [I didn't read the whole thread. I'm sorry if I missed something ...]
>
> You renamed the function in a previous version but let me suggest another one:
> pg_wal_replay_wait. It uses the same pattern as the other recovery control
> functions [1]. I think "for" doesn't add much for the function name and "lsn" is
> used in functions that return an LSN (that's not the case here).
>
> postgres=# \df pg_wal_replay*
> List of functions
> -[ RECORD 1 ]-------+---------------------
> Schema | pg_catalog
> Name | pg_wal_replay_pause
> Result data type | void
> Argument data types |
> Type | func
> -[ RECORD 2 ]-------+---------------------
> Schema | pg_catalog
> Name | pg_wal_replay_resume
> Result data type | void
> Argument data types |
> Type | func

Makes sense to me. I tried to make a new procedure name consistent
with functions acquiring various WAL positions. But you're right,
it's better to be consistent with other functions controlling wal
replay.

> Regarding the arguments, I think the timeout should be bigint. There is at least
> another function that implements a timeout that uses bigint.
>
> postgres=# \df pg_terminate_backend
> List of functions
> -[ RECORD 1 ]-------+--------------------------------------
> Schema | pg_catalog
> Name | pg_terminate_backend
> Result data type | boolean
> Argument data types | pid integer, timeout bigint DEFAULT 0
> Type | func
>
> I also suggests that the timeout unit should be milliseconds, hence, using
> bigint is perfectly fine for the timeout argument.
>
> + <para>
> + Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
> + on standby within given timeout. Parameter <parameter>timeout</parameter>
> + is the time in seconds to wait for the <parameter>target_lsn</parameter>
> + replay. When <parameter>timeout</parameter> value equals to zero no
> + timeout is applied.
> + </para></entry>

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

I see in Postgres we already have different units for timeouts:

e.g in guc's:

wal_receiver_status_interval in seconds

tcp_keepalives_idle in seconds

commit_delay in microseconds

deadlock_timeout in milliseconds
max_standby_archive_delay in milliseconds
vacuum_cost_delay in milliseconds
autovacuum_vacuum_cost_delay in milliseconds
etc..

I haven't counted precisely, but I feel that milliseconds are the most often used in both guc's and functions. So I'd propose using milliseconds for the patch as it was proposed originally.

Regards,

Pavel Borisov

Supabase.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

"Euler Taveira"

Date:

29 March 2024, 16:50:33

On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

Alexander, I see why you changed the patch.

Peter suggested to use an interval but you proposed another data type:

float. The advantage of the interval data type is that you don't need to

carefully think about the unit, however, if you use the integer data

type you have to propose one. (If that's the case, milliseconds is a

good granularity for this feature.) I don't have a strong preference

between integer and interval data types but I don't like the float for

this case. The 2 main reasons are (a) that we treat time units (hours,

minutes, seconds, ...) as integers so it seems natural for a human being

to use a unit time as integer and (b) depending on the number of digits

after the decimal separator you still don't have an integer in the

internal unit, hence, you have to round it to integer.

We already have functions that use integer (such as pg_terminate_backend)

and interval (such as pg_sleep_for) and if i searched correctly it will

be the first timeout argument as float.

Euler Taveira

EDB https://www.enterprisedb.com/

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

30 March 2024, 14:14:28

On Fri, Mar 29, 2024 at 6:50 PM Euler Taveira <euler@eulerto.com> wrote:
> On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:
>
> This generally makes sense, but I'm not sure about this.  The
> milliseconds timeout was used initially but received critics in [1].
>
>
> Alexander, I see why you changed the patch.
>
> Peter suggested to use an interval but you proposed another data type:
> float. The advantage of the interval data type is that you don't need to
> carefully think about the unit, however, if you use the integer data
> type you have to propose one. (If that's the case, milliseconds is a
> good granularity for this feature.) I don't have a strong preference
> between integer and interval data types but I don't like the float for
> this case. The 2 main reasons are (a) that we treat time units (hours,
> minutes, seconds, ...) as integers so it seems natural for a human being
> to use a unit time as integer and (b) depending on the number of digits
> after the decimal separator you still don't have an integer in the
> internal unit, hence, you have to round it to integer.
>
> We already have functions that use integer (such as pg_terminate_backend)
> and interval (such as pg_sleep_for) and if i searched correctly it will
> be the first timeout argument as float.

Thank you for the detailed explanation.  Float seconds are used in
pg_sleep() just similar to the interval in pg_sleep_for().  However,
that's a delay, not exactly a timeout.  Given the precedent of
milliseconds timeout in pg_terminate_backend(), your and Pavel's
points, I've switched back to integer milliseconds timeout.

Some fixes spotted off-list by Alexander Lakhin.
1) We don't need an explicit check for the postmaster being alive as
soon as we pass WL_EXIT_ON_PM_DEATH to WaitLatch().
2) When testing for unreachable LSN, we need to select LSN well in
advance so that autovacuum couldn't affect that.

I'm going to push this if no objections.

------
Regards,
Alexander Korotkov

Attachment

v15-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

30 March 2024, 16:14:14

Thank you Alexander for working on patch, may be we should change some 
names:
1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and 
waitlsn.h


In waitlsn.c and waitlsn.h variables:
2) targret_lsn -> trgLSN like curLSN

3) lsn -> trgLSN like curLSN

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

31 March 2024, 02:11:27

Hi!

On Sat, Mar 30, 2024 at 6:14 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Thank you Alexander for working on patch, may be we should change some
> names:
> 1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and
> waitlsn.h

I renamed that to 043_wal_replay_wait.pl to match the name of SQL procedure.

> In waitlsn.c and waitlsn.h variables:
> 2) targret_lsn -> trgLSN like curLSN

I prefer this to match the SQL procedure parameter name.

> 3) lsn -> trgLSN like curLSN

Done.

Also I implemented termination of wal replay waits on standby
promotion (with test).

In the test I change recovery_min_apply_delay to 1s in order to make
the test pass faster.

------
Regards,
Alexander Korotkov

Attachment

v16-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

31 March 2024, 05:44:20

On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi!

Thanks for the patch. I have a few comments on the v16 patch.

1. Is there any specific reason for pg_wal_replay_wait() being a
procedure rather than a function? I haven't read the full thread, but
I vaguely noticed the discussion on the new wait mechanism holding up
a snapshot or some other resource. Is that the reason to use a stored
procedure over a function? If yes, can we specify it somewhere in the
commit message and just before the procedure definition in
system_functions.sql?

2. Is the pg_wal_replay_wait first procedure that postgres provides
out of the box?

3. Defining a procedure for the first time in system_functions.sql
which is supposed to be for functions seems a bit unusual to me.

4.
+
+    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+

+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }

Why is endtime calculated even for timeout <= 0 only to just skip it
later? Can't we just do a fastpath exit if timeout = 0 and targetLSN <

5.
 Parameter
+        <parameter>timeout</parameter> is the time in milliseconds to wait
+        for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.

It turns out to be "When timeout value equals to zero no timeout is
applied." I guess, we can just say something like the following which
I picked up from pg_terminate_backend timeout parameter description.

       <para>
        If <parameter>timeout</parameter> is not specified or zero, this
        function returns if  the WAL upto
<literal>target_lsn</literal>  is replayed.
        If the <parameter>timeout</parameter> is specified (in
        milliseconds) and greater than zero, the function waits until the
        server actually replays the WAL upto <literal>target_lsn</literal>  or
        until the given time has passed. On timeout, an error is emitted.
       </para></entry>

6.
+        ereport(ERROR,
+                (errcode(ERRCODE_QUERY_CANCELED),
+                 errmsg("canceling waiting for LSN due to timeout")));

We can be a bit more informative here and say targetLSN and currentLSN
something like - "timed out while waiting for target LSN %X/%X to be
replayed; current LSN %X/%X"?

7.
+    if (context->atomic)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("pg_wal_replay_wait() must be only called in
non-atomic context")));
+

Can we say what a "non-atomic context '' is in a user understandable
way like explicit txn or whatever that might  be? "non-atomic context'
might not sound great to the end -user.

8.
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses
<function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that
<varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.

Can we just mention that run pg_current_wal_insert_lsn on the primary?

9. To me the following query blocks even though I didn't mention timeout.
CALL pg_wal_replay_wait('0/fffffff');

10. Can't we do some input validation on the timeout parameter and
emit an error for negative values just like pg_terminate_backend?
CALL pg_wal_replay_wait('0/ffffff', -100);

11.
+
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }
+

Can we avoid calling GetCurrentTimestamp in a for loop which can be
costly at times especially when pg_wal_replay_wait is called with
larger timeouts on multiple backends? Can't we reuse
pg_terminate_backend's timeout logic in
pg_wait_until_termination, perhaps reducing waittime to 1msec or so?

12. Why should we let every user run pg_wal_replay_wait procedure?
Can't we revoke execute from the public in system_functions.sql so
that one can decide who to run this function? Per comment #11, one can
easily cause a lot of activity by running this function on hundreds of
sessions.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

01 April 2024, 00:24:35

Hi Bharath,

Thank you for your feedback.

On Sun, Mar 31, 2024 at 8:44 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Thanks for the patch. I have a few comments on the v16 patch.
>
> 1. Is there any specific reason for pg_wal_replay_wait() being a
> procedure rather than a function? I haven't read the full thread, but
> I vaguely noticed the discussion on the new wait mechanism holding up
> a snapshot or some other resource. Is that the reason to use a stored
> procedure over a function? If yes, can we specify it somewhere in the
> commit message and just before the procedure definition in
> system_functions.sql?

Surely, there is a reason.  Function should be executed in a snapshot,
which can prevent WAL records from being replayed.  See [1] for a
particular test scenario.  In a procedure we may enforce non-atomic
context and release the snapshot.

I've mentioned that in the commit message and in the procedure code.
I don't think system_functions.sql is the place for this type of
comment.  We only use system_functions.sql to push the default values.

> 2. Is the pg_wal_replay_wait first procedure that postgres provides
> out of the box?

Yes, it appears first.  I see nothing wrong about that.

> 3. Defining a procedure for the first time in system_functions.sql
> which is supposed to be for functions seems a bit unusual to me.

From the scope of DDL and system catalogue procedure is just another
kind of function (prokind == 'p').  So, I don't feel wrong about that.

> 4.
> +
> +    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
> +
>
> +        if (timeout > 0)
> +        {
> +            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
> +            latch_events |= WL_TIMEOUT;
> +            if (delay_ms <= 0)
> +                break;
> +        }
>
> Why is endtime calculated even for timeout <= 0 only to just skip it
> later? Can't we just do a fastpath exit if timeout = 0 and targetLSN <

OK, fixed.

> 5.
>  Parameter
> +        <parameter>timeout</parameter> is the time in milliseconds to wait
> +        for the <parameter>target_lsn</parameter>
> +        replay. When <parameter>timeout</parameter> value equals to zero no
> +        timeout is applied.
> +        replay. When <parameter>timeout</parameter> value equals to zero no
> +        timeout is applied.
>
> It turns out to be "When timeout value equals to zero no timeout is
> applied." I guess, we can just say something like the following which
> I picked up from pg_terminate_backend timeout parameter description.
>
>        <para>
>         If <parameter>timeout</parameter> is not specified or zero, this
>         function returns if  the WAL upto
> <literal>target_lsn</literal>  is replayed.
>         If the <parameter>timeout</parameter> is specified (in
>         milliseconds) and greater than zero, the function waits until the
>         server actually replays the WAL upto <literal>target_lsn</literal>  or
>         until the given time has passed. On timeout, an error is emitted.
>        </para></entry>

Applied as you suggested with some edits from me.

> 6.
> +        ereport(ERROR,
> +                (errcode(ERRCODE_QUERY_CANCELED),
> +                 errmsg("canceling waiting for LSN due to timeout")));
>
> We can be a bit more informative here and say targetLSN and currentLSN
> something like - "timed out while waiting for target LSN %X/%X to be
> replayed; current LSN %X/%X"?

Done this way.  Adjusted other ereport()'s as well.

> 7.
> +    if (context->atomic)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("pg_wal_replay_wait() must be only called in
> non-atomic context")));
> +
>
> Can we say what a "non-atomic context '' is in a user understandable
> way like explicit txn or whatever that might  be? "non-atomic context'
> might not sound great to the end -user.

Added errdetail() to this ereport().

> 8.
> +    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
> +    changes just made.  This example uses
> <function>pg_current_wal_insert_lsn</function>
> +    to get the <acronym>lsn</acronym> given that
> <varname>synchronous_commit</varname>
> +    could be set to <literal>off</literal>.
>
> Can we just mention that run pg_current_wal_insert_lsn on the primary?

The mention is added.

> 9. To me the following query blocks even though I didn't mention timeout.
> CALL pg_wal_replay_wait('0/fffffff');

If your primary server is freshly initialized, you need to do quite
data modifications to reach this LSN.

> 10. Can't we do some input validation on the timeout parameter and
> emit an error for negative values just like pg_terminate_backend?
> CALL pg_wal_replay_wait('0/ffffff', -100);

Reasonable, added.

> 11.
> +
> +        if (timeout > 0)
> +        {
> +            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
> +            latch_events |= WL_TIMEOUT;
> +            if (delay_ms <= 0)
> +                break;
> +        }
> +
>
> Can we avoid calling GetCurrentTimestamp in a for loop which can be
> costly at times especially when pg_wal_replay_wait is called with
> larger timeouts on multiple backends? Can't we reuse
> pg_terminate_backend's timeout logic in
> pg_wait_until_termination, perhaps reducing waittime to 1msec or so?

Normally there shouldn't be many loops.  It only happens on spurious
wakeups.  For instance, some process was going to set our latch before
and for another reason, but due to kernel scheduling it does only now.
So, normally there is only one wakeup.  pg_wait_until_termination()
may sacrifice timeout accuracy due to possible spurious wakeups and
time spent outside of WaitLatch().  I don't feel reasonable to repeat
this login in WaitForLSN() especially given that we don't need
frequent wakeups here.

> 12. Why should we let every user run pg_wal_replay_wait procedure?
> Can't we revoke execute from the public in system_functions.sql so
> that one can decide who to run this function? Per comment #11, one can
> easily cause a lot of activity by running this function on hundreds of
> sessions.

Generally, if a user can make many connections, then this user can
make them busy and can consume resources.  Given my explanation above,
pg_wal_replay_wait() even wouldn't make the connection busy, it would
just wait on the latch.  I don't see why pg_wal_replay_wait() could do
more harm than pg_sleep().  So, I would leave pg_wal_replay_wait()
public.

Links.
1. https://www.postgresql.org/message-id/CAPpHfdtiGgn0iS1KbW2HTam-1%2BoK%2BvhXZDAcnX9hKaA7Oe%3DF-A%40mail.gmail.com

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

01 April 2024, 02:25:28

On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > 9. To me the following query blocks even though I didn't mention timeout.
> > CALL pg_wal_replay_wait('0/fffffff');
>
> If your primary server is freshly initialized, you need to do quite
> data modifications to reach this LSN.

Right, but why pg_wal_replay_wait  blocks without a timeout? It must
return an error saying it can't reach the target LSN, no?

Did you forget to attach the new patch?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

01 April 2024, 10:27:09

On Mon, Apr 1, 2024 at 5:25 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > 9. To me the following query blocks even though I didn't mention timeout.
> > CALL pg_wal_replay_wait('0/fffffff');
>
> If your primary server is freshly initialized, you need to do quite
> data modifications to reach this LSN.

Right, but why pg_wal_replay_wait blocks without a timeout? It must
return an error saying it can't reach the target LSN, no?

How can replica know this? It doesn't look feasible to distinguish this situation from the situation when connection between primary and replica became slow. I'd keep this simple. We have pg_sleep_for() which waits for the specified time whatever it is. And we have pg_wal_replay_wait() waits till replay lsn grows to the target whatever it is. It's up to the user to specify the correct value.

Did you forget to attach the new patch?

Yes, here it is.

------
Regards,

Alexander Korotkov

Attachment

v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Andy Fan

Date:

02 April 2024, 03:25:57

Alexander Korotkov <aekorotkov@gmail.com> writes:

Hello,

>  
>  Did you forget to attach the new patch?
>
> Yes, here it is. 
>
> ------
> Regards,
> Alexander Korotkov 
>
> [4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...

+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>

Should we return the millseconds of waiting time?  I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose. 

-- 
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

02 April 2024, 08:08:17

Hi, Andy!

On Tue, Apr 2, 2024 at 6:29 AM Andy Fan <zhihuifan1213@163.com> wrote:
> >  Did you forget to attach the new patch?
> >
> > Yes, here it is.
> >
> > [4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...
>
> +        </indexterm>
> +        <function>pg_wal_replay_wait</function> (
> +          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
> +          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
> +        <returnvalue>void</returnvalue>
> +       </para>
>
> Should we return the millseconds of waiting time?  I think this
> information may be useful for customer if they want to know how long
> time it waits for for minitor purpose.

Please, check it more carefully.  In v17 timeout is in integer milliseconds.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Andy Fan

Date:

02 April 2024, 08:14:49

Hi Alexander!

>> +        </indexterm>
>> +        <function>pg_wal_replay_wait</function> (
>> +          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
>> +          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
>> +        <returnvalue>void</returnvalue>
>> +       </para>
>>
>> Should we return the millseconds of waiting time?  I think this
>> information may be useful for customer if they want to know how long
>> time it waits for for minitor purpose.
>
> Please, check it more carefully.  In v17 timeout is in integer milliseconds.

I guess one of us misunderstand the other:( and I do didn't check the
code very carefully. 

Acutally I meant the "return value" rather than function argument. IIUC
the current return value is void per below statement.

>> +        <returnvalue>void</returnvalue>

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes. 

select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
this case, I want this function return 1. 

-- 
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

02 April 2024, 10:11:27

On 2024-04-02 11:14, Andy Fan wrote:
> If so, when users call pg_wal_replay_wait, they can get informed when
> the wal is replaied to the target_lsn, but they can't know how long 
> time
> it waits unless they check it in application side, I think such
> information will be useful for monitor purpose sometimes.
> 
> select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
> this case, I want this function return 1.

Hi Andy, to get timing we can use \time in psql.
Here is an example.
postgres=# \timing
Timing is on.
postgres=# select 1;
  ?column?
----------
         1
(1 row)

Time: 0.536 ms


>        <returnvalue>void</returnvalue>
And returning VOID is the best option, rather than returning TRUE|FALSE
or timing. It left the logic of the procedure very simple, we get an
error if LSN is not reached.

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Let's discuss further improvements in future versions.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Bharath Rupireddy

Date:

02 April 2024, 10:15:22

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> 8 years, we tried to add this feature, and now we suggest the best way
> for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Andy Fan

Date:

02 April 2024, 11:43:46

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

> On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
> <i.kartyshov@postgrespro.ru> wrote:
>>
>> 8 years, we tried to add this feature, and now we suggest the best way
>> for this feature is to commit the minimal version first.
>
> Just curious, do you or anyone else have an immediate use for this
> function? If yes, how are they achieving read-after-write-consistency
> on streaming standbys in their application right now without a
> function like this?

The link [1] may be helpful and I think the reason there is reasonable
to me.

Actually we also disucss how to make sure the "read your writes
consistency" internally, and the soluation here looks good to me.

Glad to know that this patch will be committed very soon.

[1]
https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com

--
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

02 April 2024, 12:25:49

On Tue, Apr 2, 2024 at 1:11 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
> On 2024-04-02 11:14, Andy Fan wrote:
> > If so, when users call pg_wal_replay_wait, they can get informed when
> > the wal is replaied to the target_lsn, but they can't know how long
> > time
> > it waits unless they check it in application side, I think such
> > information will be useful for monitor purpose sometimes.
> >
> > select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
> > this case, I want this function return 1.
>
> Hi Andy, to get timing we can use \time in psql.
> Here is an example.
> postgres=# \timing
> Timing is on.
> postgres=# select 1;
>   ?column?
> ----------
>          1
> (1 row)
>
> Time: 0.536 ms
>
>
> >        <returnvalue>void</returnvalue>
> And returning VOID is the best option, rather than returning TRUE|FALSE
> or timing. It left the logic of the procedure very simple, we get an
> error if LSN is not reached.
>
> 8 years, we tried to add this feature, and now we suggest the best way
> for this feature is to commit the minimal version first.
>
> Let's discuss further improvements in future versions.

+1,
It seems there was not yet a precedent of builtin PostgreSQL function
returning its duration.  And I don't think we need to introduce such
precedent, at least now.  This seems like we're placing the
responsibility on monitoring resources usage to an application.

I'd also like to note that pg_wal_replay_wait() comes with a dedicated
wait event.  So, one could monitor the average duration of these waits
using sampling of wait events.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

02 April 2024, 12:31:40

On Tue, Apr 2, 2024 at 2:47 PM Andy Fan <zhihuifan1213@163.com> wrote:
> Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
>
> > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
> > <i.kartyshov@postgrespro.ru> wrote:
> >>
> >> 8 years, we tried to add this feature, and now we suggest the best way
> >> for this feature is to commit the minimal version first.
> >
> > Just curious, do you or anyone else have an immediate use for this
> > function? If yes, how are they achieving read-after-write-consistency
> > on streaming standbys in their application right now without a
> > function like this?
>
> The link [1] may be helpful and I think the reason there is reasonable
> to me.
>
> Actually we also disucss how to make sure the "read your writes
> consistency" internally, and the soluation here looks good to me.
>
> Glad to know that this patch will be committed very soon.
>
> [1]
> https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com

Thank you for your feedback.

I also can confirm that a lot of users would be very happy to have
"read your writes consistency" and ready to do something to achieve
this at an application level.  However, they typically don't know what
exactly they need.

So, blogging about pg_wal_replay_wait() and spreading words about it
at conferences would be highly appreciated.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

02 April 2024, 15:21:34

On 2024-04-02 13:15, Bharath Rupireddy wrote:
> On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
> <i.kartyshov@postgrespro.ru> wrote:
>> 
>> 8 years, we tried to add this feature, and now we suggest the best way
>> for this feature is to commit the minimal version first.
> 
> Just curious, do you or anyone else have an immediate use for this
> function? If yes, how are they achieving read-after-write-consistency
> on streaming standbys in their application right now without a
> function like this?

Just now, application runs pg_current_wal_lsn() after update and then
waits on loop pg_current_wal_flush_lsn() until updated.

Or use slow synchronous_commit.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alvaro Herrera

Date:

03 April 2024, 06:58:41

Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
spinlock) to protect the contents of waitLSN -- and it's used to walk an
arbitrary long list of processes waiting ... and also, an arbitrary
number of processes could be calling this code.  I think using a
spinlock for this is unwise, as it'll cause busy-waiting whenever
there's contention.  Wouldn't it be better to use an LWLock for this?
Then the processes would sleep until the lock is freed.

While nosing about the code, other things struck me:

I think there should be more comments about WaitLSNProcInfo and
WaitLSNState in waitlsn.h.

In addLSNWaiter it'd be better to assign 'cur' before acquiring the
lock.

Is a plan array really the most efficient data structure for this,
considering that you have to reorder each time you add an element?
Maybe it is, but wouldn't it make sense to use memmove() when adding one
element rather iterating all the remaining elements to the end of the
queue?

I think the include list in waitlsn.c could be tightened a bit:

@@ -18,28 +18,18 @@
 #include <math.h>
 
 #include "pgstat.h"
-#include "fmgr.h"
-#include "access/transam.h"
-#include "access/xact.h"
 #include "access/xlog.h"
-#include "access/xlogdefs.h"
 #include "access/xlogrecovery.h"
-#include "catalog/pg_type.h"
 #include "commands/waitlsn.h"
-#include "executor/spi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "storage/ipc.h"
 #include "storage/latch.h"
-#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
-#include "storage/sinvaladt.h"
-#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 #include "utils/fmgrprotos.h"
+#include "utils/wait_event_types.h"

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Daniel Gustafsson

Date:

03 April 2024, 07:58:29

Buildfarm animal mamba (NetBSD 10.0 on macppc) started failing on this with
what seems like a bogus compiler warning:

waitlsn.c: In function 'WaitForLSN':
waitlsn.c:275:24: error: 'endtime' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  275 |    delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
      |               ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

endtime is indeed initialized further up, but initializing endtime at
declaration seems innocent enough and should make this compiler happy and the
buildfarm greener.

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 6679378156..17ad0057ad 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -226,7 +226,7 @@ void
 WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
 {
        XLogRecPtr      currentLSN;
-       TimestampTz endtime;
+       TimestampTz endtime = 0;

        /* Shouldn't be called when shmem isn't initialized */
        Assert(waitLSN);

--
Daniel Gustafsson

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Andy Fan

Date:

03 April 2024, 08:36:33

> I also can confirm that a lot of users would be very happy to have
> "read your writes consistency" and ready to do something to achieve
> this at an application level.  However, they typically don't know what
> exactly they need.
>
> So, blogging about pg_wal_replay_wait() and spreading words about it
> at conferences would be highly appreciated.

Sure, once it is committed, I promise I can doing a knowledge sharing in
our organization and write a article in chinese language.  

-- 
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

03 April 2024, 08:42:11

Hi, Alvaro!

Thank you for your feedback.

On Wed, Apr 3, 2024 at 9:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
> spinlock) to protect the contents of waitLSN -- and it's used to walk an
> arbitrary long list of processes waiting ... and also, an arbitrary
> number of processes could be calling this code.  I think using a
> spinlock for this is unwise, as it'll cause busy-waiting whenever
> there's contention.  Wouldn't it be better to use an LWLock for this?
> Then the processes would sleep until the lock is freed.
>
> While nosing about the code, other things struck me:
>
> I think there should be more comments about WaitLSNProcInfo and
> WaitLSNState in waitlsn.h.
>
> In addLSNWaiter it'd be better to assign 'cur' before acquiring the
> lock.
>
> Is a plan array really the most efficient data structure for this,
> considering that you have to reorder each time you add an element?
> Maybe it is, but wouldn't it make sense to use memmove() when adding one
> element rather iterating all the remaining elements to the end of the
> queue?
>
> I think the include list in waitlsn.c could be tightened a bit:

I've just pushed commit, which shortens the include list and fixes the
order of 'cur' assigning and taking spinlock in addLSNWaiter().

Regarding the shmem data structure for LSN waiters.  I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed.  In my experience
waking up a process is way slower than scanning a short flat array.

However, I agree that when the number of waiters is very high and flat
array may become a problem.  It seems that the pairing heap is not
hard to use for shmem structures.  The only memory allocation call in
paritingheap.c is in pairingheap_allocate().  So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

I'll come back with switching to the pairing heap shortly.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alvaro Herrera

Date:

03 April 2024, 16:55:35

Hello Alexander,

On 2024-Apr-03, Alexander Korotkov wrote:

> Regarding the shmem data structure for LSN waiters.  I didn't pick
> LWLock or ConditionVariable, because I needed the ability to wake up
> only those waiters whose LSN is already replayed.  In my experience
> waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

> However, I agree that when the number of waiters is very high and flat
> array may become a problem.  It seems that the pairing heap is not
> hard to use for shmem structures.  The only memory allocation call in
> paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> be able to initialize the pairing heap in-place, and it will be fine
> for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking.  I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment.  Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends.  (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay.  But clearly it deserves
a comment.)

Don't we need to WaitLSNCleanup() during error recovery or something?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"        (Andrew Morton)

Attachment

0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

03 April 2024, 18:17:48

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Apr-03, Alexander Korotkov wrote:
>
> > Regarding the shmem data structure for LSN waiters.  I didn't pick
> > LWLock or ConditionVariable, because I needed the ability to wake up
> > only those waiters whose LSN is already replayed.  In my experience
> > waking up a process is way slower than scanning a short flat array.
>
> I agree, but I think that's unrelated to what I was saying, which is
> just the patch I attach here.

Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
meant this very clearly!

I'm good with the patch.  Attached revision contains a bit of a commit message.

> > However, I agree that when the number of waiters is very high and flat
> > array may become a problem.  It seems that the pairing heap is not
> > hard to use for shmem structures.  The only memory allocation call in
> > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> > be able to initialize the pairing heap in-place, and it will be fine
> > for shmem.
>
> Ok.
>
> With the code as it stands today, everything in WaitLSNState apart from
> the pairing heap is accessed without any locking.  I think this is at
> least partly OK because each backend only accesses its own entry; but it
> deserves a comment.  Or maybe something more, because WaitLSNSetLatches
> does modify the entry for other backends.  (Admittedly, this could only
> happens for backends that are already sleeping, and it only happens
> with the lock acquired, so it's probably okay.  But clearly it deserves
> a comment.)

Please, check 0002 patch attached.  I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

> Don't we need to WaitLSNCleanup() during error recovery or something?

Yes, there is WaitLSNCleanup(). It's currently only called from one
place, given that waiting for LSN can't be invoked from background
workers or inside the transaction.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Pavel Borisov

Date:

03 April 2024, 19:04:05

Hi, Alexander!

On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Apr-03, Alexander Korotkov wrote:
>
> > Regarding the shmem data structure for LSN waiters. I didn't pick
> > LWLock or ConditionVariable, because I needed the ability to wake up
> > only those waiters whose LSN is already replayed. In my experience
> > waking up a process is way slower than scanning a short flat array.
>
> I agree, but I think that's unrelated to what I was saying, which is
> just the patch I attach here.

Oh, sorry for the confusion. I'd re-read your message. Indeed you
meant this very clearly!

I'm good with the patch. Attached revision contains a bit of a commit message.

> > However, I agree that when the number of waiters is very high and flat
> > array may become a problem. It seems that the pairing heap is not
> > hard to use for shmem structures. The only memory allocation call in
> > paritingheap.c is in pairingheap_allocate(). So, it's only needed to
> > be able to initialize the pairing heap in-place, and it will be fine
> > for shmem.
>
> Ok.
>
> With the code as it stands today, everything in WaitLSNState apart from
> the pairing heap is accessed without any locking. I think this is at
> least partly OK because each backend only accesses its own entry; but it
> deserves a comment. Or maybe something more, because WaitLSNSetLatches
> does modify the entry for other backends. (Admittedly, this could only
> happens for backends that are already sleeping, and it only happens
> with the lock acquired, so it's probably okay. But clearly it deserves
> a comment.)

Please, check 0002 patch attached. I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

Could you re-attach 0002. Seems it failed to attach to the previous message.

Regards,

Pavel

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

03 April 2024, 21:35:37

On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> >
>> > On 2024-Apr-03, Alexander Korotkov wrote:
>> >
>> > > Regarding the shmem data structure for LSN waiters.  I didn't pick
>> > > LWLock or ConditionVariable, because I needed the ability to wake up
>> > > only those waiters whose LSN is already replayed.  In my experience
>> > > waking up a process is way slower than scanning a short flat array.
>> >
>> > I agree, but I think that's unrelated to what I was saying, which is
>> > just the patch I attach here.
>>
>> Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
>> meant this very clearly!
>>
>> I'm good with the patch.  Attached revision contains a bit of a commit message.
>>
>> > > However, I agree that when the number of waiters is very high and flat
>> > > array may become a problem.  It seems that the pairing heap is not
>> > > hard to use for shmem structures.  The only memory allocation call in
>> > > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
>> > > be able to initialize the pairing heap in-place, and it will be fine
>> > > for shmem.
>> >
>> > Ok.
>> >
>> > With the code as it stands today, everything in WaitLSNState apart from
>> > the pairing heap is accessed without any locking.  I think this is at
>> > least partly OK because each backend only accesses its own entry; but it
>> > deserves a comment.  Or maybe something more, because WaitLSNSetLatches
>> > does modify the entry for other backends.  (Admittedly, this could only
>> > happens for backends that are already sleeping, and it only happens
>> > with the lock acquired, so it's probably okay.  But clearly it deserves
>> > a comment.)
>>
>> Please, check 0002 patch attached.  I found it easier to move two
>> assignments we previously moved out of lock, into the lock; then claim
>> WaitLSNState.procInfos is also protected by WaitLSNLock.
>
> Could you re-attach 0002. Seems it failed to attach to the previous message.

I actually forgot both!

------
Regards,
Alexander Korotkov

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alvaro Herrera

Date:

05 April 2024, 18:15:16

Hello,

BTW I noticed that 
https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
says that lsn_cmp is not covered by the tests.  This probably indicates
that the tests are a little too light, but I'm not sure how much extra
effort we want to spend.

I'm still concerned that WaitLSNCleanup is only called in ProcKill.
Does this mean that if a process throws an error while waiting, it'll
not get cleaned up until it exits?  Maybe this is not a big deal, but it
seems odd.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

06 April 2024, 21:52:47

Hi, Alvaro!

Thank you for your care on this matter.

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> BTW I noticed that
> https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
> says that lsn_cmp is not covered by the tests.  This probably indicates
> that the tests are a little too light, but I'm not sure how much extra
> effort we want to spend.

I'm aware of this.  Ivan promised to send a patch to improve the test.
If he doesn't, I'll care about it.

> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> Does this mean that if a process throws an error while waiting, it'll
> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> seems odd.

I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
that together with the improvements upthread.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

10 April 2024, 15:12:00

I did some experiments over synchronous replications and
got that cascade replication can`t be synchronous. And 
pg_wal_replay_wait() allows us to read your writes
consistency on cascade replication.
Beyond that, I added more tests on multi-standby replication
and cascade replications.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

additional_test.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Heikki Linnakangas

Date:

10 April 2024, 22:46:04

On 07/04/2024 00:52, Alexander Korotkov wrote:
> On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
>> Does this mean that if a process throws an error while waiting, it'll
>> not get cleaned up until it exits?  Maybe this is not a big deal, but it
>> seems odd.
> 
> I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
> that together with the improvements upthread.

Race condition:

1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend 
process to the LSN heap and returns
3. replay:  rm_redo record '0/1234'
4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
5. backend: current replay LSN location is '0/1234', so we exit the loop
6. replay: calls WaitLSNSetLatches()

In a nutshell, it's possible for the loop in WaitForLSN to exit without 
cleaning up the process from the heap. I was able to hit that by adding 
a delay after the addLSNWaiter() call:

> TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at 
the same time that the startup process wakes up the process.

>      * At first, we check that pg_wal_replay_wait() is called in a non-atomic
>      * context.  That is, a procedure call isn't wrapped into a transaction,
>      * another procedure call, or a function call.
>      *

It's pretty unfortunate to have all these restrictions. It would be nice 
to do:

select pg_wal_replay_wait('0/1234'); select * from foo;

in a single multi-query call, to avoid the round-trip to the client. You 
can avoid it with libpq or protocol level pipelining, too, but it's more 
complicated.

>      * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
>      * context, CallStmt is processed with a snapshot.  Thankfully, we can pop
>      * this snapshot, because PortalRunUtility() can tolerate this.

This assumption that PortalRunUtility() can tolerate us popping the 
snapshot sounds very fishy. I haven't looked at what's going on there, 
but doesn't sound like a great assumption.

If recovery ends while a process is waiting for an LSN to arrive, does 
it ever get woken up?

The docs could use some-copy-editing, but just to point out one issue:

> There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists 
functions like pg_promote(), pg_wal_replay_pause(), and 
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the 
progress of recovery like those functions do, it only causes the calling 
backend to wait.

Overall, this feature doesn't feel quite ready for v17, and IMHO should 
be reverted. It's a nice feature, so I'd love to have it fixed and 
reviewed early in the v18 cycle.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

11 April 2024, 15:09:43

Hi, Heikki!

Thank you for your interest in the subject.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/04/2024 00:52, Alexander Korotkov wrote:
> > On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> >> Does this mean that if a process throws an error while waiting, it'll
> >> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> >> seems odd.
> >
> > I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
> > that together with the improvements upthread.
>
> Race condition:
>
> 1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
> 2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend
> process to the LSN heap and returns
> 3. replay:  rm_redo record '0/1234'
> 4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
> 5. backend: current replay LSN location is '0/1234', so we exit the loop
> 6. replay: calls WaitLSNSetLatches()
>
> In a nutshell, it's possible for the loop in WaitForLSN to exit without
> cleaning up the process from the heap. I was able to hit that by adding
> a delay after the addLSNWaiter() call:
>
> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
>
> I think there's a similar race condition if the timeout is reached at
> the same time that the startup process wakes up the process.

Thank you for catching this.  I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

> >        * At first, we check that pg_wal_replay_wait() is called in a non-atomic
> >        * context.  That is, a procedure call isn't wrapped into a transaction,
> >        * another procedure call, or a function call.
> >        *
>
> It's pretty unfortunate to have all these restrictions. It would be nice
> to do:
>
> select pg_wal_replay_wait('0/1234'); select * from foo;

This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
 i
---
(0 rows)

> in a single multi-query call, to avoid the round-trip to the client. You
> can avoid it with libpq or protocol level pipelining, too, but it's more
> complicated.
>
> >        * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
> >        * context, CallStmt is processed with a snapshot.  Thankfully, we can pop
> >        * this snapshot, because PortalRunUtility() can tolerate this.
>
> This assumption that PortalRunUtility() can tolerate us popping the
> snapshot sounds very fishy. I haven't looked at what's going on there,
> but doesn't sound like a great assumption.

This is what PortalRunUtility() says about this.

/*
 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
 * should be the top one; pop it.  Note that this could be a different
 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
 */

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities.  But I agree this
decision needs some consensus.

> If recovery ends while a process is waiting for an LSN to arrive, does
> it ever get woken up?

If the recovery target is promote, then the user will get an error.
If the recovery target is shutdown, then connection will get
interrupted.  If the recovery target is pause, then waiting will
continue during the pause.  Not sure about the latter case.

> The docs could use some-copy-editing, but just to point out one issue:
>
> > There are also procedures to control the progress of recovery.
>
> That's copy-pasted from an earlier sentence at the table that lists
> functions like pg_promote(), pg_wal_replay_pause(), and
> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the
> progress of recovery like those functions do, it only causes the calling
> backend to wait.
>
> Overall, this feature doesn't feel quite ready for v17, and IMHO should
> be reverted. It's a nice feature, so I'd love to have it fixed and
> reviewed early in the v18 cycle.

Thank you for your review.  I've reverted this. Will repost this for early v18.

------
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Heikki Linnakangas

Date:

11 April 2024, 15:47:16

On 11/04/2024 18:09, Alexander Korotkov wrote:
> On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 07/04/2024 00:52, Alexander Korotkov wrote:
>>>         * At first, we check that pg_wal_replay_wait() is called in a non-atomic
>>>         * context.  That is, a procedure call isn't wrapped into a transaction,
>>>         * another procedure call, or a function call.
>>>         *
>>
>> It's pretty unfortunate to have all these restrictions. It would be nice
>> to do:
>>
>> select pg_wal_replay_wait('0/1234'); select * from foo;
> 
> This works for me, except that it needs "call" not "select".
> 
> # call pg_wal_replay_wait('0/1234'); select * from foo;
> CALL
>   i
> ---
> (0 rows)

If you do that from psql prompt, it works because psql parses and sends 
it as two separate round-trips. Try:

psql postgres -p 5433 -c "call pg_wal_replay_wait('0/4101BBB8'); select 1"
ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
DETAIL:  Make sure pg_wal_replay_wait() isn't called within a 
transaction, another procedure, or a function.

>> This assumption that PortalRunUtility() can tolerate us popping the
>> snapshot sounds very fishy. I haven't looked at what's going on there,
>> but doesn't sound like a great assumption.
> 
> This is what PortalRunUtility() says about this.
> 
> /*
>   * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
>   * under us, so don't complain if it's now empty.  Otherwise, our snapshot
>   * should be the top one; pop it.  Note that this could be a different
>   * snapshot from the one we made above; see EnsurePortalSnapshotExists.
>   */
> 
> So, if the vacuum pops a snapshot when it needs to run without a
> snapshot, then it's probably OK for other utilities.  But I agree this
> decision needs some consensus.

Ok, so it's at least somewhat documented that it's fine.

>> Overall, this feature doesn't feel quite ready for v17, and IMHO should
>> be reverted. It's a nice feature, so I'd love to have it fixed and
>> reviewed early in the v18 cycle.
> 
> Thank you for your review.  I've reverted this. Will repost this for early v18.

Thanks Alexander for working on this.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

12 June 2024, 08:36:05

Hi, Alexander, Here, I made some improvements according to your 
discussion with Heikki.

On 2024-04-11 18:09, Alexander Korotkov wrote:
> On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> 
> wrote:
>> In a nutshell, it's possible for the loop in WaitForLSN to exit 
>> without
>> cleaning up the process from the heap. I was able to hit that by 
>> adding
>> a delay after the addLSNWaiter() call:
>> 
>> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
>> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
>> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
>> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
>> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
>> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
>> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
>> 
>> I think there's a similar race condition if the timeout is reached at
>> the same time that the startup process wakes up the process.
> 
> Thank you for catching this.  I think WaitForLSN() just needs to call
> deleteLSNWaiter() unconditionally after exit from the loop.

Fix and add injection point test on this race condition.

> On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> 
> wrote:
>> The docs could use some-copy-editing, but just to point out one issue:
>> 
>> > There are also procedures to control the progress of recovery.
>> 
>> That's copy-pasted from an earlier sentence at the table that lists
>> functions like pg_promote(), pg_wal_replay_pause(), and
>> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control 
>> the
>> progress of recovery like those functions do, it only causes the 
>> calling
>> backend to wait.

Fix documentation and add extra tests on multi-standby replication
and cascade replication.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v18-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

14 June 2024, 12:46:15

Hi, Ivan!

On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Hi, Alexander, Here, I made some improvements according to your
> discussion with Heikki.
>
> On 2024-04-11 18:09, Alexander Korotkov wrote:
> > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> >> In a nutshell, it's possible for the loop in WaitForLSN to exit
> >> without
> >> cleaning up the process from the heap. I was able to hit that by
> >> adding
> >> a delay after the addLSNWaiter() call:
> >>
> >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
> >>
> >> I think there's a similar race condition if the timeout is reached at
> >> the same time that the startup process wakes up the process.
> >
> > Thank you for catching this.  I think WaitForLSN() just needs to call
> > deleteLSNWaiter() unconditionally after exit from the loop.
>
> Fix and add injection point test on this race condition.
>
> > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> >> The docs could use some-copy-editing, but just to point out one issue:
> >>
> >> > There are also procedures to control the progress of recovery.
> >>
> >> That's copy-pasted from an earlier sentence at the table that lists
> >> functions like pg_promote(), pg_wal_replay_pause(), and
> >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
> >> the
> >> progress of recovery like those functions do, it only causes the
> >> calling
> >> backend to wait.
>
> Fix documentation and add extra tests on multi-standby replication
> and cascade replication.

Thank you for the revised patch.

I see couple of items which are not addressed in this revision.
 * As Heikki pointed, that it's currently not possible in one round
trip to call call pg_wal_replay_wait() and do other job.  The attached
patch addresses this.  It milds the requirement.  Now, it's not
necessary to be in atomic context.  It's only necessary to have no
active snapshot.  This new requirement works for me so far.  I
appreciate a feedback on this.
 * As Alvaro pointed, multiple waiters case isn't covered by the test
suite.  That leads to no coverage of some code paths.  The attached
patch addresses this by adding a test case with multiple waiters.

The rest looks good to me.

Links.
1. https://www.postgresql.org/message-id/d1303584-b763-446c-9409-f4516118219f%40iki.fi
2.https://www.postgresql.org/message-id/202404051815.eri4u5q6oj26%40alvherre.pgsql

------
Regards,
Alexander Korotkov
Supabase

Attachment

v19-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

19 June 2024, 20:08:50

On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
> <i.kartyshov@postgrespro.ru> wrote:
> >
> > Hi, Alexander, Here, I made some improvements according to your
> > discussion with Heikki.
> >
> > On 2024-04-11 18:09, Alexander Korotkov wrote:
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > > wrote:
> > >> In a nutshell, it's possible for the loop in WaitForLSN to exit
> > >> without
> > >> cleaning up the process from the heap. I was able to hit that by
> > >> adding
> > >> a delay after the addLSNWaiter() call:
> > >>
> > >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
> > >>
> > >> I think there's a similar race condition if the timeout is reached at
> > >> the same time that the startup process wakes up the process.
> > >
> > > Thank you for catching this.  I think WaitForLSN() just needs to call
> > > deleteLSNWaiter() unconditionally after exit from the loop.
> >
> > Fix and add injection point test on this race condition.
> >
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > > wrote:
> > >> The docs could use some-copy-editing, but just to point out one issue:
> > >>
> > >> > There are also procedures to control the progress of recovery.
> > >>
> > >> That's copy-pasted from an earlier sentence at the table that lists
> > >> functions like pg_promote(), pg_wal_replay_pause(), and
> > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
> > >> the
> > >> progress of recovery like those functions do, it only causes the
> > >> calling
> > >> backend to wait.
> >
> > Fix documentation and add extra tests on multi-standby replication
> > and cascade replication.
>
> Thank you for the revised patch.
>
> I see couple of items which are not addressed in this revision.
>  * As Heikki pointed, that it's currently not possible in one round
> trip to call call pg_wal_replay_wait() and do other job.  The attached
> patch addresses this.  It milds the requirement.  Now, it's not
> necessary to be in atomic context.  It's only necessary to have no
> active snapshot.  This new requirement works for me so far.  I
> appreciate a feedback on this.
>  * As Alvaro pointed, multiple waiters case isn't covered by the test
> suite.  That leads to no coverage of some code paths.  The attached
> patch addresses this by adding a test case with multiple waiters.
>
> The rest looks good to me.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all.  It seems useless.
2. The target LSN is set to pg_current_wal_insert_lsn() + 10000.  This
location seems to be unachievable in this test.  So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki.  That is to insert the injection point at the
beginning of WaitLSNSetLatches().

Links.
1.
https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kyotaro Horiguchi

Date:

20 June 2024, 08:30:45

Hi, I looked through the patch and have some comments.


====== L68:
+    <title>Recovery Procedures</title>

It looks somewhat confusing and appears as if the section is intended
to explain how to perform recovery. Since this is the first built-in
procedure, I'm not sure how should this section be written. However,
the section immediately above is named "Recovery Control Functions",
so "Reocvery Synchronization Functions" would align better with the
naming of the upper section. (I don't believe we need to be so strcit
about the distinction between functions and procedures here.)

It looks strange that the procedure signature includes the return type.


====== L93:
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.

The first sentence should mention the main functionality. Following
precedents, it might be better to use something like "Waits until
recovery surpasses the specified LSN. If no timeout is specified or it
is set to zero, this procedure waits indefinitely for the LSN. If the
timeout is specified (in milliseconds) and is greater than zero, the
procedure waits until the LSN is reached or the specified time has
elapsed. On timeout, or if the server is promoted before the LSN is
reached, an error is emitted."

The detailed explanation that follows the above seems somewhat too
verbose to me, as other functions don't have such detailed examples.

====== L484
/*
+     * Set latches for processes, whose waited LSNs are already replayed. This
+     * involves spinlocks.  So, we shouldn't do this under a spinlock.
+     */

Here, I'm not quite sure what specifically spinlock (or mutex?) is
referring to.  However, more importantly, shouldn't we explain that it
is okay not to use any locks at all, rather than mentioning that
spinlocks should not be used here? I found a similar case around
freelist.c:238, which is written as follows.

>         * Not acquiring ProcArrayLock here which is slightly icky. It's
>         * actually fine because procLatch isn't ever freed, so we just can
>         * potentially set the wrong process' (or no process') latch.
>         */
>        SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

===== L518
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)

This function is only called within the same module. I'm not sure if
we need to expose it. I we do, the name should probably be more
specific. I'm not quite sure if the division of functionality between
this function and its only caller function is appropriate.  As a
possible refactoring, we could have WaitForLSN() just return the
result as [reached, timedout, promoted] and delegate prerequisition
checks and error reporting to the SQL function.


===== L524
+    /* Shouldn't be called when shmem isn't initialized */
+    Assert(waitLSN);

Seeing this assertion, I feel that the name "waitLSN" is a bit
obscure. How about renaming it to "waitLSNStates"?

===== L527
+    /* Should be only called by a backend */
+    Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);

This is somewhat excessive, causing a server crash when ending with an
error would suffice. By the way, if I execute "CALL
pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
The condition doesn't seem appropriate.


===== L561
+                     errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",

I don't think we need "the" before "target" in the above message.


===== L565
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;

"timeout" is immutable in the function. Therefore, we can calculate
"latch_events" before entering the loop. By the way, the name
'latch_events' seems a bit off. Latch is a kind of event the function
can wait for. Therefore, something like wait_events might be more
appropriate.

===== L567
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;

We can use TimestampDifferenceMilliseconds() here.


==== L578
+        if (rc & WL_LATCH_SET)
+            ResetLatch(MyLatch);

I think we usually reset latches unconditionally after returning from
WaitLatch(), even when waiting for timeouts.


===== L756
+{ oid => '16387', descr => 'wait for LSN with timeout',

The description seems a bit off. While timeout is mentioned, the more
important characteristic that the LSN is the replay LSN is not
included.

===== L791
+ * WaitLSNProcInfo [e2 80 93] the shared memory structure representing information

This line contains a non-ascii character EN Dash(U+2013).


===== L798
+     * A process number, same as the index of this item in waitLSN->procInfos.
+     * Stored for convenience.
+     */
+    int            procnum;

It is described as "(just) for convenience". However, it is referenced
by Startup to fetch the PGPROC entry for the waiter, which necessary
for Startup. That aside, why don't we hold (the pointer to) procLatch
instead of procnum? It makes things simpler and I believe it is our
standard practice.


===== L809
+    /* A flag indicating that this item is added to waitLSN->waitersHeap */
+    bool        inHeap;

The name "inHeap" seems too leteral and might be hard to understand in
most occurances. How about using "waiting" instead?

===== L920
+# I
+# Make sure that pg_wal_replay_wait() works: add new content to

Hmm. I feel that Arabic numerals look nicer than Greek ones here.


===== L940
+# Check that new data is visible after calling pg_wal_replay_wait()

On the other hand, the comment for the check for this test states that

> +# Make sure the current LSN on standby and is the same as primary's
> LSN +ok($output eq 30, "standby reached the same LSN as primary");

I think the first comment and the second should be consistent.




> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.
> 
> 1. It's not clear why this test needs node_standby2 at all.  It seems useless.

I agree with you. What we would need is a second *waiter client*
connecting to the same stanby rather than a second standby. I feel
like having a test where the first waiter is removed while multiple
waiters are waiting, as well as a test where promotion occurs under
the same circumstances.

> 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000.  This
> location seems to be unachievable in this test.  So, it's not clear
> what race condition this test could potentially detect.
> 3. I think it would make sense to check for the race condition
> reported by Heikki.  That is to insert the injection point at the
> beginning of WaitLSNSetLatches().

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

regards.


> Links.
> 1.
https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

09 July 2024, 09:51:23

Thank you for your interest in the patch.

On 2024-06-20 11:30, Kyotaro Horiguchi wrote:
> Hi, I looked through the patch and have some comments.
> 
> 
> ====== L68:
> +    <title>Recovery Procedures</title>
> 
> It looks somewhat confusing and appears as if the section is intended
> to explain how to perform recovery. Since this is the first built-in
> procedure, I'm not sure how should this section be written. However,
> the section immediately above is named "Recovery Control Functions",
> so "Reocvery Synchronization Functions" would align better with the
> naming of the upper section. (I don't believe we need to be so strcit
> about the distinction between functions and procedures here.)
> 
> It looks strange that the procedure signature includes the return type.

Good point, change
Recovery Procedures -> Recovery Synchronization Procedures

> ====== L93:
> +        If <parameter>timeout</parameter> is not specified or zero, 
> this
> +        procedure returns once WAL is replayed upto
> +        <literal>target_lsn</literal>.
> +        If the <parameter>timeout</parameter> is specified (in
> +        milliseconds) and greater than zero, the procedure waits until 
> the
> +        server actually replays the WAL upto 
> <literal>target_lsn</literal> or
> +        until the given time has passed. On timeout, an error is 
> emitted.
> 
> The first sentence should mention the main functionality. Following
> precedents, it might be better to use something like "Waits until
> recovery surpasses the specified LSN. If no timeout is specified or it
> is set to zero, this procedure waits indefinitely for the LSN. If the
> timeout is specified (in milliseconds) and is greater than zero, the
> procedure waits until the LSN is reached or the specified time has
> elapsed. On timeout, or if the server is promoted before the LSN is
> reached, an error is emitted."
> 
> The detailed explanation that follows the above seems somewhat too
> verbose to me, as other functions don't have such detailed examples.

Please offer your description. I think it would be better.

> ====== L484
> /*
> +     * Set latches for processes, whose waited LSNs are already replayed. 
> This
> +     * involves spinlocks.  So, we shouldn't do this under a spinlock.
> +     */
> 
> Here, I'm not quite sure what specifically spinlock (or mutex?) is
> referring to.  However, more importantly, shouldn't we explain that it
> is okay not to use any locks at all, rather than mentioning that
> spinlocks should not be used here? I found a similar case around
> freelist.c:238, which is written as follows.
> 
>>          * Not acquiring ProcArrayLock here which is slightly icky. It's
>>          * actually fine because procLatch isn't ever freed, so we just can
>>          * potentially set the wrong process' (or no process') latch.
>>          */
>>         SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

???

> ===== L518
> +void
> +WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
> 
> This function is only called within the same module. I'm not sure if
> we need to expose it. I we do, the name should probably be more
> specific. I'm not quite sure if the division of functionality between
> this function and its only caller function is appropriate.  As a
> possible refactoring, we could have WaitForLSN() just return the
> result as [reached, timedout, promoted] and delegate prerequisition
> checks and error reporting to the SQL function.

waitLSN -> waitLSNStates
No, waitLSNStates is not the best name, because waitLSNState is a state,
and waitLSN is not the array of waitLSNStates. We can think about 
another name, what you think?

> ===== L524
> +    /* Shouldn't be called when shmem isn't initialized */
> +    Assert(waitLSN);
> 
> Seeing this assertion, I feel that the name "waitLSN" is a bit
> obscure. How about renaming it to "waitLSNStates"?



> ===== L527
> +    /* Should be only called by a backend */
> +    Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
> 
> This is somewhat excessive, causing a server crash when ending with an
> error would suffice. By the way, if I execute "CALL
> pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
> The condition doesn't seem appropriate.

Can you give more information on your server crashes, so I could repeat 
them.

> ===== L565
> +        if (timeout > 0)
> +        {
> +            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
> +            latch_events |= WL_TIMEOUT;
> +            if (delay_ms <= 0)
> +                break;
> 
> "timeout" is immutable in the function. Therefore, we can calculate
> "latch_events" before entering the loop. By the way, the name
> 'latch_events' seems a bit off. Latch is a kind of event the function
> can wait for. Therefore, something like wait_events might be more
> appropriate.

"wait_event" - it can't be, because in latch declaration, this events 
responsible for wake up and not for wait
int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32 
wait_event_info)

> ==== L578
> +        if (rc & WL_LATCH_SET)
> +            ResetLatch(MyLatch);
> 
> I think we usually reset latches unconditionally after returning from
> WaitLatch(), even when waiting for timeouts.

No, it depends on you logic, when you have several wake_events and you 
want to choose what event ignited your latch.
Check applyparallelworker.c:813

> ===== L798
> +     * A process number, same as the index of this item in 
> waitLSN->procInfos.
> +     * Stored for convenience.
> +     */
> +    int            procnum;
> 
> It is described as "(just) for convenience". However, it is referenced
> by Startup to fetch the PGPROC entry for the waiter, which necessary
> for Startup. That aside, why don't we hold (the pointer to) procLatch
> instead of procnum? It makes things simpler and I believe it is our
> standard practice.

Can you deeper explane what you meen and give the example?

> ===== L809
> +    /* A flag indicating that this item is added to waitLSN->waitersHeap 
> */
> +    bool        inHeap;
> 
> The name "inHeap" seems too leteral and might be hard to understand in
> most occurances. How about using "waiting" instead?

No, inHeap leteral mean indeed inHeap. Check the comment.
Please suggest a more suitable one.

> ===== L940
> +# Check that new data is visible after calling pg_wal_replay_wait()
> 
> On the other hand, the comment for the check for this test states that
> 
>> +# Make sure the current LSN on standby and is the same as primary's
>> LSN +ok($output eq 30, "standby reached the same LSN as primary");
> 
> I think the first comment and the second should be consistent.

Thanks, I'll rephrase this comment

>> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.
>> 
>> 1. It's not clear why this test needs node_standby2 at all.  It seems 
>> useless.
> 
> I agree with you. What we would need is a second *waiter client*
> connecting to the same stanby rather than a second standby. I feel
> like having a test where the first waiter is removed while multiple
> waiters are waiting, as well as a test where promotion occurs under
> the same circumstances.

Can you give more information about this cases step by step, and what
means "remove" and "promotion".

>> 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000.  This
>> location seems to be unachievable in this test.  So, it's not clear
>> what race condition this test could potentially detect.
>> 3. I think it would make sense to check for the race condition
>> reported by Heikki.  That is to insert the injection point at the
>> beginning of WaitLSNSetLatches().
> 
> I think the race condition you mentioned refers to the inconsistency
> between the inHeap flag and the pairing heap caused by a race
> condition between timeout and wakeup (or perhaps other combinations?
> I'm not sure which version of the patch the mentioned race condition
> refers to). However, I imagine it is difficult to reliably reproduce
> this condition. In that regard, in the latest patch, the coherence
> between the inHeap flag and the pairing heap is protected by LWLock,
> so I believe we no longer need that test.

No, Alexandre means that Heikki point on race condition just before
LWLock. But injection point we can inject and stepin on backend, and
WaitLSNSetLatches is used from Recovery process. But I have trouble
to wakeup injection point on server.


-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment

v20-Implement-pg_wal_replay_wait-store.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kartyshov Ivan

Date:

10 July 2024, 04:58:23

>> I think the race condition you mentioned refers to the inconsistency
>> between the inHeap flag and the pairing heap caused by a race
>> condition between timeout and wakeup (or perhaps other combinations?
>> I'm not sure which version of the patch the mentioned race condition
>> refers to). However, I imagine it is difficult to reliably reproduce
>> this condition. In that regard, in the latest patch, the coherence
>> between the inHeap flag and the pairing heap is protected by LWLock,
>> so I believe we no longer need that test.
> 
> No, Alexandre means that Heikki point on race condition just before
> LWLock. But injection point we can inject and stepin on backend, and
> WaitLSNSetLatches is used from Recovery process. But I have trouble
> to wakeup injection point on server.

One more thing. I want to point, when you move injection point to
contrib dir and do the same test (044_wal_replay_wait_injection_test.pl)
step by step with hands, wakeup works well.

-- 
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 July 2024, 01:24:19

Thanks to Kyotaro for the review.  And thanks to Ivan for the patch
revision.  I made another revision of the patch.

On Tue, Jul 9, 2024 at 12:51 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
> Thank you for your interest in the patch.
>
> On 2024-06-20 11:30, Kyotaro Horiguchi wrote:
> > Hi, I looked through the patch and have some comments.
> >
> >
> > ====== L68:
> > +    <title>Recovery Procedures</title>
> >
> > It looks somewhat confusing and appears as if the section is intended
> > to explain how to perform recovery. Since this is the first built-in
> > procedure, I'm not sure how should this section be written. However,
> > the section immediately above is named "Recovery Control Functions",
> > so "Reocvery Synchronization Functions" would align better with the
> > naming of the upper section. (I don't believe we need to be so strcit
> > about the distinction between functions and procedures here.)
> >
> > It looks strange that the procedure signature includes the return type.
>
> Good point, change
> Recovery Procedures -> Recovery Synchronization Procedures

Thank you, looks good to me.

> > ====== L93:
> > +        If <parameter>timeout</parameter> is not specified or zero,
> > this
> > +        procedure returns once WAL is replayed upto
> > +        <literal>target_lsn</literal>.
> > +        If the <parameter>timeout</parameter> is specified (in
> > +        milliseconds) and greater than zero, the procedure waits until
> > the
> > +        server actually replays the WAL upto
> > <literal>target_lsn</literal> or
> > +        until the given time has passed. On timeout, an error is
> > emitted.
> >
> > The first sentence should mention the main functionality. Following
> > precedents, it might be better to use something like "Waits until
> > recovery surpasses the specified LSN. If no timeout is specified or it
> > is set to zero, this procedure waits indefinitely for the LSN. If the
> > timeout is specified (in milliseconds) and is greater than zero, the
> > procedure waits until the LSN is reached or the specified time has
> > elapsed. On timeout, or if the server is promoted before the LSN is
> > reached, an error is emitted."
> >
> > The detailed explanation that follows the above seems somewhat too
> > verbose to me, as other functions don't have such detailed examples.
>
> Please offer your description. I think it would be better.

Kyotaro actually provided a paragraph in his message.  I've integrated
it into the patch.

> > ====== L484
> > /*
> > +      * Set latches for processes, whose waited LSNs are already replayed.
> > This
> > +      * involves spinlocks.  So, we shouldn't do this under a spinlock.
> > +      */
> >
> > Here, I'm not quite sure what specifically spinlock (or mutex?) is
> > referring to.  However, more importantly, shouldn't we explain that it
> > is okay not to use any locks at all, rather than mentioning that
> > spinlocks should not be used here? I found a similar case around
> > freelist.c:238, which is written as follows.
> >
> >>               * Not acquiring ProcArrayLock here which is slightly icky. It's
> >>               * actually fine because procLatch isn't ever freed, so we just can
> >>               * potentially set the wrong process' (or no process') latch.
> >>               */
> >>              SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
>
> ???

I've revised the comment.

> > ===== L518
> > +void
> > +WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
> >
> > This function is only called within the same module. I'm not sure if
> > we need to expose it. I we do, the name should probably be more
> > specific. I'm not quite sure if the division of functionality between
> > this function and its only caller function is appropriate.  As a
> > possible refactoring, we could have WaitForLSN() just return the
> > result as [reached, timedout, promoted] and delegate prerequisition
> > checks and error reporting to the SQL function.

I think WaitForLSNReplay() is better name.  And it's not clear what
API we could need.  So I would prefer to keep it static for now.

> waitLSN -> waitLSNStates
> No, waitLSNStates is not the best name, because waitLSNState is a state,
> and waitLSN is not the array of waitLSNStates. We can think about
> another name, what you think?
>
> > ===== L524
> > +     /* Shouldn't be called when shmem isn't initialized */
> > +     Assert(waitLSN);
> >
> > Seeing this assertion, I feel that the name "waitLSN" is a bit
> > obscure. How about renaming it to "waitLSNStates"?

I agree that waitLSN is too generic.  I've renamed it to waitLSNState.

> > ===== L527
> > +     /* Should be only called by a backend */
> > +     Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
> >
> > This is somewhat excessive, causing a server crash when ending with an
> > error would suffice. By the way, if I execute "CALL
> > pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
> > The condition doesn't seem appropriate.
>
> Can you give more information on your server crashes, so I could repeat
> them.

I've rechecked this.  We don't need to assert the process to be a
backend given that MaxBackends include background workers too.
Replaced this just with assert that MyProcNumber is valid.

> > ===== L565
> > +             if (timeout > 0)
> > +             {
> > +                     delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
> > +                     latch_events |= WL_TIMEOUT;
> > +                     if (delay_ms <= 0)
> > +                             break;
> >
> > "timeout" is immutable in the function. Therefore, we can calculate
> > "latch_events" before entering the loop. By the way, the name
> > 'latch_events' seems a bit off. Latch is a kind of event the function
> > can wait for. Therefore, something like wait_events might be more
> > appropriate.
>
> "wait_event" - it can't be, because in latch declaration, this events
> responsible for wake up and not for wait
> int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32
> wait_event_info)

I agree with change "latch_events" => "wake_events" by Ivan.  I also
made change to calculate wake_events in advance as proposed by
Kyotaro.

> > ==== L578
> > +             if (rc & WL_LATCH_SET)
> > +                     ResetLatch(MyLatch);
> >
> > I think we usually reset latches unconditionally after returning from
> > WaitLatch(), even when waiting for timeouts.
>
> No, it depends on you logic, when you have several wake_events and you
> want to choose what event ignited your latch.
> Check applyparallelworker.c:813

+1,
Not necessary to reset latch if it hasn't been set.  A lot of places
where we do that conditionally.

> > ===== L798
> > +      * A process number, same as the index of this item in
> > waitLSN->procInfos.
> > +      * Stored for convenience.
> > +      */
> > +     int                     procnum;
> >
> > It is described as "(just) for convenience". However, it is referenced
> > by Startup to fetch the PGPROC entry for the waiter, which necessary
> > for Startup. That aside, why don't we hold (the pointer to) procLatch
> > instead of procnum? It makes things simpler and I believe it is our
> > standard practice.
>
> Can you deeper explane what you meen and give the example?

I wrote the comment that we store procnum for convenience meaning that
alternatively we could calculate it as the offset in the
waitLSN->procInfos array.  But I like your idea to keep latch instead.
This should give more flexibility for future advancements.

> > ===== L809
> > +     /* A flag indicating that this item is added to waitLSN->waitersHeap
> > */
> > +     bool            inHeap;
> >
> > The name "inHeap" seems too leteral and might be hard to understand in
> > most occurances. How about using "waiting" instead?
>
> No, inHeap leteral mean indeed inHeap. Check the comment.
> Please suggest a more suitable one.

+1,
We use this flag for consistency during operations with heap.  I don't
see a reason to rename this.

> > ===== L940
> > +# Check that new data is visible after calling pg_wal_replay_wait()
> >
> > On the other hand, the comment for the check for this test states that
> >
> >> +# Make sure the current LSN on standby and is the same as primary's
> >> LSN +ok($output eq 30, "standby reached the same LSN as primary");
> >
> > I think the first comment and the second should be consistent.
>
> Thanks, I'll rephrase this comment

I had to revised this comment furthermore.

> >> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.
> >>
> >> 1. It's not clear why this test needs node_standby2 at all.  It seems
> >> useless.
> >
> > I agree with you. What we would need is a second *waiter client*
> > connecting to the same stanby rather than a second standby. I feel
> > like having a test where the first waiter is removed while multiple
> > waiters are waiting, as well as a test where promotion occurs under
> > the same circumstances.
>
> Can you give more information about this cases step by step, and what
> means "remove" and "promotion".
>
> >> 2. The target LSN is set to pg_current_wal_insert_lsn() + 10000.  This
> >> location seems to be unachievable in this test.  So, it's not clear
> >> what race condition this test could potentially detect.
> >> 3. I think it would make sense to check for the race condition
> >> reported by Heikki.  That is to insert the injection point at the
> >> beginning of WaitLSNSetLatches().
> >
> > I think the race condition you mentioned refers to the inconsistency
> > between the inHeap flag and the pairing heap caused by a race
> > condition between timeout and wakeup (or perhaps other combinations?
> > I'm not sure which version of the patch the mentioned race condition
> > refers to). However, I imagine it is difficult to reliably reproduce
> > this condition. In that regard, in the latest patch, the coherence
> > between the inHeap flag and the pairing heap is protected by LWLock,
> > so I believe we no longer need that test.
>
> No, Alexandre means that Heikki point on race condition just before
> LWLock. But injection point we can inject and stepin on backend, and
> WaitLSNSetLatches is used from Recovery process. But I have trouble
> to wakeup injection point on server.

My initial hope was to add elegant enough test that will check for
concurrency issues between startup process and lsn waiter process.
That test should check an issue reported by Heikki and hopefully more.
I've tried to revise 044_wal_replay_wait_injection_test.pl, but it
becomes cumbersome and finally unclear what it's intended to test.
This is why I finally decide to remote this.

Regarding test for multiple waiters, 043_wal_replay_wait.pl has some.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v21-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

15 July 2024, 11:02:03

On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> Thanks to Kyotaro for the review.  And thanks to Ivan for the patch
> revision.  I made another revision of the patch.

I've noticed failures on cfbot.  The attached revision addressed docs
build failure.  Also it adds some "quits" for background psql sessions
for tests.  Probably this will address test hangs on windows.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v22-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

31 July 2024, 16:40:48

On Mon, Jul 15, 2024 at 2:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > Thanks to Kyotaro for the review.  And thanks to Ivan for the patch
> > revision.  I made another revision of the patch.
>
> I've noticed failures on cfbot.  The attached revision addressed docs
> build failure.  Also it adds some "quits" for background psql sessions
> for tests.  Probably this will address test hangs on windows.

I made the following changes to the patch.

1) I've changed the check for snapshot in pg_wal_replay_wait().  Now
it checks that GetOldestSnapshot() returns NULL.  It happens when both
ActiveSnapshot is NULL and RegisteredSnapshots pairing heap is empty.
This is the same condition when SnapshotResetXmin() sets out xmin to
invalid.  Thus, we are not preventing WAL from replay.  This should be
satisfied when pg_wal_replay_wait() isn't called within a transaction
with an isolation level higher than READ COMMITTED, another procedure,
or a function.  Documented it this way.

2) Explicitly document in the PortalRunUtility() comment that
pg_wal_replay_wait() is another case when active snapshot gets
released.

3) I've removed tests with cascading replication.  It's rather unclear
what problem these tests could potentially spot.

4) Did some improvements to docs, comments and commit message to make
them consistent with the patch contents.

The commit to pg17 was inconsiderate.  But I feel this patch got much
better shape.  Especially, now it's clear when the
pg_wal_replay_wait() procedure can be used.  So, I dare commit this to
pg18 if nobody objects.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v23-0001-Implement-pg_wal_replay_wait-stored-procedure.patch

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Kevin Hale Boyes

Date:

03 August 2024, 00:44:58

I noticed the commit and had a question and a comment.

There is a small problem in func.sgml in the sentence "After that the changes made of primary". Should be "on primary".

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?

If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

Thanks,

Kevin.

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

03 August 2024, 03:07:32

On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:
> I noticed the commit and had a question and a comment.
> There is a small problem in func.sgml in the sentence "After that the changes made of primary".  Should be "on
primary".

Thank you for spotting this. Will fix.

> In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to
GetXLogReplayRecPtr?
> If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

This is intentional.  After standby gets promoted,
GetXLogReplayRecPtr() returns the last WAL position being replayed
while being standby.  So, if standby reached target lsn before being
promoted, we don't have to throw an error.

But this gave me an idea that before the loop we probably need to put
RecoveryInProgress() check after GetXLogReplayRecPtr() too.  I'll
recheck that.

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

06 August 2024, 02:17:10

On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:
> > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to
GetXLogReplayRecPtr?
> > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.
>
> This is intentional.  After standby gets promoted,
> GetXLogReplayRecPtr() returns the last WAL position being replayed
> while being standby.  So, if standby reached target lsn before being
> promoted, we don't have to throw an error.
>
> But this gave me an idea that before the loop we probably need to put
> RecoveryInProgress() check after GetXLogReplayRecPtr() too.  I'll
> recheck that.

The attached patchset comprises assorted improvements for pg_wal_replay_wait().

The 0001 patch is intended to improve this situation.  Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Michael Paquier

Date:

06 August 2024, 05:36:03

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> The 0001 patch is intended to improve this situation.  Actually, it's
> not right to just put RecoveryInProgress() after
> GetXLogReplayRecPtr(), because more wal could be replayed between
> these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> getting negative result of RecoveryInProgress() because WAL replay
> position couldn't get updated after.
> 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> 0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree?  drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
--
Michael

Attachment

signature.asc

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

06 August 2024, 08:18:05

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > The 0001 patch is intended to improve this situation.  Actually, it's
> > not right to just put RecoveryInProgress() after
> > GetXLogReplayRecPtr(), because more wal could be replayed between
> > these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> > getting negative result of RecoveryInProgress() because WAL replay
> > position couldn't get updated after.
> > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > 0003 patch comprises tests for pg_wal_replay_wait() errors.
>
> Before adding more tests, could it be possible to stabilize what's in
> the tree?  drongo has reported one failure with the recovery test
> 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54

Thank you for pointing!
Surely, I'll fix this before.

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

06 August 2024, 10:23:35

On Tue, Aug 6, 2024 at 11:18 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > > The 0001 patch is intended to improve this situation. Actually, it's
> > > not right to just put RecoveryInProgress() after
> > > GetXLogReplayRecPtr(), because more wal could be replayed between
> > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after
> > > getting negative result of RecoveryInProgress() because WAL replay
> > > position couldn't get updated after.
> > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > > 0003 patch comprises tests for pg_wal_replay_wait() errors.
> >
> > Before adding more tests, could it be possible to stabilize what's in
> > the tree? drongo has reported one failure with the recovery test
> > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
>
> Thank you for pointing!
> Surely, I'll fix this before.

Something breaks in these lines during second iteration of the loop. "SELECT pg_current_wal_insert_lsn()" has been queried from primary, but standby didn't receive "CALL pg_wal_replay_wait('...');"

for (my $i = 0; $i < 5; $i++)
{
print($i);
$node_primary->safe_psql('postgres',
"INSERT INTO wait_test VALUES (${i});");
my $lsn =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn()");
$psql_sessions[$i] = $node_standby1->background_psql('postgres');
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
CALL pg_wal_replay_wait('${lsn}');
SELECT log_count(${i});
]);
}

I wonder what could it be. Probably something hangs inside launching background psql... I'll investigate this more.

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

10 August 2024, 15:58:31

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > The 0001 patch is intended to improve this situation.  Actually, it's
> > not right to just put RecoveryInProgress() after
> > GetXLogReplayRecPtr(), because more wal could be replayed between
> > these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> > getting negative result of RecoveryInProgress() because WAL replay
> > position couldn't get updated after.
> > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > 0003 patch comprises tests for pg_wal_replay_wait() errors.
>
> Before adding more tests, could it be possible to stabilize what's in
> the tree?  drongo has reported one failure with the recovery test
> 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54

I'm currently running a 043_wal_replay_wait test in a loop of drongo.
No failures during more than 10 hours.  As I pointed in [1] it seems
that test stuck somewhere on launching BackgroundPsql.  Given that
drongo have some strange failures from time to time (for instance [2]
or [3]), I doubt there is something specifically wrong in
043_wal_replay_wait test that caused the subject failure.

Therefore, while I'm going to continue looking at the reason of
failure on drongo in background, I'm going to go ahead with my
improvements for pg_wal_replay_wait().

Links.
1. https://www.postgresql.org/message-id/CAPpHfduYkve0sw-qy4aCCmJv_MXfuuAQ7wyRQsX8NjaLVKDE1Q%40mail.gmail.com
2. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-02%2010%3A34%3A45
3. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-06-06%2012%3A36%3A11

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

10 August 2024, 16:33:57

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:
> > > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to
GetXLogReplayRecPtr?
> > > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.
> >
> > This is intentional.  After standby gets promoted,
> > GetXLogReplayRecPtr() returns the last WAL position being replayed
> > while being standby.  So, if standby reached target lsn before being
> > promoted, we don't have to throw an error.
> >
> > But this gave me an idea that before the loop we probably need to put
> > RecoveryInProgress() check after GetXLogReplayRecPtr() too.  I'll
> > recheck that.
>
> The attached patchset comprises assorted improvements for pg_wal_replay_wait().
>
> The 0001 patch is intended to improve this situation.  Actually, it's
> not right to just put RecoveryInProgress() after
> GetXLogReplayRecPtr(), because more wal could be replayed between
> these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> getting negative result of RecoveryInProgress() because WAL replay
> position couldn't get updated after.
> 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> 0003 patch comprises tests for pg_wal_replay_wait() errors.

Here is a revised version of the patchset.  I've fixed some typos,
identation, etc.  I'm going to push this once it passes cfbot.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

10 August 2024, 17:18:53

On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > > On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:
> > > > In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to
GetXLogReplayRecPtr?
> > > > If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.
> > >
> > > This is intentional.  After standby gets promoted,
> > > GetXLogReplayRecPtr() returns the last WAL position being replayed
> > > while being standby.  So, if standby reached target lsn before being
> > > promoted, we don't have to throw an error.
> > >
> > > But this gave me an idea that before the loop we probably need to put
> > > RecoveryInProgress() check after GetXLogReplayRecPtr() too.  I'll
> > > recheck that.
> >
> > The attached patchset comprises assorted improvements for pg_wal_replay_wait().
> >
> > The 0001 patch is intended to improve this situation.  Actually, it's
> > not right to just put RecoveryInProgress() after
> > GetXLogReplayRecPtr(), because more wal could be replayed between
> > these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> > getting negative result of RecoveryInProgress() because WAL replay
> > position couldn't get updated after.
> > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > 0003 patch comprises tests for pg_wal_replay_wait() errors.
>
> Here is a revised version of the patchset.  I've fixed some typos,
> identation, etc.  I'm going to push this once it passes cfbot.

The next revison of the patchset fixes uninitialized variable usage
spotted by cfbot.

------
Regards,
Alexander Korotkov
Supabase

Attachment

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Korotkov

Date:

12 August 2024, 21:25:47

On Sat, Aug 10, 2024 at 6:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > > The 0001 patch is intended to improve this situation.  Actually, it's
> > > not right to just put RecoveryInProgress() after
> > > GetXLogReplayRecPtr(), because more wal could be replayed between
> > > these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> > > getting negative result of RecoveryInProgress() because WAL replay
> > > position couldn't get updated after.
> > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > > 0003 patch comprises tests for pg_wal_replay_wait() errors.
> >
> > Before adding more tests, could it be possible to stabilize what's in
> > the tree?  drongo has reported one failure with the recovery test
> > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
>
> I'm currently running a 043_wal_replay_wait test in a loop of drongo.
> No failures during more than 10 hours.  As I pointed in [1] it seems
> that test stuck somewhere on launching BackgroundPsql.  Given that
> drongo have some strange failures from time to time (for instance [2]
> or [3]), I doubt there is something specifically wrong in
> 043_wal_replay_wait test that caused the subject failure.

With help of Andrew Dunstan, I've run 043_wal_replay_wait.pl in a loop
for two days, then the whole test suite also for two days.  Haven't
seen any failures.  I don't see the point to run more experiments,
because Andrew needs to bring drongo back online as a buildfarm
member.  It might happen that something exceptional happened on drongo
(like inability to launch a new process or something).  For now, I
think the reasonable strategy would be to wait and see if something
similar will repeat on buildfarm.

------
Regards,
Alexander Korotkov
Supabase

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Lakhin

Date:

13 September 2024, 15:00:00

Hi Alexander,

10.08.2024 20:18, Alexander Korotkov wrote:
> On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> ...
>> Here is a revised version of the patchset.  I've fixed some typos,
>> identation, etc.  I'm going to push this once it passes cfbot.
> The next revison of the patchset fixes uninitialized variable usage
> spotted by cfbot.

When running check-world on a rather slow armv7 device, I came across the
043_wal_replay_wait.pl test failure:
t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 8.

regress_log_043_wal_replay_wait contains:
...
01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data
### Promoting node "standby"
# Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l 
.../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
waiting for server to promote.... done
server promoted
[21:58:56.637](0.268s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL

pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line
2140.

043_wal_replay_wait_standby.log contains:
2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG:  connection received: host=[local]
2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG:  connection authenticated: user="android" method=trust 
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG:  connection authorized: user=android database=postgres 
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('2/570CB4E8');
2024-09-12 21:58:56.535 UTC [15123:7] LOG:  received promote request
2024-09-12 21:58:56.535 UTC [15124:2] FATAL:  terminating walreceiver process due to administrator command
2024-09-12 21:58:56.537 UTC [15123:8] LOG:  invalid record length at 0/300D0B0: expected at least 24, got 0
2024-09-12 21:58:56.537 UTC [15123:9] LOG:  redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s, 
elapsed: 14.23 s
2024-09-12 21:58:56.537 UTC [15123:10] LOG:  last completed transaction was at log time 2024-09-12 21:58:55.322831+00
2024-09-12 21:58:56.540 UTC [15123:11] LOG:  selected new timeline ID: 2
2024-09-12 21:58:56.589 UTC [15123:12] LOG:  archive recovery complete
2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN 
2/570CB4E8; last replay LSN 0/300D0B0.
2024-09-12 21:58:56.591 UTC [15121:1] LOG:  checkpoint starting: force
2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android

database=postgres host=[local]
2024-09-12 21:58:56.595 UTC [15120:4] LOG:  database system is ready to accept connections
2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG:  connection received: host=[local]
2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG:  connection authenticated: user="android" method=trust 
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG:  connection authorized: user=android database=postgres 
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300D0E8');
2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for LSN can only be executed during
recovery.
2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android

database=postgres host=[local]

Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
call wants to wait for 300D0E8.

pg_waldump -p src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
shows:
rmgr: Heap        len (rec/tot):     59/    59, tx:        748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,

flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT 
2024-09-12 21:58:55.322831 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS 
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I could reproduce this failure on my workstation with bgworker modified
as below:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -69 +69 @@ int            BgWriterDelay = 200;
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 15
@@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

When looking at the test, I noticed probably a typo in the test message:
wait for already replayed LSN exists immediately ...
shouldn't it be "exits" there (though maybe the whole phrase could be
improved)?

I also suspect that "achieve" is not suitable word in the context of LSNs
and timeouts. Maybe you would find it appropriate to replace it with
"reach"?

Best regards,
Alexander

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Lakhin

Date:

17 September 2024, 09:00:00

Hi Alexander,

16.09.2024 21:55, Alexander Korotkov wrote:
> Please find two patches attached.  The first one does minor cleanup
> including misuse of words you've pointed.  The second one adds missing
> wait_for_catchup().  That should fix the test failure you've spotted.
> Please, check if it fixes an issue for you.

Thank you for looking at that!

Unfortunately, the issue is still here — the test failed for me 6 out of
10 runs, as below:
[05:14:02.807](0.135s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL

pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line
2140.

043_wal_replay_wait_standby.log:
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN 
2/570CD648; last replay LSN 0/300F210.
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT:  CALL pg_wal_replay_wait('2/570CD648');
2024-09-17 05:14:02.714 UTC [1817155] LOG:  checkpoint starting: force
2024-09-17 05:14:02.714 UTC [1817154] LOG:  database system is ready to accept connections
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300F248');
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is not in progress

pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
rmgr: Transaction len (rec/tot):     34/    34, tx:        748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT 
2024-09-17 05:14:01.654874 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS 
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I wonder, can you reproduce this with that bgwriter's modification?

I've also found two more "achievements" coined by 3c5db1d6b:
doc/src/sgml/func.sgml:   It may also happen that target <acronym>lsn</acronym> is not achieved
src/backend/access/transam/xlog.c-       * recovery was ended before achieving the target LSN.

Best regards,
Alexander

Re: [HACKERS] make async slave to wait for lsn to be replayed

From

Alexander Lakhin

Date:

17 September 2024, 12:00:00

17.09.2024 10:47, Alexander Korotkov wrote:
> Yes, now I did reproduce.  I got that the problem could be that insert
> LSN is not yet written at primary, thus wait_for_catchup doesn't wait
> for it.  I've workarounded that using pg_switch_wal().  The revised
> patchset is attached.

Thank you for the revised patch!

The improved test works reliably for me (100 out of 100 runs passed),

Best regards,
Alexander