Thread: Strange decreasing value of pg_last_wal_receive_lsn()

Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
PG12
Steps to reproduce on 3 nodes cluster with quorum commit.
1. Cut off network on master with everything.
2. Pkill -9 PostgreSQL on each node.
3. Start PostgreSQL on each node.

What was strange?
I check every second pg_last_wal_replay_lsn() and  pg_last_wal_receive_lsn().

All time it was the same.
For example last output before killing PostgreSQL:
2020-05-08 08:31:44,495 DEBUG: Last replayed LSN: 0/502E3D8
2020-05-08 08:31:44,497 DEBUG: Last received LSN: 0/502E3D8

After starting PostgreSQL I got this:
2020-05-08 08:31:55,324 DEBUG: Last replayed LSN: 0/502E3D8
2020-05-08 08:31:55,326 DEBUG: Last received LSN: 0/5000000

Why could it happen and is it expected behaviour?

Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Sergei Kornilov
Date:
Hello

Yes, this is expected. Walreceiver always start streaming from beginning of the wal segment.
./src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming:

     * We always start at the beginning of the segment. That prevents a broken
     * segment (i.e., with no records in the first half of a segment) from
     * being created by XLOG streaming, which might cause trouble later on if
     * the segment is e.g archived.

regards, Sergei



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
I got it, thank you.
Can you recommend what to use to determine which quorum standby should be promoted in such case?
We planned to use pg_last_wal_receive_lsn() to determine which has fresh data but if it returns the beginning of the segment on both replicas we can’t determine which standby confirmed that write transaction to disk.

On 8 May 2020, at 14:50, Sergei Kornilov <sk@zsrv.org> wrote:

Hello

Yes, this is expected. Walreceiver always start streaming from beginning of the wal segment.
./src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming:

* We always start at the beginning of the segment. That prevents a broken
* segment (i.e., with no records in the first half of a segment) from
* being created by XLOG streaming, which might cause trouble later on if
* the segment is e.g archived.

regards, Sergei

Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Michael Paquier
Date:
On Fri, May 08, 2020 at 03:02:26PM +0500, godjan • wrote:
> Can you recommend what to use to determine which quorum standby
> should be promoted in such case?
> We planned to use pg_last_wal_receive_lsn() to determine which has
> fresh data but if it returns the beginning of the segment on both
> replicas we can’t determine which standby confirmed that write
> transaction to disk.

If you want to preserve transaction-level consistency across those
notes, what is your configuration for synchronous_standby_names and
synchronous_commit on the primary?  Cannot you rely on that?
--
Michael

Attachment

Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
synchronous_standby_names=ANY 1(host1, host2)
synchronous_commit=on

So to understand which standby wrote last data to disk I should know receive_lsn or write_lsn.

Sent from my iPhone

> On 9 May 2020, at 13:48, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, May 08, 2020 at 03:02:26PM +0500, godjan • wrote:
>> Can you recommend what to use to determine which quorum standby
>> should be promoted in such case?
>> We planned to use pg_last_wal_receive_lsn() to determine which has
>> fresh data but if it returns the beginning of the segment on both
>> replicas we can’t determine which standby confirmed that write
>> transaction to disk.
>
> If you want to preserve transaction-level consistency across those
> notes, what is your configuration for synchronous_standby_names and
> synchronous_commit on the primary?  Cannot you rely on that?
> --
> Michael



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Michael Paquier
Date:
On Sun, May 10, 2020 at 06:58:50PM +0500, godjan • wrote:
> synchronous_standby_names=ANY 1(host1, host2)
> synchronous_commit=on

Thanks for the details.  I was not sure based on your previous
messages.

> So to understand which standby wrote last data to disk I should know
> receive_lsn or write_lsn.

If you have just an access to the standbys, using
pg_last_wal_replay_lsn() should be enough, no?  One tricky point is to
make sure that each standby does not have more WAL to replay, though
you can do that by looking at the wait event called
RecoveryRetrieveRetryInterval for the startup process.
Note that when a standby starts and has primary_conninfo set, it would
request streaming to start again at the beginning of the segment as
mentioned, but it does not change the point up to which the startup
process replays the WAL available locally, as that's what takes
priority as WAL source (second choice is a WAL archive and third is
streaming if all options are set in the recovery configuration).

There are several HA solutions floating around in the community, and I
got to wonder as well if some of them don't just scan the local
pg_wal/ of each standby in this case, even if that's more simple to
let the nodes start and replay up to their latest point available.
--
Michael

Attachment

Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Jehan-Guillaume de Rorthais
Date:
(too bad the history has been removed to keep context)

On Fri, 8 May 2020 15:02:26 +0500
godjan • <g0dj4n@gmail.com> wrote:

> I got it, thank you.
> Can you recommend what to use to determine which quorum standby should be
> promoted in such case? We planned to use pg_last_wal_receive_lsn() to
> determine which has fresh data but if it returns the beginning of the segment
> on both replicas we can’t determine which standby confirmed that write
> transaction to disk.

Wait, pg_last_wal_receive_lsn() only decrease because you killed your standby.

pg_last_wal_receive_lsn() returns the value of walrcv->flushedUpto. The later
is set to the beginning of the segment requested only during the first
walreceiver startup or a timeline fork:

    /*
     * If this is the first startup of walreceiver (on this timeline),
     * initialize flushedUpto and latestChunkStart to the starting point.
     */
    if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
    {
        walrcv->flushedUpto = recptr;
        walrcv->receivedTLI = tli;
        walrcv->latestChunkStart = recptr;
    }
    walrcv->receiveStart = recptr;
    walrcv->receiveStartTLI = tli;

After a primary loss, as far as the standby are up and running, it is fine
to use pg_last_wal_receive_lsn().

Why do you kill -9 your standby? Whay am I missing? Could you explain the
usecase you are working on to justify this?

Regards,



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 11 May 2020 15:54:02 +0900
Michael Paquier <michael@paquier.xyz> wrote:
[...]
> There are several HA solutions floating around in the community, and I
> got to wonder as well if some of them don't just scan the local
> pg_wal/ of each standby in this case, even if that's more simple to
> let the nodes start and replay up to their latest point available.

PAF relies on pg_last_wal_receive_lsn(). Relying on pg_last_wal_replay_lsn
might be possible. As you explained, it would requires to compare current
replay LSN with the last received on disk thought. This might probably be done,
eg with pg_waldump maybe and a waiting loop.

However, such a waiting loop might be dangerous. If standbys are lagging far
behind and/or have read only sessions and/or load slowing down the replay, the
waiting loop might be very long. Maybe longer than the required RTO. The HA
automatic operator might even takes curative action because of some
recovery timeout, making things worst.

Regards,



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
-> Why do you kill -9 your standby?
Hi, it’s Jepsen test for our HA solution. It checks that we don’t lose data in such situation.

So, now we update logic as Michael said. All ha alive standbys now waiting for replaying all WAL that they have and
afterwe use pg_last_replay_lsn() to choose which standby will be promoted in failover. 

It fixed out trouble, but there is one another. Now we should wait when all ha alive hosts finish replaying WAL to
failover.It might take a while(for example WAL contains wal_record about splitting b-tree). 

We are looking for options that will allow us to find a standby that contains all data and replay all WAL only for this
standbybefore failover. 

Maybe you have ideas on how to keep the last actual value of pg_last_wal_receive_lsn()? As I understand WAL receiver
doesn’twrite to disk walrcv->flushedUpto. 

> On 13 May 2020, at 19:52, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
>
>
> (too bad the history has been removed to keep context)
>
> On Fri, 8 May 2020 15:02:26 +0500
> godjan • <g0dj4n@gmail.com> wrote:
>
>> I got it, thank you.
>> Can you recommend what to use to determine which quorum standby should be
>> promoted in such case? We planned to use pg_last_wal_receive_lsn() to
>> determine which has fresh data but if it returns the beginning of the segment
>> on both replicas we can’t determine which standby confirmed that write
>> transaction to disk.
>
> Wait, pg_last_wal_receive_lsn() only decrease because you killed your standby.
>
> pg_last_wal_receive_lsn() returns the value of walrcv->flushedUpto. The later
> is set to the beginning of the segment requested only during the first
> walreceiver startup or a timeline fork:
>
>     /*
>      * If this is the first startup of walreceiver (on this timeline),
>      * initialize flushedUpto and latestChunkStart to the starting point.
>      */
>     if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
>     {
>         walrcv->flushedUpto = recptr;
>         walrcv->receivedTLI = tli;
>         walrcv->latestChunkStart = recptr;
>     }
>     walrcv->receiveStart = recptr;
>     walrcv->receiveStartTLI = tli;
>
> After a primary loss, as far as the standby are up and running, it is fine
> to use pg_last_wal_receive_lsn().
>
> Why do you kill -9 your standby? Whay am I missing? Could you explain the
> usecase you are working on to justify this?
>
> Regards,




Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Jehan-Guillaume de Rorthais
Date:
(please, the list policy is bottom-posting to keep history clean, thanks).

On Thu, 14 May 2020 07:18:33 +0500
godjan • <g0dj4n@gmail.com> wrote:

> -> Why do you kill -9 your standby?
> Hi, it’s Jepsen test for our HA solution. It checks that we don’t lose data
> in such situation.

OK. This test is highly useful to stress data high availability and durability,
of course. However, how useful is this test in a context of auto failover for
**service** high availability?  If all your nodes are killed in the same
disaster, how/why an automatic cluster manager should take care of starting all
nodes again and pick the right node to promote?

> So, now we update logic as Michael said. All ha alive standbys now waiting
> for replaying all WAL that they have and after we use pg_last_replay_lsn() to
> choose which standby will be promoted in failover.
>
> It fixed out trouble, but there is one another. Now we should wait when all
> ha alive hosts finish replaying WAL to failover. It might take a while(for
> example WAL contains wal_record about splitting b-tree).

Indeed, this is the concern I wrote about yesterday in a second mail on this
thread.

> We are looking for options that will allow us to find a standby that contains
> all data and replay all WAL only for this standby before failover.

Note that when you promote a node, it first replays available WALs before
acting as a primary. So you can safely signal the promotion to the node and
wait for it to finish the replay and promote.

> Maybe you have ideas on how to keep the last actual value of
> pg_last_wal_receive_lsn()?

Nope, no clean and elegant idea. One your instances are killed, maybe you can
force flush the system cache (secure in-memory-only data) and read the latest
received WAL using pg_waldump?

But, what if some more data are available from archives, but not received from
streaming rep because of a high lag?

> As I understand WAL receiver doesn’t write to disk walrcv->flushedUpto.

I'm not sure to understand what you mean here.
pg_last_wal_receive_lsn() reports the actual value of walrcv->flushedUpto.
walrcv->flushedUpto reports the latest LSN force-flushed to disk.


> > On 13 May 2020, at 19:52, Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> > wrote:
> >
> >
> > (too bad the history has been removed to keep context)
> >
> > On Fri, 8 May 2020 15:02:26 +0500
> > godjan • <g0dj4n@gmail.com> wrote:
> >
> >> I got it, thank you.
> >> Can you recommend what to use to determine which quorum standby should be
> >> promoted in such case? We planned to use pg_last_wal_receive_lsn() to
> >> determine which has fresh data but if it returns the beginning of the
> >> segment on both replicas we can’t determine which standby confirmed that
> >> write transaction to disk.
> >
> > Wait, pg_last_wal_receive_lsn() only decrease because you killed your
> > standby.
> >
> > pg_last_wal_receive_lsn() returns the value of walrcv->flushedUpto. The
> > later is set to the beginning of the segment requested only during the first
> > walreceiver startup or a timeline fork:
> >
> >     /*
> >      * If this is the first startup of walreceiver (on this timeline),
> >      * initialize flushedUpto and latestChunkStart to the starting
> > point. */
> >     if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
> >     {
> >         walrcv->flushedUpto = recptr;
> >         walrcv->receivedTLI = tli;
> >         walrcv->latestChunkStart = recptr;
> >     }
> >     walrcv->receiveStart = recptr;
> >     walrcv->receiveStartTLI = tli;
> >
> > After a primary loss, as far as the standby are up and running, it is fine
> > to use pg_last_wal_receive_lsn().
> >
> > Why do you kill -9 your standby? Whay am I missing? Could you explain the
> > usecase you are working on to justify this?
> >
> > Regards,
>
>
>



--
Jehan-Guillaume de Rorthais
Dalibo



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
Hi, sorry for 2 weeks latency in answer :)

It fixed out trouble, but there is one another. Now we should wait when all
ha alive hosts finish replaying WAL to failover. It might take a while(for
example WAL contains wal_record about splitting b-tree).

Indeed, this is the concern I wrote about yesterday in a second mail on this
thread.

Actually, I found out that we use the wrong heuristic to understand that standby still replaying WAL.
We compare values of pg_last_wal_replay_lsn() after and before sleeping.
If standby replaying huge wal_record(e.g. splitting b-tree) it gave us the wrong result.


Note that when you promote a node, it first replays available WALs before
acting as a primary.

Do you know how Postgres understand that standby still replays available WAL?
I didn’t get it from the code of promotion.


However, how useful is this test in a context of auto failover for
**service** high availability?
Such a situation has a really low probability in our service. And we thought that it could be okay to resolve such a situation with on-call participation.

Nope, no clean and elegant idea. One your instances are killed, maybe you can
force flush the system cache (secure in-memory-only data)? 
Do "force flush the system cache” means invoke this command https://linux.die.net/man/8/sync on the standby?

and read the latest received WAL using pg_waldump?

I did an experiment with pg_waldump without sync:
- write data on primary 
- kill primary
read the latest received WAL using pg_waldump:
0/1D019F38
- pg_last_wal_replay_lsn():
0/1D019F68

So it’s wrong to use pg_waldump to understand what was latest received LSN. At least without “forcing flush system cache”.

If all your nodes are killed in the same
disaster, how/why an automatic cluster manager should take care of starting all
nodes again and pick the right node to promote?

1. How?
The automatic cluster manager will restart standbys in such a situation.
If the primary lock in ZK is released automatic cluster manager start process of election new primary.
To understand which node should be promoted automatic cluster manager should get LSN of the last wal_record wrote on disk by each potential new primary.
We used pg_last_wal_receive_lsn() for it but it was a mistake. Because after "kill -9” on standby pg_last_wal_receive_lsn() reports first lsn of last segment.

2. Why?
- sleepy on-call in a night can make something bad in such situation)
- pg_waldump didn’t give the last LSN wrote on disk(at least without forcing flush the system cache) so I don’t know how on-call can understand which standby should be promoted
- automatic cluster manager successfully resolve such situations in clusters with one determined synchronous standby for years, and we hope it’s possible to do it in clusters with quorum replication



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
Jehan-Guillaume de Rorthais
Date:
On Mon, 1 Jun 2020 12:44:26 +0500
godjan • <g0dj4n@gmail.com> wrote:

> Hi, sorry for 2 weeks latency in answer :)
>
> >> It fixed out trouble, but there is one another. Now we should wait when all
> >> ha alive hosts finish replaying WAL to failover. It might take a while(for
> >> example WAL contains wal_record about splitting b-tree).
> >
> > Indeed, this is the concern I wrote about yesterday in a second mail on this
> > thread.
>
> Actually, I found out that we use the wrong heuristic to understand that
> standby still replaying WAL. We compare values of pg_last_wal_replay_lsn()
> after and before sleeping. If standby replaying huge wal_record(e.g.
> splitting b-tree) it gave us the wrong result.

It could, yes.

> > Note that when you promote a node, it first replays available WALs before
> > acting as a primary.
>
> Do you know how Postgres understand that standby still replays available WAL?
> I didn’t get it from the code of promotion.

See chapter "26.2.2. Standby Server Operation" in official doc:

«
  Standby mode is exited and the server switches to normal operation when
  pg_ctl promote is run or a trigger file is found (promote_trigger_file).
  Before failover, any WAL immediately available in the archive or in pg_wal
  will be restored, but no attempt is made to connect to the master.
»

In the source code, dig around the following chain if interested: StartupXLOG ->
ReadRecord -> XLogReadRecord -> XLogPageRead -> WaitForWALToBecomeAvailable.

[...]

> > Nope, no clean and elegant idea. One your instances are killed, maybe you
> > can force flush the system cache (secure in-memory-only data)?
>
> Do "force flush the system cache” means invoke this command
> https://linux.die.net/man/8/sync <https://linux.die.net/man/8/sync> on the
> standby?

Yes, just for safety.

> > and read the latest received WAL using pg_waldump?
>
> I did an experiment with pg_waldump without sync:
> - write data on primary
> - kill primary
> - read the latest received WAL using pg_waldump:
> 0/1D019F38
> - pg_last_wal_replay_lsn():
> 0/1D019F68

Normal. pg_waldump gives you the starting LSN of the record.
pg_last_wal_replay_lsn() returns lastReplayedEndRecPtr, which is the end of the
record:

    /*
     * lastReplayedEndRecPtr points to end+1 of the last record successfully
     * replayed.

So I suppose your last xlogrecord was 30 bytes long. If I remember correctly,
minimal xlogrecord length is 24 bytes, so I bet there's only one xlogrecord
there, starting at 0/1D019F38 with last byte at 0/1D019F67.

> So it’s wrong to use pg_waldump to understand what was latest received LSN.
> At least without “forcing flush system cache”.

Nope, just sum the xlogrecord length.



Re: Strange decreasing value of pg_last_wal_receive_lsn()

From
godjan •
Date:
I got it should be LSN + MAXALIGN(xlogrecord length) 👍
Thanks a lot.

On 2 Jun 2020, at 19:11, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Nope, just sum the xlogrecord length.