Re: Bug on update timing of walrcv->flushedUpto variable - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Bug on update timing of walrcv->flushedUpto variable
Date
Msg-id 20210329.105441.1978082841561262877.horikyota.ntt@gmail.com
Whole thread Raw
In response to Bug on update timing of walrcv->flushedUpto variable  ("蔡梦娟(玊于)" <mengjuan.cmj@alibaba-inc.com>)
Responses 回复:Bug on update timing of walrcv->flushedUpto variable
List pgsql-hackers
Hi.

(Added Nathan, Andrey and Heikki in Cc:)

At Fri, 26 Mar 2021 23:44:21 +0800, "蔡梦娟(玊于)" <mengjuan.cmj@alibaba-inc.com> wrote in 
> Hi, all
> 
> Recently, I found a bug on update timing of walrcv->flushedUpto variable, consider the following scenario, there is
onePrimary node, one Standby node which streaming from Primary:
 
> There are a large number of SQL running in the Primary, and the length of the xlog record generated by these SQL
maybegreater than the left space of current page so that it needs to be written cross pages. As shown below, the length
ofthe last_xlog of wal_1 is greater than the left space of last_page, so it has to be written in wal_2. If Primary
crashedafter flused the last_page of wal_1 to disk, the remian content of last_xlog hasn't been flushed in time, then
thelast_xlog in wal_1 will be incomplete. And Standby also received the wal_1 by wal-streaming in this case.
 

It seems like the same with the issue discussed in [1].

There are two symptom of the issue, one is that archive ends with a
segment that ends with a immature WAL record, which causes
inconsistency between archive and pg_wal directory. Another is , as
you saw, walreceiver receives an immature record at the end of a
segment, which prevents recovery from proceeding.

In the thread, trying to solve that by preventing such an immature
records at a segment boundary from being archived and inhibiting being
sent to standby.

> [日志1.png]

It doesn't seem attached..

> The confusing point is: why only updates the walrcv->flushedUpto at the first startup of walreceiver on a specific
timeline,not each time when request xlog streaming? In above case, it is also reasonable to update walrcv->flushedUpto
towal_1 when Standby re-receive wal_1. So I changed to update the walrcv->flushedUpto each time when request xlog
streaming,which is the patch I want to share with you, based on postgresql-13.2, what do you think of this change?
 
> 
> By the way, I also want to know why call pgstat_reset_all function during recovery process?

We shouldn't rewind flushedUpto to backward. The variable notifies how
far recovery (or startup process) can read WAL content safely.  Once
startup process reads the beginning of a record, XLogReadRecord tries
to continue fetching *only the rest* of the record, which is
inconsistent from the first part in this scenario. So at least only
this fix doesn't work fine.  And we also need to fix the archive
inconsistency, maybe as a part of a fix for this issue.

We are trying to fix this by refraining from archiving (or streaming)
until a record crossing a segment boundary is completely flushed.

regards.


[1] https://www.postgresql.org/message-id/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: TRUNCATE on foreign table
Next
From: Michael Paquier
Date:
Subject: Re: Allow matching whole DN from a client certificate