Re: prevent immature WAL streaming - Mailing list pgsql-hackers

From Andres Freund
Subject Re: prevent immature WAL streaming
Date
Msg-id 20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
Whole thread Raw
In response to prevent immature WAL streaming  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: prevent immature WAL streaming  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
Hi,

On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote:
> Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
> this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
> variable").
>
> As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.

I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:

- It adds nontrivial work to hot code paths to handle an edge case,
  rather than making rare code paths more expensive.

- There are very similar issues with promotions of replicas (consider
  what happens if we need to promote with the end of local WAL spanning
  a segment boundary, and what happens to cascading replicas). We have
  some logic to try to deal with that, but it's pretty grotty and I
  think incomplete.

- It seems to make some future optimizations harder - we should work
  towards replicating data sooner, rather than the opposite. Right now
  that's a major bottleneck around syncrep.

- Once XLogFlush() for some LSN returned we can write that LSN to
  disk. The LSN doesn't necessarily have to correspond to a specific
  on-disk location (it could e.g. be the return value from
  GetFlushRecPtr()). But "rewinding" to before the last record makes that
  problematic.

- I suspect that schemes with heuristic knowledge of segment boundary
  spanning records have deadlock or at least latency spike issues. What
  if synchronous commit needs to flush up to a certain record boundary,
  but streaming rep doesn't replicate it out because there's segment
  spanning records both before and after?



I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Of course there's some difficulties with that - the checksum and the
length from the record header aren't going to be meaningful.

But we could deal with that using a special flag in the
XLogPageHeaderData.xlp_info of the following page. If that flag is set,
xlp_rem_len could contain the checksum of the partial record.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: perlcritic: prohibit map and grep in void conext
Next
From: Fabien COELHO
Date:
Subject: Re: Fix around conn_duration in pgbench