RE: Newly created replication slot may be invalidated by checkpoint - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Newly created replication slot may be invalidated by checkpoint
Date
Msg-id OSCPR01MB14966C191ECEC11D051148D6DF5E0A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: Newly created replication slot may be invalidated by checkpoint  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List pgsql-hackers
Dear Vitaly,

> > Per my understanding, this happened because there is a lag that restart_lsn of
> > the slot is set, and it is protected by the system. Your idea is to ensure the
> > restart_lsn is protected by the system before obtaining on-memory LSN, right?
> 
> Not sure what you mean by on-memory LSN, but, the issue happens because we
> have
> a lag between restart_lsn assignment and update of
> XLogCtl->replicationSlotsMinLSN which is used to protect the WAL.

Sorry I should say "before obtaining replicationSlotMinLSN".

> Yes, I
> propose
> to ensure that the protection happens when we assign restart_lsn. It seems to be
> wrong that we invalidate slots by its restart_lsn but protect the wal for
> slots using XLogCtl->replicationSlotsMinLSN.

Seems valid. There is another corner case that another restart_lsn can be set in-between,
but they have larger LSN than RedoRecPtr, right?

> Below I tried to write some summary and propose the patch which fixes the
> problem.

Sorry but it is too long to understand properly for me :-(.

> There is one subtle thing. Once, the operation of restart_lsn assignment is not
> an atomic, the following scenario may happen theoretically:
> 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal)
> 2. Assign a new redo LSN in the checkpointer
> 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer
> 3. Assign the old redo LSN to restart_lsn
> 
> In this scenario, the restart_lsn will point to a previous redo LSN and it will
> be not protected by the new redo LSN. This scenario is unlikely, but it can
> happen theoretically. I have no ideas how to deal with it, except of assigning
> restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of
> XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot.

Oh, your point is there is another race condition, right? Do you have the reproducer for it?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Next
From: Viktor Holmberg
Date:
Subject: ON CONFLICT DO SELECT (take 3)