Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Vitaly Davydov
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id 2c1d-68344100-b7-3411b8c0@256938178
Whole thread Raw
In response to Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
List pgsql-hackers
Dear Alexander, Amit, All

> Amit wrote:
> > Is my understanding correct that we need 0001 because
> > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > changing the slot's restart_lsn?
>
> Yes.  Also, even if it would save slot to the disk, there is still
> race condition that concurrent checkpoint could use updated value from
> the shared memory to clean old WAL segments, and then crash happens
> before we managed to write the slot to the disk.
>
> How can that happen, if we first write the updated value to disk and
> then update the shared memory as we do in
> LogicalConfirmReceivedLocation?

I guess, that the problem with logical slots still exist. Please, see the tap
test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch.
A race condition may happen when logical slot's restart_lsn was changed but not
yet written to the disk. Imagine, there is another physical slot which is
advanced at this moment. It recomputes oldest min LSN and takes into account
changed but not saved to disk restart_lsn of the logical slot. We come to the
situation when the WAL segment for the logical slot's restart_lsn may be
truncated after immediate restart.

I'm not sure what may happen with two checkpoints which execute in parallel, but
I would say that the patch 0001 guarantees that every checkpoint run will trim
the WAL segments based on the already saved on disk restart LSNs of slots. The
rule to trim the WAL by saved slot's restart_lsn will not be violated.

> Amit wrote:
> As per my understanding, neither the xmin nor the LSN problem exists
> for logical slots. I am pointing this out to indicate we may need to
> think of a different solution for physical slots, if these are
> problems only for physical slots.

As I've already told, it indirectly affects the logical slots as well.

> Alexander wrote:
> I spend more time on this.  The next revision is attached.  It
> contains revised comments and other cosmetic changes.  I'm going to
> backpatch 0001 to all supported branches, and 0002 to 17 where
> injection points were introduced.

Alexander, thank you for polishing the patch. Just my opinion, I would prefer
to put tests before the fix due to reason that you can reproduce the problem
when simply checkout the commit with tests. Once, the tests are after the fix
you are not able to do this way. Anyway, I'm ok with your changes. Thank you!

I did some changes in the patch (v7 is attached):
* Removed modules/test_replslot_required_lsn directory. It is not needed anymore,
once you've moved test files to another directory.
* Renamed tests to 046_checkpoint_logical_slot.pl, 047_checkpoint_physical_slot.pl.
I believe, such names are more descriptive.

Please, consider these changes.

With best regards,
Vitaly

Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: MERGE issues around inheritance
Next
From: Tomas Vondra
Date:
Subject: Re: Amcheck verification of GiST and GIN