Thread: Lack of PageSetLSN in heap_xlog_visible

Lack of PageSetLSN in heap_xlog_visible

From

Konstantin Knizhnik

Date:

13 October 2022, 09:50:37

Hi hackers!

heap_xlog_visible is not bumping heap page LSN when setting all-visible 
flag in it.
There is long comment explaining it:

         /*
          * We don't bump the LSN of the heap page when setting the 
visibility
          * map bit (unless checksums or wal_hint_bits is enabled, in which
          * case we must), because that would generate an unworkable 
volume of
          * full-page writes.  This exposes us to torn page hazards, but 
since
          * we're not inspecting the existing page contents in any way, we
          * don't care.
          *
          * However, all operations that clear the visibility map bit 
*do* bump
          * the LSN, and those operations will only be replayed if the 
XLOG LSN
          * follows the page LSN.  Thus, if the page LSN has advanced 
past our
          * XLOG record's LSN, we mustn't mark the page all-visible, because
          * the subsequent update won't be replayed to clear the flag.
          */

But it still not clear for me that not bumping LSN in this place is 
correct if wal_log_hints is set.
In this case we will have VM page with larger LSN than heap page, 
because visibilitymap_set
bumps LSN of VM page. It means that in theory after recovery we may have 
page marked as all-visible in VM,
but not having PD_ALL_VISIBLE  in page header. And it violates VM 
constraint:

  * When we *set* a visibility map during VACUUM, we must write WAL. 
This may
  * seem counterintuitive, since the bit is basically a hint: if it is 
clear,
  * it may still be the case that every tuple on the page is visible to all
  * transactions; we just don't know that for certain.  The difficulty 
is that
  * there are two bits which are typically set together: the 
PD_ALL_VISIBLE bit
  * on the page itself, and the visibility map bit.  If a crash occurs 
after the
  * visibility map page makes it to disk and before the updated heap 
page makes
  * it to disk, redo must set the bit on the heap page.  Otherwise, the next
  * insert, update, or delete on the heap page will fail to realize that the
  * visibility map bit must be cleared, possibly causing index-only scans to
  * return wrong answers.

Re: Lack of PageSetLSN in heap_xlog_visible

From

Jeff Davis

Date:

13 October 2022, 19:49:35

On Thu, 2022-10-13 at 12:50 +0300, Konstantin Knizhnik wrote:
>          /*
>           * We don't bump the LSN of the heap page when setting the
> visibility
>           * map bit (unless checksums or wal_hint_bits is enabled, in
> which
>           * case we must), because that would generate an unworkable
> volume of
>           * full-page writes.

It clearly says there that it must set the page LSN, but I don't see
where that's happening. It seems to go all the way back to the original
checksums commit, 96ef3b8ff1.

I can reproduce a case where a replica ends up with a different page
header than the primary (checksums enabled):

  Primary:
    create extension pageinspect;
    create table t(i int) with (autovacuum_enabled=off);
    insert into t values(0);

  Shut down and restart primary and replica.

  Primary:
    insert into t values(1);
    vacuum t;

  Crash replica and let it recover.

  Shut down and restart primary and replica.

  Primary:
    select * from page_header(get_raw_page('t', 0));

  Replica:
    select * from page_header(get_raw_page('t', 0));

The LSN on the replica is lower, but the flags are the same
(PD_ALL_VISIBLE set). That's a problem, right? The checksums are valid
on both, though.

It may violate our torn page protections for checksums, as well, but I
couldn't construct a scenario for that because recovery can only create
restartpoints at certain times.

> But it still not clear for me that not bumping LSN in this place is
> correct if wal_log_hints is set.
> In this case we will have VM page with larger LSN than heap page,
> because visibilitymap_set
> bumps LSN of VM page. It means that in theory after recovery we may
> have
> page marked as all-visible in VM,
> but not having PD_ALL_VISIBLE  in page header. And it violates VM
> constraint:

I'm not quite following this scenario. If the heap page has a lower LSN
than the VM page, how could we recover to a point where the VM bit is
set but the heap flag isn't? And what does it have to do with
wal_log_hints/checksums?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Lack of PageSetLSN in heap_xlog_visible

From

Jeff Davis

Date:

11 November 2022, 01:20:35

On Thu, 2022-10-13 at 12:49 -0700, Jeff Davis wrote:

> It may violate our torn page protections for checksums, as well...

I could not reproduce a problem here, but I believe one exists when
checksums are enabled, because it bypasses the protections of
UpdateMinRecoveryPoint(). By not updating the page's LSN, it would
allow the page to be flushed (and torn) without updating the minimum
recovery point. That would allow the administrator to subsequently do a
PITR or standby promotion while there's a torn page on disk.

I'm considering this a theoretical risk because for a page tear to be
consequential in this case, it would need to happen between the
checksum and the flags; and I doubt that's a real possibility. But
until we formalize that declaration, then this problem should be fixed.

Patch attached. I also fixed:

  * The comment in heap_xlog_visible() says that not updating the page
LSN avoids full page writes; but the causation is backwards: avoiding
full page images requires that we don't update the page's LSN.
  * Also in heap_xlog_visible(), there are comments and a branch
leftover from before commit f8f4227976 which don't seem to be necessary
any more.
  * In visibilitymap_set(), I clarified that we *must* not set the page
LSN of the heap page if no full page image was taken.

> > But it still not clear for me that not bumping LSN in this place is
> > correct if wal_log_hints is set.
> > In this case we will have VM page with larger LSN than heap page,
> > because visibilitymap_set
> > bumps LSN of VM page. It means that in theory after recovery we may
> > have
> > page marked as all-visible in VM,
> > but not having PD_ALL_VISIBLE  in page header. And it violates VM
> > constraint:
>
> I'm not quite following this scenario. If the heap page has a lower
> LSN
> than the VM page, how could we recover to a point where the VM bit is
> set but the heap flag isn't?

I still don't understand this problem scenario.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment

v1-0001-Fix-theoretical-torn-page-hazard.patch

Re: Lack of PageSetLSN in heap_xlog_visible

From

Konstantin Knizhnik

Date:

11 November 2022, 10:43:13

On 11.11.2022 03:20, Jeff Davis wrote:
> On Thu, 2022-10-13 at 12:49 -0700, Jeff Davis wrote:
>
>> It may violate our torn page protections for checksums, as well...
> I could not reproduce a problem here, but I believe one exists when
> checksums are enabled, because it bypasses the protections of
> UpdateMinRecoveryPoint(). By not updating the page's LSN, it would
> allow the page to be flushed (and torn) without updating the minimum
> recovery point. That would allow the administrator to subsequently do a
> PITR or standby promotion while there's a torn page on disk.
>
> I'm considering this a theoretical risk because for a page tear to be
> consequential in this case, it would need to happen between the
> checksum and the flags; and I doubt that's a real possibility. But
> until we formalize that declaration, then this problem should be fixed.
>
> Patch attached. I also fixed:
>
>    * The comment in heap_xlog_visible() says that not updating the page
> LSN avoids full page writes; but the causation is backwards: avoiding
> full page images requires that we don't update the page's LSN.
>    * Also in heap_xlog_visible(), there are comments and a branch
> leftover from before commit f8f4227976 which don't seem to be necessary
> any more.
>    * In visibilitymap_set(), I clarified that we *must* not set the page
> LSN of the heap page if no full page image was taken.
>
>
>>> But it still not clear for me that not bumping LSN in this place is
>>> correct if wal_log_hints is set.
>>> In this case we will have VM page with larger LSN than heap page,
>>> because visibilitymap_set
>>> bumps LSN of VM page. It means that in theory after recovery we may
>>> have
>>> page marked as all-visible in VM,
>>> but not having PD_ALL_VISIBLE  in page header. And it violates VM
>>> constraint:
>> I'm not quite following this scenario. If the heap page has a lower
>> LSN
>> than the VM page, how could we recover to a point where the VM bit is
>> set but the heap flag isn't?
> I still don't understand this problem scenario.
>
>

I am sorry that I have reported the issue and then didn't reply.
Yes, you are right: my original concerns that it may cause problems with 
recovery at replica are not correct.
Not updating page LSN may just force it's reconstruction.
I also not sure that it can cause problems with checksums - page is 
marked as dirty in any case.
Yes, content and checksum of the page will be different at master and 
replica. It may be a problem for recovery pages from replica.

When it really be critical - is incremental backup (like pg_probackup) 
whch looks at page LSN to determine moment of page modification.

And definitely it is critical for Neon, because LSN of page 
reconstructed by pageserver is different from one expected by compute node.

May be I missing something, but I do not see any penalty from setting 
page LSN here.
So even if there is not obvious scenario of failure, I still thing that 
it should be fixed. Breaking invariants is always bad thing
and there are should be very strong arguments for doing it. And I do not 
see them here.

So it will be great if your patch can be committed.

Re: Lack of PageSetLSN in heap_xlog_visible

From

Jeff Davis

Date:

12 November 2022, 18:26:19

On Fri, 2022-11-11 at 12:43 +0200, Konstantin Knizhnik wrote:
> Yes, you are right: my original concerns that it may cause problems
> with
> recovery at replica are not correct.

Great, thank you for following up.

> I also not sure that it can cause problems with checksums - page is
> marked as dirty in any case.
> Yes, content and checksum of the page will be different at master and
> replica. It may be a problem for recovery pages from replica.

It could cause a theoretical problem: during recovery, the page could
be flushed out before minRecoveryPoint is updated, and while that's
happening, a crash could tear it. Then, a subsequent partial recovery
(e.g. PITR) could recover without fixing the torn page.

That would not be a practical problem, because it would require a tear
between two fields of the page header, which I don't think is possible.

> When it really be critical - is incremental backup (like
> pg_probackup)
> whch looks at page LSN to determine moment of page modification.
>
> And definitely it is critical for Neon, because LSN of page
> reconstructed by pageserver is different from one expected by compute
> node.
>
> May be I missing something, but I do not see any penalty from setting
> page LSN here.
> So even if there is not obvious scenario of failure, I still thing
> that
> it should be fixed. Breaking invariants is always bad thing
> and there are should be very strong arguments for doing it. And I do
> not
> see them here.

Agreed, thank you for the report!

Committed d6a3dbe14f and backpatched through version 11.

Also committed an unrelated cleanup patch in the same area (3eb8eeccbe)
and a README patch (97c61f70d1) to master. The README patch doesn't
guarantee that things won't change in the future, but the behavior it
describes has been true for quite a while now.

--
Jeff Davis
PostgreSQL Contributor Team - AWS