Re: Lowering the ever-growing heap->pd_lower - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Lowering the ever-growing heap->pd_lower
Date
Msg-id CAEze2Wi7yp2t2r-WpXNhUJnQtm+FE8uy3a5VcEgs7j6s4cNt2w@mail.gmail.com
Whole thread Raw
In response to Re: Lowering the ever-growing heap->pd_lower  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Lowering the ever-growing heap->pd_lower  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Wed, 31 Mar 2021 at 05:35, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > The case I was concerned about back when is that there are various bits of
> > > code that may visit a page with a predetermined TID in mind to look at.
> > > An index lookup is an obvious example, and another one is chasing an
> > > update chain's t_ctid link.  You might argue that if the tuple was dead
> > > enough to be removed, there should be no such in-flight references to
> > > worry about, but I think such an assumption is unsafe.  There is not, for
> > > example, any interlock that ensures that a process that has obtained a TID
> > > from an index will have completed its heap visit before a VACUUM that
> > > subsequently removed that index entry also removes the heap tuple.
> >
> > I am aware of this problem. I will admit that I did not detected all
> > potentially problematic accesses, so I'll show you my work.
>
> Attached is a trivial rebase of your v3, which I've called v4. I am
> interested in getting this patch into Postgres 14.

Thanks, and that'd be great! PFA v5, which fixes 1 issue later
mentioned, and adds some comments on existing checks that are now in a
critical path.

> > In my search for problematic accesses, I make the following assumptions:
> > * PageRepairFragmentation as found in bufpage is only applicable to
> > heap pages; this function is not applied to other pages in core
> > postgres. So, any problems that occur are with due to access with an
> > OffsetNumber > PageGetMaxOffsetNumber.
> > * Items [on heap pages] are only extracted after using PageGetItemId
> > for that item on the page, whilst holding a lock.
>
> > The 3 problem cases were classified based on the origin of the
> > potentially invalid pointer.
> >
> > Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup
>
> I think that it boils down to this: 100% of the cases where this could
> be a problem all either involve old TIDs, or old line pointer -- in
> principle these could be invalidated in some way, like your backwards
> scan example. But that's it. Bear in mind that we always call
> PageRepairFragmentation() with a super-exclusive lock.

Yeah, that's the gist of what I found out. All accesses using old line
pointers need revalidation, and there were some cases in which this
was not yet done correctly.

> > This is in the replay of transaction logs, in heap_xlog_freeze_page.
> > As I am unaware whether or not pages to which these transaction logs
> > are applied can contain changes from the xlog-generating instance, I
> > flagged this as a potential problem. The application of the xlogs is
> > probably safe (it assumes the existence of a HeapTupleHeader for that
> > ItemId), but my limited knowledge put this on the 'potential problem'
> > list.
> >
> > Please advise on this; I cannot find the right documentation
>
> Are you aware of wal_consistency_checking?

I was vaguely aware that an option with that name exists, but that was
about the extent. Thanks for pointing me in that direction.

> I think that this should be fine. There are differences that are
> possible between a replica and primary, but they're very minor and
> don't seem relevant.

OK, then I'll assume that WAL replay shouldn't cause problems here.

> > Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
> > heap_get_root_tuples
> >
> > The heap pruning mechanism currently assumes that all redirects are
> > valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
> > out of the loop, but doesn't actually fail. This code is already
> > potentially problematic because it has no bounds or sanity checking
> > for the nextoffnum, but with shrinking pd_linp it would also add the
> > failure mode of HOT tuples pointing into what is now arbitrary data.
>
> heap_prune_chain() is less trusting than heap_get_root_tuples(),
> though -- it doesn't trust redirects (because there is a generic
> offnum sanity check at the start of its loop). I think that the
> inconsistency between these two functions probably isn't hugely
> significant.
>
> Ideally it would be 100% clear which of the defenses in code like this
> is merely extra hardening. The assumptions should be formalized. There
> is nothing wrong with hardening, but we should know it when we see it.

I realized one of my Assert()s was incorrectly asserting an actually
valid page state, so I've updated and documented that case.

> > This failure mode is now accompanied by an Assert, which fails when
> > the redirect is to an invalid OffsetNumber. This is not enough to not
> > exhibit arbitrary behaviour when accessing corrupted data with
> > non-assert builds, but I think that that is fine; we already do not
> > have a guaranteed behaviour for this failure mode.
>
> What about my "set would-be LP_UNUSED space to all-0x7F bytes in
> CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

I had implemented it locally, but was waiting for some more feedback
before posting that and got busy with other stuff since, it's now
attached.

I've also played around with marking the free space on the page as
undefined for valgrind, but later realized that that would make the
test for defined memory in PageAddItemExtended fail. This is
documented in the commit message of the attached patch 0002.


With regards,

Matthias van de Meent

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Shared buffers advice for windows in the wiki
Next
From: Etsuro Fujita
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.