Re: Exceptional md.c paths for recovery and zero_damaged_pages - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Exceptional md.c paths for recovery and zero_damaged_pages
Date
Msg-id 55bjljxk6ijolkeir5l2zb2kd7muktgbxpeq4y5fmkwyjqh2gt@cpcue5vkaro6
Whole thread Raw
In response to Re: Exceptional md.c paths for recovery and zero_damaged_pages  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Exceptional md.c paths for recovery and zero_damaged_pages
List pgsql-hackers
Hi,

On 2024-12-17 17:07:34 -0500, Peter Geoghegan wrote:
> On Tue, Dec 17, 2024 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> > ISTM that we could do better with some fairly simple cooperation between index
> > and table AM. It should be rather rare to look up a TID that was removed
> > between finding the index entry and fetching the table entry, a small amount
> > of extra work in that case ought to be ok.
> 
> Maybe, but I just want to be clear: as far as I know, as things stand
> we're very permissive about what can happen around concurrent TID
> recycling. We need to be because bitmap index scans can build a bitmap
> based on the index as it was some time ago (among other reasons),
> which cannot prevent concurrent TID recycling for TIDs that point to
> dead-to-everybody tuples (or point to LP_DEAD heap page stubs).

Bitmap heapscans go through a distinct path, so it'd be trivial to accept tids
pointing into the void for bitmap scans but not index [only] scans.


> > Could we e.g., for logged tables, track the LSN of the leaf index page in the
> > IndexScanDesc and pass that to table_index_fetch_tuple() and only error out in
> > if the table relation has a newer LSN?  That leaves a small window for a
> > false-negative, but shouldn't have any false-positives?
> 
> Technically that would work, but it might not be very useful.
> 
> It's very typical for a heap page LSN to be older than a corresponding
> index leaf page LSN, since inserts start with the heap tuple insertion
> (the index tuple insertion must happen afterwards). Plus index pages
> almost always store more tuples than their corresponding heap pages
> and so are presumably more likely to be modified.

I think the case where we'd need to *not* error out would be if the heap page
LSN is *newer* than the index leaf page LSN, as that could be due to
lazy_vacuum_heap() marking items unused between getting the tid from the index
scan and the heap access.

For it to be legal to remove items from the heap page between the scan of the
leaf page and the heap fetch, the btree page would need to have been processed
by btvacuumpage() (with a cleanup lock) and the heap page by
lazy_vacuum_heap_page().

If the heap page has an LSN that's older than the leaf page's LSN when we had
it pinned/locked, it could not have been processed by lazy_vacuum_heap_page()
since.  But if the heap page's LSN is newer, it might have. Obviously there
might be other reasons for the LSN to increase, leading to false negatives,
but it's a quite tight window.


> > I've seen enough bugs / corruption leading to indexes pointing to wrong and
> > nonexisting tuples to make me think it's worth being a bit more proactive
> > about raising errors for such cases.  Of course what I described above can
> > only detect index entries pointing to non-existing tuples, but that's still
> > a good bit better than nothing.
> 
> OTOH we've had index_delete_check_htid for several years now, and I've
> yet to hear a report involving one of its errors. That's not
> conclusive, but it does suggest that this might not be a huge problem.

Not sure that tells us that much, there's lots of workloads where you'd never
end up in heap_index_delete_tuples().

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Exceptional md.c paths for recovery and zero_damaged_pages
Next
From: Peter Geoghegan
Date:
Subject: Re: Exceptional md.c paths for recovery and zero_damaged_pages