Re: heapgetpage() and ->takenDuringRecovery - Mailing list pgsql-hackers

From Andres Freund
Subject Re: heapgetpage() and ->takenDuringRecovery
Date
Msg-id 20140303133338.GD23352@awork2.anarazel.de
Whole thread Raw
In response to Re: heapgetpage() and ->takenDuringRecovery  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: heapgetpage() and ->takenDuringRecovery  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2014-03-03 06:57:00 -0500, Robert Haas wrote:
> On Sun, Mar 2, 2014 at 8:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I don't think this is neccessary >= 9.2. The are two only "interestings" place
> > where PD_ALL_VISIBLE is set:
> > a) lazy_vacuum_page() where a xl_heap_clean is logged *before*
> >    PD_ALL_VISIBLE/the vm is touched and that causes recovery
> >    conflicts. The heap page is locked for cleanup at that point. As the
> >    logging of xl_heap_clean sets the page's LSN there's no way the page
> >    can appear on the standby too early.
> > b) empty pages in lazy_scan_heap(). If they always were empty, there's
> >    no need for conflicts. The only other way I can see to end up there
> >    is a previous heap_page_prune() that repaired fragmentation. But that
> >    logs a WAL record with conflict information.
> 
> I don't think there's any reason to believe that lazy_scan_heap() can
> only hit pages that are empty or have just been defragged.  Suppose
> that there's a tuple on the page which was recently inserted; the
> inserting transaction has committed but there are some backends that
> still have older snapshots.  The page won't be marked all-visible
> because it isn't.  Now, eventually those older snapshots will go away,
> and sometime after that the relation will get vacuumed again, and
> we'll once again look the page.  But this time we notice that it is
> all-visible, and mark it so.

Right now I am missing how this isn't an actual correctness problem
after a crash. Without an LSN interlock we could crash *after* the heap
page has been written out, but *before* the vm WAL record has been
flushed to disk. Combined with synchronous_commit=off there could be
transactions that appeared as safely committed for vacuum (i.e. are
below GetOldestXmin()), but which are actually aborted after the
commit.
Normal hint bits circumvent that by checking XLogNeedsFlush(commitLSN),
but that doesn't work here.

Am I missing something?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Securing "make check" (CVE-2014-0067)
Next
From: Kohei KaiGai
Date:
Subject: Re: Triggers on foreign tables