Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date
Msg-id 201209170844.43077.andres@2ndquadrant.com
Whole thread Raw
In response to Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Monday, September 17, 2012 07:35:06 AM Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sep 15, 2012, at 11:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Right, but we do a shutdown checkpoint at the end of crash recovery.
(as noted somewhere else and tackled by Simon, a END_OF_RECOVERY didn't sync 
those before)

> > Yes, but that only writes the buffers that are dirty. It doesn't fix the
> > lack of a BM_PERMANENT flag on a buffer that ought to have had one. So
> > that page can now get modified AGAIN, after recovery, and not
> > checkpointed.
> 
> Ugh.  Yeah, we need to fix this ASAP.  I've notified pgsql-packagers to
> expect a release this week.
Btw, I played with this some more on Saturday and I think, while definitely a 
bad bug, the actual consequences aren't as bad as at least I initially feared.

Fake relcache entries are currently set in 3 scenarios during recovery:
1. removal of ALL_VISIBLE in heapam.c
2. incomplete splits and incomplete deletions in nbtxlog.c
3. incomplete splits in ginxlog.c

As Jeff nicely showed its easy to corrupt the visibilitymap with this. 
Fortunately in < 9.2 that doesn't have too bad consequences and will be fixed 
by the next vacuum. In 9.2 that obviously can result in wrong query results but 
will still be fixed by a later (probably full table) vacuum.

To hit 2) and 3) the server needs to have crashed (or a strange 
recovery_target_* set) while doing a multilevel operation. I have only 
cursorily looked at gin but it looks to me like in both, nbtree and gin, the 
window during logging the multiple steps in a split/deletion is fairly short 
and its not likely that we crashed exactly during that. Unless we haven't read 
the complete multistep operation during recovery we won't ever create fake 
relcache entries for those relations/buffers! And even if that edge case is hit 
it seems somewhat likely that the few pages that are read with the fake entry 
are still in the cache (the incomplete operation has to have been soon before) 
and thus won't get the bad relpersistence flag set.

So I think while that bug had the possibility of being really bad we were 
pretty lucky...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Next
From: Amit Kapila
Date:
Subject: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown