Re: crash-safe visibility map, take three - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: crash-safe visibility map, take three
Date
Msg-id 4CF51BB3.10207@enterprisedb.com
Whole thread Raw
In response to Re: crash-safe visibility map, take three  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: crash-safe visibility map, take three
List pgsql-hackers
On 30.11.2010 17:32, Robert Haas wrote:
> On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Some care is needed with checkpoints. Setting visibility map bits in step 2
>> is safe because crash recovery will replay the intent XLOG record and clear
>> any incorrectly set bits. But if a checkpoint has happened after the intent
>> XLOG record was written, that's not true. This can be avoided by checking
>> RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
>> since the last intent XLOG record was written.
>
> It seems like you'll need to hold some kind of lock between the time
> you examine RedoRecPtr and the time you actually examine the bit.
> WALInsertLock in shared mode, maybe?

It's enough to hold an exclusive lock on the visibility map page. You 
have to set the bit first, and then check RedoRecPtr, and if it changed, 
write the XLOG record before releasing the lock. If RedoRecPtr changes 
any time before we check RedoRecPtr, we'll write the XLOG record so 
we're safe. If it changes after that, we're safe because the checkpoint 
will flush the updated heap page and visibility map page.

>> There's a small race condition in the way a visibility map bit is currently
>> cleared. When a heap page is updated, it is locked, the update is
>> WAL-logged, and the lock is released. The visibility map page is updated
>> only after that. If the final vacuum XLOG record is written just after
>> updating the heap page, but before the visibility map bit is cleared,
>> replaying the final XLOG record will set a bit that should not have been
>> set.
>
> Well, if that final XLOG record isn't necessary for correctness
> anyway, the obvious thing to do seems to be - don't write it.  Crashes
> are not so common that loss of even a full hour's visibility map bits
> in the event that we have one seems worth killing ourselves over.  And
> not everybody sets checkpoint_timeout to an hour, and not all
> checkpoints are triggered by checkpoint_timeout, and not all crashes
> happen just before it expires.  Seems like we might be better off
> writing that much less WAL.

Yeah, possibly. It also means that the set bits will not propagate to 
standby servers, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: DELETE with LIMIT (or my first hack)
Next
From: Robert Haas
Date:
Subject: Re: crash-safe visibility map, take three