Thread: Crash safe visibility map vs hint bits

Crash safe visibility map vs hint bits

From

"Jesper@Krogh.cc"

Date:

04 December 2010, 03:22:08

There has been a lot discussion about index-only scans and how to make the visibillity map crash safe. Then followed by
agood discussion about hint bits. 

What seems to be the main concern is the added wal volume and it makes me wonder if there is a way in-between that
looksmore like hint bits. 

How about lazily wal-log the complete visibility map say every X minutes or N amount of tuple updates and make the wal
recoveryjobs of rechecking visibility of pages touched by the wal stream on recovery. 

This seems a lot like the checkpoint mechanism but I can't see if it can just follow the same pattern directly.

This may also just demonstrate my total lack of understanding of PGs intervals.

Jesper

Re: Crash safe visibility map vs hint bits

From

Heikki Linnakangas

Date:

04 December 2010, 03:48:36

On 04.12.2010 09:14, Jesper@Krogh.cc wrote:
> There has been a lot discussion about index-only scans and how to make the visibillity map crash safe. Then followed
bya good discussion about hint bits.

>
> What seems to be the main concern is the added wal volume and it makes me wonder if there is a way in-between that
looksmore like hint bits.

>
> How about lazily wal-log the complete visibility map say every X minutes or N amount of tuple updates and make the
walrecovery jobs of rechecking visibility of pages touched by the wal stream on recovery.

If you WAL-log the visibility map changes after-the-fact, it doesn't 
solve the race condition we're struggling with: the visibility map 
change might hit the disk before the PD_ALL_VISIBLE to the heap page. If 
you crash, you can end up with a situation where the PD_ALL_VISIBLE flag 
on the heap page is not set, but the bit in the visibility map is. Which 
causes serious issues later on.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Crash safe visibility map vs hint bits

From

"Jesper@Krogh.cc"

Date:

04 December 2010, 04:21:25

Den 4 Dec 2010 kl. 08:48 skrev Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

> On 04.12.2010 09:14, Jesper@Krogh.cc wrote:
>> There has been a lot discussion about index-only scans and how to make the visibillity map crash safe. Then followed
bya good discussion about hint bits. 
>>
>> What seems to be the main concern is the added wal volume and it makes me wonder if there is a way in-between that
looksmore like hint bits. 
>>
>> How about lazily wal-log the complete visibility map say every X minutes or N amount of tuple updates and make the
walrecovery jobs of rechecking visibility of pages touched by the wal stream on recovery. 
>
> If you WAL-log the visibility map changes after-the-fact, it doesn't solve the race condition we're struggling with:
thevisibility map change might hit the disk before the PD_ALL_VISIBLE to the heap page. If you crash, you can end up
witha situation where the PD_ALL_VISIBLE flag on the heap page is not set, but the bit in the visibility map is. Which
causesserious issues later on. 

My imagination is probably not as good, but if you at time A wallog the complete map and at A+1 you update a tuple so
thevisibility bit is cleared but the map bit change does not happen due to a crash. Then at wal replay time you restore
themap from time A and if the tuple change at A+1 is represented in the wal stream the you also update the visibility
map. This is the situation where the heap tuple hit disk but the map is left in a broken state?  Or is it a different
similarlooking situation? 

The tuple change in the wal stream will require the system to reinspect the page anyway so there shouldn't be any
additionaldisk io on replay due to this. 

Jesper
>

Re: Crash safe visibility map vs hint bits

From

Heikki Linnakangas

Date:

04 December 2010, 04:27:20

On 04.12.2010 10:22, Jesper@Krogh.cc wrote:
> Den 4 Dec 2010 kl. 08:48 skrev Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>:
>> If you WAL-log the visibility map changes after-the-fact, it doesn't solve the race condition we're struggling with:
thevisibility map change might hit the disk before the PD_ALL_VISIBLE to the heap page. If you crash, you can end up
witha situation where the PD_ALL_VISIBLE flag on the heap page is not set, but the bit in the visibility map is. Which
causesserious issues later on.
 
>
> My imagination is probably not as good, but if you at time A wallog the complete map and at A+1 you update a tuple so
thevisibility bit is cleared but the map bit change does not happen due to a crash. Then at wal replay time you restore
themap from time A and if the tuple change at A+1 is represented in the wal stream the you also update the visibility
map. This is the situation where the heap tuple hit disk but the map is left in a broken state?  Or is it a different
similarlooking situation?
 

The problem is when a bit is *set* in the visibility map. Clearing a bit 
is not a problem, we already handle that reliably. If you set the flag 
on the heap page and set the bit on the visibility map page, and you 
don't emit a WAL record on either of those operations, the VM page might 
be flushed to disk before the heap page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Crash safe visibility map vs hint bits

From

"Jesper@Krogh.cc"

Date:

04 December 2010, 05:20:27

>> My imagination is probably not as good, but if you at time A wallog the complete map and at A+1 you update a tuple
sothe visibility bit is cleared but the map bit change does not happen due to a crash. Then at wal replay time you
restorethe map from time A and if the tuple change at A+1 is represented in the wal stream the you also update the
visibilitymap.  This is the situation where the heap tuple hit disk but the map is left in a broken state?  Or is it a
differentsimilar looking situation? 
>
> The problem is when a bit is *set* in the visibility map. Clearing a bit is not a problem, we already handle that
reliably.If you set the flag on the heap page and set the bit on the visibility map page, and you don't emit a WAL
recordon either of those operations, the VM page might be flushed to disk before the heap page. 

Ah got it, I thought there was an implicit wal stream representing the change. Which there isn't.

My initial suggestion was actually to trash the map on recovery and write it safe out on stop and let it be lazily
created/updatedon reads.  But I can see that the different performance patterns from normal operation and fresh
recoverycan be hard to accept although it would be sufficient/acceptable for many of us.  

It is nice that a recovery brings the database into the same state as before the crash in all perspectives but in real
wouldthe application still have a huge performance drop due to cold shared buffers and cold page cache on the os. 

A database wide select would the be needed to bring the map up to date. Would it be ok to update the inmemory bitmap as
asideeffect on selects? 


Jesper

Re: Crash safe visibility map vs hint bits

From

Bruce Momjian

Date:

15 December 2010, 10:56:06

Heikki Linnakangas wrote:
> On 04.12.2010 09:14, Jesper@Krogh.cc wrote:
> > There has been a lot discussion about index-only scans and how to make the visibillity map crash safe. Then
followedby a good discussion about hint bits.

> >
> > What seems to be the main concern is the added wal volume and it makes me wonder if there is a way in-between that
looksmore like hint bits.

> >
> > How about lazily wal-log the complete visibility map say every X minutes or N amount of tuple updates and make the
walrecovery jobs of rechecking visibility of pages touched by the wal stream on recovery.

> 
> If you WAL-log the visibility map changes after-the-fact, it doesn't 
> solve the race condition we're struggling with: the visibility map 
> change might hit the disk before the PD_ALL_VISIBLE to the heap page. If 
> you crash, you can end up with a situation where the PD_ALL_VISIBLE flag 
> on the heap page is not set, but the bit in the visibility map is. Which 
> causes serious issues later on.

Based on hacker emails and a discussion I had with Heikki while we were
in Germany, I have updated the index-only scans wiki to document a known
solution to making the visibility map crash-safe for use by index-only
scan use:
http://wiki.postgresql.org/wiki/Index-only_scans#Making_the_Visibility_Map_Crash-Safe
Making the Visibility Map Crash-SafeCurrently, a heap page that has all-visible tuples is marked by vacuumas
PD_ALL_VISIBLEand the visibility map (VM) bit is set. This iscurrently unlogged, and a crash could require these to be
setagain.The complexity is that for index-only scans, the VM bit has meaning, andcannot be incorrectly set (though it
canbe incorrectly cleared becausethat would just result in additional heap access). If bothPD_ALL_VISIBLE and the VM
bitwere to be set, and a crash resulted theVM bit being written to disk, but not the PD_ALL_VISIBLE bit, a laterheap
accessthat wrote a conditionally-visible row would not know toclear the VM bit, causing incorrect results for
index-onlyscans.The solution is to WAL log the VM set bit activity. This will causefull-page writes for the VM page,
butthis is much less than WAL-loggingeach heap page because a VM page represents many heap pages. Thisrequires that the
VMpage not be written to disk until its VM-set WALrecord is fsynced to disk. Also, during crash recovering, reading
theVM-setWAL record would cause both the VM-set and heap PD_ALL_VISIBLE tobe set. 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +