Visibility map thoughts - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Visibility map thoughts
Date
Msg-id 472EE7F6.4070005@enterprisedb.com
Whole thread Raw
Responses Re: Visibility map thoughts  ("Gokulakannan Somasundaram" <gokul007@gmail.com>)
Re: Visibility map thoughts  (Jeff Davis <pgsql@j-davis.com>)
Re: Visibility map thoughts  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Visibility map thoughts  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
Though 8.3 isn't out of the oven just yet, I've been thinking about the 
dead space map a lot, and decided I have to start writing down those 
thoughts.

Reducing VACUUM time is important, but the real big promise is the 
ability to do index-only-scans. Because that's the main focus of this 
exercise, I'm calling it the the Visibility Map from now on, because 
it's not about tracking dead space, but tuple visibility in general. 
Don't worry, reduced VACUUM times on read-mostly tables with hot spots 
will still fall out of it.



What to store in the Visibility Map?
------------------------------------

Each potential user of the visibility map needs slightly different 
information:

a) VACUUM needs to know if a page has enough dead tuples that it's 
interesting to visit.
b) VACUUM FREEZE needs to know if a page has any non-frozen tuples that 
can be frozen.
c) Index-only-scans needs to know if all tuples on a page are visible to 
all transactions. HOT updates are OK, though.
From this point on, this document concentrates on a map suited for 
index-only-scans.

The basic structure is a bitmap with one bit per heap page. A set bit 
means "all tuples on heap page are visible to all transactions". A 
cleared bit means that we don't know if that's true or not.

This kind of map is useful for VACUUM as well, though VACUUM could get 
away with less strict semantics, and we might not want to visit a page 
in VACUUM if there's only very few dead tuples on a page. What this 
implementation gives us is a conservative VACUUM that will always find 
all the same dead tuples as a regular VACUUM (except for HOT updated 
tuples which can be pruned without scanning indexes). VACUUM using the 
visibility map can't update stats, so we'd still want regular vacuums 
every now and then, or rely on ANALYZE to do that.

It's not useful for VACUUM FREEZE, unless we're willing to freeze much 
more aggressively, and change the meaning of a set bit to "all tuples on 
heap page are frozen".

Other kinds of visibility maps could be useful in some narrow scenarios. 
For example, if we had a map of pages that have no live tuples, we could 
skip those pages in sequential scans. We could also use more than one 
bit per page to store more information, but let's keep it simple for now.


Where to store the visibility map?
----------------------------------
a) In a fixed size shared memory struct. Not acceptable.

b) In the special area of every nth heap page. I played a bit with this 
back in February 2006, because it's simple and requires few changes.

c) As a new relation kind, like toast tables.

d) As an auxiliary smgr relation, attached to the heap.

I'm leaning towards D at the moment. It requires a little bit of changes 
to the buffer manager API and elsewhere, but not too much.

B is not good because we might want to have other kinds of auxiliary 
files in the future (FSM in particular). And it is a bit hacky anyway. 
Let's do something more generic.

C doesn't seem like the right approach, because a visibility map doesn't 
really have much in common with "real" relations.

In any case, the bitmap is divided into pages, and the pages live in 
shared_buffers, and are managed by the buffer manager like any other page.

Updating the visibility map
---------------------------

A single page in the visibility map covers a wide range of heap pages 
((8192 - page header) * 8 = ~65000 heap pages). It can easily become 
locking bottleneck, if every update on any page in that range needs to 
lock the visibility map page.

Setting a bit is just a hint. It's ok to lose it on crash. However, 
setting a bit mustn't hit the disk too early. What might otherwise 
happen is that the change that made all tuples on a page visible, like 
committing an inserting transaction, isn't replayed after crash, but the 
set bit is already on disk. In other words, setting a bit must set the 
LSN of the visibility map page.

Clearing a bit is the opposite. Clearing a bit mustn't be lost, but it's 
ok if it hits the disk before the operation that cleared it. Therefore 
we don't need to set the LSN, but it must be redone at WAL replay of 
heap insert/update/delete.

Because a concurrent update on the same word might be lost if we didn't 
lock the page at all, we need to lock the visibility map page to set or 
clear a bit. Since these are quick operations, especially clearing a 
bit, perhaps we could use just the buffer header spinlock.

To further reduce contention, we can have a copy of the bit in the page 
header of the heap page itself. That way we'd only need to access the 
visibility map on the first update on a page that clears a bit.

When to update the visibility map?
----------------------------------
- Clear bit on every update/insert/delete.
- Set bit in vacuum or prune, if all tuples on page are now visible to 
all transactions.
- If prune sees that all tuples are LIVE or INSERT_IN_PROGRESS, we can't 
set the bit yet, but as soon as the maximum xmin on the page < 
OldestXmin, we can. Call PageSetPrunable accordingly, to prune again 
after that happens.
- We don't need to clear the bit on HOT updates, because by definition 
none of the indexed columns changed.




Next we need to figure out how to actually do the index-only-scans, 
using the visibility map. And how to use it for VACUUMs; Itagaki's dsm 
patch addressed that, and assuming it does what we want, that part of 
the patch is still usable.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: "Gokulakannan Somasundaram"
Date:
Subject: Fwd: Clarification about HOT
Next
From: Heikki Linnakangas
Date:
Subject: Re: Fwd: Clarification about HOT