Thread: Visibility map, partial vacuums

Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Here's finally my attempt at the visibility map, aka. the dead space
map. It's still work-in-progress, but it's time to discuss some design
details in detail. Patch attached, anyway, for reference.

Visibility Map is basically a bitmap, with one bit per heap page, with
'1' for pages that are known to contain only tuples that are visible to
everyone. Such pages don't need vacuuming, because there is no dead
tuples, and the information can also be used to skip visibility checks.
It should allow index-only-scans in the future, 8.5 perhaps, but that's
not part of this patch. The visibility map is stored in a new relation
fork, alongside the main data and the FSM.

Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.

To avoid having to update the visibility map every time a heap page is
updated, I have added a new flag to the heap page header,
PD_ALL_VISIBLE, which indicates the same thing as a set bit in the
visibility map: all tuples on the page are known to be visible to
everyone. When a page is modified, the visibility map only needs to be
updated if PD_ALL_VISIBLE was set. That should make the impact
unnoticeable for use cases with lots of updates, where the visibility
map doesn't help, as only the first update on page after a vacuum needs
to update the visibility map.

As a bonus, I'm using the PD_ALL_VISIBLE flag to skip visibility checks
in sequential scans. That seems to give a small 5-10% speedup on my
laptop, to a simple "SELECT COUNT(*) FROM foo" query, where foo is a
narrow table with just a single integer column, fitting in RAM.

The critical part of this patch is to keep the PD_ALL_VISIBLE flag and
the visibility map up-to-date, avoiding race conditions. An invariant is
maintained: if PD_ALL_VISIBLE flag is *not* set, the corresponding bit
in the visiblity map must also not be set. If PD_ALL_VISIBLE flag is
set, the bit in the visibility map can be set, or not.

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared
first. The heap page is kept pinned, but not locked, while the
visibility map is updated. We want to avoid holding a lock across I/O,
even though the visibility map is likely to stay in cache. After the
visibility map has been updated, the page is exclusively locked and
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing
the lock.

To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the
page, while you observe that all tuples on the page are visible to everyone.

To set the bit in the visibility map, you need to hold a cleanup lock on
the heap page. That keeps away other backends trying to clear the bit in
the visibility map at the same time. Note that you need to hold a lock
on the heap page to examine PD_ALL_VISIBLE, otherwise the cleanup lock
doesn't protect from the race condition.


That's how the patch works right now. However, there's a small
performance problem with the current approach: setting the
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
1. All tuples on a page become visible to everyone. The inserting
transaction committed, for example. A backend sees that and set
PD_ALL_VISIBLE
2. Vacuum comes along, and sees that there's no work to be done on the
page. It sets the bit in the visibility map.
3. The visibility map page is flushed to disk. The heap page is not, yet.
4. Crash

The bit in the visibility map is now set, but the corresponding
PD_ALL_VISIBLE flag is not, because it never made it to disk.

I'm avoiding that at the moment by only setting PD_ALL_VISIBLE as part
of a page prune operation, and forcing a WAL record to be written even
if no other work is done on the page. The downside of that is that it
can lead to a big increase in WAL traffic after a bulk load, for
example. The first vacuum after the bulk load would have to write a WAL
record for every heap page, even though there's no dead tuples.

One option would be to just ignore that problem for now, and not
WAL-log. As long as we don't use the visibility map for anything like
index-only-scans, it doesn't matter much if there's some bits set that
shouldn't be. It just means that VACUUM will skip some pages that need
vacuuming, but VACUUM FREEZE will eventually catch those. Given how
little time we have until commitfest and feature freeze, that's probably
the most reasonable thing to do. I'll follow up with other solutions to
that problem, but mainly for discussion for 8.5.


Another thing that does need to be fixed, is the way that the extension
and truncation of the visibility map is handled; that's broken in the
current patch. I started working on the patch a long time ago, before
the FSM rewrite was finished, and haven't gotten around fixing that part
yet. We already solved it for the FSM, so we could just follow that
pattern. The way we solved truncation in the FSM was to write a separate
WAL record with the new heap size, but perhaps we want to revisit that
decision, instead of adding again new code to write a third WAL record,
for truncation of the visibility map. smgrtruncate() writes a WAL record
of its own, if any full blocks are truncated away of the FSM, but we
needed a WAL record even if no full blocks are truncated from the FSM
file, because the "tail" of the last remaining FSM page, representing
the truncated away heap pages, still needs to cleared. Visibility map
has the same problem.

One proposal was to piggyback on the smgrtruncate() WAL-record, and call
FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a
modularity point of view; smgr.c shouldn't be calling higher-level
functions. But maybe it wouldn't be that bad, after all. Or, we could
remove WAL-logging from smgrtruncate() altogether, and move it to
RelationTruncate() or another higher-level function, and handle the
WAL-logging and replay there.


There's some side-effects of partial vacuums that also need to be fixed.
First of all, the tuple count stored in pg_class is now wrong: it only
includes tuples from the pages that are visited. VACUUM VERBOSE output
needs to be changed as well to reflect that only some pages were scanned.

Other TODOs
- performance testing, to ensure that there's no significant performance
penalty.
- should add a specialized version of visibilitymap_clear() for WAL
reaply, so that wouldn't have to rely so much on the fake relcache entries.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/heap/Makefile
--- src/backend/access/heap/Makefile
***************
*** 12,17 **** subdir = src/backend/access/heap
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o

  include $(top_srcdir)/src/backend/common.mk
--- 12,17 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o

  include $(top_srcdir)/src/backend/common.mk
*** src/backend/access/heap/heapam.c
--- src/backend/access/heap/heapam.c
***************
*** 47,52 ****
--- 47,53 ----
  #include "access/transam.h"
  #include "access/tuptoaster.h"
  #include "access/valid.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
***************
*** 194,199 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
--- 195,201 ----
      int            ntup;
      OffsetNumber lineoff;
      ItemId        lpp;
+     bool        all_visible;

      Assert(page < scan->rs_nblocks);

***************
*** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
-             HeapTupleData loctup;
              bool        valid;

!             loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!             loctup.t_len = ItemIdGetLength(lpp);
!             ItemPointerSet(&(loctup.t_self), page, lineoff);

!             valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
--- 235,266 ----
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

+     /*
+      * If the all-visible flag indicates that all tuples on the page are
+      * visible to everyone, we can skip the per-tuple visibility tests.
+      */
+     all_visible = PageIsAllVisible(dp);
+
      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
              bool        valid;

!             if (all_visible)
!                 valid = true;
!             else
!             {
!                 HeapTupleData loctup;
!
!                 loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!                 loctup.t_len = ItemIdGetLength(lpp);
!                 ItemPointerSet(&(loctup.t_self), page, lineoff);

!                 valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
!             }
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
***************
*** 1914,1919 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1928,1934 ----
          Page        page = BufferGetPage(buffer);
          uint8        info = XLOG_HEAP_INSERT;

+         xlrec.all_visible_cleared = PageIsAllVisible(page);
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = heaptup->t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 1961,1966 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1976,1991 ----
          PageSetTLI(page, ThisTimeLineID);
      }

+     if (PageIsAllVisible(BufferGetPage(buffer)))
+     {
+         /*
+          * The bit in the visibility map was already cleared by
+          * RelationGetBufferForTuple
+          */
+         /* visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); */
+         PageClearAllVisible(BufferGetPage(buffer));
+     }
+
      END_CRIT_SECTION();

      UnlockReleaseBuffer(buffer);
***************
*** 2045,2050 **** heap_delete(Relation relation, ItemPointer tid,
--- 2070,2080 ----
      Assert(ItemPointerIsValid(tid));

      buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+
+     /* Clear the bit in the visibility map if necessary */
+     if (PageIsAllVisible(BufferGetPage(buffer)))
+         visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+
      LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);

      page = BufferGetPage(buffer);
***************
*** 2208,2213 **** l1:
--- 2238,2244 ----
          XLogRecPtr    recptr;
          XLogRecData rdata[2];

+         xlrec.all_visible_cleared = PageIsAllVisible(page);
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = tp.t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 2229,2234 **** l1:
--- 2260,2268 ----

      END_CRIT_SECTION();

+     if (PageIsAllVisible(page))
+         PageClearAllVisible(page);
+
      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

      /*
***************
*** 2627,2632 **** l2:
--- 2661,2670 ----
          }
          else
          {
+             /* Clear bit in visibility map */
+             if (PageIsAllVisible(page))
+                 visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+
              /* Re-acquire the lock on the old tuple's page. */
              LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
              /* Re-check using the up-to-date free space */
***************
*** 2750,2755 **** l2:
--- 2788,2799 ----
          PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
      }

+     /* The bits in visibility map were already cleared */
+     if (PageIsAllVisible(BufferGetPage(buffer)))
+         PageClearAllVisible(BufferGetPage(buffer));
+     if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
+         PageClearAllVisible(BufferGetPage(newbuf));
+
      END_CRIT_SECTION();

      if (newbuf != buffer)
***************
*** 3381,3386 **** l3:
--- 3425,3436 ----

      END_CRIT_SECTION();

+     /*
+      * Don't update the visibility map here. Locking a tuple doesn't
+      * change visibility info.
+      */
+     /* visibilitymap_clear(relation, tuple->t_self); */
+
      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);

      /*
***************
*** 3727,3733 **** log_heap_clean(Relation reln, Buffer buffer,
                 OffsetNumber *redirected, int nredirected,
                 OffsetNumber *nowdead, int ndead,
                 OffsetNumber *nowunused, int nunused,
!                bool redirect_move)
  {
      xl_heap_clean xlrec;
      uint8        info;
--- 3777,3783 ----
                 OffsetNumber *redirected, int nredirected,
                 OffsetNumber *nowdead, int ndead,
                 OffsetNumber *nowunused, int nunused,
!                bool redirect_move, bool all_visible_set)
  {
      xl_heap_clean xlrec;
      uint8        info;
***************
*** 3741,3746 **** log_heap_clean(Relation reln, Buffer buffer,
--- 3791,3797 ----
      xlrec.block = BufferGetBlockNumber(buffer);
      xlrec.nredirected = nredirected;
      xlrec.ndead = ndead;
+     xlrec.all_visible_set = all_visible_set;

      rdata[0].data = (char *) &xlrec;
      rdata[0].len = SizeOfHeapClean;
***************
*** 3892,3900 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 3943,3953 ----
      else
          info = XLOG_HEAP_UPDATE;

+     xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf));
      xlrec.target.node = reln->rd_node;
      xlrec.target.tid = from;
      xlrec.newtid = newtup->t_self;
+     xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf));

      rdata[0].data = (char *) &xlrec;
      rdata[0].len = SizeOfHeapUpdate;
***************
*** 4029,4034 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
--- 4082,4088 ----
      int            nredirected;
      int            ndead;
      int            nunused;
+     bool        all_visible_set;

      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;
***************
*** 4046,4051 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
--- 4100,4106 ----

      nredirected = xlrec->nredirected;
      ndead = xlrec->ndead;
+     all_visible_set = xlrec->all_visible_set;
      end = (OffsetNumber *) ((char *) xlrec + record->xl_len);
      redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean);
      nowdead = redirected + (nredirected * 2);
***************
*** 4058,4064 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
                              redirected, nredirected,
                              nowdead, ndead,
                              nowunused, nunused,
!                             clean_move);

      /*
       * Note: we don't worry about updating the page's prunability hints.
--- 4113,4119 ----
                              redirected, nredirected,
                              nowdead, ndead,
                              nowunused, nunused,
!                             clean_move, all_visible_set);

      /*
       * Note: we don't worry about updating the page's prunability hints.
***************
*** 4152,4157 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
--- 4207,4224 ----
      ItemId        lp = NULL;
      HeapTupleHeader htup;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&(xlrec->target.tid)));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

***************
*** 4189,4194 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
--- 4256,4264 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /* Make sure there is no forward chain link in t_ctid */
      htup->t_ctid = xlrec->target.tid;
      PageSetLSN(page, lsn);
***************
*** 4213,4218 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4283,4299 ----
      xl_heap_header xlhdr;
      uint32        newlen;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

***************
*** 4270,4275 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4351,4360 ----
          elog(PANIC, "heap_insert_redo: failed to add tuple");
      PageSetLSN(page, lsn);
      PageSetTLI(page, ThisTimeLineID);
+
+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      MarkBufferDirty(buffer);
      UnlockReleaseBuffer(buffer);
  }
***************
*** 4297,4302 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4382,4398 ----
      int            hsize;
      uint32        newlen;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
      {
          if (samepage)
***************
*** 4361,4366 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4457,4465 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /*
       * this test is ugly, but necessary to avoid thinking that insert change
       * is already applied
***************
*** 4376,4381 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4475,4491 ----

  newt:;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->new_all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_2)
          return;

***************
*** 4453,4458 **** newsame:;
--- 4563,4572 ----
      offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
      if (offnum == InvalidOffsetNumber)
          elog(PANIC, "heap_update_redo: failed to add tuple");
+
+     if (xlrec->new_all_visible_cleared)
+         PageClearAllVisible(page);
+
      PageSetLSN(page, lsn);
      PageSetTLI(page, ThisTimeLineID);
      MarkBufferDirty(buffer);
*** src/backend/access/heap/hio.c
--- src/backend/access/heap/hio.c
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"

  #include "access/hio.h"
+ #include "access/visibilitymap.h"
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/lmgr.h"
***************
*** 221,229 **** RelationGetBufferForTuple(Relation relation, Size len,
          pageFreeSpace = PageGetHeapFreeSpace(page);
          if (len + saveFreeSpace <= pageFreeSpace)
          {
!             /* use this page as future insert target, too */
!             relation->rd_targblock = targetBlock;
!             return buffer;
          }

          /*
--- 222,278 ----
          pageFreeSpace = PageGetHeapFreeSpace(page);
          if (len + saveFreeSpace <= pageFreeSpace)
          {
!             if (PageIsAllVisible(page))
!             {
!                 /*
!                  * Need to update the visibility map first. Let's drop the
!                  * locks while we do that.
!                  */
!                 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
!                 if (otherBlock != targetBlock && BufferIsValid(otherBuffer))
!                     LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
!
!                 visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
!
!                 /* relock */
!                 if (otherBuffer == InvalidBuffer)
!                 {
!                     /* easy case */
!                     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
!                 }
!                 else if (otherBlock == targetBlock)
!                 {
!                     /* also easy case */
!                     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
!                 }
!                 else if (otherBlock < targetBlock)
!                 {
!                     /* lock other buffer first */
!                     LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
!                     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
!                 }
!                 else
!                 {
!                     /* lock target buffer first */
!                     LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
!                     LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
!                 }
!
!                 /* Check if it still has enough space */
!                 pageFreeSpace = PageGetHeapFreeSpace(page);
!                 if (len + saveFreeSpace <= pageFreeSpace)
!                 {
!                     /* use this page as future insert target, too */
!                     relation->rd_targblock = targetBlock;
!                     return buffer;
!                 }
!             }
!             else
!             {
!                 /* use this page as future insert target, too */
!                 relation->rd_targblock = targetBlock;
!                 return buffer;
!             }
          }

          /*
***************
*** 276,281 **** RelationGetBufferForTuple(Relation relation, Size len,
--- 325,332 ----
       */
      buffer = ReadBuffer(relation, P_NEW);

+     visibilitymap_extend(relation, BufferGetBlockNumber(buffer) + 1);
+
      /*
       * We can be certain that locking the otherBuffer first is OK, since it
       * must have a lower page number.
*** src/backend/access/heap/pruneheap.c
--- src/backend/access/heap/pruneheap.c
***************
*** 17,22 ****
--- 17,24 ----
  #include "access/heapam.h"
  #include "access/htup.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
+ #include "access/xlogdefs.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/bufmgr.h"
***************
*** 37,42 **** typedef struct
--- 39,45 ----
      OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
      OffsetNumber nowdead[MaxHeapTuplesPerPage];
      OffsetNumber nowunused[MaxHeapTuplesPerPage];
+     bool        all_visible_set;
      /* marked[i] is TRUE if item i is entered in one of the above arrays */
      bool        marked[MaxHeapTuplesPerPage + 1];
  } PruneState;
***************
*** 156,161 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
--- 159,166 ----
      OffsetNumber offnum,
                  maxoff;
      PruneState    prstate;
+     bool        all_visible, all_visible_in_future;
+     TransactionId newest_xid;

      /*
       * Our strategy is to scan the page and make lists of items to change,
***************
*** 177,182 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
--- 182,188 ----
       */
      prstate.new_prune_xid = InvalidTransactionId;
      prstate.nredirected = prstate.ndead = prstate.nunused = 0;
+     prstate.all_visible_set = false;
      memset(prstate.marked, 0, sizeof(prstate.marked));

      /* Scan the page */
***************
*** 215,220 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
--- 221,317 ----
      if (redirect_move)
          EndNonTransactionalInvalidation();

+     /* Update the visibility map */
+     all_visible = true;
+     all_visible_in_future = true;
+     newest_xid = InvalidTransactionId;
+     maxoff = PageGetMaxOffsetNumber(page);
+     for (offnum = FirstOffsetNumber;
+          offnum <= maxoff;
+          offnum = OffsetNumberNext(offnum))
+     {
+         ItemId itemid = PageGetItemId(page, offnum);
+         HeapTupleHeader htup;
+         HTSV_Result status;
+
+         if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid))
+             continue;
+
+         if (ItemIdIsDead(itemid))
+         {
+             all_visible = false;
+             all_visible_in_future = false;
+             break;
+         }
+
+         htup = (HeapTupleHeader) PageGetItem(page, itemid);
+         status = HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer);
+         switch(status)
+         {
+             case HEAPTUPLE_DEAD:
+                 /*
+                  * There shouldn't be any dead tuples left on the page, since
+                  * we just pruned. They should've been truncated to just dead
+                  * line pointers.
+                  */
+                 Assert(false);
+             case HEAPTUPLE_RECENTLY_DEAD:
+                 /*
+                  * This tuple is not visible to all, and it won't become
+                  * so in the future
+                  */
+                 all_visible = false;
+                 all_visible_in_future = false;
+                 break;
+             case HEAPTUPLE_INSERT_IN_PROGRESS:
+                 /*
+                  * This tuple is not visible to all. But it might become
+                  * so in the future, if the inserter commits.
+                  */
+                 all_visible = false;
+                 if (TransactionIdFollows(HeapTupleHeaderGetXmin(htup), newest_xid))
+                     newest_xid = HeapTupleHeaderGetXmin(htup);
+                 break;
+             case HEAPTUPLE_DELETE_IN_PROGRESS:
+                 /*
+                  * This tuple is not visible to all. But it might become
+                  * so in the future, if the deleter aborts.
+                  */
+                 all_visible = false;
+                 if (TransactionIdFollows(HeapTupleHeaderGetXmax(htup), newest_xid))
+                     newest_xid = HeapTupleHeaderGetXmax(htup);
+                 break;
+             case HEAPTUPLE_LIVE:
+                 /*
+                  * Check if the inserter is old enough that this tuple is
+                  * visible to all
+                  */
+                 if (!TransactionIdPrecedes(HeapTupleHeaderGetXmin(htup), OldestXmin))
+                 {
+                     /*
+                      * Nope. But as OldestXmin advances beyond xmin, this
+                      * will become visible to all
+                      */
+                     all_visible = false;
+                     if (TransactionIdFollows(HeapTupleHeaderGetXmin(htup), newest_xid))
+                         newest_xid = HeapTupleHeaderGetXmin(htup);
+                 }
+         }
+     }
+     if (all_visible)
+     {
+         if (!PageIsAllVisible(page))
+             prstate.all_visible_set = true;
+     }
+     else if (all_visible_in_future && TransactionIdIsValid(newest_xid))
+     {
+         /*
+          * We still have hope that all tuples will become visible
+          * in the future
+          */
+         heap_prune_record_prunable(&prstate, newest_xid);
+     }
+
      /* Any error while applying the changes is critical */
      START_CRIT_SECTION();

***************
*** 230,236 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
                                  prstate.redirected, prstate.nredirected,
                                  prstate.nowdead, prstate.ndead,
                                  prstate.nowunused, prstate.nunused,
!                                 redirect_move);

          /*
           * Update the page's pd_prune_xid field to either zero, or the lowest
--- 327,333 ----
                                  prstate.redirected, prstate.nredirected,
                                  prstate.nowdead, prstate.ndead,
                                  prstate.nowunused, prstate.nunused,
!                                 redirect_move, prstate.all_visible_set);

          /*
           * Update the page's pd_prune_xid field to either zero, or the lowest
***************
*** 253,264 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
          if (!relation->rd_istemp)
          {
              XLogRecPtr    recptr;
-
              recptr = log_heap_clean(relation, buffer,
                                      prstate.redirected, prstate.nredirected,
                                      prstate.nowdead, prstate.ndead,
                                      prstate.nowunused, prstate.nunused,
!                                     redirect_move);

              PageSetLSN(BufferGetPage(buffer), recptr);
              PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
--- 350,360 ----
          if (!relation->rd_istemp)
          {
              XLogRecPtr    recptr;
              recptr = log_heap_clean(relation, buffer,
                                      prstate.redirected, prstate.nredirected,
                                      prstate.nowdead, prstate.ndead,
                                      prstate.nowunused, prstate.nunused,
!                                     redirect_move, prstate.all_visible_set);

              PageSetLSN(BufferGetPage(buffer), recptr);
              PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
***************
*** 701,707 **** heap_page_prune_execute(Buffer buffer,
                          OffsetNumber *redirected, int nredirected,
                          OffsetNumber *nowdead, int ndead,
                          OffsetNumber *nowunused, int nunused,
!                         bool redirect_move)
  {
      Page        page = (Page) BufferGetPage(buffer);
      OffsetNumber *offnum;
--- 797,803 ----
                          OffsetNumber *redirected, int nredirected,
                          OffsetNumber *nowdead, int ndead,
                          OffsetNumber *nowunused, int nunused,
!                         bool redirect_move, bool all_visible)
  {
      Page        page = (Page) BufferGetPage(buffer);
      OffsetNumber *offnum;
***************
*** 766,771 **** heap_page_prune_execute(Buffer buffer,
--- 862,875 ----
       * whether it has free pointers.
       */
      PageRepairFragmentation(page);
+
+     /*
+      * We don't want poke the visibility map from here, as that might mean
+      * physical I/O; just set the flag on the heap page. The caller can
+      * update the visibility map afterwards if it wants to.
+      */
+     if (all_visible)
+         PageSetAllVisible(page);
  }


*** /dev/null
--- src/backend/access/heap/visibilitymap.c
***************
*** 0 ****
--- 1,312 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.c
+  *      Visibility map
+  *
+  * Portions Copyright (c) 2008, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      $PostgreSQL$
+  *
+  * NOTES
+  *
+  * The visibility map is a bitmap with one bit per heap page. A set bit means
+  * that all tuples on the page are visible to all transactions. The
+  * map is conservative in the sense that we make sure that whenever a bit is
+  * set, we know the condition is true, but if a bit is not set, it might
+  * or might not be.
+  *
+  * From that it follows that when a bit is set, we need to update the LSN
+  * of the page to make sure that it doesn't get written to disk before the
+  * WAL record of the changes that made it possible to set the bit is flushed.
+  * But when a bit is cleared, we don't have to do that because if the page is
+  * flushed early, it's ok.
+  *
+  * There's no explicit WAL logging in the functions in this file. The callers
+  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+  * replay of the updating operation as well. XXX: the WAL-logging of setting
+  * bit needs more thought.
+  *
+  * LOCKING
+  *
+  * To clear a bit for a heap page, caller must hold an exclusive lock
+  * on the heap page. To set a bit, a clean up lock on the heap page is
+  * needed.
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+
+ #include "access/visibilitymap.h"
+ #include "storage/bufmgr.h"
+ #include "storage/bufpage.h"
+ #include "storage/smgr.h"
+
+ //#define TRACE_VISIBILITYMAP
+
+ /* Number of bits allocated for each heap block. */
+ #define BITS_PER_HEAPBLOCK 1
+
+ /* Number of heap blocks we can represent in one byte. */
+ #define HEAPBLOCKS_PER_BYTE 8
+
+ /* Number of heap blocks we can represent in one visibility map page */
+ #define HEAPBLOCKS_PER_PAGE ((BLCKSZ - SizeOfPageHeaderData) * HEAPBLOCKS_PER_BYTE )
+
+ /* Mapping from heap block number to the right bit in the visibility map */
+ #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+ #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+ #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
+
+ static Buffer ReadVMBuffer(Relation rel, BlockNumber blkno);
+ static Buffer ReleaseAndReadVMBuffer(Relation rel, BlockNumber blkno, Buffer oldbuf);
+
+ static Buffer
+ ReadVMBuffer(Relation rel, BlockNumber blkno)
+ {
+     if (blkno == P_NEW)
+         return ReadBufferWithFork(rel, VISIBILITYMAP_FORKNUM, P_NEW);
+
+     if (rel->rd_vm_nblocks_cache == InvalidBlockNumber ||
+         rel->rd_vm_nblocks_cache <= blkno)
+         rel->rd_vm_nblocks_cache = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+
+     if (blkno >= rel->rd_fsm_nblocks_cache)
+         return InvalidBuffer;
+     else
+         return ReadBufferWithFork(rel, VISIBILITYMAP_FORKNUM, blkno);
+ }
+
+ static Buffer
+ ReleaseAndReadVMBuffer(Relation rel, BlockNumber blkno, Buffer oldbuf)
+ {
+     if (BufferIsValid(oldbuf))
+     {
+         if (BufferGetBlockNumber(oldbuf) == blkno)
+             return oldbuf;
+         else
+             ReleaseBuffer(oldbuf);
+     }
+
+     return ReadVMBuffer(rel, blkno);
+ }
+
+ void
+ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+     uint32        mapByte  = HEAPBLK_TO_MAPBYTE(nheapblocks);
+     uint8        mapBit   = HEAPBLK_TO_MAPBIT(nheapblocks);
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(LOG, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ #endif
+
+     /* Truncate away pages that are no longer needed */
+     if (mapBlock == 0 && mapBit == 0)
+         smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, mapBlock,
+                      rel->rd_istemp);
+     else
+     {
+         Buffer mapBuffer;
+         Page page;
+         char *mappage;
+         int len;
+
+         smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, mapBlock + 1,
+                      rel->rd_istemp);
+
+         /*
+          * Clear all bits in the last map page, that represent the truncated
+          * heap blocks. This is not only tidy, but also necessary because
+          * we don't clear the bits on extension.
+          */
+         mapBuffer = ReadVMBuffer(rel, mapBlock);
+         if (BufferIsValid(mapBuffer))
+         {
+             page = BufferGetPage(mapBuffer);
+             mappage = PageGetContents(page);
+
+             LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+             /*
+              * Clear out the unwanted bytes.
+              */
+             len = HEAPBLOCKS_PER_PAGE/HEAPBLOCKS_PER_BYTE - (mapByte + 1);
+             MemSet(&mappage[mapByte + 1], 0, len);
+
+             /*
+              * Mask out the unwanted bits of the last remaining byte
+              *
+              * ((1 << 0) - 1) = 00000000
+              * ((1 << 1) - 1) = 00000001
+              * ...
+              * ((1 << 6) - 1) = 00111111
+              * ((1 << 7) - 1) = 01111111
+              */
+             mappage[mapByte] &= (1 << mapBit) - 1;
+
+             /*
+              * This needs to be WAL-logged. Although the now unused shouldn't
+              * be accessed anymore, they better be zero if we extend again.
+              */
+
+             MarkBufferDirty(mapBuffer);
+             UnlockReleaseBuffer(mapBuffer);
+         }
+     }
+ }
+
+ void
+ visibilitymap_extend(Relation rel, BlockNumber nheapblocks)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+     BlockNumber size;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(LOG, "vm_extend %s %d", RelationGetRelationName(rel), nheapblocks);
+ #endif
+
+     Assert(nheapblocks > 0);
+
+     /* Open it at the smgr level if not already done */
+     RelationOpenSmgr(rel);
+
+     size = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+     for(; size < mapBlock + 1; size++)
+     {
+         Buffer mapBuffer = ReadVMBuffer(rel, P_NEW);
+
+         LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+         PageInit(BufferGetPage(mapBuffer), BLCKSZ, 0);
+         MarkBufferDirty(mapBuffer);
+         UnlockReleaseBuffer(mapBuffer);
+     }
+ }
+
+ /*
+  * Marks that all tuples on a heap page are visible to all.
+  *
+  * *buf is a buffer, previously returned by visibilitymap_test(). This is
+  * an opportunistic function; if *buf doesn't contain the bit for heapBlk,
+  * we do nothing. We don't want to do any I/O, because the caller is holding
+  * a cleanup lock on the heap page.
+  */
+ void
+ visibilitymap_set_opt(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
+                       Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     Page        page;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(WARNING, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
+         return;
+
+     page = BufferGetPage(*buf);
+     mappage = PageGetContents(page);
+     LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
+
+     if (!(mappage[mapByte] & (1 << mapBit)))
+     {
+         mappage[mapByte] |= (1 << mapBit);
+
+         if (XLByteLT(PageGetLSN(page), recptr))
+             PageSetLSN(page, recptr);
+         PageSetTLI(page, ThisTimeLineID);
+         MarkBufferDirty(*buf);
+     }
+
+     LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+  * Are all tuples on heap page visible to all?
+  *
+  * The page containing the bit for the heap block is (kept) pinned,
+  * and *buf is set to that buffer. If *buf is valid on entry, it should
+  * be a buffer previously returned by this function, for the same relation,
+  * and unless the new heap block is on the same page, it is released.
+  */
+ bool
+ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     bool        val;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(WARNING, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     *buf = ReleaseAndReadVMBuffer(rel, mapBlock, *buf);
+     if (!BufferIsValid(*buf))
+         return false;
+
+     /* XXX: Can we get away without locking? */
+     LockBuffer(*buf, BUFFER_LOCK_SHARE);
+
+     mappage = PageGetContents(BufferGetPage(*buf));
+
+     val = (mappage[mapByte] & (1 << mapBit)) ? true : false;
+
+     LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+
+     return val;
+ }
+
+ /*
+  * Mark that not all tuples are visible to all.
+  */
+ void
+ visibilitymap_clear(Relation rel, BlockNumber heapBlk)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     Buffer        mapBuffer;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(WARNING, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     mapBuffer = ReadVMBuffer(rel, mapBlock);
+     if (!BufferIsValid(mapBuffer))
+         return; /* nothing to do */
+
+     /* XXX: Can we get away without locking here?
+      *
+      * We mustn't re-set a bit that was just cleared, so it doesn't seem
+      * safe. Clearing the bit is really "load; and; store", so without
+      * the lock we might store back a bit that's just being cleared
+      * by a concurrent updater.
+      *
+      * We could use the buffer header spinlock here, but the API to do
+      * that is intended to be internal to buffer manager. We'd still need
+      * to get a shared lock to mark the buffer as dirty, though.
+      */
+     LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+     mappage = PageGetContents(BufferGetPage(mapBuffer));
+
+     if (mappage[mapByte] & (1 << mapBit))
+     {
+         mappage[mapByte] &= ~(1 << mapBit);
+
+         MarkBufferDirty(mapBuffer);
+     }
+
+     LockBuffer(mapBuffer, BUFFER_LOCK_UNLOCK);
+     ReleaseBuffer(mapBuffer);
+ }
*** src/backend/access/transam/xlogutils.c
--- src/backend/access/transam/xlogutils.c
***************
*** 360,365 **** CreateFakeRelcacheEntry(RelFileNode rnode)
--- 360,366 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     rel->rd_vm_nblocks_cache = InvalidBlockNumber;
      rel->rd_smgr = NULL;

      return rel;
*** src/backend/catalog/heap.c
--- src/backend/catalog/heap.c
***************
*** 33,38 ****
--- 33,39 ----
  #include "access/heapam.h"
  #include "access/sysattr.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "catalog/catalog.h"
  #include "catalog/dependency.h"
***************
*** 306,316 **** heap_create(const char *relname,
          smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false);

          /*
!          * For a real heap, create FSM fork as well. Indexams are
!          * responsible for creating any extra forks themselves.
           */
          if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE)
              smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false);
      }

      return rel;
--- 307,320 ----
          smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false);

          /*
!          * For a real heap, create FSM and visibility map as well. Indexams
!          * are responsible for creating any extra forks themselves.
           */
          if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE)
+         {
              smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false);
+             smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, rel->rd_istemp, false);
+         }
      }

      return rel;
***************
*** 2324,2329 **** heap_truncate(List *relids)
--- 2328,2334 ----

          /* Truncate the FSM and actual file (and discard buffers) */
          FreeSpaceMapTruncateRel(rel, 0);
+         visibilitymap_truncate(rel, 0);
          RelationTruncate(rel, 0);

          /* If this relation has indexes, truncate the indexes too */
*** src/backend/catalog/index.c
--- src/backend/catalog/index.c
***************
*** 1343,1354 **** setNewRelfilenode(Relation relation, TransactionId freezeXid)
      smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false);

      /*
!      * For a heap, create FSM fork as well. Indexams are responsible for
!      * creating any extra forks themselves.
       */
      if (relation->rd_rel->relkind == RELKIND_RELATION ||
          relation->rd_rel->relkind == RELKIND_TOASTVALUE)
          smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false);

      /* schedule unlinking old files */
      for (i = 0; i <= MAX_FORKNUM; i++)
--- 1343,1357 ----
      smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false);

      /*
!      * For a heap, create FSM and visibility map as well. Indexams are
!      * responsible for creating any extra forks themselves.
       */
      if (relation->rd_rel->relkind == RELKIND_RELATION ||
          relation->rd_rel->relkind == RELKIND_TOASTVALUE)
+     {
          smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false);
+         smgrcreate(srel, VISIBILITYMAP_FORKNUM, relation->rd_istemp, false);
+     }

      /* schedule unlinking old files */
      for (i = 0; i <= MAX_FORKNUM; i++)
*** src/backend/commands/vacuum.c
--- src/backend/commands/vacuum.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlog.h"
  #include "catalog/namespace.h"
***************
*** 1327,1332 **** scan_heap(VRelStats *vacrelstats, Relation onerel,
--- 1328,1336 ----

      nblocks = RelationGetNumberOfBlocks(onerel);

+     if (nblocks > 0)
+         visibilitymap_extend(onerel, nblocks);
+
      /*
       * We initially create each VacPage item in a maximal-sized workspace,
       * then copy the workspace into a just-large-enough copy.
***************
*** 2822,2828 **** repair_frag(VRelStats *vacrelstats, Relation onerel,
                  recptr = log_heap_clean(onerel, buf,
                                          NULL, 0, NULL, 0,
                                          unused, uncnt,
!                                         false);
                  PageSetLSN(page, recptr);
                  PageSetTLI(page, ThisTimeLineID);
              }
--- 2826,2832 ----
                  recptr = log_heap_clean(onerel, buf,
                                          NULL, 0, NULL, 0,
                                          unused, uncnt,
!                                         false, false);
                  PageSetLSN(page, recptr);
                  PageSetTLI(page, ThisTimeLineID);
              }
***************
*** 2843,2848 **** repair_frag(VRelStats *vacrelstats, Relation onerel,
--- 2847,2853 ----
      if (blkno < nblocks)
      {
          FreeSpaceMapTruncateRel(onerel, blkno);
+         visibilitymap_truncate(onerel, blkno);
          RelationTruncate(onerel, blkno);
          vacrelstats->rel_pages = blkno; /* set new number of blocks */
      }
***************
*** 2881,2886 **** move_chain_tuple(Relation rel,
--- 2886,2899 ----
      Size        tuple_len = old_tup->t_len;

      /*
+      * we don't need to bother with the usual locking protocol for updating
+      * the visibility map, since we're holding an AccessExclusiveLock on the
+      * relation anyway.
+      */
+     visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+
+     /*
       * make a modifiable copy of the source tuple.
       */
      heap_copytuple_with_tuple(old_tup, &newtup);
***************
*** 3020,3025 **** move_plain_tuple(Relation rel,
--- 3033,3046 ----
      ItemId        newitemid;
      Size        tuple_len = old_tup->t_len;

+     /*
+      * we don't need to bother with the usual locking protocol for updating
+      * the visibility map, since we're holding an AccessExclusiveLock on the
+      * relation anyway.
+      */
+     visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+
      /* copy tuple */
      heap_copytuple_with_tuple(old_tup, &newtup);

***************
*** 3238,3243 **** vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
--- 3259,3265 ----
                          RelationGetRelationName(onerel),
                          vacrelstats->rel_pages, relblocks)));
          FreeSpaceMapTruncateRel(onerel, relblocks);
+         visibilitymap_truncate(onerel, relblocks);
          RelationTruncate(onerel, relblocks);
          vacrelstats->rel_pages = relblocks;        /* set new number of blocks */
      }
***************
*** 3279,3285 **** vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage)
          recptr = log_heap_clean(onerel, buffer,
                                  NULL, 0, NULL, 0,
                                  vacpage->offsets, vacpage->offsets_free,
!                                 false);
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
--- 3301,3307 ----
          recptr = log_heap_clean(onerel, buffer,
                                  NULL, 0, NULL, 0,
                                  vacpage->offsets, vacpage->offsets_free,
!                                 false, false);
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
*** src/backend/commands/vacuumlazy.c
--- src/backend/commands/vacuumlazy.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "commands/dbcommands.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
***************
*** 87,92 **** typedef struct LVRelStats
--- 88,94 ----
      int            max_dead_tuples;    /* # slots allocated in array */
      ItemPointer dead_tuples;    /* array of ItemPointerData */
      int            num_index_scans;
+     bool        scanned_all;    /* have we scanned all pages (this far) in the rel? */
  } LVRelStats;


***************
*** 101,111 **** static BufferAccessStrategy vac_strategy;

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
                    LVRelStats *vacrelstats);
  static void lazy_cleanup_index(Relation indrel,
                     IndexBulkDeleteResult *stats,
                     LVRelStats *vacrelstats);
--- 103,114 ----

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
                    LVRelStats *vacrelstats);
+
  static void lazy_cleanup_index(Relation indrel,
                     IndexBulkDeleteResult *stats,
                     LVRelStats *vacrelstats);
***************
*** 140,145 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
--- 143,149 ----
      BlockNumber possibly_freeable;
      PGRUsage    ru0;
      TimestampTz starttime = 0;
+     bool        scan_all;

      pg_rusage_init(&ru0);

***************
*** 165,172 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
--- 169,187 ----
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

+     /* Should we use the visibility map or scan all pages? */
+     if (vacstmt->freeze_min_age != -1)
+         scan_all = true;
+     else if (vacstmt->analyze)
+         scan_all = true;
+     else
+         scan_all = false;
+
+     /* initialize this variable */
+     vacrelstats->scanned_all = true;
+
      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
***************
*** 231,237 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes)
  {
      BlockNumber nblocks,
                  blkno;
--- 246,252 ----
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all)
  {
      BlockNumber nblocks,
                  blkno;
***************
*** 246,251 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 261,267 ----
      IndexBulkDeleteResult **indstats;
      int            i;
      PGRUsage    ru0;
+     Buffer        vmbuffer = InvalidBuffer;

      pg_rusage_init(&ru0);

***************
*** 267,272 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 283,291 ----

      lazy_space_alloc(vacrelstats, nblocks);

+     if (nblocks > 0)
+         visibilitymap_extend(onerel, nblocks);
+
      for (blkno = 0; blkno < nblocks; blkno++)
      {
          Buffer        buf;
***************
*** 279,284 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 298,320 ----
          OffsetNumber frozen[MaxOffsetNumber];
          int            nfrozen;
          Size        freespace;
+         bool        all_visible_according_to_vm;
+
+         /*
+          * If all tuples on page are visible to all, there's no
+          * need to visit that page.
+          *
+          * Note that we test the visibility map even if we're scanning all
+          * pages, to pin the visibility map page. We might set the bit there,
+          * and we don't want to do the I/O while we're holding the heap page
+          * locked.
+          */
+         all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer);
+         if (!scan_all && all_visible_according_to_vm)
+         {
+             vacrelstats->scanned_all = false;
+             continue;
+         }

          vacuum_delay_point();

***************
*** 525,530 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 561,570 ----

          freespace = PageGetHeapFreeSpace(page);

+         /* Update the visibility map */
+         if (PageIsAllVisible(page))
+             visibilitymap_set_opt(onerel, blkno, PageGetLSN(page), &vmbuffer);
+
          /* Remember the location of the last page with nonremovable tuples */
          if (hastup)
              vacrelstats->nonempty_pages = blkno + 1;
***************
*** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 600,611 ----
          vacrelstats->num_index_scans++;
      }

+     if (BufferIsValid(vmbuffer))
+     {
+         ReleaseBuffer(vmbuffer);
+         vmbuffer = InvalidBuffer;
+     }
+
      /* Do post-vacuum cleanup and statistics update for each index */
      for (i = 0; i < nindexes; i++)
          lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
***************
*** 622,627 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
--- 668,682 ----
          LockBufferForCleanup(buf);
          tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);

+         /*
+          * Before we let the page go, prune it. The primary reason is to
+          * update the visibility map in the common special case that we just
+          * vacuumed away the last tuple on the page that wasn't visible to
+          * everyone.
+          */
+         vacrelstats->tuples_deleted +=
+             heap_page_prune(onerel, buf, OldestXmin, false, false);
+
          /* Now that we've compacted the page, record its available space */
          page = BufferGetPage(buf);
          freespace = PageGetHeapFreeSpace(page);
***************
*** 686,692 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
          recptr = log_heap_clean(onerel, buffer,
                                  NULL, 0, NULL, 0,
                                  unused, uncnt,
!                                 false);
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
--- 741,747 ----
          recptr = log_heap_clean(onerel, buffer,
                                  NULL, 0, NULL, 0,
                                  unused, uncnt,
!                                 false, false);
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
***************
*** 829,834 **** lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
--- 884,890 ----
       * Okay to truncate.
       */
      FreeSpaceMapTruncateRel(onerel, new_rel_pages);
+     visibilitymap_truncate(onerel, new_rel_pages);
      RelationTruncate(onerel, new_rel_pages);

      /*
*** src/backend/utils/cache/relcache.c
--- src/backend/utils/cache/relcache.c
***************
*** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp)
--- 305,311 ----
      MemSet(relation, 0, sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1366,1371 **** formrdesc(const char *relationName, Oid relationReltype,
--- 1367,1373 ----
      relation = (Relation) palloc0(sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1654,1662 **** RelationReloadIndexInfo(Relation relation)
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /* Must reset targblock and fsm_nblocks_cache in case rel was truncated */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
--- 1656,1665 ----
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /* Must reset targblock and fsm_nblocks_cache and vm_nblocks_cache in case rel was truncated */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
***************
*** 1740,1745 **** RelationClearRelation(Relation relation, bool rebuild)
--- 1743,1749 ----
      {
          relation->rd_targblock = InvalidBlockNumber;
          relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+         relation->rd_vm_nblocks_cache = InvalidBlockNumber;
          if (relation->rd_rel->relkind == RELKIND_INDEX)
          {
              relation->rd_isvalid = false;        /* needs to be revalidated */
***************
*** 2335,2340 **** RelationBuildLocalRelation(const char *relname,
--- 2339,2345 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     rel->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      rel->rd_smgr = NULL;
***************
*** 3592,3597 **** load_relcache_init_file(void)
--- 3597,3603 ----
          rel->rd_smgr = NULL;
          rel->rd_targblock = InvalidBlockNumber;
          rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+         rel->rd_vm_nblocks_cache = InvalidBlockNumber;
          if (rel->rd_isnailed)
              rel->rd_refcnt = 1;
          else
*** src/include/access/heapam.h
--- src/include/access/heapam.h
***************
*** 125,131 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
                 OffsetNumber *redirected, int nredirected,
                 OffsetNumber *nowdead, int ndead,
                 OffsetNumber *nowunused, int nunused,
!                bool redirect_move);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
                  TransactionId cutoff_xid,
                  OffsetNumber *offsets, int offcnt);
--- 125,131 ----
                 OffsetNumber *redirected, int nredirected,
                 OffsetNumber *nowdead, int ndead,
                 OffsetNumber *nowunused, int nunused,
!                bool redirect_move, bool all_visible);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
                  TransactionId cutoff_xid,
                  OffsetNumber *offsets, int offcnt);
***************
*** 142,148 **** extern void heap_page_prune_execute(Buffer buffer,
                          OffsetNumber *redirected, int nredirected,
                          OffsetNumber *nowdead, int ndead,
                          OffsetNumber *nowunused, int nunused,
!                         bool redirect_move);
  extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);

  /* in heap/syncscan.c */
--- 142,148 ----
                          OffsetNumber *redirected, int nredirected,
                          OffsetNumber *nowdead, int ndead,
                          OffsetNumber *nowunused, int nunused,
!                         bool redirect_move, bool all_visible);
  extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);

  /* in heap/syncscan.c */
*** src/include/access/htup.h
--- src/include/access/htup.h
***************
*** 595,600 **** typedef struct xl_heaptid
--- 595,601 ----
  typedef struct xl_heap_delete
  {
      xl_heaptid    target;            /* deleted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
  } xl_heap_delete;

  #define SizeOfHeapDelete    (offsetof(xl_heap_delete, target) + SizeOfHeapTid)
***************
*** 620,635 **** typedef struct xl_heap_header
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, target) + SizeOfHeapTid)

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
--- 621,639 ----
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
+     bool new_all_visible_cleared; /* same for the page of newtid */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
***************
*** 660,665 **** typedef struct xl_heap_clean
--- 664,670 ----
      BlockNumber block;
      uint16        nredirected;
      uint16        ndead;
+     bool        all_visible_set;
      /* OFFSET NUMBERS FOLLOW */
  } xl_heap_clean;

*** /dev/null
--- src/include/access/visibilitymap.h
***************
*** 0 ****
--- 1,28 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.h
+  *      visibility map interface
+  *
+  *
+  * Portions Copyright (c) 2007, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef VISIBILITYMAP_H
+ #define VISIBILITYMAP_H
+
+ #include "utils/rel.h"
+ #include "storage/buf.h"
+ #include "storage/itemptr.h"
+ #include "access/xlogdefs.h"
+
+ extern void visibilitymap_set_opt(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr, Buffer *vmbuf);
+ extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk);
+ extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+ extern void visibilitymap_extend(Relation rel, BlockNumber heapblk);
+ extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk);
+
+ #endif   /* VISIBILITYMAP_H */
*** src/include/storage/bufpage.h
--- src/include/storage/bufpage.h
***************
*** 152,159 **** typedef PageHeaderData *PageHeader;
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */

! #define PD_VALID_FLAG_BITS    0x0003        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
--- 152,161 ----
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */
+ #define PD_ALL_VISIBLE        0x0004        /* all tuples on page are visible to
+                                          * everyone */

! #define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 336,341 **** typedef PageHeaderData *PageHeader;
--- 338,350 ----
  #define PageClearFull(page) \
      (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL)

+ #define PageIsAllVisible(page) \
+     (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE)
+ #define PageSetAllVisible(page) \
+     (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
+ #define PageClearAllVisible(page) \
+     (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+
  #define PageIsPrunable(page, oldestxmin) \
  ( \
      AssertMacro(TransactionIdIsNormal(oldestxmin)), \
*** src/include/storage/relfilenode.h
--- src/include/storage/relfilenode.h
***************
*** 24,37 **** typedef enum ForkNumber
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
  } ForkNumber;

! #define MAX_FORKNUM        FSM_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
--- 24,38 ----
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM,
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
+     VISIBILITYMAP_FORKNUM
  } ForkNumber;

! #define MAX_FORKNUM        VISIBILITYMAP_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
*** src/include/utils/rel.h
--- src/include/utils/rel.h
***************
*** 195,202 **** typedef struct RelationData
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /* Cached last-seen size of the FSM */
      BlockNumber    rd_fsm_nblocks_cache;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */
--- 195,203 ----
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /* Cached last-seen size of the FSM and visibility map */
      BlockNumber    rd_fsm_nblocks_cache;
+     BlockNumber    rd_vm_nblocks_cache;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */

Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> To modify a page:
> If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared 
> first. The heap page is kept pinned, but not locked, while the 
> visibility map is updated. We want to avoid holding a lock across I/O, 
> even though the visibility map is likely to stay in cache. After the 
> visibility map has been updated, the page is exclusively locked and 
> modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing 
> the lock.

So after having determined that you will modify a page, you release the
ex lock on the buffer and then try to regain it later?  Seems like a
really bad idea from here.  What if it's no longer possible to do the
modification you intended?

> To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the 
> page, while you observe that all tuples on the page are visible to everyone.

That doesn't sound too good from a concurrency standpoint...

> That's how the patch works right now. However, there's a small 
> performance problem with the current approach: setting the 
> PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:

I'm more concerned about *clearing* the bit being WAL-logged.  That's
necessary for correctness.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Simon Riggs
Date:
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
> One option would be to just ignore that problem for now, and not 
> WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby? 

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say "page is now all visible", without too much work.

Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
>> One option would be to just ignore that problem for now, and not 
>> WAL-log.
> 
> Probably worth skipping for now, since it will cause patch conflicts if
> you do. Are there any other interactions with Hot Standby? 
> 
> But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
> to say "page is now all visible", without too much work.

Hmm. Even if a tuple is visible to everyone on the master, it's not 
necessarily yet visible to all the read-only transactions in the slave.

> Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
> the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
> to set the block flag (unlogged), but just not update VM. Separating the
> two concepts should allow the visibility check speed gain to more
> generally available. 

Yes, that should be possible in theory. There's no version of 
ConditionalLockBuffer() for conditionally upgrading a shared lock to 
exclusive, but it should be possible in theory. I'm not sure if it would 
be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
though. If it is, then we could do just that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> ... I'm not sure if it would 
> be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
> though. If it is, then we could do just that.

Seems like it must be safe.  If you have shared lock on a page then no
one else could be modifying the page in a way that would falsify
PD_ALL_VISIBLE.  You might have several processes concurrently try to
set the bit but that is safe (same situation as for hint bits).

The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?
        regards, tom lane


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> The harder part is propagating the bit to the visibility map, but I
> gather you intend to only allow VACUUM to do that?

Yep.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> To modify a page:
>> If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared 
>> first. The heap page is kept pinned, but not locked, while the 
>> visibility map is updated. We want to avoid holding a lock across I/O, 
>> even though the visibility map is likely to stay in cache. After the 
>> visibility map has been updated, the page is exclusively locked and 
>> modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing 
>> the lock.
> 
> So after having determined that you will modify a page, you release the
> ex lock on the buffer and then try to regain it later?  Seems like a
> really bad idea from here.  What if it's no longer possible to do the
> modification you intended?

In case of insert/update, you have to find a new target page. I put the 
logic in RelationGetBufferForTuple(). In case of delete and update (old 
page), the flag is checked and bit cleared just after pinning the 
buffer, before doing anything else. (I note that that's not actually 
what the patch is doing for heap_update, will fix..)

If we give up on the strict requirement that the bit in the visibility 
map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set, 
then we could just update the visibility map after releasing the locks 
on the heap pages. I think I'll do that for now, for simplicity.

>> To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the 
>> page, while you observe that all tuples on the page are visible to everyone.
> 
> That doesn't sound too good from a concurrency standpoint...

Well, no, but it's only done in VACUUM. And pruning. I implemented it as 
a new loop that call HeapTupleSatisfiesVacuum on each tuple, and 
checking that xmin is old enough for live tuples, but come to think of 
it, we're already calling HeapTupleSatisfiesVacuum for every tuple on 
the page during VACUUM, so it should be possible to piggyback on that by 
restructuring the code.

>> That's how the patch works right now. However, there's a small 
>> performance problem with the current approach: setting the 
>> PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
> 
> I'm more concerned about *clearing* the bit being WAL-logged.  That's
> necessary for correctness.

Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged. 
There's a new boolean field in xl_heap_insert/update/delete records 
indicating if the operation cleared the flag. On replay, if the flag was 
cleared, the bit in the visibility map is also cleared.


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Simon Riggs
Date:
On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
> >> One option would be to just ignore that problem for now, and not 
> >> WAL-log.
> > 
> > Probably worth skipping for now, since it will cause patch conflicts if
> > you do. Are there any other interactions with Hot Standby? 
> > 
> > But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
> > to say "page is now all visible", without too much work.
> 
> Hmm. Even if a tuple is visible to everyone on the master, it's not 
> necessarily yet visible to all the read-only transactions in the slave.

Never a problem. No query can ever see the rows removed by a cleanup
record, enforced by the recovery system.

> > Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
> > the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
> > to set the block flag (unlogged), but just not update VM. Separating the
> > two concepts should allow the visibility check speed gain to more
> > generally available. 
> 
> Yes, that should be possible in theory. There's no version of 
> ConditionalLockBuffer() for conditionally upgrading a shared lock to 
> exclusive, but it should be possible in theory. I'm not sure if it would 
> be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
> though. If it is, then we could do just that.

To be honest, I'm more excited about your perf results for that than I
am about speeding up some VACUUMs.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
>>>> One option would be to just ignore that problem for now, and not 
>>>> WAL-log.
>>> Probably worth skipping for now, since it will cause patch conflicts if
>>> you do. Are there any other interactions with Hot Standby? 
>>>
>>> But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
>>> to say "page is now all visible", without too much work.
>> Hmm. Even if a tuple is visible to everyone on the master, it's not 
>> necessarily yet visible to all the read-only transactions in the slave.
> 
> Never a problem. No query can ever see the rows removed by a cleanup
> record, enforced by the recovery system.

Yes, but there's a problem with recently inserted tuples:

1. A query begins in the slave, taking a snapshot with xmax = 100. So 
the effects of anything more recent should not be seen.
2. Transaction 100 inserts a tuple in the master, and commits
3. A vacuum comes along. There's no other transactions running in the 
master. Vacuum sees that all tuples on the page, including the one just 
inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
4. The change is replicated to the slave.
5. The query in the slave that began at step 1 looks at the page, sees 
that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
checks, and erroneously returns the inserted tuple.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Simon Riggs
Date:
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

> Lazy VACUUM only needs to visit pages that are '0' in the visibility 
> map. This allows partial vacuums, where we only need to scan those parts 
> of the table that need vacuuming, plus all indexes.

Just realised that this means we still have to visit each block of a
btree index with a cleanup lock.

That means the earlier idea of saying I don't need a cleanup lock if the
page is not in memory makes a lot more sense with a partial vacuum.

1. Scan all blocks in memory for the index (and so, don't do this unless
the index is larger than a certain % of shared buffers), 
2. Start reading in new blocks until you've removed the correct number
of tuples
3. Work through the rest of the blocks checking that they are either in
shared buffers and we can get a cleanup lock, or they aren't in shared
buffers and so nobody has them pinned.

If you step (2) intelligently with regard to index correlation you might
not need to do much I/O at all, if any.

(1) has a good hit ratio because mostly only active tables will be
vacuumed so are fairly likely to be in memory.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Visibility map, partial vacuums

From
Simon Riggs
Date:
On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote:

> Yes, but there's a problem with recently inserted tuples:
> 
> 1. A query begins in the slave, taking a snapshot with xmax = 100. So 
> the effects of anything more recent should not be seen.
> 2. Transaction 100 inserts a tuple in the master, and commits
> 3. A vacuum comes along. There's no other transactions running in the 
> master. Vacuum sees that all tuples on the page, including the one just 
> inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
> 4. The change is replicated to the slave.
> 5. The query in the slave that began at step 1 looks at the page, sees 
> that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
> checks, and erroneously returns the inserted tuple.

Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be
separately settable on the standby. Another reason why it should be able
to be set without a VACUUM - since there will never be one on standby.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
>> Lazy VACUUM only needs to visit pages that are '0' in the visibility 
>> map. This allows partial vacuums, where we only need to scan those parts 
>> of the table that need vacuuming, plus all indexes.

> Just realised that this means we still have to visit each block of a
> btree index with a cleanup lock.

Yes, and your proposal cannot fix that.  Read "The Deletion Algorithm"
in nbtree/README, particularly the second paragraph.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Yes, but there's a problem with recently inserted tuples:

> 1. A query begins in the slave, taking a snapshot with xmax = 100. So 
> the effects of anything more recent should not be seen.
> 2. Transaction 100 inserts a tuple in the master, and commits
> 3. A vacuum comes along. There's no other transactions running in the 
> master. Vacuum sees that all tuples on the page, including the one just 
> inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
> 4. The change is replicated to the slave.
> 5. The query in the slave that began at step 1 looks at the page, sees 
> that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
> checks, and erroneously returns the inserted tuple.

But this is exactly equivalent to the problem with recently deleted
tuples: vacuum on the master might take actions that are premature with
respect to the status on the slave.  Whatever solution we adopt for that
will work for this too.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Simon Riggs
Date:
On Tue, 2008-10-28 at 13:58 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
> >> Lazy VACUUM only needs to visit pages that are '0' in the visibility 
> >> map. This allows partial vacuums, where we only need to scan those parts 
> >> of the table that need vacuuming, plus all indexes.
> 
> > Just realised that this means we still have to visit each block of a
> > btree index with a cleanup lock.
> 
> Yes, and your proposal cannot fix that.  Read "The Deletion Algorithm"
> in nbtree/README, particularly the second paragraph.

Yes, understood. Please read the algorithm again. It does guarantee that
each block in the index has been checked to see if nobody is pinning it,
it just avoids performing I/O to prove that.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Another thing that does need to be fixed, is the way that the extension 
> and truncation of the visibility map is handled; that's broken in the 
> current patch. I started working on the patch a long time ago, before 
> the FSM rewrite was finished, and haven't gotten around fixing that part 
> yet. We already solved it for the FSM, so we could just follow that 
> pattern. The way we solved truncation in the FSM was to write a separate 
> WAL record with the new heap size, but perhaps we want to revisit that 
> decision, instead of adding again new code to write a third WAL record, 
> for truncation of the visibility map. smgrtruncate() writes a WAL record 
> of its own, if any full blocks are truncated away of the FSM, but we 
> needed a WAL record even if no full blocks are truncated from the FSM 
> file, because the "tail" of the last remaining FSM page, representing 
> the truncated away heap pages, still needs to cleared. Visibility map 
> has the same problem.
> 
> One proposal was to piggyback on the smgrtruncate() WAL-record, and call 
> FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a 
> modularity point of view; smgr.c shouldn't be calling higher-level 
> functions. But maybe it wouldn't be that bad, after all. Or, we could 
> remove WAL-logging from smgrtruncate() altogether, and move it to 
> RelationTruncate() or another higher-level function, and handle the 
> WAL-logging and replay there.

In preparation for the visibility map patch, I revisited the truncation 
issue, and hacked together a patch to piggyback the FSM truncation to 
the main fork smgr truncation WAL record. I moved the WAL-logging from 
smgrtruncate() to RelationTruncate(). There's a new flag to 
RelationTruncate indicating whether the FSM should be truncated too, and 
only one truncation WAL record is written for the operation.

That does seem cleaner than the current approach where the FSM writes a 
separate WAL record just to clear the bits of the last remaining FSM 
page. I had to move RelationTruncate() to smgr.c, because I don't think 
a function in bufmgr.c should be doing WAL-logging. However, 
RelationTruncate really doesn't belong in smgr.c either. Also, now that 
smgrtruncate doesn't write its own WAL record, it doesn't seem right for 
smgrcreate to be doing that either.

So, I think I'll take this one step forward, and move RelationTruncate() 
to a new higher level file, e.g. src/backend/catalog/storage.c, and also 
create a new RelationCreateStorage() function that calls smgrcreate(), 
and move the WAL-logging from smgrcreate() to RelationCreateStorage().

So, we'll have two functions in a new file:

/* Create physical storage for a relation. If 'fsm' is true, an FSM fork 
is also created */
RelationCreateStorage(Relation rel, bool fsm)
/* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM 
is also truncated */
RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm)

The next question is whether the "pending rel deletion" stuff in smgr.c 
should be moved to the new file too. It seems like it would belong there 
better. That would leave smgr.c as a very thin wrapper around md.c

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> The next question is whether the "pending rel deletion" stuff in smgr.c should
> be moved to the new file too. It seems like it would belong there better. That
> would leave smgr.c as a very thin wrapper around md.c

Well it's just a switch, albeit with only one case, so I wouldn't expect it to
be much more than a thin wrapper.

If we had more storage systems it might be clearer what features were common
to all of them and could be hoisted up from md.c. I'm not clear there are any
though.

Actually I wonder if an entirely in-memory storage system would help with the
"temporary table" problem on systems where the kernel is too aggressive about
flushing file buffers or metadata.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> So, I think I'll take this one step forward, and move RelationTruncate()
> to a new higher level file, e.g. src/backend/catalog/storage.c, and also
> create a new RelationCreateStorage() function that calls smgrcreate(),
> and move the WAL-logging from smgrcreate() to RelationCreateStorage().
>
> So, we'll have two functions in a new file:
>
> /* Create physical storage for a relation. If 'fsm' is true, an FSM fork
> is also created */
> RelationCreateStorage(Relation rel, bool fsm)
> /* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM
> is also truncated */
> RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm)
>
> The next question is whether the "pending rel deletion" stuff in smgr.c
> should be moved to the new file too. It seems like it would belong there
> better. That would leave smgr.c as a very thin wrapper around md.c

This new approach feels pretty good to me, attached is a patch to do
just that. Many of the functions formerly in smgr.c are now in
src/backend/catalog/storage.c, including all the WAL-logging and pending
rel deletion stuff. I kept their old names for now, though perhaps they
should be renamed now that they're above smgr level.

I also implemented Tom's idea of delaying creation of the FSM until it's
needed, not because of performance, but because it started to get quite
hairy to keep track of which relations should have a FSM and which
shouldn't. Creation of the FSM fork is now treated more like extending a
relation, as a non-WAL-logged operation, and it's up to freespace.c to
create the file when it's needed. There's no operation to explicitly
delete an individual fork of a relation, RelationCreateStorage only
creates the main fork, RelationDropStorage drops all forks, and
RelationTruncate truncates the FSM if and only if the FSM fork exists.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/gin/gininsert.c
--- src/backend/access/gin/gininsert.c
***************
*** 284,292 **** ginbuild(PG_FUNCTION_ARGS)
          elog(ERROR, "index \"%s\" already contains data",
               RelationGetRelationName(index));

-     /* Initialize FSM */
-     InitIndexFreeSpaceMap(index);
-
      initGinState(&buildstate.ginstate, index);

      /* initialize the root page */
--- 284,289 ----
*** src/backend/access/gin/ginvacuum.c
--- src/backend/access/gin/ginvacuum.c
***************
*** 16,21 ****
--- 16,22 ----

  #include "access/genam.h"
  #include "access/gin.h"
+ #include "catalog/storage.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
***************
*** 757,763 **** ginvacuumcleanup(PG_FUNCTION_ARGS)
      if (info->vacuum_full && lastBlock > lastFilledBlock)
      {
          /* try to truncate index */
-         FreeSpaceMapTruncateRel(index, lastFilledBlock + 1);
          RelationTruncate(index, lastFilledBlock + 1);

          stats->pages_removed = lastBlock - lastFilledBlock;
--- 758,763 ----
*** src/backend/access/gist/gist.c
--- src/backend/access/gist/gist.c
***************
*** 103,111 **** gistbuild(PG_FUNCTION_ARGS)
          elog(ERROR, "index \"%s\" already contains data",
               RelationGetRelationName(index));

-     /* Initialize FSM */
-     InitIndexFreeSpaceMap(index);
-
      /* no locking is needed */
      initGISTstate(&buildstate.giststate, index);

--- 103,108 ----
*** src/backend/access/gist/gistvacuum.c
--- src/backend/access/gist/gistvacuum.c
***************
*** 16,21 ****
--- 16,22 ----

  #include "access/genam.h"
  #include "access/gist_private.h"
+ #include "catalog/storage.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
***************
*** 603,609 **** gistvacuumcleanup(PG_FUNCTION_ARGS)

      if (info->vacuum_full && lastFilledBlock < lastBlock)
      {                            /* try to truncate index */
-         FreeSpaceMapTruncateRel(rel, lastFilledBlock + 1);
          RelationTruncate(rel, lastFilledBlock + 1);

          stats->std.pages_removed = lastBlock - lastFilledBlock;
--- 604,609 ----
*** src/backend/access/heap/heapam.c
--- src/backend/access/heap/heapam.c
***************
*** 4863,4870 **** heap_sync(Relation rel)
      /* FlushRelationBuffers will have opened rd_smgr */
      smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);

!     /* sync FSM as well */
!     smgrimmedsync(rel->rd_smgr, FSM_FORKNUM);

      /* toast heap, if any */
      if (OidIsValid(rel->rd_rel->reltoastrelid))
--- 4863,4869 ----
      /* FlushRelationBuffers will have opened rd_smgr */
      smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);

!     /* FSM is not critical, don't bother syncing it */

      /* toast heap, if any */
      if (OidIsValid(rel->rd_rel->reltoastrelid))
***************
*** 4874,4880 **** heap_sync(Relation rel)
          toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
          FlushRelationBuffers(toastrel);
          smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-         smgrimmedsync(toastrel->rd_smgr, FSM_FORKNUM);
          heap_close(toastrel, AccessShareLock);
      }
  }
--- 4873,4878 ----
*** src/backend/access/nbtree/nbtree.c
--- src/backend/access/nbtree/nbtree.c
***************
*** 22,27 ****
--- 22,28 ----
  #include "access/nbtree.h"
  #include "access/relscan.h"
  #include "catalog/index.h"
+ #include "catalog/storage.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
***************
*** 109,117 **** btbuild(PG_FUNCTION_ARGS)
          elog(ERROR, "index \"%s\" already contains data",
               RelationGetRelationName(index));

-     /* Initialize FSM */
-     InitIndexFreeSpaceMap(index);
-
      buildstate.spool = _bt_spoolinit(index, indexInfo->ii_Unique, false);

      /*
--- 110,115 ----
***************
*** 696,702 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
          /*
           * Okay to truncate.
           */
-         FreeSpaceMapTruncateRel(rel, new_pages);
          RelationTruncate(rel, new_pages);

          /* update statistics */
--- 694,699 ----
*** src/backend/access/transam/rmgr.c
--- src/backend/access/transam/rmgr.c
***************
*** 31,37 **** const RmgrData RmgrTable[RM_MAX_ID + 1] = {
      {"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
      {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
      {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
!     {"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL},
      {"Reserved 8", NULL, NULL, NULL, NULL, NULL},
      {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
      {"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
--- 31,37 ----
      {"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
      {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
      {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
!     {"Reserved 7", NULL, NULL, NULL, NULL, NULL},
      {"Reserved 8", NULL, NULL, NULL, NULL, NULL},
      {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
      {"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
*** src/backend/access/transam/twophase.c
--- src/backend/access/transam/twophase.c
***************
*** 48,54 ****
--- 48,56 ----
  #include "access/twophase.h"
  #include "access/twophase_rmgr.h"
  #include "access/xact.h"
+ #include "access/xlogutils.h"
  #include "catalog/pg_type.h"
+ #include "catalog/storage.h"
  #include "funcapi.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 141,152 **** static void RecordTransactionCommitPrepared(TransactionId xid,
                                  int nchildren,
                                  TransactionId *children,
                                  int nrels,
!                                 RelFileFork *rels);
  static void RecordTransactionAbortPrepared(TransactionId xid,
                                 int nchildren,
                                 TransactionId *children,
                                 int nrels,
!                                RelFileFork *rels);
  static void ProcessRecords(char *bufptr, TransactionId xid,
                 const TwoPhaseCallback callbacks[]);

--- 143,154 ----
                                  int nchildren,
                                  TransactionId *children,
                                  int nrels,
!                                 RelFileNode *rels);
  static void RecordTransactionAbortPrepared(TransactionId xid,
                                 int nchildren,
                                 TransactionId *children,
                                 int nrels,
!                                RelFileNode *rels);
  static void ProcessRecords(char *bufptr, TransactionId xid,
                 const TwoPhaseCallback callbacks[]);

***************
*** 793,800 **** StartPrepare(GlobalTransaction gxact)
      TransactionId xid = gxact->proc.xid;
      TwoPhaseFileHeader hdr;
      TransactionId *children;
!     RelFileFork *commitrels;
!     RelFileFork *abortrels;

      /* Initialize linked list */
      records.head = palloc0(sizeof(XLogRecData));
--- 795,802 ----
      TransactionId xid = gxact->proc.xid;
      TwoPhaseFileHeader hdr;
      TransactionId *children;
!     RelFileNode *commitrels;
!     RelFileNode *abortrels;

      /* Initialize linked list */
      records.head = palloc0(sizeof(XLogRecData));
***************
*** 832,843 **** StartPrepare(GlobalTransaction gxact)
      }
      if (hdr.ncommitrels > 0)
      {
!         save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileFork));
          pfree(commitrels);
      }
      if (hdr.nabortrels > 0)
      {
!         save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileFork));
          pfree(abortrels);
      }
  }
--- 834,845 ----
      }
      if (hdr.ncommitrels > 0)
      {
!         save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileNode));
          pfree(commitrels);
      }
      if (hdr.nabortrels > 0)
      {
!         save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode));
          pfree(abortrels);
      }
  }
***************
*** 1140,1147 **** FinishPreparedTransaction(const char *gid, bool isCommit)
      TwoPhaseFileHeader *hdr;
      TransactionId latestXid;
      TransactionId *children;
!     RelFileFork *commitrels;
!     RelFileFork *abortrels;
      int            i;

      /*
--- 1142,1151 ----
      TwoPhaseFileHeader *hdr;
      TransactionId latestXid;
      TransactionId *children;
!     RelFileNode *commitrels;
!     RelFileNode *abortrels;
!     RelFileNode *delrels;
!     int            ndelrels;
      int            i;

      /*
***************
*** 1169,1178 **** FinishPreparedTransaction(const char *gid, bool isCommit)
      bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
      children = (TransactionId *) bufptr;
      bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
!     commitrels = (RelFileFork *) bufptr;
!     bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileFork));
!     abortrels = (RelFileFork *) bufptr;
!     bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileFork));

      /* compute latestXid among all children */
      latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
--- 1173,1182 ----
      bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
      children = (TransactionId *) bufptr;
      bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
!     commitrels = (RelFileNode *) bufptr;
!     bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
!     abortrels = (RelFileNode *) bufptr;
!     bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));

      /* compute latestXid among all children */
      latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
***************
*** 1214,1234 **** FinishPreparedTransaction(const char *gid, bool isCommit)
       */
      if (isCommit)
      {
!         for (i = 0; i < hdr->ncommitrels; i++)
!         {
!             SMgrRelation srel = smgropen(commitrels[i].rnode);
!             smgrdounlink(srel, commitrels[i].forknum, false, false);
!             smgrclose(srel);
!         }
      }
      else
      {
!         for (i = 0; i < hdr->nabortrels; i++)
          {
!             SMgrRelation srel = smgropen(abortrels[i].rnode);
!             smgrdounlink(srel, abortrels[i].forknum, false, false);
!             smgrclose(srel);
          }
      }

      /* And now do the callbacks */
--- 1218,1245 ----
       */
      if (isCommit)
      {
!         delrels = commitrels;
!         ndelrels = hdr->ncommitrels;
      }
      else
      {
!         delrels = abortrels;
!         ndelrels = hdr->nabortrels;
!     }
!     for (i = 0; i < ndelrels; i++)
!     {
!         SMgrRelation srel = smgropen(delrels[i]);
!         ForkNumber    fork;
!
!         for (fork = 0; fork <= MAX_FORKNUM; fork++)
          {
!             if (smgrexists(srel, fork))
!             {
!                 XLogDropRelation(delrels[i], fork);
!                 smgrdounlink(srel, fork, false, true);
!             }
          }
+         smgrclose(srel);
      }

      /* And now do the callbacks */
***************
*** 1639,1646 **** RecoverPreparedTransactions(void)
              bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
              subxids = (TransactionId *) bufptr;
              bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
!             bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileFork));
!             bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileFork));

              /*
               * Reconstruct subtrans state for the transaction --- needed
--- 1650,1657 ----
              bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
              subxids = (TransactionId *) bufptr;
              bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
!             bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
!             bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));

              /*
               * Reconstruct subtrans state for the transaction --- needed
***************
*** 1693,1699 **** RecordTransactionCommitPrepared(TransactionId xid,
                                  int nchildren,
                                  TransactionId *children,
                                  int nrels,
!                                 RelFileFork *rels)
  {
      XLogRecData rdata[3];
      int            lastrdata = 0;
--- 1704,1710 ----
                                  int nchildren,
                                  TransactionId *children,
                                  int nrels,
!                                 RelFileNode *rels)
  {
      XLogRecData rdata[3];
      int            lastrdata = 0;
***************
*** 1718,1724 **** RecordTransactionCommitPrepared(TransactionId xid,
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileFork);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
--- 1729,1735 ----
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileNode);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
***************
*** 1766,1772 **** RecordTransactionAbortPrepared(TransactionId xid,
                                 int nchildren,
                                 TransactionId *children,
                                 int nrels,
!                                RelFileFork *rels)
  {
      XLogRecData rdata[3];
      int            lastrdata = 0;
--- 1777,1783 ----
                                 int nchildren,
                                 TransactionId *children,
                                 int nrels,
!                                RelFileNode *rels)
  {
      XLogRecData rdata[3];
      int            lastrdata = 0;
***************
*** 1796,1802 **** RecordTransactionAbortPrepared(TransactionId xid,
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileFork);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
--- 1807,1813 ----
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileNode);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
*** src/backend/access/transam/xact.c
--- src/backend/access/transam/xact.c
***************
*** 28,33 ****
--- 28,34 ----
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
  #include "catalog/namespace.h"
+ #include "catalog/storage.h"
  #include "commands/async.h"
  #include "commands/tablecmds.h"
  #include "commands/trigger.h"
***************
*** 819,825 **** RecordTransactionCommit(void)
      bool        markXidCommitted = TransactionIdIsValid(xid);
      TransactionId latestXid = InvalidTransactionId;
      int            nrels;
!     RelFileFork *rels;
      bool        haveNonTemp;
      int            nchildren;
      TransactionId *children;
--- 820,826 ----
      bool        markXidCommitted = TransactionIdIsValid(xid);
      TransactionId latestXid = InvalidTransactionId;
      int            nrels;
!     RelFileNode *rels;
      bool        haveNonTemp;
      int            nchildren;
      TransactionId *children;
***************
*** 900,906 **** RecordTransactionCommit(void)
          {
              rdata[0].next = &(rdata[1]);
              rdata[1].data = (char *) rels;
!             rdata[1].len = nrels * sizeof(RelFileFork);
              rdata[1].buffer = InvalidBuffer;
              lastrdata = 1;
          }
--- 901,907 ----
          {
              rdata[0].next = &(rdata[1]);
              rdata[1].data = (char *) rels;
!             rdata[1].len = nrels * sizeof(RelFileNode);
              rdata[1].buffer = InvalidBuffer;
              lastrdata = 1;
          }
***************
*** 1165,1171 **** RecordTransactionAbort(bool isSubXact)
      TransactionId xid = GetCurrentTransactionIdIfAny();
      TransactionId latestXid;
      int            nrels;
!     RelFileFork *rels;
      int            nchildren;
      TransactionId *children;
      XLogRecData rdata[3];
--- 1166,1172 ----
      TransactionId xid = GetCurrentTransactionIdIfAny();
      TransactionId latestXid;
      int            nrels;
!     RelFileNode *rels;
      int            nchildren;
      TransactionId *children;
      XLogRecData rdata[3];
***************
*** 1226,1232 **** RecordTransactionAbort(bool isSubXact)
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileFork);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
--- 1227,1233 ----
      {
          rdata[0].next = &(rdata[1]);
          rdata[1].data = (char *) rels;
!         rdata[1].len = nrels * sizeof(RelFileNode);
          rdata[1].buffer = InvalidBuffer;
          lastrdata = 1;
      }
***************
*** 2078,2084 **** AbortTransaction(void)
      AtEOXact_xml();
      AtEOXact_on_commit_actions(false);
      AtEOXact_Namespace(false);
-     smgrabort();
      AtEOXact_Files();
      AtEOXact_ComboCid();
      AtEOXact_HashTables(false);
--- 2079,2084 ----
***************
*** 4239,4250 **** xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid)
      /* Make sure files supposed to be dropped are dropped */
      for (i = 0; i < xlrec->nrels; i++)
      {
!         SMgrRelation srel;

!         XLogDropRelation(xlrec->xnodes[i].rnode, xlrec->xnodes[i].forknum);
!
!         srel = smgropen(xlrec->xnodes[i].rnode);
!         smgrdounlink(srel, xlrec->xnodes[i].forknum, false, true);
          smgrclose(srel);
      }
  }
--- 4239,4255 ----
      /* Make sure files supposed to be dropped are dropped */
      for (i = 0; i < xlrec->nrels; i++)
      {
!         SMgrRelation srel = smgropen(xlrec->xnodes[i]);
!         ForkNumber fork;

!         for (fork = 0; fork <= MAX_FORKNUM; fork++)
!         {
!             if (smgrexists(srel, fork))
!             {
!                 XLogDropRelation(xlrec->xnodes[i], fork);
!                 smgrdounlink(srel, fork, false, true);
!             }
!         }
          smgrclose(srel);
      }
  }
***************
*** 4277,4288 **** xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
      /* Make sure files supposed to be dropped are dropped */
      for (i = 0; i < xlrec->nrels; i++)
      {
!         SMgrRelation srel;

!         XLogDropRelation(xlrec->xnodes[i].rnode, xlrec->xnodes[i].forknum);
!
!         srel = smgropen(xlrec->xnodes[i].rnode);
!         smgrdounlink(srel, xlrec->xnodes[i].forknum, false, true);
          smgrclose(srel);
      }
  }
--- 4282,4298 ----
      /* Make sure files supposed to be dropped are dropped */
      for (i = 0; i < xlrec->nrels; i++)
      {
!         SMgrRelation srel = smgropen(xlrec->xnodes[i]);
!         ForkNumber fork;

!         for (fork = 0; fork <= MAX_FORKNUM; fork++)
!         {
!             if (smgrexists(srel, fork))
!             {
!                 XLogDropRelation(xlrec->xnodes[i], fork);
!                 smgrdounlink(srel, fork, false, true);
!             }
!         }
          smgrclose(srel);
      }
  }
***************
*** 4339,4346 **** xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
          appendStringInfo(buf, "; rels:");
          for (i = 0; i < xlrec->nrels; i++)
          {
!             char *path = relpath(xlrec->xnodes[i].rnode,
!                                  xlrec->xnodes[i].forknum);
              appendStringInfo(buf, " %s", path);
              pfree(path);
          }
--- 4349,4355 ----
          appendStringInfo(buf, "; rels:");
          for (i = 0; i < xlrec->nrels; i++)
          {
!             char *path = relpath(xlrec->xnodes[i], MAIN_FORKNUM);
              appendStringInfo(buf, " %s", path);
              pfree(path);
          }
***************
*** 4367,4374 **** xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec)
          appendStringInfo(buf, "; rels:");
          for (i = 0; i < xlrec->nrels; i++)
          {
!             char *path = relpath(xlrec->xnodes[i].rnode,
!                                  xlrec->xnodes[i].forknum);
              appendStringInfo(buf, " %s", path);
              pfree(path);
          }
--- 4376,4382 ----
          appendStringInfo(buf, "; rels:");
          for (i = 0; i < xlrec->nrels; i++)
          {
!             char *path = relpath(xlrec->xnodes[i], MAIN_FORKNUM);
              appendStringInfo(buf, " %s", path);
              pfree(path);
          }
*** src/backend/access/transam/xlogutils.c
--- src/backend/access/transam/xlogutils.c
***************
*** 273,279 **** XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
       * filesystem loses an inode during a crash.  Better to write the data
       * until we are actually told to delete the file.)
       */
!     smgrcreate(smgr, forknum, false, true);

      lastblock = smgrnblocks(smgr, forknum);

--- 273,279 ----
       * filesystem loses an inode during a crash.  Better to write the data
       * until we are actually told to delete the file.)
       */
!     smgrcreate(smgr, forknum, true);

      lastblock = smgrnblocks(smgr, forknum);

*** src/backend/catalog/Makefile
--- src/backend/catalog/Makefile
***************
*** 13,19 **** include $(top_builddir)/src/Makefile.global
  OBJS = catalog.o dependency.o heap.o index.o indexing.o namespace.o aclchk.o \
         pg_aggregate.o pg_constraint.o pg_conversion.o pg_depend.o pg_enum.o \
         pg_largeobject.o pg_namespace.o pg_operator.o pg_proc.o pg_shdepend.o \
!        pg_type.o toasting.o

  BKIFILES = postgres.bki postgres.description postgres.shdescription

--- 13,19 ----
  OBJS = catalog.o dependency.o heap.o index.o indexing.o namespace.o aclchk.o \
         pg_aggregate.o pg_constraint.o pg_conversion.o pg_depend.o pg_enum.o \
         pg_largeobject.o pg_namespace.o pg_operator.o pg_proc.o pg_shdepend.o \
!        pg_type.o storage.o toasting.o

  BKIFILES = postgres.bki postgres.description postgres.shdescription

*** src/backend/catalog/heap.c
--- src/backend/catalog/heap.c
***************
*** 47,52 ****
--- 47,53 ----
  #include "catalog/pg_tablespace.h"
  #include "catalog/pg_type.h"
  #include "catalog/pg_type_fn.h"
+ #include "catalog/storage.h"
  #include "commands/tablecmds.h"
  #include "commands/typecmds.h"
  #include "miscadmin.h"
***************
*** 295,317 **** heap_create(const char *relname,
      /*
       * Have the storage manager create the relation's disk file, if needed.
       *
!      * We create storage for the main fork here, and also for the FSM for a
!      * heap or toast relation. The caller is responsible for creating any
!      * additional forks if needed.
       */
      if (create_storage)
!     {
!         Assert(rel->rd_smgr == NULL);
!         RelationOpenSmgr(rel);
!         smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false);
!
!         /*
!          * For a real heap, create FSM fork as well. Indexams are
!          * responsible for creating any extra forks themselves.
!          */
!         if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE)
!             smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false);
!     }

      return rel;
  }
--- 296,306 ----
      /*
       * Have the storage manager create the relation's disk file, if needed.
       *
!      * We only create the main fork here, the other forks will be created
!      * on-demand.
       */
      if (create_storage)
!         RelationCreateStorage(rel->rd_node, rel->rd_istemp);

      return rel;
  }
***************
*** 1426,1438 **** heap_drop_with_catalog(Oid relid)
      if (rel->rd_rel->relkind != RELKIND_VIEW &&
          rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE)
      {
!         ForkNumber forknum;
!
!         RelationOpenSmgr(rel);
!         for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
!             if (smgrexists(rel->rd_smgr, forknum))
!                 smgrscheduleunlink(rel->rd_smgr, forknum, rel->rd_istemp);
!         RelationCloseSmgr(rel);
      }

      /*
--- 1415,1421 ----
      if (rel->rd_rel->relkind != RELKIND_VIEW &&
          rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE)
      {
!         RelationDropStorage(rel);
      }

      /*
***************
*** 2348,2354 **** heap_truncate(List *relids)
          Relation    rel = lfirst(cell);

          /* Truncate the FSM and actual file (and discard buffers) */
-         FreeSpaceMapTruncateRel(rel, 0);
          RelationTruncate(rel, 0);

          /* If this relation has indexes, truncate the indexes too */
--- 2331,2336 ----
*** src/backend/catalog/index.c
--- src/backend/catalog/index.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "catalog/pg_opclass.h"
  #include "catalog/pg_tablespace.h"
  #include "catalog/pg_type.h"
+ #include "catalog/storage.h"
  #include "commands/tablecmds.h"
  #include "executor/executor.h"
  #include "miscadmin.h"
***************
*** 897,903 **** index_drop(Oid indexId)
      Relation    indexRelation;
      HeapTuple    tuple;
      bool        hasexprs;
-     ForkNumber    forknum;

      /*
       * To drop an index safely, we must grab exclusive lock on its parent
--- 898,903 ----
***************
*** 918,929 **** index_drop(Oid indexId)
      /*
       * Schedule physical removal of the files
       */
!     RelationOpenSmgr(userIndexRelation);
!     for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
!         if (smgrexists(userIndexRelation->rd_smgr, forknum))
!             smgrscheduleunlink(userIndexRelation->rd_smgr, forknum,
!                                userIndexRelation->rd_istemp);
!     RelationCloseSmgr(userIndexRelation);

      /*
       * Close and flush the index's relcache entry, to ensure relcache doesn't
--- 918,924 ----
      /*
       * Schedule physical removal of the files
       */
!     RelationDropStorage(userIndexRelation);

      /*
       * Close and flush the index's relcache entry, to ensure relcache doesn't
***************
*** 1283,1293 **** setNewRelfilenode(Relation relation, TransactionId freezeXid)
  {
      Oid            newrelfilenode;
      RelFileNode newrnode;
-     SMgrRelation srel;
      Relation    pg_class;
      HeapTuple    tuple;
      Form_pg_class rd_rel;
-     ForkNumber    i;

      /* Can't change relfilenode for nailed tables (indexes ok though) */
      Assert(!relation->rd_isnailed ||
--- 1278,1286 ----
***************
*** 1318,1325 **** setNewRelfilenode(Relation relation, TransactionId freezeXid)
               RelationGetRelid(relation));
      rd_rel = (Form_pg_class) GETSTRUCT(tuple);

-     RelationOpenSmgr(relation);
-
      /*
       * ... and create storage for corresponding forks in the new relfilenode.
       *
--- 1311,1316 ----
***************
*** 1327,1354 **** setNewRelfilenode(Relation relation, TransactionId freezeXid)
       */
      newrnode = relation->rd_node;
      newrnode.relNode = newrelfilenode;
-     srel = smgropen(newrnode);
-
-     /* Create the main fork, like heap_create() does */
-     smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false);

      /*
!      * For a heap, create FSM fork as well. Indexams are responsible for
!      * creating any extra forks themselves.
       */
!     if (relation->rd_rel->relkind == RELKIND_RELATION ||
!         relation->rd_rel->relkind == RELKIND_TOASTVALUE)
!         smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false);
!
!     /* schedule unlinking old files */
!     for (i = 0; i <= MAX_FORKNUM; i++)
!     {
!         if (smgrexists(relation->rd_smgr, i))
!             smgrscheduleunlink(relation->rd_smgr, i, relation->rd_istemp);
!     }
!
!     smgrclose(srel);
!     RelationCloseSmgr(relation);

      /* update the pg_class row */
      rd_rel->relfilenode = newrelfilenode;
--- 1318,1330 ----
       */
      newrnode = relation->rd_node;
      newrnode.relNode = newrelfilenode;

      /*
!      * Create the main fork, like heap_create() does, and drop the old
!      * storage.
       */
!     RelationCreateStorage(newrnode, relation->rd_istemp);
!     RelationDropStorage(relation);

      /* update the pg_class row */
      rd_rel->relfilenode = newrelfilenode;
***************
*** 2326,2333 **** reindex_index(Oid indexId)
          if (inplace)
          {
              /*
!              * Truncate the actual file (and discard buffers). The indexam
!              * is responsible for truncating the FSM, if applicable
               */
              RelationTruncate(iRel, 0);
          }
--- 2302,2308 ----
          if (inplace)
          {
              /*
!              * Truncate the actual file (and discard buffers).
               */
              RelationTruncate(iRel, 0);
          }
*** /dev/null
--- src/backend/catalog/storage.c
***************
*** 0 ****
--- 1,460 ----
+ /*-------------------------------------------------------------------------
+  *
+  * storage.c
+  *      code to create and destroy physical storage for relations
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ #include "postgres.h"
+
+ #include "access/xact.h"
+ #include "access/xlogutils.h"
+ #include "catalog/catalog.h"
+ #include "catalog/storage.h"
+ #include "storage/freespace.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+ #include "utils/rel.h"
+
+ /*
+  * We keep a list of all relations (represented as RelFileNode values)
+  * that have been created or deleted in the current transaction.  When
+  * a relation is created, we create the physical file immediately, but
+  * remember it so that we can delete the file again if the current
+  * transaction is aborted.    Conversely, a deletion request is NOT
+  * executed immediately, but is just entered in the list.  When and if
+  * the transaction commits, we can delete the physical file.
+  *
+  * To handle subtransactions, every entry is marked with its transaction
+  * nesting level.  At subtransaction commit, we reassign the subtransaction's
+  * entries to the parent nesting level.  At subtransaction abort, we can
+  * immediately execute the abort-time actions for all entries of the current
+  * nesting level.
+  *
+  * NOTE: the list is kept in TopMemoryContext to be sure it won't disappear
+  * unbetimes.  It'd probably be OK to keep it in TopTransactionContext,
+  * but I'm being paranoid.
+  */
+
+ typedef struct PendingRelDelete
+ {
+     RelFileNode relnode;        /* relation that may need to be deleted */
+     bool        isTemp;            /* is it a temporary relation? */
+     bool        atCommit;        /* T=delete at commit; F=delete at abort */
+     int            nestLevel;        /* xact nesting level of request */
+     struct PendingRelDelete *next;        /* linked-list link */
+ } PendingRelDelete;
+
+ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+
+ /*
+  * Declarations for smgr-related XLOG records
+  *
+  * Note: we log file creation and truncation here, but logging of deletion
+  * actions is handled by xact.c, because it is part of transaction commit.
+  */
+
+ /* XLOG gives us high 4 bits */
+ #define XLOG_SMGR_CREATE    0x10
+ #define XLOG_SMGR_TRUNCATE    0x20
+
+ typedef struct xl_smgr_create
+ {
+     RelFileNode rnode;
+ } xl_smgr_create;
+
+ typedef struct xl_smgr_truncate
+ {
+     BlockNumber blkno;
+     RelFileNode rnode;
+ } xl_smgr_truncate;
+
+
+ /*
+  * RelationCreateStorage
+  *        Create physical storage for a relation.
+  *
+  * Create the underlying disk file storage for the relation. This only
+  * creates the main fork; additional forks are created lazily by the
+  * modules that need them.
+  *
+  * This function is transactional. The creation is WAL-logged, and if the
+  * transaction aborts later on, the storage will be destroyed.
+  */
+ void
+ RelationCreateStorage(RelFileNode rnode, bool istemp)
+ {
+     PendingRelDelete *pending;
+
+     XLogRecPtr    lsn;
+     XLogRecData rdata;
+     xl_smgr_create xlrec;
+     SMgrRelation srel;
+
+     srel = smgropen(rnode);
+     smgrcreate(srel, MAIN_FORKNUM, false);
+
+     smgrclose(srel);
+
+     if (istemp)
+     {
+         /*
+          * Make an XLOG entry showing the file creation.  If we abort, the file
+          * will be dropped at abort time.
+          */
+         xlrec.rnode = rnode;
+
+         rdata.data = (char *) &xlrec;
+         rdata.len = sizeof(xlrec);
+         rdata.buffer = InvalidBuffer;
+         rdata.next = NULL;
+
+         lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE, &rdata);
+     }
+
+     /* Add the relation to the list of stuff to delete at abort */
+     pending = (PendingRelDelete *)
+         MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+     pending->relnode = rnode;
+     pending->isTemp = istemp;
+     pending->atCommit = false;    /* delete if abort */
+     pending->nestLevel = GetCurrentTransactionNestLevel();
+     pending->next = pendingDeletes;
+     pendingDeletes = pending;
+ }
+
+ /*
+  * RelationDropStorage
+  *        Schedule unlinking of physical storage at transaction commit.
+  */
+ void
+ RelationDropStorage(Relation rel)
+ {
+     PendingRelDelete *pending;
+
+     /* Add the relation to the list of stuff to delete at commit */
+     pending = (PendingRelDelete *)
+         MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+     pending->relnode = rel->rd_node;
+     pending->isTemp = rel->rd_istemp;
+     pending->atCommit = true;    /* delete if commit */
+     pending->nestLevel = GetCurrentTransactionNestLevel();
+     pending->next = pendingDeletes;
+     pendingDeletes = pending;
+
+     /*
+      * NOTE: if the relation was created in this transaction, it will now be
+      * present in the pending-delete list twice, once with atCommit true and
+      * once with atCommit false.  Hence, it will be physically deleted at end
+      * of xact in either case (and the other entry will be ignored by
+      * smgrDoPendingDeletes, so no error will occur).  We could instead remove
+      * the existing list entry and delete the physical file immediately, but
+      * for now I'll keep the logic simple.
+      */
+
+     RelationCloseSmgr(rel);
+ }
+
+ /*
+  * RelationTruncate
+  *        Physically truncate a relation to the specified number of blocks.
+  *
+  * This includes getting rid of any buffers for the blocks that are to be
+  * dropped. If 'fsm' is true, the FSM of the relation is truncated as well.
+  */
+ void
+ RelationTruncate(Relation rel, BlockNumber nblocks)
+ {
+     bool fsm;
+
+     /* Open it at the smgr level if not already done */
+     RelationOpenSmgr(rel);
+
+     /* Make sure rd_targblock isn't pointing somewhere past end */
+     rel->rd_targblock = InvalidBlockNumber;
+
+     /* Truncate the FSM too if it exists. */
+     fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
+     if (fsm)
+         FreeSpaceMapTruncateRel(rel, nblocks);
+
+     /*
+      * We WAL-log the truncation before actually truncating, which
+      * means trouble if the truncation fails. If we then crash, the WAL
+      * replay likely isn't going to succeed in the truncation either, and
+      * cause a PANIC. It's tempting to put a critical section here, but
+      * that cure would be worse than the disease. It would turn a usually
+      * harmless failure to truncate, that could spell trouble at WAL replay,
+      * into a certain PANIC.
+      */
+     if (rel->rd_istemp)
+     {
+         /*
+          * Make an XLOG entry showing the file truncation.
+          */
+         XLogRecPtr    lsn;
+         XLogRecData rdata;
+         xl_smgr_truncate xlrec;
+
+         xlrec.blkno = nblocks;
+         xlrec.rnode = rel->rd_node;
+
+         rdata.data = (char *) &xlrec;
+         rdata.len = sizeof(xlrec);
+         rdata.buffer = InvalidBuffer;
+         rdata.next = NULL;
+
+         lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata);
+
+         /*
+          * Flush, because otherwise the truncation of the main relation
+          * might hit the disk before the WAL record of truncating the
+          * FSM is flushed. If we crashed during that window, we'd be
+          * left with a truncated heap, without a truncated FSM.
+          */
+         if (fsm)
+             XLogFlush(lsn);
+     }
+
+     /* Do the real work */
+     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks, rel->rd_istemp);
+ }
+
+ /*
+  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+  *
+  * This also runs when aborting a subxact; we want to clean up a failed
+  * subxact immediately.
+  */
+ void
+ smgrDoPendingDeletes(bool isCommit)
+ {
+     int            nestLevel = GetCurrentTransactionNestLevel();
+     PendingRelDelete *pending;
+     PendingRelDelete *prev;
+     PendingRelDelete *next;
+
+     prev = NULL;
+     for (pending = pendingDeletes; pending != NULL; pending = next)
+     {
+         next = pending->next;
+         if (pending->nestLevel < nestLevel)
+         {
+             /* outer-level entries should not be processed yet */
+             prev = pending;
+         }
+         else
+         {
+             /* unlink list entry first, so we don't retry on failure */
+             if (prev)
+                 prev->next = next;
+             else
+                 pendingDeletes = next;
+             /* do deletion if called for */
+             if (pending->atCommit == isCommit)
+             {
+                 int i;
+
+                 /* schedule unlinking old files */
+                 SMgrRelation srel;
+
+                 srel = smgropen(pending->relnode);
+                 for (i = 0; i <= MAX_FORKNUM; i++)
+                 {
+                     if (smgrexists(srel, i))
+                         smgrdounlink(srel,
+                                      i,
+                                      pending->isTemp,
+                                      false);
+                 }
+                 smgrclose(srel);
+             }
+             /* must explicitly free the list entry */
+             pfree(pending);
+             /* prev does not change */
+         }
+     }
+ }
+
+ /*
+  * smgrGetPendingDeletes() -- Get a list of relations to be deleted.
+  *
+  * The return value is the number of relations scheduled for termination.
+  * *ptr is set to point to a freshly-palloc'd array of RelFileForks.
+  * If there are no relations to be deleted, *ptr is set to NULL.
+  *
+  * If haveNonTemp isn't NULL, the bool it points to gets set to true if
+  * there is any non-temp table pending to be deleted; false if not.
+  *
+  * Note that the list does not include anything scheduled for termination
+  * by upper-level transactions.
+  */
+ int
+ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp)
+ {
+     int            nestLevel = GetCurrentTransactionNestLevel();
+     int            nrels;
+     RelFileNode *rptr;
+     PendingRelDelete *pending;
+
+     nrels = 0;
+     if (haveNonTemp)
+         *haveNonTemp = false;
+     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+     {
+         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
+             nrels++;
+     }
+     if (nrels == 0)
+     {
+         *ptr = NULL;
+         return 0;
+     }
+     rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
+     *ptr = rptr;
+     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+     {
+         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
+         {
+             *rptr = pending->relnode;
+             rptr++;
+         }
+         if (haveNonTemp && !pending->isTemp)
+             *haveNonTemp = true;
+     }
+     return nrels;
+ }
+
+ /*
+  *    PostPrepare_smgr -- Clean up after a successful PREPARE
+  *
+  * What we have to do here is throw away the in-memory state about pending
+  * relation deletes.  It's all been recorded in the 2PC state file and
+  * it's no longer smgr's job to worry about it.
+  */
+ void
+ PostPrepare_smgr(void)
+ {
+     PendingRelDelete *pending;
+     PendingRelDelete *next;
+
+     for (pending = pendingDeletes; pending != NULL; pending = next)
+     {
+         next = pending->next;
+         pendingDeletes = next;
+         /* must explicitly free the list entry */
+         pfree(pending);
+     }
+ }
+
+
+ /*
+  * AtSubCommit_smgr() --- Take care of subtransaction commit.
+  *
+  * Reassign all items in the pending-deletes list to the parent transaction.
+  */
+ void
+ AtSubCommit_smgr(void)
+ {
+     int            nestLevel = GetCurrentTransactionNestLevel();
+     PendingRelDelete *pending;
+
+     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+     {
+         if (pending->nestLevel >= nestLevel)
+             pending->nestLevel = nestLevel - 1;
+     }
+ }
+
+ /*
+  * AtSubAbort_smgr() --- Take care of subtransaction abort.
+  *
+  * Delete created relations and forget about deleted relations.
+  * We can execute these operations immediately because we know this
+  * subtransaction will not commit.
+  */
+ void
+ AtSubAbort_smgr(void)
+ {
+     smgrDoPendingDeletes(false);
+ }
+
+ void
+ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+     uint8        info = record->xl_info & ~XLR_INFO_MASK;
+
+     if (info == XLOG_SMGR_CREATE)
+     {
+         xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
+         SMgrRelation reln;
+
+         reln = smgropen(xlrec->rnode);
+         smgrcreate(reln, MAIN_FORKNUM, true);
+     }
+     else if (info == XLOG_SMGR_TRUNCATE)
+     {
+         xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
+         SMgrRelation reln;
+
+         reln = smgropen(xlrec->rnode);
+
+         /*
+          * Forcibly create relation if it doesn't exist (which suggests that
+          * it was dropped somewhere later in the WAL sequence).  As in
+          * XLogOpenRelation, we prefer to recreate the rel and replay the log
+          * as best we can until the drop is seen.
+          */
+         smgrcreate(reln, MAIN_FORKNUM, true);
+
+         smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno, false);
+
+         /* Also tell xlogutils.c about it */
+         XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
+
+         /* Truncate FSM too */
+         if (smgrexists(reln, FSM_FORKNUM))
+         {
+             Relation rel = CreateFakeRelcacheEntry(xlrec->rnode);
+             FreeSpaceMapTruncateRel(rel, xlrec->blkno);
+             FreeFakeRelcacheEntry(rel);
+         }
+
+     }
+     else
+         elog(PANIC, "smgr_redo: unknown op code %u", info);
+ }
+
+ void
+ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+     uint8        info = xl_info & ~XLR_INFO_MASK;
+
+     if (info == XLOG_SMGR_CREATE)
+     {
+         xl_smgr_create *xlrec = (xl_smgr_create *) rec;
+         char *path = relpath(xlrec->rnode, MAIN_FORKNUM);
+
+         appendStringInfo(buf, "file create: %s", path);
+         pfree(path);
+     }
+     else if (info == XLOG_SMGR_TRUNCATE)
+     {
+         xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec;
+         char *path = relpath(xlrec->rnode, MAIN_FORKNUM);
+
+         appendStringInfo(buf, "file truncate: %s to %u blocks", path,
+                          xlrec->blkno);
+         pfree(path);
+     }
+     else
+         appendStringInfo(buf, "UNKNOWN");
+ }
*** src/backend/commands/tablecmds.c
--- src/backend/commands/tablecmds.c
***************
*** 35,40 ****
--- 35,41 ----
  #include "catalog/pg_trigger.h"
  #include "catalog/pg_type.h"
  #include "catalog/pg_type_fn.h"
+ #include "catalog/storage.h"
  #include "catalog/toasting.h"
  #include "commands/cluster.h"
  #include "commands/defrem.h"
***************
*** 6482,6488 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace)
      Relation    pg_class;
      HeapTuple    tuple;
      Form_pg_class rd_rel;
!     ForkNumber    forkNum;

      /*
       * Need lock here in case we are recursing to toast table or index
--- 6483,6489 ----
      Relation    pg_class;
      HeapTuple    tuple;
      Form_pg_class rd_rel;
!     ForkNumber      forkNum;

      /*
       * Need lock here in case we are recursing to toast table or index
***************
*** 6558,6564 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace)
      newrnode = rel->rd_node;
      newrnode.relNode = newrelfilenode;
      newrnode.spcNode = newTableSpace;
-     dstrel = smgropen(newrnode);

      RelationOpenSmgr(rel);

--- 6559,6564 ----
***************
*** 6567,6588 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace)
       * of old physical files.
       *
       * NOTE: any conflict in relfilenode value will be caught in
!      *         smgrcreate() below.
       */
!     for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
      {
          if (smgrexists(rel->rd_smgr, forkNum))
          {
!             smgrcreate(dstrel, forkNum, rel->rd_istemp, false);
              copy_relation_data(rel->rd_smgr, dstrel, forkNum, rel->rd_istemp);
-
-             smgrscheduleunlink(rel->rd_smgr, forkNum, rel->rd_istemp);
          }
      }

      /* Close old and new relation */
      smgrclose(dstrel);
-     RelationCloseSmgr(rel);

      /* update the pg_class row */
      rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace;
--- 6567,6592 ----
       * of old physical files.
       *
       * NOTE: any conflict in relfilenode value will be caught in
!      *         RelationCreateStorage().
       */
!     RelationCreateStorage(newrnode, rel->rd_istemp);
!
!     dstrel = smgropen(newrnode);
!
!     copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM, rel->rd_istemp);
!     for (forkNum = MAIN_FORKNUM + 1; forkNum <= MAX_FORKNUM; forkNum++)
      {
          if (smgrexists(rel->rd_smgr, forkNum))
          {
!             smgrcreate(dstrel, forkNum, false);
              copy_relation_data(rel->rd_smgr, dstrel, forkNum, rel->rd_istemp);
          }
      }

+     RelationDropStorage(rel);
+
      /* Close old and new relation */
      smgrclose(dstrel);

      /* update the pg_class row */
      rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace;
*** src/backend/commands/vacuum.c
--- src/backend/commands/vacuum.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "catalog/namespace.h"
  #include "catalog/pg_database.h"
  #include "catalog/pg_namespace.h"
+ #include "catalog/storage.h"
  #include "commands/dbcommands.h"
  #include "commands/vacuum.h"
  #include "executor/executor.h"
***************
*** 2863,2869 **** repair_frag(VRelStats *vacrelstats, Relation onerel,
      /* Truncate relation, if needed */
      if (blkno < nblocks)
      {
-         FreeSpaceMapTruncateRel(onerel, blkno);
          RelationTruncate(onerel, blkno);
          vacrelstats->rel_pages = blkno; /* set new number of blocks */
      }
--- 2864,2869 ----
***************
*** 3258,3264 **** vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages)
                  (errmsg("\"%s\": truncated %u to %u pages",
                          RelationGetRelationName(onerel),
                          vacrelstats->rel_pages, relblocks)));
-         FreeSpaceMapTruncateRel(onerel, relblocks);
          RelationTruncate(onerel, relblocks);
          vacrelstats->rel_pages = relblocks;        /* set new number of blocks */
      }
--- 3258,3263 ----
*** src/backend/commands/vacuumlazy.c
--- src/backend/commands/vacuumlazy.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "catalog/storage.h"
  #include "commands/dbcommands.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
***************
*** 827,833 **** lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
      /*
       * Okay to truncate.
       */
-     FreeSpaceMapTruncateRel(onerel, new_rel_pages);
      RelationTruncate(onerel, new_rel_pages);

      /*
--- 828,833 ----
*** src/backend/rewrite/rewriteDefine.c
--- src/backend/rewrite/rewriteDefine.c
***************
*** 19,31 ****
  #include "catalog/indexing.h"
  #include "catalog/namespace.h"
  #include "catalog/pg_rewrite.h"
  #include "miscadmin.h"
  #include "nodes/nodeFuncs.h"
  #include "parser/parse_utilcmd.h"
  #include "rewrite/rewriteDefine.h"
  #include "rewrite/rewriteManip.h"
  #include "rewrite/rewriteSupport.h"
- #include "storage/smgr.h"
  #include "utils/acl.h"
  #include "utils/builtins.h"
  #include "utils/inval.h"
--- 19,31 ----
  #include "catalog/indexing.h"
  #include "catalog/namespace.h"
  #include "catalog/pg_rewrite.h"
+ #include "catalog/storage.h"
  #include "miscadmin.h"
  #include "nodes/nodeFuncs.h"
  #include "parser/parse_utilcmd.h"
  #include "rewrite/rewriteDefine.h"
  #include "rewrite/rewriteManip.h"
  #include "rewrite/rewriteSupport.h"
  #include "utils/acl.h"
  #include "utils/builtins.h"
  #include "utils/inval.h"
***************
*** 484,499 **** DefineQueryRewrite(char *rulename,
       * XXX what about getting rid of its TOAST table?  For now, we don't.
       */
      if (RelisBecomingView)
!     {
!         ForkNumber forknum;
!
!         RelationOpenSmgr(event_relation);
!         for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
!             if (smgrexists(event_relation->rd_smgr, forknum))
!                 smgrscheduleunlink(event_relation->rd_smgr, forknum,
!                                    event_relation->rd_istemp);
!         RelationCloseSmgr(event_relation);
!     }

      /* Close rel, but keep lock till commit... */
      heap_close(event_relation, NoLock);
--- 484,490 ----
       * XXX what about getting rid of its TOAST table?  For now, we don't.
       */
      if (RelisBecomingView)
!         RelationDropStorage(event_relation);

      /* Close rel, but keep lock till commit... */
      heap_close(event_relation, NoLock);
*** src/backend/storage/buffer/bufmgr.c
--- src/backend/storage/buffer/bufmgr.c
***************
*** 1695,1702 **** void
  BufmgrCommit(void)
  {
      /* Nothing to do in bufmgr anymore... */
-
-     smgrcommit();
  }

  /*
--- 1695,1700 ----
***************
*** 1848,1873 **** RelationGetNumberOfBlocks(Relation relation)
      return smgrnblocks(relation->rd_smgr, MAIN_FORKNUM);
  }

- /*
-  * RelationTruncate
-  *        Physically truncate a relation to the specified number of blocks.
-  *
-  * As of Postgres 8.1, this includes getting rid of any buffers for the
-  * blocks that are to be dropped; previously, callers had to do that.
-  */
- void
- RelationTruncate(Relation rel, BlockNumber nblocks)
- {
-     /* Open it at the smgr level if not already done */
-     RelationOpenSmgr(rel);
-
-     /* Make sure rd_targblock isn't pointing somewhere past end */
-     rel->rd_targblock = InvalidBlockNumber;
-
-     /* Do the real work */
-     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks, rel->rd_istemp);
- }
-
  /* ---------------------------------------------------------------------
   *        DropRelFileNodeBuffers
   *
--- 1846,1851 ----
*** src/backend/storage/freespace/freespace.c
--- src/backend/storage/freespace/freespace.c
***************
*** 47,53 ****
   * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming
   * default 8k BLCKSZ, and that MaxFSMRequestSize is 24 bytes, the categories
   * look like this
!  *
   *
   * Range     Category
   * 0    - 31   0
--- 47,53 ----
   * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming
   * default 8k BLCKSZ, and that MaxFSMRequestSize is 24 bytes, the categories
   * look like this
!  *
   *
   * Range     Category
   * 0    - 31   0
***************
*** 93,107 **** typedef struct
  /* Address of the root page. */
  static const FSMAddress FSM_ROOT_ADDRESS = { FSM_ROOT_LEVEL, 0 };

- /* XLOG record types */
- #define XLOG_FSM_TRUNCATE     0x00    /* truncate */
-
- typedef struct
- {
-     RelFileNode node;            /* truncated relation */
-     BlockNumber nheapblocks;    /* new number of blocks in the heap */
- } xl_fsm_truncate;
-
  /* functions to navigate the tree */
  static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
  static FSMAddress fsm_get_parent(FSMAddress child, uint16 *slot);
--- 93,98 ----
***************
*** 110,116 **** static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
  static BlockNumber fsm_logical_to_physical(FSMAddress addr);

  static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
! static void fsm_extend(Relation rel, BlockNumber nfsmblocks);

  /* functions to convert amount of free space to a FSM category */
  static uint8 fsm_space_avail_to_cat(Size avail);
--- 101,107 ----
  static BlockNumber fsm_logical_to_physical(FSMAddress addr);

  static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
! static void fsm_extend(Relation rel, BlockNumber nfsmblocks, bool createstorage);

  /* functions to convert amount of free space to a FSM category */
  static uint8 fsm_space_avail_to_cat(Size avail);
***************
*** 123,130 **** static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
  static BlockNumber fsm_search(Relation rel, uint8 min_cat);
  static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);

- static void fsm_redo_truncate(xl_fsm_truncate *xlrec);
-

  /******** Public API ********/

--- 114,119 ----
***************
*** 275,280 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
--- 264,276 ----

      RelationOpenSmgr(rel);

+     /*
+      * If no FSM has been created yet for this relation, there's nothing to
+      * truncate.
+      */
+     if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+         return;
+
      /* Get the location in the FSM of the first removed heap block */
      first_removed_address = fsm_get_location(nblocks, &first_removed_slot);

***************
*** 307,348 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
      smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks, rel->rd_istemp);

      /*
-      * FSM truncations are WAL-logged, because we must never return a block
-      * that doesn't exist in the heap, not even if we crash before the FSM
-      * truncation has made it to disk. smgrtruncate() writes its own WAL
-      * record, but that's not enough to zero out the last remaining FSM page.
-      * (if we didn't need to zero out anything above, we can skip this)
-      */
-     if (!rel->rd_istemp && first_removed_slot != 0)
-     {
-         xl_fsm_truncate xlrec;
-         XLogRecData        rdata;
-         XLogRecPtr        recptr;
-
-         xlrec.node = rel->rd_node;
-         xlrec.nheapblocks = nblocks;
-
-         rdata.data = (char *) &xlrec;
-         rdata.len = sizeof(xl_fsm_truncate);
-         rdata.buffer = InvalidBuffer;
-         rdata.next = NULL;
-
-         recptr = XLogInsert(RM_FREESPACE_ID, XLOG_FSM_TRUNCATE, &rdata);
-
-         /*
-          * Flush, because otherwise the truncation of the main relation
-          * might hit the disk before the WAL record of truncating the
-          * FSM is flushed. If we crashed during that window, we'd be
-          * left with a truncated heap, without a truncated FSM.
-          */
-         XLogFlush(recptr);
-     }
-
-     /*
       * Need to invalidate the relcache entry, because rd_fsm_nblocks_cache
       * seen by other backends is no longer valid.
       */
!     CacheInvalidateRelcache(rel);

      rel->rd_fsm_nblocks_cache = new_nfsmblocks;
  }
--- 303,313 ----
      smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks, rel->rd_istemp);

      /*
       * Need to invalidate the relcache entry, because rd_fsm_nblocks_cache
       * seen by other backends is no longer valid.
       */
!     if (!InRecovery)
!         CacheInvalidateRelcache(rel);

      rel->rd_fsm_nblocks_cache = new_nfsmblocks;
  }
***************
*** 538,551 **** fsm_readbuf(Relation rel, FSMAddress addr, bool extend)

      RelationOpenSmgr(rel);

!     if (rel->rd_fsm_nblocks_cache == InvalidBlockNumber ||
          rel->rd_fsm_nblocks_cache <= blkno)
!         rel->rd_fsm_nblocks_cache = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);

      if (blkno >= rel->rd_fsm_nblocks_cache)
      {
          if (extend)
!             fsm_extend(rel, blkno + 1);
          else
              return InvalidBuffer;
      }
--- 503,521 ----

      RelationOpenSmgr(rel);

!     if (rel->rd_fsm_nblocks_cache == InvalidBlockNumber ||
          rel->rd_fsm_nblocks_cache <= blkno)
!     {
!         if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
!             fsm_extend(rel, blkno + 1, true);
!         else
!             rel->rd_fsm_nblocks_cache = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
!     }

      if (blkno >= rel->rd_fsm_nblocks_cache)
      {
          if (extend)
!             fsm_extend(rel, blkno + 1, false);
          else
              return InvalidBuffer;
      }
***************
*** 566,575 **** fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
  /*
   * Ensure that the FSM fork is at least n_fsmblocks long, extending
   * it if necessary with empty pages. And by empty, I mean pages filled
!  * with zeros, meaning there's no free space.
   */
  static void
! fsm_extend(Relation rel, BlockNumber n_fsmblocks)
  {
      BlockNumber n_fsmblocks_now;
      Page pg;
--- 536,546 ----
  /*
   * Ensure that the FSM fork is at least n_fsmblocks long, extending
   * it if necessary with empty pages. And by empty, I mean pages filled
!  * with zeros, meaning there's no free space. If createstorage is true,
!  * the FSM file might need to be created first.
   */
  static void
! fsm_extend(Relation rel, BlockNumber n_fsmblocks, bool createstorage)
  {
      BlockNumber n_fsmblocks_now;
      Page pg;
***************
*** 584,595 **** fsm_extend(Relation rel, BlockNumber n_fsmblocks)
       * FSM happens seldom enough that it doesn't seem worthwhile to
       * have a separate lock tag type for it.
       *
!      * Note that another backend might have extended the relation
!      * before we get the lock.
       */
      LockRelationForExtension(rel, ExclusiveLock);

!     n_fsmblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
      while (n_fsmblocks_now < n_fsmblocks)
      {
          smgrextend(rel->rd_smgr, FSM_FORKNUM, n_fsmblocks_now,
--- 555,574 ----
       * FSM happens seldom enough that it doesn't seem worthwhile to
       * have a separate lock tag type for it.
       *
!      * Note that another backend might have extended or created the
!      * relation before we get the lock.
       */
      LockRelationForExtension(rel, ExclusiveLock);

!     /* Create the FSM file first if it doesn't exist */
!     if (createstorage && !smgrexists(rel->rd_smgr, FSM_FORKNUM))
!     {
!         smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
!         n_fsmblocks_now = 0;
!     }
!     else
!         n_fsmblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
!
      while (n_fsmblocks_now < n_fsmblocks)
      {
          smgrextend(rel->rd_smgr, FSM_FORKNUM, n_fsmblocks_now,
***************
*** 799,873 **** fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)

      return max_avail;
  }
-
-
- /****** WAL-logging ******/
-
- static void
- fsm_redo_truncate(xl_fsm_truncate *xlrec)
- {
-     FSMAddress    first_removed_address;
-     uint16        first_removed_slot;
-     BlockNumber fsmblk;
-     Buffer        buf;
-
-     /* Get the location in the FSM of the first removed heap block */
-     first_removed_address = fsm_get_location(xlrec->nheapblocks,
-                                              &first_removed_slot);
-     fsmblk = fsm_logical_to_physical(first_removed_address);
-
-     /*
-      * Zero out the tail of the last remaining FSM page. We rely on the
-      * replay of the smgr truncation record to remove completely unused
-      * pages.
-      */
-     buf = XLogReadBufferExtended(xlrec->node, FSM_FORKNUM, fsmblk,
-                                  RBM_ZERO_ON_ERROR);
-     if (BufferIsValid(buf))
-     {
-         Page page = BufferGetPage(buf);
-
-         if (PageIsNew(page))
-             PageInit(page, BLCKSZ, 0);
-         fsm_truncate_avail(page, first_removed_slot);
-         MarkBufferDirty(buf);
-         UnlockReleaseBuffer(buf);
-     }
- }
-
- void
- fsm_redo(XLogRecPtr lsn, XLogRecord *record)
- {
-     uint8        info = record->xl_info & ~XLR_INFO_MASK;
-
-     switch (info)
-     {
-         case XLOG_FSM_TRUNCATE:
-             fsm_redo_truncate((xl_fsm_truncate *) XLogRecGetData(record));
-             break;
-         default:
-             elog(PANIC, "fsm_redo: unknown op code %u", info);
-     }
- }
-
- void
- fsm_desc(StringInfo buf, uint8 xl_info, char *rec)
- {
-     uint8           info = xl_info & ~XLR_INFO_MASK;
-
-     switch (info)
-     {
-         case XLOG_FSM_TRUNCATE:
-         {
-             xl_fsm_truncate *xlrec = (xl_fsm_truncate *) rec;
-
-             appendStringInfo(buf, "truncate: rel %u/%u/%u; nheapblocks %u;",
-                              xlrec->node.spcNode, xlrec->node.dbNode,
-                              xlrec->node.relNode, xlrec->nheapblocks);
-             break;
-         }
-         default:
-             appendStringInfo(buf, "UNKNOWN");
-             break;
-     }
- }
--- 778,780 ----
*** src/backend/storage/freespace/indexfsm.c
--- src/backend/storage/freespace/indexfsm.c
***************
*** 31,50 ****
   */

  /*
-  * InitIndexFreeSpaceMap - Create or reset the FSM fork for relation.
-  */
- void
- InitIndexFreeSpaceMap(Relation rel)
- {
-     /* Create FSM fork if it doesn't exist yet, or truncate it if it does */
-     RelationOpenSmgr(rel);
-     if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
-         smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false);
-     else
-         smgrtruncate(rel->rd_smgr, FSM_FORKNUM, 0, rel->rd_istemp);
- }
-
- /*
   * GetFreeIndexPage - return a free page from the FSM
   *
   * As a side effect, the page is marked as used in the FSM.
--- 31,36 ----
*** src/backend/storage/smgr/smgr.c
--- src/backend/storage/smgr/smgr.c
***************
*** 17,31 ****
   */
  #include "postgres.h"

- #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
  #include "commands/tablespace.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
  #include "storage/smgr.h"
  #include "utils/hsearch.h"
- #include "utils/memutils.h"


  /*
--- 17,30 ----
   */
  #include "postgres.h"

  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
  #include "commands/tablespace.h"
  #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
  #include "storage/ipc.h"
  #include "storage/smgr.h"
  #include "utils/hsearch.h"


  /*
***************
*** 58,65 **** typedef struct f_smgr
      void        (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
                                    BlockNumber nblocks, bool isTemp);
      void        (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-     void        (*smgr_commit) (void);    /* may be NULL */
-     void        (*smgr_abort) (void);    /* may be NULL */
      void        (*smgr_pre_ckpt) (void);        /* may be NULL */
      void        (*smgr_sync) (void);    /* may be NULL */
      void        (*smgr_post_ckpt) (void);        /* may be NULL */
--- 57,62 ----
***************
*** 70,76 **** static const f_smgr smgrsw[] = {
      /* magnetic disk */
      {mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
          mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
!         NULL, NULL, mdpreckpt, mdsync, mdpostckpt
      }
  };

--- 67,73 ----
      /* magnetic disk */
      {mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
          mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
!         mdpreckpt, mdsync, mdpostckpt
      }
  };

***************
*** 82,146 **** static const int NSmgr = lengthof(smgrsw);
   */
  static HTAB *SMgrRelationHash = NULL;

- /*
-  * We keep a list of all relations (represented as RelFileNode values)
-  * that have been created or deleted in the current transaction.  When
-  * a relation is created, we create the physical file immediately, but
-  * remember it so that we can delete the file again if the current
-  * transaction is aborted.    Conversely, a deletion request is NOT
-  * executed immediately, but is just entered in the list.  When and if
-  * the transaction commits, we can delete the physical file.
-  *
-  * To handle subtransactions, every entry is marked with its transaction
-  * nesting level.  At subtransaction commit, we reassign the subtransaction's
-  * entries to the parent nesting level.  At subtransaction abort, we can
-  * immediately execute the abort-time actions for all entries of the current
-  * nesting level.
-  *
-  * NOTE: the list is kept in TopMemoryContext to be sure it won't disappear
-  * unbetimes.  It'd probably be OK to keep it in TopTransactionContext,
-  * but I'm being paranoid.
-  */
-
- typedef struct PendingRelDelete
- {
-     RelFileNode relnode;        /* relation that may need to be deleted */
-     ForkNumber    forknum;        /* fork number that may need to be deleted */
-     int            which;            /* which storage manager? */
-     bool        isTemp;            /* is it a temporary relation? */
-     bool        atCommit;        /* T=delete at commit; F=delete at abort */
-     int            nestLevel;        /* xact nesting level of request */
-     struct PendingRelDelete *next;        /* linked-list link */
- } PendingRelDelete;
-
- static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
-
-
- /*
-  * Declarations for smgr-related XLOG records
-  *
-  * Note: we log file creation and truncation here, but logging of deletion
-  * actions is handled by xact.c, because it is part of transaction commit.
-  */
-
- /* XLOG gives us high 4 bits */
- #define XLOG_SMGR_CREATE    0x10
- #define XLOG_SMGR_TRUNCATE    0x20
-
- typedef struct xl_smgr_create
- {
-     RelFileNode rnode;
-     ForkNumber    forknum;
- } xl_smgr_create;
-
- typedef struct xl_smgr_truncate
- {
-     BlockNumber blkno;
-     RelFileNode rnode;
-     ForkNumber forknum;
- } xl_smgr_truncate;
-
-
  /* local function prototypes */
  static void smgrshutdown(int code, Datum arg);
  static void smgr_internal_unlink(RelFileNode rnode, ForkNumber forknum,
--- 79,84 ----
***************
*** 341,358 **** smgrclosenode(RelFileNode rnode)
   *        to be created.
   *
   *        If isRedo is true, it is okay for the underlying file to exist
!  *        already because we are in a WAL replay sequence.  In this case
!  *        we should make no PendingRelDelete entry; the WAL sequence will
!  *        tell whether to drop the file.
   */
  void
! smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo)
  {
-     XLogRecPtr    lsn;
-     XLogRecData rdata;
-     xl_smgr_create xlrec;
-     PendingRelDelete *pending;
-
      /*
       * Exit quickly in WAL replay mode if we've already opened the file.
       * If it's open, it surely must exist.
--- 279,289 ----
   *        to be created.
   *
   *        If isRedo is true, it is okay for the underlying file to exist
!  *        already because we are in a WAL replay sequence.
   */
  void
! smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
  {
      /*
       * Exit quickly in WAL replay mode if we've already opened the file.
       * If it's open, it surely must exist.
***************
*** 374,442 **** smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo)
                              isRedo);

      (*(smgrsw[reln->smgr_which].smgr_create)) (reln, forknum, isRedo);
-
-     if (isRedo)
-         return;
-
-     /*
-      * Make an XLOG entry showing the file creation.  If we abort, the file
-      * will be dropped at abort time.
-      */
-     xlrec.rnode = reln->smgr_rnode;
-     xlrec.forknum = forknum;
-
-     rdata.data = (char *) &xlrec;
-     rdata.len = sizeof(xlrec);
-     rdata.buffer = InvalidBuffer;
-     rdata.next = NULL;
-
-     lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE, &rdata);
-
-     /* Add the relation to the list of stuff to delete at abort */
-     pending = (PendingRelDelete *)
-         MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-     pending->relnode = reln->smgr_rnode;
-     pending->forknum = forknum;
-     pending->which = reln->smgr_which;
-     pending->isTemp = isTemp;
-     pending->atCommit = false;    /* delete if abort */
-     pending->nestLevel = GetCurrentTransactionNestLevel();
-     pending->next = pendingDeletes;
-     pendingDeletes = pending;
- }
-
- /*
-  *    smgrscheduleunlink() -- Schedule unlinking a relation at xact commit.
-  *
-  *        The fork is marked to be removed from the store if we successfully
-  *        commit the current transaction.
-  */
- void
- smgrscheduleunlink(SMgrRelation reln, ForkNumber forknum, bool isTemp)
- {
-     PendingRelDelete *pending;
-
-     /* Add the relation to the list of stuff to delete at commit */
-     pending = (PendingRelDelete *)
-         MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
-     pending->relnode = reln->smgr_rnode;
-     pending->forknum = forknum;
-     pending->which = reln->smgr_which;
-     pending->isTemp = isTemp;
-     pending->atCommit = true;    /* delete if commit */
-     pending->nestLevel = GetCurrentTransactionNestLevel();
-     pending->next = pendingDeletes;
-     pendingDeletes = pending;
-
-     /*
-      * NOTE: if the relation was created in this transaction, it will now be
-      * present in the pending-delete list twice, once with atCommit true and
-      * once with atCommit false.  Hence, it will be physically deleted at end
-      * of xact in either case (and the other entry will be ignored by
-      * smgrDoPendingDeletes, so no error will occur).  We could instead remove
-      * the existing list entry and delete the physical file immediately, but
-      * for now I'll keep the logic simple.
-      */
  }

  /*
--- 305,310 ----
***************
*** 573,599 **** smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
      /* Do the truncation */
      (*(smgrsw[reln->smgr_which].smgr_truncate)) (reln, forknum, nblocks,
                                                   isTemp);
-
-     if (!isTemp)
-     {
-         /*
-          * Make an XLOG entry showing the file truncation.
-          */
-         XLogRecPtr    lsn;
-         XLogRecData rdata;
-         xl_smgr_truncate xlrec;
-
-         xlrec.blkno = nblocks;
-         xlrec.rnode = reln->smgr_rnode;
-         xlrec.forknum = forknum;
-
-         rdata.data = (char *) &xlrec;
-         rdata.len = sizeof(xlrec);
-         rdata.buffer = InvalidBuffer;
-         rdata.next = NULL;
-
-         lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata);
-     }
  }

  /*
--- 441,446 ----
***************
*** 627,813 **** smgrimmedsync(SMgrRelation reln, ForkNumber forknum)


  /*
-  *    PostPrepare_smgr -- Clean up after a successful PREPARE
-  *
-  * What we have to do here is throw away the in-memory state about pending
-  * relation deletes.  It's all been recorded in the 2PC state file and
-  * it's no longer smgr's job to worry about it.
-  */
- void
- PostPrepare_smgr(void)
- {
-     PendingRelDelete *pending;
-     PendingRelDelete *next;
-
-     for (pending = pendingDeletes; pending != NULL; pending = next)
-     {
-         next = pending->next;
-         pendingDeletes = next;
-         /* must explicitly free the list entry */
-         pfree(pending);
-     }
- }
-
-
- /*
-  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
-  *
-  * This also runs when aborting a subxact; we want to clean up a failed
-  * subxact immediately.
-  */
- void
- smgrDoPendingDeletes(bool isCommit)
- {
-     int            nestLevel = GetCurrentTransactionNestLevel();
-     PendingRelDelete *pending;
-     PendingRelDelete *prev;
-     PendingRelDelete *next;
-
-     prev = NULL;
-     for (pending = pendingDeletes; pending != NULL; pending = next)
-     {
-         next = pending->next;
-         if (pending->nestLevel < nestLevel)
-         {
-             /* outer-level entries should not be processed yet */
-             prev = pending;
-         }
-         else
-         {
-             /* unlink list entry first, so we don't retry on failure */
-             if (prev)
-                 prev->next = next;
-             else
-                 pendingDeletes = next;
-             /* do deletion if called for */
-             if (pending->atCommit == isCommit)
-                 smgr_internal_unlink(pending->relnode,
-                                      pending->forknum,
-                                      pending->which,
-                                      pending->isTemp,
-                                      false);
-             /* must explicitly free the list entry */
-             pfree(pending);
-             /* prev does not change */
-         }
-     }
- }
-
- /*
-  * smgrGetPendingDeletes() -- Get a list of relations to be deleted.
-  *
-  * The return value is the number of relations scheduled for termination.
-  * *ptr is set to point to a freshly-palloc'd array of RelFileForks.
-  * If there are no relations to be deleted, *ptr is set to NULL.
-  *
-  * If haveNonTemp isn't NULL, the bool it points to gets set to true if
-  * there is any non-temp table pending to be deleted; false if not.
-  *
-  * Note that the list does not include anything scheduled for termination
-  * by upper-level transactions.
-  */
- int
- smgrGetPendingDeletes(bool forCommit, RelFileFork **ptr, bool *haveNonTemp)
- {
-     int            nestLevel = GetCurrentTransactionNestLevel();
-     int            nrels;
-     RelFileFork *rptr;
-     PendingRelDelete *pending;
-
-     nrels = 0;
-     if (haveNonTemp)
-         *haveNonTemp = false;
-     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
-     {
-         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
-             nrels++;
-     }
-     if (nrels == 0)
-     {
-         *ptr = NULL;
-         return 0;
-     }
-     rptr = (RelFileFork *) palloc(nrels * sizeof(RelFileFork));
-     *ptr = rptr;
-     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
-     {
-         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
-         {
-             rptr->rnode = pending->relnode;
-             rptr->forknum = pending->forknum;
-             rptr++;
-         }
-         if (haveNonTemp && !pending->isTemp)
-             *haveNonTemp = true;
-     }
-     return nrels;
- }
-
- /*
-  * AtSubCommit_smgr() --- Take care of subtransaction commit.
-  *
-  * Reassign all items in the pending-deletes list to the parent transaction.
-  */
- void
- AtSubCommit_smgr(void)
- {
-     int            nestLevel = GetCurrentTransactionNestLevel();
-     PendingRelDelete *pending;
-
-     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
-     {
-         if (pending->nestLevel >= nestLevel)
-             pending->nestLevel = nestLevel - 1;
-     }
- }
-
- /*
-  * AtSubAbort_smgr() --- Take care of subtransaction abort.
-  *
-  * Delete created relations and forget about deleted relations.
-  * We can execute these operations immediately because we know this
-  * subtransaction will not commit.
-  */
- void
- AtSubAbort_smgr(void)
- {
-     smgrDoPendingDeletes(false);
- }
-
- /*
-  *    smgrcommit() -- Prepare to commit changes made during the current
-  *                    transaction.
-  *
-  *        This is called before we actually commit.
-  */
- void
- smgrcommit(void)
- {
-     int            i;
-
-     for (i = 0; i < NSmgr; i++)
-     {
-         if (smgrsw[i].smgr_commit)
-             (*(smgrsw[i].smgr_commit)) ();
-     }
- }
-
- /*
-  *    smgrabort() -- Clean up after transaction abort.
-  */
- void
- smgrabort(void)
- {
-     int            i;
-
-     for (i = 0; i < NSmgr; i++)
-     {
-         if (smgrsw[i].smgr_abort)
-             (*(smgrsw[i].smgr_abort)) ();
-     }
- }
-
- /*
   *    smgrpreckpt() -- Prepare for checkpoint.
   */
  void
--- 474,479 ----
***************
*** 852,931 **** smgrpostckpt(void)
      }
  }

-
- void
- smgr_redo(XLogRecPtr lsn, XLogRecord *record)
- {
-     uint8        info = record->xl_info & ~XLR_INFO_MASK;
-
-     if (info == XLOG_SMGR_CREATE)
-     {
-         xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
-         SMgrRelation reln;
-
-         reln = smgropen(xlrec->rnode);
-         smgrcreate(reln, xlrec->forknum, false, true);
-     }
-     else if (info == XLOG_SMGR_TRUNCATE)
-     {
-         xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
-         SMgrRelation reln;
-
-         reln = smgropen(xlrec->rnode);
-
-         /*
-          * Forcibly create relation if it doesn't exist (which suggests that
-          * it was dropped somewhere later in the WAL sequence).  As in
-          * XLogOpenRelation, we prefer to recreate the rel and replay the log
-          * as best we can until the drop is seen.
-          */
-         smgrcreate(reln, xlrec->forknum, false, true);
-
-         /* Can't use smgrtruncate because it would try to xlog */
-
-         /*
-          * First, force bufmgr to drop any buffers it has for the to-be-
-          * truncated blocks.  We must do this, else subsequent XLogReadBuffer
-          * operations will not re-extend the file properly.
-          */
-         DropRelFileNodeBuffers(xlrec->rnode, xlrec->forknum, false,
-                                xlrec->blkno);
-
-         /* Do the truncation */
-         (*(smgrsw[reln->smgr_which].smgr_truncate)) (reln,
-                                                      xlrec->forknum,
-                                                      xlrec->blkno,
-                                                      false);
-
-         /* Also tell xlogutils.c about it */
-         XLogTruncateRelation(xlrec->rnode, xlrec->forknum, xlrec->blkno);
-     }
-     else
-         elog(PANIC, "smgr_redo: unknown op code %u", info);
- }
-
- void
- smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
- {
-     uint8        info = xl_info & ~XLR_INFO_MASK;
-
-     if (info == XLOG_SMGR_CREATE)
-     {
-         xl_smgr_create *xlrec = (xl_smgr_create *) rec;
-         char *path = relpath(xlrec->rnode, xlrec->forknum);
-
-         appendStringInfo(buf, "file create: %s", path);
-         pfree(path);
-     }
-     else if (info == XLOG_SMGR_TRUNCATE)
-     {
-         xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec;
-         char *path = relpath(xlrec->rnode, xlrec->forknum);
-
-         appendStringInfo(buf, "file truncate: %s to %u blocks", path,
-                          xlrec->blkno);
-         pfree(path);
-     }
-     else
-         appendStringInfo(buf, "UNKNOWN");
- }
--- 518,520 ----
*** src/include/access/rmgr.h
--- src/include/access/rmgr.h
***************
*** 23,29 **** typedef uint8 RmgrId;
  #define RM_DBASE_ID                4
  #define RM_TBLSPC_ID            5
  #define RM_MULTIXACT_ID            6
- #define RM_FREESPACE_ID            7
  #define RM_HEAP2_ID                9
  #define RM_HEAP_ID                10
  #define RM_BTREE_ID                11
--- 23,28 ----
*** src/include/access/xact.h
--- src/include/access/xact.h
***************
*** 90,97 **** typedef struct xl_xact_commit
      TimestampTz xact_time;        /* time of commit */
      int            nrels;            /* number of RelFileForks */
      int            nsubxacts;        /* number of subtransaction XIDs */
!     /* Array of RelFileFork(s) to drop at commit */
!     RelFileFork    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;

--- 90,97 ----
      TimestampTz xact_time;        /* time of commit */
      int            nrels;            /* number of RelFileForks */
      int            nsubxacts;        /* number of subtransaction XIDs */
!     /* Array of RelFileNode(s) to drop at commit */
!     RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;

***************
*** 102,109 **** typedef struct xl_xact_abort
      TimestampTz xact_time;        /* time of abort */
      int            nrels;            /* number of RelFileForks */
      int            nsubxacts;        /* number of subtransaction XIDs */
!     /* Array of RelFileFork(s) to drop at abort */
!     RelFileFork    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;

--- 102,109 ----
      TimestampTz xact_time;        /* time of abort */
      int            nrels;            /* number of RelFileForks */
      int            nsubxacts;        /* number of subtransaction XIDs */
!     /* Array of RelFileNode(s) to drop at abort */
!     RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;

*** /dev/null
--- src/include/catalog/storage.h
***************
*** 0 ****
--- 1,32 ----
+ /*-------------------------------------------------------------------------
+  *
+  * heap.h
+  *      prototypes for functions in backend/catalog/heap.c
+  *
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef STORAGE_H
+ #define STORAGE_H
+
+ #include "storage/block.h"
+ #include "storage/relfilenode.h"
+ #include "utils/rel.h"
+
+ extern void RelationCreateStorage(RelFileNode rnode, bool istemp);
+ extern void RelationDropStorage(Relation rel);
+ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
+
+ extern void smgrDoPendingDeletes(bool isCommit);
+ extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr,
+                       bool *haveNonTemp);
+ extern void AtSubCommit_smgr(void);
+ extern void AtSubAbort_smgr(void);
+ extern void PostPrepare_smgr(void);
+
+ #endif   /* STORAGE_H */
*** src/include/storage/bufmgr.h
--- src/include/storage/bufmgr.h
***************
*** 176,182 **** extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
- extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  extern void FlushRelationBuffers(Relation rel);
  extern void FlushDatabaseBuffers(Oid dbid);
  extern void DropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
--- 176,181 ----
*** src/include/storage/freespace.h
--- src/include/storage/freespace.h
***************
*** 33,40 **** extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
  extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
  extern void FreeSpaceMapVacuum(Relation rel);

- /* WAL prototypes */
- extern void fsm_desc(StringInfo buf, uint8 xl_info, char *rec);
- extern void fsm_redo(XLogRecPtr lsn, XLogRecord *record);
-
  #endif   /* FREESPACE_H */
--- 33,36 ----
*** src/include/storage/indexfsm.h
--- src/include/storage/indexfsm.h
***************
*** 20,26 **** extern BlockNumber GetFreeIndexPage(Relation rel);
  extern void RecordFreeIndexPage(Relation rel, BlockNumber page);
  extern void RecordUsedIndexPage(Relation rel, BlockNumber page);

- extern void InitIndexFreeSpaceMap(Relation rel);
  extern void IndexFreeSpaceMapTruncate(Relation rel, BlockNumber nblocks);
  extern void IndexFreeSpaceMapVacuum(Relation rel);

--- 20,25 ----
*** src/include/storage/relfilenode.h
--- src/include/storage/relfilenode.h
***************
*** 78,90 **** typedef struct RelFileNode
       (node1).dbNode == (node2).dbNode && \
       (node1).spcNode == (node2).spcNode)

- /*
-  * RelFileFork identifies a particular fork of a relation.
-  */
- typedef struct RelFileFork
- {
-     RelFileNode rnode;
-     ForkNumber forknum;
- } RelFileFork;
-
  #endif   /* RELFILENODE_H */
--- 78,81 ----
*** src/include/storage/smgr.h
--- src/include/storage/smgr.h
***************
*** 65,74 **** extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
  extern void smgrclose(SMgrRelation reln);
  extern void smgrcloseall(void);
  extern void smgrclosenode(RelFileNode rnode);
! extern void smgrcreate(SMgrRelation reln, ForkNumber forknum,
!                        bool isTemp, bool isRedo);
! extern void smgrscheduleunlink(SMgrRelation reln, ForkNumber forknum,
!                                bool isTemp);
  extern void smgrdounlink(SMgrRelation reln, ForkNumber forknum,
                           bool isTemp, bool isRedo);
  extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--- 65,71 ----
  extern void smgrclose(SMgrRelation reln);
  extern void smgrcloseall(void);
  extern void smgrclosenode(RelFileNode rnode);
! extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
  extern void smgrdounlink(SMgrRelation reln, ForkNumber forknum,
                           bool isTemp, bool isRedo);
  extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
***************
*** 81,94 **** extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
  extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
                           BlockNumber nblocks, bool isTemp);
  extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
- extern void smgrDoPendingDeletes(bool isCommit);
- extern int smgrGetPendingDeletes(bool forCommit, RelFileFork **ptr,
-                       bool *haveNonTemp);
- extern void AtSubCommit_smgr(void);
- extern void AtSubAbort_smgr(void);
- extern void PostPrepare_smgr(void);
- extern void smgrcommit(void);
- extern void smgrabort(void);
  extern void smgrpreckpt(void);
  extern void smgrsync(void);
  extern void smgrpostckpt(void);
--- 78,83 ----

Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
I committed the changes to FSM truncation yesterday, that helps with the
truncation of the visibility map as well. Attached is an updated
visibility map patch.

There's two open issues:

1. The bits in the visibility map are set in the 1st phase of lazy
vacuum. That works, but it means that after a delete or update, it takes
two vacuums until the bit in the visibility map is set. The first vacuum
removes the dead tuple, and only the second sees that there's no dead
tuples and sets the bit.

2. Should modify the output of VACUUM VERBOSE to say how many pages were
actually scanned. What other information is relevant, or is no longer
relevant, with partial vacuums.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/heap/Makefile
--- src/backend/access/heap/Makefile
***************
*** 12,17 **** subdir = src/backend/access/heap
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o

  include $(top_srcdir)/src/backend/common.mk
--- 12,17 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o

  include $(top_srcdir)/src/backend/common.mk
*** src/backend/access/heap/heapam.c
--- src/backend/access/heap/heapam.c
***************
*** 47,52 ****
--- 47,53 ----
  #include "access/transam.h"
  #include "access/tuptoaster.h"
  #include "access/valid.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
***************
*** 195,200 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
--- 196,202 ----
      int            ntup;
      OffsetNumber lineoff;
      ItemId        lpp;
+     bool        all_visible;

      Assert(page < scan->rs_nblocks);

***************
*** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
-             HeapTupleData loctup;
              bool        valid;

!             loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!             loctup.t_len = ItemIdGetLength(lpp);
!             ItemPointerSet(&(loctup.t_self), page, lineoff);

!             valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
--- 235,266 ----
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

+     /*
+      * If the all-visible flag indicates that all tuples on the page are
+      * visible to everyone, we can skip the per-tuple visibility tests.
+      */
+     all_visible = PageIsAllVisible(dp);
+
      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
              bool        valid;

!             if (all_visible)
!                 valid = true;
!             else
!             {
!                 HeapTupleData loctup;
!
!                 loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!                 loctup.t_len = ItemIdGetLength(lpp);
!                 ItemPointerSet(&(loctup.t_self), page, lineoff);

!                 valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
!             }
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
***************
*** 1860,1865 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1874,1880 ----
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
      Buffer        buffer;
+     bool        all_visible_cleared;

      if (relation->rd_rel->relhasoids)
      {
***************
*** 1920,1925 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1935,1946 ----

      RelationPutHeapTuple(relation, buffer, heaptup);

+     if (PageIsAllVisible(BufferGetPage(buffer)))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(BufferGetPage(buffer));
+     }
+
      /*
       * XXX Should we set PageSetPrunable on this page ?
       *
***************
*** 1943,1948 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1964,1970 ----
          Page        page = BufferGetPage(buffer);
          uint8        info = XLOG_HEAP_INSERT;

+         xlrec.all_visible_cleared = all_visible_cleared;
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = heaptup->t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 1994,1999 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 2016,2026 ----

      UnlockReleaseBuffer(buffer);

+     /* Clear the bit in the visibility map if necessary */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation,
+                             ItemPointerGetBlockNumber(&(heaptup->t_self)));
+
      /*
       * If tuple is cachable, mark it for invalidation from the caches in case
       * we abort.  Note it is OK to do this after releasing the buffer, because
***************
*** 2070,2075 **** heap_delete(Relation relation, ItemPointer tid,
--- 2097,2103 ----
      Buffer        buffer;
      bool        have_tuple_lock = false;
      bool        iscombo;
+     bool        all_visible_cleared = false;

      Assert(ItemPointerIsValid(tid));

***************
*** 2216,2221 **** l1:
--- 2244,2255 ----
       */
      PageSetPrunable(page, xid);

+     if (PageIsAllVisible(page))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(page);
+     }
+
      /* store transaction information of xact deleting the tuple */
      tp.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
                                 HEAP_XMAX_INVALID |
***************
*** 2237,2242 **** l1:
--- 2271,2277 ----
          XLogRecPtr    recptr;
          XLogRecData rdata[2];

+         xlrec.all_visible_cleared = all_visible_cleared;
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = tp.t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 2281,2286 **** l1:
--- 2316,2325 ----
       */
      CacheInvalidateHeapTuple(relation, &tp);

+     /* Clear the bit in the visibility map if necessary */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+
      /* Now we can release the buffer */
      ReleaseBuffer(buffer);

***************
*** 2388,2393 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2427,2434 ----
      bool        have_tuple_lock = false;
      bool        iscombo;
      bool        use_hot_update = false;
+     bool        all_visible_cleared = false;
+     bool        all_visible_cleared_new = false;

      Assert(ItemPointerIsValid(otid));

***************
*** 2763,2768 **** l2:
--- 2804,2815 ----
          MarkBufferDirty(newbuf);
      MarkBufferDirty(buffer);

+     /*
+      * Note: we mustn't clear PD_ALL_VISIBLE flags before calling writing
+      * the WAL record, because log_heap_update looks at those flags and sets
+      * the corresponding flags in the WAL record.
+      */
+
      /* XLOG stuff */
      if (!relation->rd_istemp)
      {
***************
*** 2778,2783 **** l2:
--- 2825,2842 ----
          PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
      }

+     /* Clear PD_ALL_VISIBLE flags */
+     if (PageIsAllVisible(BufferGetPage(buffer)))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(BufferGetPage(buffer));
+     }
+     if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
+     {
+         all_visible_cleared_new = true;
+         PageClearAllVisible(BufferGetPage(newbuf));
+     }
+
      END_CRIT_SECTION();

      if (newbuf != buffer)
***************
*** 2791,2796 **** l2:
--- 2850,2861 ----
       */
      CacheInvalidateHeapTuple(relation, &oldtup);

+     /* Clear bits in visibility map */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+     if (all_visible_cleared_new)
+         visibilitymap_clear(relation, BufferGetBlockNumber(newbuf));
+
      /* Now we can release the buffer(s) */
      if (newbuf != buffer)
          ReleaseBuffer(newbuf);
***************
*** 3412,3417 **** l3:
--- 3477,3487 ----
      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);

      /*
+      * Don't update the visibility map here. Locking a tuple doesn't
+      * change visibility info.
+      */
+
+     /*
       * Now that we have successfully marked the tuple as locked, we can
       * release the lmgr tuple lock, if we had it.
       */
***************
*** 3916,3922 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 3986,3994 ----

      xlrec.target.node = reln->rd_node;
      xlrec.target.tid = from;
+     xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf));
      xlrec.newtid = newtup->t_self;
+     xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf));

      rdata[0].data = (char *) &xlrec;
      rdata[0].len = SizeOfHeapUpdate;
***************
*** 4186,4191 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
--- 4258,4274 ----
      ItemId        lp = NULL;
      HeapTupleHeader htup;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&(xlrec->target.tid)));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

***************
*** 4223,4228 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
--- 4306,4314 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /* Make sure there is no forward chain link in t_ctid */
      htup->t_ctid = xlrec->target.tid;
      PageSetLSN(page, lsn);
***************
*** 4249,4254 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4335,4351 ----
      Size        freespace;
      BlockNumber    blkno;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

***************
*** 4307,4312 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4404,4413 ----

      PageSetLSN(page, lsn);
      PageSetTLI(page, ThisTimeLineID);
+
+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      MarkBufferDirty(buffer);
      UnlockReleaseBuffer(buffer);

***************
*** 4347,4352 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4448,4464 ----
      uint32        newlen;
      Size        freespace;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
      {
          if (samepage)
***************
*** 4411,4416 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4523,4531 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /*
       * this test is ugly, but necessary to avoid thinking that insert change
       * is already applied
***************
*** 4426,4431 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4541,4557 ----

  newt:;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->new_all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_2)
          return;

***************
*** 4504,4509 **** newsame:;
--- 4630,4638 ----
      if (offnum == InvalidOffsetNumber)
          elog(PANIC, "heap_update_redo: failed to add tuple");

+     if (xlrec->new_all_visible_cleared)
+         PageClearAllVisible(page);
+
      freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */

      PageSetLSN(page, lsn);
*** /dev/null
--- src/backend/access/heap/visibilitymap.c
***************
*** 0 ****
--- 1,390 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.c
+  *      bitmap for tracking visibility of heap tuples
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      $PostgreSQL$
+  *
+  * NOTES
+  *
+  * The visibility map is a bitmap with one bit per heap page. A set bit means
+  * that all tuples on the page are visible to all transactions, and doesn't
+  * therefore need to be vacuumed.
+  *
+  * The map is conservative in the sense that we make sure that whenever a bit
+  * is set, we know the condition is true, but if a bit is not set, it might
+  * or might not be.
+  *
+  * There's no explicit WAL logging in the functions in this file. The callers
+  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+  * replay of the updating operation as well. Setting bits during recovery
+  * isn't necessary for correctness.
+  *
+  * LOCKING
+  *
+  * In heapam.c, whenever a page is modified so that not all tuples on the
+  * page are visible to everyone anymore, the corresponding bit in the
+  * visibility map is cleared. The bit in the visibility map is cleared
+  * after releasing the lock on the heap page, to avoid holding the lock
+  * over possible I/O to read in the visibility map page.
+  *
+  * To set a bit, you need to hold a lock on the heap page. That prevents
+  * the race condition where VACUUM sees that all tuples on the page are
+  * visible to everyone, but another backend modifies the page before VACUUM
+  * sets the bit in the visibility map.
+  *
+  * When a bit is set, we need to update the LSN of the page to make sure that
+  * the visibility map update doesn't get written to disk before the WAL record
+  * of the changes that made it possible to set the bit is flushed. But when a
+  * bit is cleared, we don't have to do that because it's always OK to clear
+  * a bit in the map from correctness point of view.
+  *
+  * TODO
+  *
+  * It would be nice to use the visibility map to skip visibility checkes in
+  * index scans.
+  *
+  * Currently, the visibility map is not 100% correct all the time.
+  * During updates, the bit in the visibility map is cleared after releasing
+  * the lock on the heap page. During the window after releasing the lock
+  * and clearing the bit in the visibility map, the bit in the visibility map
+  * is set, but the new insertion or deletion is not yet visible to other
+  * backends.
+  *
+  * That might actually be OK for the index scans, though. The newly inserted
+  * tuple wouldn't have an index pointer yet, so all tuples reachable from an
+  * index would still be visible to all other backends, and deletions wouldn't
+  * be visible to other backends yet.
+  *
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+
+ #include "access/visibilitymap.h"
+ #include "storage/bufmgr.h"
+ #include "storage/bufpage.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+
+ /*#define TRACE_VISIBILITYMAP */
+
+ /* Number of bits allocated for each heap block. */
+ #define BITS_PER_HEAPBLOCK 1
+
+ /* Number of heap blocks we can represent in one byte. */
+ #define HEAPBLOCKS_PER_BYTE 8
+
+ /* Number of heap blocks we can represent in one visibility map page */
+ #define HEAPBLOCKS_PER_PAGE ((BLCKSZ - SizeOfPageHeaderData) * HEAPBLOCKS_PER_BYTE )
+
+ /* Mapping from heap block number to the right bit in the visibility map */
+ #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+ #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+ #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
+
+ static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
+ static void vm_extend(Relation rel, BlockNumber nvmblocks, bool createstorage);
+
+ /*
+  * Read a visibility map page.
+  *
+  * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+  * true, the visibility map file is extended.
+  */
+ static Buffer
+ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
+ {
+     Buffer buf;
+
+     RelationOpenSmgr(rel);
+
+     if (rel->rd_vm_nblocks_cache == InvalidBlockNumber ||
+         rel->rd_vm_nblocks_cache <= blkno)
+     {
+         if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+             vm_extend(rel, blkno + 1, true);
+         else
+             rel->rd_vm_nblocks_cache = smgrnblocks(rel->rd_smgr,
+                                                    VISIBILITYMAP_FORKNUM);
+     }
+
+     if (blkno >= rel->rd_vm_nblocks_cache)
+     {
+         if (extend)
+             vm_extend(rel, blkno + 1, false);
+         else
+             return InvalidBuffer;
+     }
+
+     /*
+      * Use ZERO_ON_ERROR mode, and initialize the page if necessary. XXX The
+      * information is not accurate anyway, so it's better to clear corrupt
+      * pages than error out. Since the FSM changes are not WAL-logged, the
+      * so-called torn page problem on crash can lead to pages with corrupt
+      * headers, for example.
+      */
+     buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+                              RBM_ZERO_ON_ERROR, NULL);
+     if (PageIsNew(BufferGetPage(buf)))
+         PageInit(BufferGetPage(buf), BLCKSZ, 0);
+     return buf;
+ }
+
+ /*
+  * Ensure that the visibility map fork is at least n_vmblocks long, extending
+  * it if necessary with empty pages. And by empty, I mean pages filled
+  * with zeros, meaning there's no free space. If createstorage is true,
+  * the physical file might need to be created first.
+  */
+ static void
+ vm_extend(Relation rel, BlockNumber n_vmblocks, bool createstorage)
+ {
+     BlockNumber n_vmblocks_now;
+     Page pg;
+
+     pg = (Page) palloc(BLCKSZ);
+     PageInit(pg, BLCKSZ, 0);
+
+     /*
+      * We use the relation extension lock to lock out other backends
+      * trying to extend the visibility map at the same time. It also locks out
+      * extension of the main fork, unnecessarily, but extending the
+      * visibility map happens seldom enough that it doesn't seem worthwhile to
+      * have a separate lock tag type for it.
+      *
+      * Note that another backend might have extended or created the
+      * relation before we get the lock.
+      */
+     LockRelationForExtension(rel, ExclusiveLock);
+
+     /* Create the file first if it doesn't exist */
+     if (createstorage && !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+     {
+         smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+         n_vmblocks_now = 0;
+     }
+     else
+         n_vmblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+
+     while (n_vmblocks_now < n_vmblocks)
+     {
+         smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, n_vmblocks_now,
+                    (char *) pg, rel->rd_istemp);
+         n_vmblocks_now++;
+     }
+
+     UnlockRelationForExtension(rel, ExclusiveLock);
+
+     pfree(pg);
+
+     /* update the cache with the up-to-date size */
+     rel->rd_vm_nblocks_cache = n_vmblocks_now;
+ }
+
+ void
+ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
+ {
+     BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+     uint32        truncByte  = HEAPBLK_TO_MAPBYTE(nheapblocks);
+     uint8        truncBit   = HEAPBLK_TO_MAPBIT(nheapblocks);
+     BlockNumber newnblocks;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ #endif
+
+     /*
+      * If no visibility map has been created yet for this relation, there's
+      * nothing to truncate.
+      */
+     if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+         return;
+
+     /* Truncate away pages that are no longer needed */
+     if (truncByte == 0 && truncBit == 0)
+         newnblocks = truncBlock;
+     else
+     {
+         Buffer mapBuffer;
+         Page page;
+         char *mappage;
+         int len;
+
+         newnblocks = truncBlock + 1;
+
+         /*
+          * Clear all bits in the last map page, that represent the truncated
+          * heap blocks. This is not only tidy, but also necessary because
+          * we don't clear the bits on extension.
+          */
+         mapBuffer = vm_readbuf(rel, truncBlock, false);
+         if (BufferIsValid(mapBuffer))
+         {
+             page = BufferGetPage(mapBuffer);
+             mappage = PageGetContents(page);
+
+             LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+             /*
+              * Clear out the unwanted bytes.
+              */
+             len = HEAPBLOCKS_PER_PAGE/HEAPBLOCKS_PER_BYTE - (truncByte + 1);
+             MemSet(&mappage[truncByte + 1], 0, len);
+
+             /*
+              * Mask out the unwanted bits of the last remaining byte
+              *
+              * ((1 << 0) - 1) = 00000000
+              * ((1 << 1) - 1) = 00000001
+              * ...
+              * ((1 << 6) - 1) = 00111111
+              * ((1 << 7) - 1) = 01111111
+              */
+             mappage[truncByte] &= (1 << truncBit) - 1;
+
+             /*
+              * This needs to be WAL-logged. Although the now unused shouldn't
+              * be accessed anymore, they better be zero if we extend again.
+              */
+
+             MarkBufferDirty(mapBuffer);
+             UnlockReleaseBuffer(mapBuffer);
+         }
+     }
+
+     if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) > newnblocks)
+         smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks,
+                      rel->rd_istemp);
+ }
+
+ /*
+  * Marks that all tuples on a heap page are visible to all.
+  *
+  * recptr is the LSN of the heap page. The LSN of the visibility map
+  * page is advanced to that, to make sure that the visibility map doesn't
+  * get flushed to disk before update to the heap page that made all tuples
+  * visible.
+  *
+  * *buf is a buffer previously returned by visibilitymap_test(). This is
+  * an opportunistic function; if *buf doesn't contain the bit for heapBlk,
+  * we do nothing. We don't want to do any I/O here, because the caller is
+  * holding a cleanup lock on the heap page.
+  */
+ void
+ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
+                   Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     Page        page;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
+         return;
+
+     page = BufferGetPage(*buf);
+     mappage = PageGetContents(page);
+     LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
+
+     if (!(mappage[mapByte] & (1 << mapBit)))
+     {
+         mappage[mapByte] |= (1 << mapBit);
+
+         if (XLByteLT(PageGetLSN(page), recptr))
+             PageSetLSN(page, recptr);
+         PageSetTLI(page, ThisTimeLineID);
+         MarkBufferDirty(*buf);
+     }
+
+     LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+  * Are all tuples on heap page visible to all?
+  *
+  * The page containing the bit for the heap block is (kept) pinned,
+  * and *buf is set to that buffer. If *buf is valid on entry, it should
+  * be a buffer previously returned by this function, for the same relation,
+  * and unless the new heap block is on the same page, it is released. On the
+  * first call, InvalidBuffer should be passed, and when the caller doesn't
+  * want to test any more pages, it should release *buf if it's valid.
+  */
+ bool
+ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     bool        val;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     if (BufferIsValid(*buf))
+     {
+         if (BufferGetBlockNumber(*buf) == heapBlk)
+             return *buf;
+         else
+             ReleaseBuffer(*buf);
+     }
+
+     *buf = vm_readbuf(rel, mapBlock, true);
+     if (!BufferIsValid(*buf))
+         return false;
+
+     mappage = PageGetContents(BufferGetPage(*buf));
+
+     /*
+      * We don't need to lock the page, as we're only looking at a single bit.
+      */
+     val = (mappage[mapByte] & (1 << mapBit)) ? true : false;
+
+     return val;
+ }
+
+ /*
+  * Mark that not all tuples are visible to all.
+  */
+ void
+ visibilitymap_clear(Relation rel, BlockNumber heapBlk)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     int            mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     int            mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     uint8        mask = 1 << mapBit;
+     Buffer        mapBuffer;
+     char       *mappage;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     mapBuffer = vm_readbuf(rel, mapBlock, false);
+     if (!BufferIsValid(mapBuffer))
+         return; /* nothing to do */
+
+     LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+     mappage = PageGetContents(BufferGetPage(mapBuffer));
+
+     if (mappage[mapByte] & mask)
+     {
+         mappage[mapByte] &= ~mask;
+
+         MarkBufferDirty(mapBuffer);
+     }
+
+     UnlockReleaseBuffer(mapBuffer);
+ }
*** src/backend/access/transam/xlogutils.c
--- src/backend/access/transam/xlogutils.c
***************
*** 377,382 **** CreateFakeRelcacheEntry(RelFileNode rnode)
--- 377,383 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     rel->rd_vm_nblocks_cache = InvalidBlockNumber;
      rel->rd_smgr = NULL;

      return rel;
*** src/backend/catalog/catalog.c
--- src/backend/catalog/catalog.c
***************
*** 54,60 ****
   */
  const char *forkNames[] = {
      "main", /* MAIN_FORKNUM */
!     "fsm"   /* FSM_FORKNUM */
  };

  /*
--- 54,61 ----
   */
  const char *forkNames[] = {
      "main", /* MAIN_FORKNUM */
!     "fsm",   /* FSM_FORKNUM */
!     "vm"   /* VISIBILITYMAP_FORKNUM */
  };

  /*
*** src/backend/catalog/heap.c
--- src/backend/catalog/heap.c
***************
*** 33,38 ****
--- 33,39 ----
  #include "access/heapam.h"
  #include "access/sysattr.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "catalog/catalog.h"
  #include "catalog/dependency.h"
*** src/backend/catalog/storage.c
--- src/backend/catalog/storage.c
***************
*** 19,24 ****
--- 19,25 ----

  #include "postgres.h"

+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
***************
*** 175,180 **** void
--- 176,182 ----
  RelationTruncate(Relation rel, BlockNumber nblocks)
  {
      bool fsm;
+     bool vm;

      /* Open it at the smgr level if not already done */
      RelationOpenSmgr(rel);
***************
*** 187,192 **** RelationTruncate(Relation rel, BlockNumber nblocks)
--- 189,199 ----
      if (fsm)
          FreeSpaceMapTruncateRel(rel, nblocks);

+     /* Truncate the visibility map too if it exists. */
+     vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+     if (vm)
+         visibilitymap_truncate(rel, nblocks);
+
      /*
       * We WAL-log the truncation before actually truncating, which
       * means trouble if the truncation fails. If we then crash, the WAL
***************
*** 222,228 **** RelationTruncate(Relation rel, BlockNumber nblocks)
           * left with a truncated heap, but the FSM would still contain
           * entries for the non-existent heap pages.
           */
!         if (fsm)
              XLogFlush(lsn);
      }

--- 229,235 ----
           * left with a truncated heap, but the FSM would still contain
           * entries for the non-existent heap pages.
           */
!         if (fsm || vm)
              XLogFlush(lsn);
      }

*** src/backend/commands/vacuum.c
--- src/backend/commands/vacuum.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlog.h"
  #include "catalog/namespace.h"
***************
*** 2902,2907 **** move_chain_tuple(Relation rel,
--- 2903,2914 ----
      Size        tuple_len = old_tup->t_len;

      /*
+      * Clear the bits in the visibility map.
+      */
+     visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+
+     /*
       * make a modifiable copy of the source tuple.
       */
      heap_copytuple_with_tuple(old_tup, &newtup);
***************
*** 3005,3010 **** move_chain_tuple(Relation rel,
--- 3012,3021 ----

      END_CRIT_SECTION();

+     PageClearAllVisible(BufferGetPage(old_buf));
+     if (dst_buf != old_buf)
+         PageClearAllVisible(BufferGetPage(dst_buf));
+
      LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK);
      if (dst_buf != old_buf)
          LockBuffer(old_buf, BUFFER_LOCK_UNLOCK);
***************
*** 3107,3112 **** move_plain_tuple(Relation rel,
--- 3118,3140 ----

      END_CRIT_SECTION();

+     /*
+      * Clear the visible-to-all hint bits on the page, and bits in the
+      * visibility map. Normally we'd release the locks on the heap pages
+      * before updating the visibility map, but doesn't really matter here
+      * because we're holding an AccessExclusiveLock on the relation anyway.
+      */
+     if (PageIsAllVisible(dst_page))
+     {
+         PageClearAllVisible(dst_page);
+         visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+     }
+     if (PageIsAllVisible(old_page))
+     {
+         PageClearAllVisible(old_page);
+         visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     }
+
      dst_vacpage->free = PageGetFreeSpaceWithFillFactor(rel, dst_page);
      LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK);
      LockBuffer(old_buf, BUFFER_LOCK_UNLOCK);
*** src/backend/commands/vacuumlazy.c
--- src/backend/commands/vacuumlazy.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "catalog/storage.h"
  #include "commands/dbcommands.h"
  #include "commands/vacuum.h"
***************
*** 88,93 **** typedef struct LVRelStats
--- 89,95 ----
      int            max_dead_tuples;    /* # slots allocated in array */
      ItemPointer dead_tuples;    /* array of ItemPointerData */
      int            num_index_scans;
+     bool        scanned_all;    /* have we scanned all pages (this far) in the rel? */
  } LVRelStats;


***************
*** 102,108 **** static BufferAccessStrategy vac_strategy;

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
--- 104,110 ----

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
***************
*** 141,146 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
--- 143,149 ----
      BlockNumber possibly_freeable;
      PGRUsage    ru0;
      TimestampTz starttime = 0;
+     bool        scan_all;

      pg_rusage_init(&ru0);

***************
*** 166,173 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
--- 169,185 ----
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

+     /* Should we use the visibility map or scan all pages? */
+     if (vacstmt->freeze_min_age != -1)
+         scan_all = true;
+     else
+         scan_all = false;
+
+     /* initialize this variable */
+     vacrelstats->scanned_all = true;
+
      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
***************
*** 189,195 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
      /* Update statistics in pg_class */
      vac_update_relstats(onerel,
                          vacrelstats->rel_pages, vacrelstats->rel_tuples,
!                         vacrelstats->hasindex, FreezeLimit);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared,
--- 201,208 ----
      /* Update statistics in pg_class */
      vac_update_relstats(onerel,
                          vacrelstats->rel_pages, vacrelstats->rel_tuples,
!                         vacrelstats->hasindex,
!                         vacrelstats->scanned_all ? FreezeLimit : InvalidOid);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared,
***************
*** 230,236 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes)
  {
      BlockNumber nblocks,
                  blkno;
--- 243,249 ----
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all)
  {
      BlockNumber nblocks,
                  blkno;
***************
*** 245,250 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 258,264 ----
      IndexBulkDeleteResult **indstats;
      int            i;
      PGRUsage    ru0;
+     Buffer        vmbuffer = InvalidBuffer;

      pg_rusage_init(&ru0);

***************
*** 278,283 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 292,315 ----
          OffsetNumber frozen[MaxOffsetNumber];
          int            nfrozen;
          Size        freespace;
+         bool        all_visible_according_to_vm;
+         bool        all_visible;
+
+         /*
+          * If all tuples on page are visible to all, there's no
+          * need to visit that page.
+          *
+          * Note that we test the visibility map even if we're scanning all
+          * pages, to pin the visibility map page. We might set the bit there,
+          * and we don't want to do the I/O while we're holding the heap page
+          * locked.
+          */
+         all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer);
+         if (!scan_all && all_visible_according_to_vm)
+         {
+             vacrelstats->scanned_all = false;
+             continue;
+         }

          vacuum_delay_point();

***************
*** 354,359 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 386,398 ----
          {
              empty_pages++;
              freespace = PageGetHeapFreeSpace(page);
+
+             PageSetAllVisible(page);
+             /* Update the visibility map */
+             if (!all_visible_according_to_vm)
+                 visibilitymap_set(onerel, blkno, PageGetLSN(page),
+                                   &vmbuffer);
+
              UnlockReleaseBuffer(buf);
              RecordPageWithFreeSpace(onerel, blkno, freespace);
              continue;
***************
*** 371,376 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 410,416 ----
           * Now scan the page to collect vacuumable items and check for tuples
           * requiring freezing.
           */
+         all_visible = true;
          nfrozen = 0;
          hastup = false;
          prev_dead_count = vacrelstats->num_dead_tuples;
***************
*** 408,413 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 448,454 ----
              if (ItemIdIsDead(itemid))
              {
                  lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+                 all_visible = false;
                  continue;
              }

***************
*** 442,447 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 483,489 ----
                          nkeep += 1;
                      else
                          tupgone = true; /* we can delete the tuple */
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_LIVE:
                      /* Tuple is good --- but let's do some validity checks */
***************
*** 449,454 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 491,525 ----
                          !OidIsValid(HeapTupleGetOid(&tuple)))
                          elog(WARNING, "relation \"%s\" TID %u/%u: OID is invalid",
                               relname, blkno, offnum);
+
+                     /*
+                      * Definitely visible to all? Note that SetHintBits handles
+                      * async commit correctly
+                      */
+                     if (all_visible)
+                     {
+                         /*
+                          * Is it visible to all transactions? It's important
+                          * that we look at the hint bit here. Only if a hint
+                          * bit is set, we can be sure that the tuple is indeed
+                          * live, even if asynchronous_commit is true and we
+                          * crash later
+                          */
+                         if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED))
+                         {
+                             all_visible = false;
+                             break;
+                         }
+                         /*
+                          * The inserter definitely committed. But is it
+                          * old enough that everyone sees it as committed?
+                          */
+                         if (!TransactionIdPrecedes(HeapTupleHeaderGetXmin(tuple.t_data), OldestXmin))
+                         {
+                             all_visible = false;
+                             break;
+                         }
+                     }
                      break;
                  case HEAPTUPLE_RECENTLY_DEAD:

***************
*** 457,468 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 528,542 ----
                       * from relation.
                       */
                      nkeep += 1;
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_INSERT_IN_PROGRESS:
                      /* This is an expected case during concurrent vacuum */
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_DELETE_IN_PROGRESS:
                      /* This is an expected case during concurrent vacuum */
+                     all_visible = false;
                      break;
                  default:
                      elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
***************
*** 525,530 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 599,621 ----

          freespace = PageGetHeapFreeSpace(page);

+         /* Update the all-visible flag on the page */
+         if (!PageIsAllVisible(page) && all_visible)
+         {
+             SetBufferCommitInfoNeedsSave(buf);
+             PageSetAllVisible(page);
+         }
+         else if (PageIsAllVisible(page) && !all_visible)
+         {
+             elog(WARNING, "all-visible flag was incorrectly set");
+             SetBufferCommitInfoNeedsSave(buf);
+             PageClearAllVisible(page);
+         }
+
+         /* Update the visibility map */
+         if (!all_visible_according_to_vm && all_visible)
+             visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
+
          /* Remember the location of the last page with nonremovable tuples */
          if (hastup)
              vacrelstats->nonempty_pages = blkno + 1;
***************
*** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 651,663 ----
          vacrelstats->num_index_scans++;
      }

+     /* Release the pin on the visibility map page */
+     if (BufferIsValid(vmbuffer))
+     {
+         ReleaseBuffer(vmbuffer);
+         vmbuffer = InvalidBuffer;
+     }
+
      /* Do post-vacuum cleanup and statistics update for each index */
      for (i = 0; i < nindexes; i++)
          lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
***************
*** 623,628 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
--- 721,735 ----
          LockBufferForCleanup(buf);
          tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);

+         /*
+          * Before we let the page go, prune it. The primary reason is to
+          * update the visibility map in the common special case that we just
+          * vacuumed away the last tuple on the page that wasn't visible to
+          * everyone.
+          */
+         vacrelstats->tuples_deleted +=
+             heap_page_prune(onerel, buf, OldestXmin, false, false);
+
          /* Now that we've compacted the page, record its available space */
          page = BufferGetPage(buf);
          freespace = PageGetHeapFreeSpace(page);
*** src/backend/storage/freespace/freespace.c
--- src/backend/storage/freespace/freespace.c
***************
*** 555,562 **** fsm_extend(Relation rel, BlockNumber n_fsmblocks, bool createstorage)
       * FSM happens seldom enough that it doesn't seem worthwhile to
       * have a separate lock tag type for it.
       *
!      * Note that another backend might have extended the relation
!      * before we get the lock.
       */
      LockRelationForExtension(rel, ExclusiveLock);

--- 555,562 ----
       * FSM happens seldom enough that it doesn't seem worthwhile to
       * have a separate lock tag type for it.
       *
!      * Note that another backend might have extended or created the
!      * relation before we get the lock.
       */
      LockRelationForExtension(rel, ExclusiveLock);

*** src/backend/storage/smgr/smgr.c
--- src/backend/storage/smgr/smgr.c
***************
*** 21,26 ****
--- 21,27 ----
  #include "catalog/catalog.h"
  #include "commands/tablespace.h"
  #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
  #include "storage/ipc.h"
  #include "storage/smgr.h"
  #include "utils/hsearch.h"
*** src/backend/utils/cache/relcache.c
--- src/backend/utils/cache/relcache.c
***************
*** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp)
--- 305,311 ----
      MemSet(relation, 0, sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1377,1382 **** formrdesc(const char *relationName, Oid relationReltype,
--- 1378,1384 ----
      relation = (Relation) palloc0(sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1665,1673 **** RelationReloadIndexInfo(Relation relation)
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /* Must reset targblock and fsm_nblocks_cache in case rel was truncated */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
--- 1667,1676 ----
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /* Must reset targblock and fsm_nblocks_cache and vm_nblocks_cache in case rel was truncated */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     relation->rd_vm_nblocks_cache = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
***************
*** 1751,1756 **** RelationClearRelation(Relation relation, bool rebuild)
--- 1754,1760 ----
      {
          relation->rd_targblock = InvalidBlockNumber;
          relation->rd_fsm_nblocks_cache = InvalidBlockNumber;
+         relation->rd_vm_nblocks_cache = InvalidBlockNumber;
          if (relation->rd_rel->relkind == RELKIND_INDEX)
          {
              relation->rd_isvalid = false;        /* needs to be revalidated */
***************
*** 2346,2351 **** RelationBuildLocalRelation(const char *relname,
--- 2350,2356 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+     rel->rd_vm_nblocks_cache = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      rel->rd_smgr = NULL;
***************
*** 3603,3608 **** load_relcache_init_file(void)
--- 3608,3614 ----
          rel->rd_smgr = NULL;
          rel->rd_targblock = InvalidBlockNumber;
          rel->rd_fsm_nblocks_cache = InvalidBlockNumber;
+         rel->rd_vm_nblocks_cache = InvalidBlockNumber;
          if (rel->rd_isnailed)
              rel->rd_refcnt = 1;
          else
*** src/include/access/heapam.h
--- src/include/access/heapam.h
***************
*** 153,158 **** extern void heap_page_prune_execute(Buffer buffer,
--- 153,159 ----
                          OffsetNumber *nowunused, int nunused,
                          bool redirect_move);
  extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
+ extern void heap_page_update_all_visible(Buffer buffer);

  /* in heap/syncscan.c */
  extern void ss_report_location(Relation rel, BlockNumber location);
*** src/include/access/htup.h
--- src/include/access/htup.h
***************
*** 601,606 **** typedef struct xl_heaptid
--- 601,607 ----
  typedef struct xl_heap_delete
  {
      xl_heaptid    target;            /* deleted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
  } xl_heap_delete;

  #define SizeOfHeapDelete    (offsetof(xl_heap_delete, target) + SizeOfHeapTid)
***************
*** 626,641 **** typedef struct xl_heap_header
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, target) + SizeOfHeapTid)

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
--- 627,645 ----
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
+     bool new_all_visible_cleared; /* same for the page of newtid */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
*** /dev/null
--- src/include/access/visibilitymap.h
***************
*** 0 ****
--- 1,28 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.h
+  *      visibility map interface
+  *
+  *
+  * Portions Copyright (c) 2007, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef VISIBILITYMAP_H
+ #define VISIBILITYMAP_H
+
+ #include "utils/rel.h"
+ #include "storage/buf.h"
+ #include "storage/itemptr.h"
+ #include "access/xlogdefs.h"
+
+ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk,
+                               XLogRecPtr recptr, Buffer *vmbuf);
+ extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk);
+ extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+ extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk);
+
+ #endif   /* VISIBILITYMAP_H */
*** src/include/storage/bufpage.h
--- src/include/storage/bufpage.h
***************
*** 152,159 **** typedef PageHeaderData *PageHeader;
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */

! #define PD_VALID_FLAG_BITS    0x0003        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
--- 152,161 ----
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */
+ #define PD_ALL_VISIBLE        0x0004        /* all tuples on page are visible to
+                                          * everyone */

! #define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 336,341 **** typedef PageHeaderData *PageHeader;
--- 338,350 ----
  #define PageClearFull(page) \
      (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL)

+ #define PageIsAllVisible(page) \
+     (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE)
+ #define PageSetAllVisible(page) \
+     (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
+ #define PageClearAllVisible(page) \
+     (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+
  #define PageIsPrunable(page, oldestxmin) \
  ( \
      AssertMacro(TransactionIdIsNormal(oldestxmin)), \
*** src/include/storage/relfilenode.h
--- src/include/storage/relfilenode.h
***************
*** 24,37 **** typedef enum ForkNumber
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
  } ForkNumber;

! #define MAX_FORKNUM        FSM_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
--- 24,38 ----
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM,
!     VISIBILITYMAP_FORKNUM
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
  } ForkNumber;

! #define MAX_FORKNUM        VISIBILITYMAP_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
*** src/include/utils/rel.h
--- src/include/utils/rel.h
***************
*** 195,202 **** typedef struct RelationData
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /* Cached last-seen size of the FSM */
      BlockNumber    rd_fsm_nblocks_cache;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */
--- 195,203 ----
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /* Cached last-seen size of the FSM and visibility map */
      BlockNumber    rd_fsm_nblocks_cache;
+     BlockNumber    rd_vm_nblocks_cache;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */

Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I committed the changes to FSM truncation yesterday, that helps with the 
> truncation of the visibility map as well. Attached is an updated 
> visibility map patch.

I looked over this patch a bit ...

> 1. The bits in the visibility map are set in the 1st phase of lazy 
> vacuum. That works, but it means that after a delete or update, it takes 
> two vacuums until the bit in the visibility map is set. The first vacuum 
> removes the dead tuple, and only the second sees that there's no dead 
> tuples and sets the bit.

I think this is probably not a big issue really.  The point of this change
is to optimize things for pages that are static over the long term; one
extra vacuum cycle before the page is deemed static doesn't seem like a
problem.  You could even argue that this saves I/O because we don't set
the bit (and perhaps later have to clear it) until we know that the page
has stayed static across a vacuum cycle and thus has a reasonable
probability of continuing to do so.

A possible problem is that if a relation is filled all in one shot,
autovacuum would trigger a single vacuum cycle on it and then never have
a reason to trigger another; leading to the bits never getting set (or
at least not till an antiwraparound vacuum occurs).  We might want to
tweak autovac so that an extra vacuum cycle occurs in this case.  But
I'm not quite sure what a reasonable heuristic would be.

Some other points:

* ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE
page header flag *must* be correct, but it's really okay if the backing
map bit *isn't* correct --- in particular we don't trust the map bit
when performing antiwraparound vacuums.  This isn't well documented.

* Also, I see that vacuum has a provision for clearing an incorrectly
set PD_ALL_VISIBLE flag, but shouldn't it fix the map too?

* It would be good if the visibility map fork were never created until
there is occasion to set a bit in it; this would for instance typically
mean that temp tables would never have one.  I think that
visibilitymap.c doesn't get this quite right --- in particular
vm_readbuf seems willing to create/extend the fork whether its extend
argument is true or not, so it looks like an inquiry operation would
cause the map fork to be created.  It should be possible to act as
though a nonexistent fork just means "all zeroes".

* heap_insert's all_visible_cleared variable doesn't seem to get
initialized --- didn't your compiler complain?

* You missed updating SizeOfHeapDelete and SizeOfHeapUpdate
        regards, tom lane


Re: Visibility map, partial vacuums

From
Jeff Davis
Date:
On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote:
> A possible problem is that if a relation is filled all in one shot,
> autovacuum would trigger a single vacuum cycle on it and then never have
> a reason to trigger another; leading to the bits never getting set (or
> at least not till an antiwraparound vacuum occurs).  We might want to
> tweak autovac so that an extra vacuum cycle occurs in this case.  But
> I'm not quite sure what a reasonable heuristic would be.
> 

This would only be an issue if using the visibility map for things other
than partial vacuum (e.g. index-only scan), right? If we never do
another VACUUM, we don't need partial vacuum.

Regards,Jeff Davis



Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote:
>> A possible problem is that if a relation is filled all in one shot,
>> autovacuum would trigger a single vacuum cycle on it and then never have
>> a reason to trigger another; leading to the bits never getting set (or
>> at least not till an antiwraparound vacuum occurs).

> This would only be an issue if using the visibility map for things other
> than partial vacuum (e.g. index-only scan), right? If we never do
> another VACUUM, we don't need partial vacuum.

Well, the patch already uses the page header bits for optimization of
seqscans, and could probably make good use of them for bitmap scans too.
It'd be nice if the page header bits got set even if the map bits
didn't.

Reflecting on it though, maybe Heikki described the behavior too
pessimistically anyway.  If a page contains no dead tuples, it should
get its bits set on first visit anyhow, no?  So for the ordinary bulk
load scenario where there are no failed insertions, the first vacuum
pass should set all the bits ... at least, if enough time has passed
for RecentXmin to be past the inserting transaction.

However, my comment above was too optimistic, because in an insert-only
scenario autovac would in fact not trigger VACUUM at all, only ANALYZE.

So it seems like we do indeed want to rejigger autovac's rules a bit
to account for the possibility of wanting to apply vacuum to get
visibility bits set.
        regards, tom lane


Re: Visibility map, partial vacuums

From
"Matthew T. O'Connor"
Date:
Tom Lane wrote:
> However, my comment above was too optimistic, because in an insert-only
> scenario autovac would in fact not trigger VACUUM at all, only ANALYZE.
>
> So it seems like we do indeed want to rejigger autovac's rules a bit
> to account for the possibility of wanting to apply vacuum to get
> visibility bits set.

I'm sure I'm missing something, but I thought the point of this was to 
lessen the impact of VACUUM and now you are suggesting that we have to 
add vacuums to tables that have never needed one before.


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Reflecting on it though, maybe Heikki described the behavior too
> pessimistically anyway.  If a page contains no dead tuples, it should
> get its bits set on first visit anyhow, no?  So for the ordinary bulk
> load scenario where there are no failed insertions, the first vacuum
> pass should set all the bits ... at least, if enough time has passed
> for RecentXmin to be past the inserting transaction.

Right. I did say "... after a delete or update, it takes two vacuums 
until ..." in my mail.

> However, my comment above was too optimistic, because in an insert-only
> scenario autovac would in fact not trigger VACUUM at all, only ANALYZE.
> 
> So it seems like we do indeed want to rejigger autovac's rules a bit
> to account for the possibility of wanting to apply vacuum to get
> visibility bits set.

I'm not too excited about triggering an extra vacuum. As Matthew pointed 
out, the point of this patch is to reduce the number of vacuums 
required, not increase it. If you're not going to vacuum a table, you 
don't care if the bits in the visibility map are set or not.

We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, 
if we want to make the seqscan optimization more effective. For example, 
a seqscan could set the flag too, if it sees that all the tuples were 
visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> * ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE
> page header flag *must* be correct, but it's really okay if the backing
> map bit *isn't* correct --- in particular we don't trust the map bit
> when performing antiwraparound vacuums.  This isn't well documented.

Right. Will add comments.

We can't use the map bit for antiwraparound vacuums, because the bit 
doesn't tell you when the tuples have been frozen. And we can't advance 
relfrozenxid if we've skipped any pages.

I've been thinking that we could add one frozenxid field to each 
visibility map page, for the oldest xid on the heap pages covered by the 
visibility map page. That would allow more fine-grained anti-wraparound 
vacuums as well.

> * Also, I see that vacuum has a provision for clearing an incorrectly
> set PD_ALL_VISIBLE flag, but shouldn't it fix the map too?

Yes, will fix. Although, as long as we don't trust the visibility map, 
no real damage would be done.

> * It would be good if the visibility map fork were never created until
> there is occasion to set a bit in it; this would for instance typically
> mean that temp tables would never have one.  I think that
> visibilitymap.c doesn't get this quite right --- in particular
> vm_readbuf seems willing to create/extend the fork whether its extend
> argument is true or not, so it looks like an inquiry operation would
> cause the map fork to be created.  It should be possible to act as
> though a nonexistent fork just means "all zeroes".

The visibility map won't be inquired unless you vacuum. This is a bit 
tricky. In vacuum, we only know whether we can set a bit or not, after 
we've acquired a cleanup lock on the page, and scanned all the tuples. 
While we're holding a cleanup lock, we don't want to do I/O, which could 
potentially block out other processes for a long time. So it's too late 
to extend the visibility map at that point.

I agree that vm_readbuf should not create the fork if 'extend' is false, 
that's an oversight, but it won't change the actual behavior because 
visibilitymap_test calls it with 'extend' true. Because of the above.

I will add comments about that, though, there's nothing describing that 
currently.

> * heap_insert's all_visible_cleared variable doesn't seem to get
> initialized --- didn't your compiler complain?

Hmph, I must've been compiling with -O0.

> * You missed updating SizeOfHeapDelete and SizeOfHeapUpdate

Thanks.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I've been thinking that we could add one frozenxid field to each 
> visibility map page, for the oldest xid on the heap pages covered by the 
> visibility map page. That would allow more fine-grained anti-wraparound 
> vacuums as well.

This doesn't strike me as a particularly good idea.  Right now the map
is only hints as far as vacuum is concerned --- if you do the above then
the map becomes critical data.  And I don't really think you'll buy
much.

> The visibility map won't be inquired unless you vacuum. This is a bit 
> tricky. In vacuum, we only know whether we can set a bit or not, after 
> we've acquired a cleanup lock on the page, and scanned all the tuples. 
> While we're holding a cleanup lock, we don't want to do I/O, which could 
> potentially block out other processes for a long time. So it's too late 
> to extend the visibility map at that point.

This is no good; I think you've made the wrong tradeoffs.  In
particular, even though only vacuum *currently* uses the map, you want
to extend it to be used by indexscans.  So it's going to uselessly
spring into being even without vacuums.

I'm not convinced that I/O while holding cleanup lock is so bad that we
should break other aspects of the system to avoid it.  However, if you
want to stick to that, how about* vacuum page, possibly set its header bit* release page lock (but not pin)* if we need
toset the bit, fetch the corresponding map page  (I/O might happen here)* get share lock on heap page, then recheck its
headerbit;  if still set, set the map bit
 
        regards, tom lane


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> So it seems like we do indeed want to rejigger autovac's rules a bit
>> to account for the possibility of wanting to apply vacuum to get
>> visibility bits set.

> I'm not too excited about triggering an extra vacuum. As Matthew pointed 
> out, the point of this patch is to reduce the number of vacuums 
> required, not increase it. If you're not going to vacuum a table, you 
> don't care if the bits in the visibility map are set or not.

But it's already the case that the bits provide a performance increase
to other things besides vacuum.

> We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, 
> if we want to make the seqscan optimization more effective. For example, 
> a seqscan could set the flag too, if it sees that all the tuples were 
> visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set.

I was wondering whether we could teach heap_page_prune to set the flag
without adding any extra tuple visibility checks.  A seqscan per se
shouldn't be doing this because it doesn't normally call
HeapTupleSatifiesVacuum.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I've been thinking that we could add one frozenxid field to each 
>> visibility map page, for the oldest xid on the heap pages covered by the 
>> visibility map page. That would allow more fine-grained anti-wraparound 
>> vacuums as well.
>
> This doesn't strike me as a particularly good idea.  Right now the map
> is only hints as far as vacuum is concerned --- if you do the above then
> the map becomes critical data.  And I don't really think you'll buy
> much.

Hm, that depends on how critical the critical data is. It's critical that the
frozenxid that autovacuum sees is no more recent than the actual frozenxid,
but not critical that it be entirely up-to-date otherwise.

So if it's possible for the frozenxid in the visibility map to go backwards
then it's no good, since if that update is lost we might skip a necessary
vacuum freeze. But if we guarantee that we never update the frozenxid in the
visibility map forward ahead of recentglobalxmin then it can't ever go
backwards. (Well, not in a way that matters)

However I'm a bit puzzled how you could possibly maintain this frozenxid. As
soon as you freeze an xid you'll have to visit all the other pages covered by
that visibility map page to see what the new value should be.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> However I'm a bit puzzled how you could possibly maintain this frozenxid. As
> soon as you freeze an xid you'll have to visit all the other pages covered by
> that visibility map page to see what the new value should be.

Right, you could only advance it when you scan all the pages covered by 
the visibility map page. But that's better than having to scan the whole 
relation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Gregory Stark wrote:
>> However I'm a bit puzzled how you could possibly maintain this frozenxid. As
>> soon as you freeze an xid you'll have to visit all the other pages covered by
>> that visibility map page to see what the new value should be.
>
> Right, you could only advance it when you scan all the pages covered by the
> visibility map page. But that's better than having to scan the whole relation.

Is it? It seems like that would just move around the work. You'll still have
to visit every page once ever 2B transactions or so. You'll just do it 64MB at
a time. 

It's nice to smooth the work but it would be much nicer to detect that a
normal vacuum has already processed all of those pages since the last
insert/update/delete on those pages and so avoid the work entirely.

To avoid the work entirely you need some information about the oldest xid on
those pages seen by regular vacuums (and/or prunes). 

We would want to skip any page which:

a) Has been visited by vacuum freeze and not been updated since 

b) Has been visited by a regular vacuum and the oldest xid found was more  recent than freeze_threshold.

c) Has been updated frequently such that no old tuples remain

Ideally (b) should completely obviate the need for anti-wraparound freezes
entirely.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> So if it's possible for the frozenxid in the visibility map to go backwards
> then it's no good, since if that update is lost we might skip a necessary
> vacuum freeze.

Seems like a lost disk write would be enough to make that happen.

Now you might argue that the odds of that are no worse than the odds of
losing an update to one particular heap page, but in this case the
single hiccup could lead to losing half a gigabyte of data (assuming 8K
page size).  The leverage you get for saving vacuum freeze work is
exactly equal to the magnification factor for data loss.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Decibel!
Date:
On Nov 23, 2008, at 3:18 PM, Tom Lane wrote:
> So it seems like we do indeed want to rejigger autovac's rules a bit
> to account for the possibility of wanting to apply vacuum to get
> visibility bits set.


That makes the idea of not writing out hint bit updates unless the  
page is already dirty a lot easier to swallow, because now we'd have  
a mechanism in place to ensure that they were set in a reasonable  
timeframe by autovacuum. That actually wouldn't incur much extra  
overhead at all, except in the case of a table that's effectively  
write-only. Actually, that's not even true; you still have to  
eventually freeze a write-mostly table.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> The visibility map won't be inquired unless you vacuum. This is a bit 
>> tricky. In vacuum, we only know whether we can set a bit or not, after 
>> we've acquired a cleanup lock on the page, and scanned all the tuples. 
>> While we're holding a cleanup lock, we don't want to do I/O, which could 
>> potentially block out other processes for a long time. So it's too late 
>> to extend the visibility map at that point.
> 
> This is no good; I think you've made the wrong tradeoffs.  In
> particular, even though only vacuum *currently* uses the map, you want
> to extend it to be used by indexscans.  So it's going to uselessly
> spring into being even without vacuums.
> 
> I'm not convinced that I/O while holding cleanup lock is so bad that we
> should break other aspects of the system to avoid it.  However, if you
> want to stick to that, how about
>     * vacuum page, possibly set its header bit
>     * release page lock (but not pin)
>     * if we need to set the bit, fetch the corresponding map page
>       (I/O might happen here)
>     * get share lock on heap page, then recheck its header bit;
>       if still set, set the map bit

Yeah, could do that.

There is another problem, though, if the map is frequently probed for 
pages that don't exist in the map, or the map doesn't exist at all. 
Currently, the size of the map file is kept in relcache, in the 
rd_vm_nblocks_cache variable. Whenever a page is accessed that's > 
rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, 
and rd_vm_nblocks_cache is updated. That means that every probe to a 
non-existing page causes an lseek(), which isn't free.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> There is another problem, though, if the map is frequently probed for 
> pages that don't exist in the map, or the map doesn't exist at all. 
> Currently, the size of the map file is kept in relcache, in the 
> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > 
> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, 
> and rd_vm_nblocks_cache is updated. That means that every probe to a 
> non-existing page causes an lseek(), which isn't free.

Well, considering how seldom new pages will be added to the visibility
map, it seems to me we could afford to send out a relcache inval event
when that happens.  Then rd_vm_nblocks_cache could be treated as
trustworthy.

Maybe it'd be worth doing that for the FSM too.  The frequency of
invals would be higher, but then again the reference frequency is
probably higher too?
        regards, tom lane


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> There is another problem, though, if the map is frequently probed for 
>> pages that don't exist in the map, or the map doesn't exist at all. 
>> Currently, the size of the map file is kept in relcache, in the 
>> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > 
>> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, 
>> and rd_vm_nblocks_cache is updated. That means that every probe to a 
>> non-existing page causes an lseek(), which isn't free.
> 
> Well, considering how seldom new pages will be added to the visibility
> map, it seems to me we could afford to send out a relcache inval event
> when that happens.  Then rd_vm_nblocks_cache could be treated as
> trustworthy.
> 
> Maybe it'd be worth doing that for the FSM too.  The frequency of
> invals would be higher, but then again the reference frequency is
> probably higher too?

A relcache invalidation sounds awfully heavy-weight. Perhaps a 
light-weight invalidation event that doesn't flush the entry altogether, 
but just resets the cached sizes?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Well, considering how seldom new pages will be added to the visibility
>> map, it seems to me we could afford to send out a relcache inval event
>> when that happens.  Then rd_vm_nblocks_cache could be treated as
>> trustworthy.

> A relcache invalidation sounds awfully heavy-weight.

It really isn't.
        regards, tom lane


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Well, considering how seldom new pages will be added to the visibility
>>> map, it seems to me we could afford to send out a relcache inval event
>>> when that happens.  Then rd_vm_nblocks_cache could be treated as
>>> trustworthy.
> 
>> A relcache invalidation sounds awfully heavy-weight.
> 
> It really isn't.

Okay, then. I'll use relcache invalidation for both the FSM and 
visibility map.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> There is another problem, though, if the map is frequently probed for 
>> pages that don't exist in the map, or the map doesn't exist at all. 
>> Currently, the size of the map file is kept in relcache, in the 
>> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > 
>> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, 
>> and rd_vm_nblocks_cache is updated. That means that every probe to a 
>> non-existing page causes an lseek(), which isn't free.
> 
> Well, considering how seldom new pages will be added to the visibility
> map, it seems to me we could afford to send out a relcache inval event
> when that happens.  Then rd_vm_nblocks_cache could be treated as
> trustworthy.

Here's an updated version, with a lot of smaller cleanups, and using 
relcache invalidation to notify other backends when the visibility map 
fork is extended. I already committed the change to FSM to do the same. 
I'm feeling quite satisfied to commit this patch early next week.

I modified the VACUUM VERBOSE output slightly, to print the number of 
pages scanned. The added part emphasized below:

postgres=# vacuum verbose foo;
INFO:  vacuuming "public.foo"
INFO:  "foo": removed 230 row versions in 10 pages
INFO:  "foo": found 230 removable, 10 nonremovable row versions in *10 
out of* 43 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

That seems OK to me, but maybe others have an opinion on that?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Here's an updated version, ...

And here it is, for real...

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/heap/Makefile
--- src/backend/access/heap/Makefile
***************
*** 12,17 **** subdir = src/backend/access/heap
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o

  include $(top_srcdir)/src/backend/common.mk
--- 12,17 ----
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global

! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o

  include $(top_srcdir)/src/backend/common.mk
*** src/backend/access/heap/heapam.c
--- src/backend/access/heap/heapam.c
***************
*** 47,52 ****
--- 47,53 ----
  #include "access/transam.h"
  #include "access/tuptoaster.h"
  #include "access/valid.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
***************
*** 195,200 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
--- 196,202 ----
      int            ntup;
      OffsetNumber lineoff;
      ItemId        lpp;
+     bool        all_visible;

      Assert(page < scan->rs_nblocks);

***************
*** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page)
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
-             HeapTupleData loctup;
              bool        valid;

!             loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!             loctup.t_len = ItemIdGetLength(lpp);
!             ItemPointerSet(&(loctup.t_self), page, lineoff);

!             valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
--- 235,266 ----
      lines = PageGetMaxOffsetNumber(dp);
      ntup = 0;

+     /*
+      * If the all-visible flag indicates that all tuples on the page are
+      * visible to everyone, we can skip the per-tuple visibility tests.
+      */
+     all_visible = PageIsAllVisible(dp);
+
      for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
           lineoff <= lines;
           lineoff++, lpp++)
      {
          if (ItemIdIsNormal(lpp))
          {
              bool        valid;

!             if (all_visible)
!                 valid = true;
!             else
!             {
!                 HeapTupleData loctup;
!
!                 loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
!                 loctup.t_len = ItemIdGetLength(lpp);
!                 ItemPointerSet(&(loctup.t_self), page, lineoff);

!                 valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
!             }
              if (valid)
                  scan->rs_vistuples[ntup++] = lineoff;
          }
***************
*** 1860,1865 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1874,1880 ----
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
      Buffer        buffer;
+     bool        all_visible_cleared = false;

      if (relation->rd_rel->relhasoids)
      {
***************
*** 1920,1925 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1935,1946 ----

      RelationPutHeapTuple(relation, buffer, heaptup);

+     if (PageIsAllVisible(BufferGetPage(buffer)))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(BufferGetPage(buffer));
+     }
+
      /*
       * XXX Should we set PageSetPrunable on this page ?
       *
***************
*** 1943,1948 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 1964,1970 ----
          Page        page = BufferGetPage(buffer);
          uint8        info = XLOG_HEAP_INSERT;

+         xlrec.all_visible_cleared = all_visible_cleared;
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = heaptup->t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 1994,1999 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 2016,2026 ----

      UnlockReleaseBuffer(buffer);

+     /* Clear the bit in the visibility map if necessary */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation,
+                             ItemPointerGetBlockNumber(&(heaptup->t_self)));
+
      /*
       * If tuple is cachable, mark it for invalidation from the caches in case
       * we abort.  Note it is OK to do this after releasing the buffer, because
***************
*** 2070,2075 **** heap_delete(Relation relation, ItemPointer tid,
--- 2097,2103 ----
      Buffer        buffer;
      bool        have_tuple_lock = false;
      bool        iscombo;
+     bool        all_visible_cleared = false;

      Assert(ItemPointerIsValid(tid));

***************
*** 2216,2221 **** l1:
--- 2244,2255 ----
       */
      PageSetPrunable(page, xid);

+     if (PageIsAllVisible(page))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(page);
+     }
+
      /* store transaction information of xact deleting the tuple */
      tp.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED |
                                 HEAP_XMAX_INVALID |
***************
*** 2237,2242 **** l1:
--- 2271,2277 ----
          XLogRecPtr    recptr;
          XLogRecData rdata[2];

+         xlrec.all_visible_cleared = all_visible_cleared;
          xlrec.target.node = relation->rd_node;
          xlrec.target.tid = tp.t_self;
          rdata[0].data = (char *) &xlrec;
***************
*** 2281,2286 **** l1:
--- 2316,2325 ----
       */
      CacheInvalidateHeapTuple(relation, &tp);

+     /* Clear the bit in the visibility map if necessary */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+
      /* Now we can release the buffer */
      ReleaseBuffer(buffer);

***************
*** 2388,2393 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2427,2434 ----
      bool        have_tuple_lock = false;
      bool        iscombo;
      bool        use_hot_update = false;
+     bool        all_visible_cleared = false;
+     bool        all_visible_cleared_new = false;

      Assert(ItemPointerIsValid(otid));

***************
*** 2763,2768 **** l2:
--- 2804,2815 ----
          MarkBufferDirty(newbuf);
      MarkBufferDirty(buffer);

+     /*
+      * Note: we mustn't clear PD_ALL_VISIBLE flags before writing the WAL
+      * record, because log_heap_update looks at those flags to set the
+      * corresponding flags in the WAL record.
+      */
+
      /* XLOG stuff */
      if (!relation->rd_istemp)
      {
***************
*** 2778,2783 **** l2:
--- 2825,2842 ----
          PageSetTLI(BufferGetPage(buffer), ThisTimeLineID);
      }

+     /* Clear PD_ALL_VISIBLE flags */
+     if (PageIsAllVisible(BufferGetPage(buffer)))
+     {
+         all_visible_cleared = true;
+         PageClearAllVisible(BufferGetPage(buffer));
+     }
+     if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
+     {
+         all_visible_cleared_new = true;
+         PageClearAllVisible(BufferGetPage(newbuf));
+     }
+
      END_CRIT_SECTION();

      if (newbuf != buffer)
***************
*** 2791,2796 **** l2:
--- 2850,2861 ----
       */
      CacheInvalidateHeapTuple(relation, &oldtup);

+     /* Clear bits in visibility map */
+     if (all_visible_cleared)
+         visibilitymap_clear(relation, BufferGetBlockNumber(buffer));
+     if (all_visible_cleared_new)
+         visibilitymap_clear(relation, BufferGetBlockNumber(newbuf));
+
      /* Now we can release the buffer(s) */
      if (newbuf != buffer)
          ReleaseBuffer(newbuf);
***************
*** 3412,3417 **** l3:
--- 3477,3487 ----
      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);

      /*
+      * Don't update the visibility map here. Locking a tuple doesn't
+      * change visibility info.
+      */
+
+     /*
       * Now that we have successfully marked the tuple as locked, we can
       * release the lmgr tuple lock, if we had it.
       */
***************
*** 3916,3922 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 3986,3994 ----

      xlrec.target.node = reln->rd_node;
      xlrec.target.tid = from;
+     xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf));
      xlrec.newtid = newtup->t_self;
+     xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf));

      rdata[0].data = (char *) &xlrec;
      rdata[0].len = SizeOfHeapUpdate;
***************
*** 4185,4197 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
      OffsetNumber offnum;
      ItemId        lp = NULL;
      HeapTupleHeader htup;

      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

!     buffer = XLogReadBuffer(xlrec->target.node,
!                             ItemPointerGetBlockNumber(&(xlrec->target.tid)),
!                             false);
      if (!BufferIsValid(buffer))
          return;
      page = (Page) BufferGetPage(buffer);
--- 4257,4281 ----
      OffsetNumber offnum;
      ItemId        lp = NULL;
      HeapTupleHeader htup;
+     BlockNumber    blkno;
+
+     blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+
+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, blkno);
+         FreeFakeRelcacheEntry(reln);
+     }

      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

!     buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
      if (!BufferIsValid(buffer))
          return;
      page = (Page) BufferGetPage(buffer);
***************
*** 4223,4228 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
--- 4307,4315 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /* Make sure there is no forward chain link in t_ctid */
      htup->t_ctid = xlrec->target.tid;
      PageSetLSN(page, lsn);
***************
*** 4249,4259 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
      Size        freespace;
      BlockNumber    blkno;

      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

-     blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
-
      if (record->xl_info & XLOG_HEAP_INIT_PAGE)
      {
          buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
--- 4336,4357 ----
      Size        freespace;
      BlockNumber    blkno;

+     blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+
+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, blkno);
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
          return;

      if (record->xl_info & XLOG_HEAP_INIT_PAGE)
      {
          buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
***************
*** 4307,4312 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4405,4414 ----

      PageSetLSN(page, lsn);
      PageSetTLI(page, ThisTimeLineID);
+
+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      MarkBufferDirty(buffer);
      UnlockReleaseBuffer(buffer);

***************
*** 4347,4352 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4449,4466 ----
      uint32        newlen;
      Size        freespace;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln,
+                             ItemPointerGetBlockNumber(&xlrec->target.tid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_1)
      {
          if (samepage)
***************
*** 4411,4416 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4525,4533 ----
      /* Mark the page as a candidate for pruning */
      PageSetPrunable(page, record->xl_xid);

+     if (xlrec->all_visible_cleared)
+         PageClearAllVisible(page);
+
      /*
       * this test is ugly, but necessary to avoid thinking that insert change
       * is already applied
***************
*** 4426,4431 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
--- 4543,4559 ----

  newt:;

+     /*
+      * The visibility map always needs to be updated, even if the heap page
+      * is already up-to-date.
+      */
+     if (xlrec->new_all_visible_cleared)
+     {
+         Relation reln = CreateFakeRelcacheEntry(xlrec->target.node);
+         visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid));
+         FreeFakeRelcacheEntry(reln);
+     }
+
      if (record->xl_info & XLR_BKP_BLOCK_2)
          return;

***************
*** 4504,4509 **** newsame:;
--- 4632,4640 ----
      if (offnum == InvalidOffsetNumber)
          elog(PANIC, "heap_update_redo: failed to add tuple");

+     if (xlrec->new_all_visible_cleared)
+         PageClearAllVisible(page);
+
      freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */

      PageSetLSN(page, lsn);
*** /dev/null
--- src/backend/access/heap/visibilitymap.c
***************
*** 0 ****
--- 1,478 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.c
+  *      bitmap for tracking visibility of heap tuples
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      $PostgreSQL$
+  *
+  * INTERFACE ROUTINES
+  *        visibilitymap_clear    - clear a bit in the visibility map
+  *        visibilitymap_pin    - pin a map page for setting a bit
+  *        visibilitymap_set    - set a bit in a previously pinned page
+  *        visibilitymap_test    - test if a bit is set
+  *
+  * NOTES
+  *
+  * The visibility map is a bitmap with one bit per heap page. A set bit means
+  * that all tuples on the page are visible to all transactions, and doesn't
+  * therefore need to be vacuumed. The map is conservative in the sense that we
+  * make sure that whenever a bit is set, we know the condition is true, but if
+  * a bit is not set, it might or might not be.
+  *
+  * There's no explicit WAL logging in the functions in this file. The callers
+  * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+  * replay of the updating operation as well. Setting bits during recovery
+  * isn't necessary for correctness.
+  *
+  * Currently, the visibility map is only used as a hint, to speed up VACUUM.
+  * A corrupted visibility map won't cause data corruption, although it can
+  * make VACUUM skip pages that need vacuuming, until the next anti-wraparound
+  * vacuum. The visibility map is not used for anti-wraparound vacuums, because
+  * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
+  * present in the table, also on pages that don't have any dead tuples.
+  *
+  * Although the visibility map is just a hint at the moment, the PD_ALL_VISIBLE
+  * flag on heap pages *must* be correct.
+  *
+  * LOCKING
+  *
+  * In heapam.c, whenever a page is modified so that not all tuples on the
+  * page are visible to everyone anymore, the corresponding bit in the
+  * visibility map is cleared. The bit in the visibility map is cleared
+  * after releasing the lock on the heap page, to avoid holding the lock
+  * over possible I/O to read in the visibility map page.
+  *
+  * To set a bit, you need to hold a lock on the heap page. That prevents
+  * the race condition where VACUUM sees that all tuples on the page are
+  * visible to everyone, but another backend modifies the page before VACUUM
+  * sets the bit in the visibility map.
+  *
+  * When a bit is set, the LSN of the visibility map page is updated to make
+  * sure that the visibility map update doesn't get written to disk before the
+  * WAL record of the changes that made it possible to set the bit is flushed.
+  * But when a bit is cleared, we don't have to do that because it's always OK
+  * to clear a bit in the map from correctness point of view.
+  *
+  * TODO
+  *
+  * It would be nice to use the visibility map to skip visibility checkes in
+  * index scans.
+  *
+  * Currently, the visibility map is not 100% correct all the time.
+  * During updates, the bit in the visibility map is cleared after releasing
+  * the lock on the heap page. During the window after releasing the lock
+  * and clearing the bit in the visibility map, the bit in the visibility map
+  * is set, but the new insertion or deletion is not yet visible to other
+  * backends.
+  *
+  * That might actually be OK for the index scans, though. The newly inserted
+  * tuple wouldn't have an index pointer yet, so all tuples reachable from an
+  * index would still be visible to all other backends, and deletions wouldn't
+  * be visible to other backends yet.
+  *
+  * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
+  * vacuum observes that all tuples are visible to all, it sets the flag on
+  * the heap page, and also sets the bit in the visibility map. If we then
+  * crash, and only the visibility map page was flushed to disk, we'll have
+  * a bit set in the visibility map, but the corresponding flag on the heap
+  * page is not set. If the heap page is then updated, the updater won't
+  * know to clear the bit in the visibility map.
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+
+ #include "access/visibilitymap.h"
+ #include "storage/bufmgr.h"
+ #include "storage/bufpage.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+ #include "utils/inval.h"
+
+ /*#define TRACE_VISIBILITYMAP */
+
+ /*
+  * Size of the bitmap on each visibility map page, in bytes. There's no
+  * extra headers, so the whole page minus except for the standard page header
+  * is used for the bitmap.
+  */
+ #define MAPSIZE (BLCKSZ - SizeOfPageHeaderData)
+
+ /* Number of bits allocated for each heap block. */
+ #define BITS_PER_HEAPBLOCK 1
+
+ /* Number of heap blocks we can represent in one byte. */
+ #define HEAPBLOCKS_PER_BYTE 8
+
+ /* Number of heap blocks we can represent in one visibility map page. */
+ #define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
+
+ /* Mapping from heap block number to the right bit in the visibility map */
+ #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+ #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+ #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
+
+ /* prototypes for internal routines */
+ static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
+ static void vm_extend(Relation rel, BlockNumber nvmblocks);
+
+
+ /*
+  *    visibilitymap_clear - clear a bit in visibility map
+  *
+  * Clear a bit in the visibility map, marking that not all tuples are
+  * visible to all transactions anymore.
+  */
+ void
+ visibilitymap_clear(Relation rel, BlockNumber heapBlk)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     int            mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     int            mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     uint8        mask = 1 << mapBit;
+     Buffer        mapBuffer;
+     char       *map;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     mapBuffer = vm_readbuf(rel, mapBlock, false);
+     if (!BufferIsValid(mapBuffer))
+         return; /* nothing to do */
+
+     LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+     map = PageGetContents(BufferGetPage(mapBuffer));
+
+     if (map[mapByte] & mask)
+     {
+         map[mapByte] &= ~mask;
+
+         MarkBufferDirty(mapBuffer);
+     }
+
+     UnlockReleaseBuffer(mapBuffer);
+ }
+
+ /*
+  *    visibilitymap_pin - pin a map page for setting a bit
+  *
+  * Setting a bit in the visibility map is a two-phase operation. First, call
+  * visibilitymap_pin, to pin the visibility map page containing the bit for
+  * the heap page. Because that can require I/O to read the map page, you
+  * shouldn't hold a lock on the heap page while doing that. Then, call
+  * visibilitymap_set to actually set the bit.
+  *
+  * On entry, *buf should be InvalidBuffer or a valid buffer returned by
+  * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+  * relation. On return, *buf is a valid buffer with the map page containing
+  * the the bit for heapBlk.
+  *
+  * If the page doesn't exist in the map file yet, it is extended.
+  */
+ void
+ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+     /* Reuse the old pinned buffer if possible */
+     if (BufferIsValid(*buf))
+     {
+         if (BufferGetBlockNumber(*buf) == mapBlock)
+             return;
+
+         ReleaseBuffer(*buf);
+     }
+     *buf = vm_readbuf(rel, mapBlock, true);
+ }
+
+ /*
+  *    visibilitymap_set - set a bit on a previously pinned page
+  *
+  * recptr is the LSN of the heap page. The LSN of the visibility map page is
+  * advanced to that, to make sure that the visibility map doesn't get flushed
+  * to disk before the update to the heap page that made all tuples visible.
+  *
+  * This is an opportunistic function. It does nothing, unless *buf
+  * contains the bit for heapBlk. Call visibilitymap_pin first to pin
+  * the right map page. This function doesn't do any I/O.
+  */
+ void
+ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
+                   Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     Page        page;
+     char       *map;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     /* Check that we have the right page pinned */
+     if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
+         return;
+
+     page = BufferGetPage(*buf);
+     map = PageGetContents(page);
+     LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
+
+     if (!(map[mapByte] & (1 << mapBit)))
+     {
+         map[mapByte] |= (1 << mapBit);
+
+         if (XLByteLT(PageGetLSN(page), recptr))
+             PageSetLSN(page, recptr);
+         PageSetTLI(page, ThisTimeLineID);
+         MarkBufferDirty(*buf);
+     }
+
+     LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+  *    visibilitymap_test - test if a bit is set
+  *
+  * Are all tuples on heapBlk visible to all, according to the visibility map?
+  *
+  * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
+  * earlier call to visibilitymap_pin or visibilitymap_test on the same
+  * relation. On return, *buf is a valid buffer with the map page containing
+  * the the bit for heapBlk, or InvalidBuffer. The caller is responsible for
+  * releasing *buf after it's done testing and setting bits.
+  */
+ bool
+ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+ {
+     BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+     uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+     uint8        mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+     bool        result;
+     char       *map;
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ #endif
+
+     /* Reuse the old pinned buffer if possible */
+     if (BufferIsValid(*buf))
+     {
+         if (BufferGetBlockNumber(*buf) != mapBlock)
+         {
+             ReleaseBuffer(*buf);
+             *buf = InvalidBuffer;
+         }
+     }
+
+     if (!BufferIsValid(*buf))
+     {
+         *buf = vm_readbuf(rel, mapBlock, false);
+         if (!BufferIsValid(*buf))
+             return false;
+     }
+
+     map = PageGetContents(BufferGetPage(*buf));
+
+     /*
+      * We don't need to lock the page, as we're only looking at a single bit.
+      */
+     result = (map[mapByte] & (1 << mapBit)) ? true : false;
+
+     return result;
+ }
+
+ /*
+  *    visibilitymap_test - truncate the visibility map
+  */
+ void
+ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
+ {
+     BlockNumber newnblocks;
+     /* last remaining block, byte, and bit */
+     BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+     uint32        truncByte  = HEAPBLK_TO_MAPBYTE(nheapblocks);
+     uint8        truncBit   = HEAPBLK_TO_MAPBIT(nheapblocks);
+
+ #ifdef TRACE_VISIBILITYMAP
+     elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ #endif
+
+     /*
+      * If no visibility map has been created yet for this relation, there's
+      * nothing to truncate.
+      */
+     if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+         return;
+
+     /*
+      * Unless the new size is exactly at a visibility map page boundary, the
+      * tail bits in the last remaining map page, representing truncated heap
+      * blocks, need to be cleared. This is not only tidy, but also necessary
+      * because we don't get a chance to clear the bits if the heap is
+      * extended again.
+      */
+     if (truncByte != 0 || truncBit != 0)
+     {
+         Buffer mapBuffer;
+         Page page;
+         char *map;
+
+         newnblocks = truncBlock + 1;
+
+         mapBuffer = vm_readbuf(rel, truncBlock, false);
+         if (!BufferIsValid(mapBuffer))
+         {
+             /* nothing to do, the file was already smaller */
+             return;
+         }
+
+         page = BufferGetPage(mapBuffer);
+         map = PageGetContents(page);
+
+         LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+         /* Clear out the unwanted bytes. */
+         MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+
+         /*
+          * Mask out the unwanted bits of the last remaining byte.
+          *
+          * ((1 << 0) - 1) = 00000000
+          * ((1 << 1) - 1) = 00000001
+          * ...
+          * ((1 << 6) - 1) = 00111111
+          * ((1 << 7) - 1) = 01111111
+          */
+         map[truncByte] &= (1 << truncBit) - 1;
+
+         MarkBufferDirty(mapBuffer);
+         UnlockReleaseBuffer(mapBuffer);
+     }
+     else
+         newnblocks = truncBlock;
+
+     if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) < newnblocks)
+     {
+         /* nothing to do, the file was already smaller than requested size */
+         return;
+     }
+
+     smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks,
+                  rel->rd_istemp);
+
+     /*
+      * Need to invalidate the relcache entry, because rd_vm_nblocks
+      * seen by other backends is no longer valid.
+      */
+     if (!InRecovery)
+         CacheInvalidateRelcache(rel);
+
+     rel->rd_vm_nblocks = newnblocks;
+ }
+
+ /*
+  * Read a visibility map page.
+  *
+  * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+  * true, the visibility map file is extended.
+  */
+ static Buffer
+ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
+ {
+     Buffer buf;
+
+     RelationOpenSmgr(rel);
+
+     /*
+      * The current size of the visibility map fork is kept in relcache, to
+      * avoid reading beyond EOF. If we haven't cached the size of the map yet,
+      * do that first.
+      */
+     if (rel->rd_vm_nblocks == InvalidBlockNumber)
+     {
+         if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+             rel->rd_vm_nblocks = smgrnblocks(rel->rd_smgr,
+                                              VISIBILITYMAP_FORKNUM);
+         else
+             rel->rd_vm_nblocks = 0;
+     }
+
+     /* Handle requests beyond EOF */
+     if (blkno >= rel->rd_vm_nblocks)
+     {
+         if (extend)
+             vm_extend(rel, blkno + 1);
+         else
+             return InvalidBuffer;
+     }
+
+     /*
+      * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
+      * always safe to clear bits, so it's better to clear corrupt pages than
+      * error out.
+      */
+     buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+                              RBM_ZERO_ON_ERROR, NULL);
+     if (PageIsNew(BufferGetPage(buf)))
+         PageInit(BufferGetPage(buf), BLCKSZ, 0);
+     return buf;
+ }
+
+ /*
+  * Ensure that the visibility map fork is at least vm_nblocks long, extending
+  * it if necessary with zeroed pages.
+  */
+ static void
+ vm_extend(Relation rel, BlockNumber vm_nblocks)
+ {
+     BlockNumber vm_nblocks_now;
+     Page pg;
+
+     pg = (Page) palloc(BLCKSZ);
+     PageInit(pg, BLCKSZ, 0);
+
+     /*
+      * We use the relation extension lock to lock out other backends trying
+      * to extend the visibility map at the same time. It also locks out
+      * extension of the main fork, unnecessarily, but extending the
+      * visibility map happens seldom enough that it doesn't seem worthwhile to
+      * have a separate lock tag type for it.
+      *
+      * Note that another backend might have extended or created the
+      * relation before we get the lock.
+      */
+     LockRelationForExtension(rel, ExclusiveLock);
+
+     /* Create the file first if it doesn't exist */
+     if ((rel->rd_vm_nblocks == 0 || rel->rd_vm_nblocks == InvalidBlockNumber)
+         && !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
+     {
+         smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
+         vm_nblocks_now = 0;
+     }
+     else
+         vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+
+     while (vm_nblocks_now < vm_nblocks)
+     {
+         smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
+                    (char *) pg, rel->rd_istemp);
+         vm_nblocks_now++;
+     }
+
+     UnlockRelationForExtension(rel, ExclusiveLock);
+
+     pfree(pg);
+
+     /* Update the relcache with the up-to-date size */
+     if (!InRecovery)
+         CacheInvalidateRelcache(rel);
+     rel->rd_vm_nblocks = vm_nblocks_now;
+ }
*** src/backend/access/transam/xlogutils.c
--- src/backend/access/transam/xlogutils.c
***************
*** 377,382 **** CreateFakeRelcacheEntry(RelFileNode rnode)
--- 377,383 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks = InvalidBlockNumber;
+     rel->rd_vm_nblocks = InvalidBlockNumber;
      rel->rd_smgr = NULL;

      return rel;
*** src/backend/catalog/catalog.c
--- src/backend/catalog/catalog.c
***************
*** 54,60 ****
   */
  const char *forkNames[] = {
      "main", /* MAIN_FORKNUM */
!     "fsm"   /* FSM_FORKNUM */
  };

  /*
--- 54,61 ----
   */
  const char *forkNames[] = {
      "main", /* MAIN_FORKNUM */
!     "fsm",   /* FSM_FORKNUM */
!     "vm"   /* VISIBILITYMAP_FORKNUM */
  };

  /*
*** src/backend/catalog/storage.c
--- src/backend/catalog/storage.c
***************
*** 19,24 ****
--- 19,25 ----

  #include "postgres.h"

+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
***************
*** 175,180 **** void
--- 176,182 ----
  RelationTruncate(Relation rel, BlockNumber nblocks)
  {
      bool fsm;
+     bool vm;

      /* Open it at the smgr level if not already done */
      RelationOpenSmgr(rel);
***************
*** 187,192 **** RelationTruncate(Relation rel, BlockNumber nblocks)
--- 189,199 ----
      if (fsm)
          FreeSpaceMapTruncateRel(rel, nblocks);

+     /* Truncate the visibility map too if it exists. */
+     vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+     if (vm)
+         visibilitymap_truncate(rel, nblocks);
+
      /*
       * We WAL-log the truncation before actually truncating, which
       * means trouble if the truncation fails. If we then crash, the WAL
***************
*** 217,228 **** RelationTruncate(Relation rel, BlockNumber nblocks)

          /*
           * Flush, because otherwise the truncation of the main relation
!          * might hit the disk before the WAL record of truncating the
!          * FSM is flushed. If we crashed during that window, we'd be
!          * left with a truncated heap, but the FSM would still contain
!          * entries for the non-existent heap pages.
           */
!         if (fsm)
              XLogFlush(lsn);
      }

--- 224,235 ----

          /*
           * Flush, because otherwise the truncation of the main relation
!          * might hit the disk before the WAL record, and the truncation of
!          * the FSM or visibility map. If we crashed during that window, we'd
!          * be left with a truncated heap, but the FSM or visibility map would
!          * still contain entries for the non-existent heap pages.
           */
!         if (fsm || vm)
              XLogFlush(lsn);
      }

*** src/backend/commands/vacuum.c
--- src/backend/commands/vacuum.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "access/xact.h"
  #include "access/xlog.h"
  #include "catalog/namespace.h"
***************
*** 2902,2907 **** move_chain_tuple(Relation rel,
--- 2903,2914 ----
      Size        tuple_len = old_tup->t_len;

      /*
+      * Clear the bits in the visibility map.
+      */
+     visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+
+     /*
       * make a modifiable copy of the source tuple.
       */
      heap_copytuple_with_tuple(old_tup, &newtup);
***************
*** 3005,3010 **** move_chain_tuple(Relation rel,
--- 3012,3021 ----

      END_CRIT_SECTION();

+     PageClearAllVisible(BufferGetPage(old_buf));
+     if (dst_buf != old_buf)
+         PageClearAllVisible(BufferGetPage(dst_buf));
+
      LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK);
      if (dst_buf != old_buf)
          LockBuffer(old_buf, BUFFER_LOCK_UNLOCK);
***************
*** 3107,3112 **** move_plain_tuple(Relation rel,
--- 3118,3140 ----

      END_CRIT_SECTION();

+     /*
+      * Clear the visible-to-all hint bits on the page, and bits in the
+      * visibility map. Normally we'd release the locks on the heap pages
+      * before updating the visibility map, but doesn't really matter here
+      * because we're holding an AccessExclusiveLock on the relation anyway.
+      */
+     if (PageIsAllVisible(dst_page))
+     {
+         PageClearAllVisible(dst_page);
+         visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf));
+     }
+     if (PageIsAllVisible(old_page))
+     {
+         PageClearAllVisible(old_page);
+         visibilitymap_clear(rel, BufferGetBlockNumber(old_buf));
+     }
+
      dst_vacpage->free = PageGetFreeSpaceWithFillFactor(rel, dst_page);
      LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK);
      LockBuffer(old_buf, BUFFER_LOCK_UNLOCK);
*** src/backend/commands/vacuumlazy.c
--- src/backend/commands/vacuumlazy.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "access/transam.h"
+ #include "access/visibilitymap.h"
  #include "catalog/storage.h"
  #include "commands/dbcommands.h"
  #include "commands/vacuum.h"
***************
*** 88,93 **** typedef struct LVRelStats
--- 89,95 ----
      int            max_dead_tuples;    /* # slots allocated in array */
      ItemPointer dead_tuples;    /* array of ItemPointerData */
      int            num_index_scans;
+     bool        scanned_all;    /* have we scanned all pages (this far)? */
  } LVRelStats;


***************
*** 102,108 **** static BufferAccessStrategy vac_strategy;

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
--- 104,110 ----

  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all);
  static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
  static void lazy_vacuum_index(Relation indrel,
                    IndexBulkDeleteResult **stats,
***************
*** 141,146 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
--- 143,149 ----
      BlockNumber possibly_freeable;
      PGRUsage    ru0;
      TimestampTz starttime = 0;
+     bool        scan_all;

      pg_rusage_init(&ru0);

***************
*** 161,173 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
      vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));

      vacrelstats->num_index_scans = 0;

      /* Open all indexes of the relation */
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
--- 164,183 ----
      vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));

      vacrelstats->num_index_scans = 0;
+     vacrelstats->scanned_all = true;

      /* Open all indexes of the relation */
      vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
      vacrelstats->hasindex = (nindexes > 0);

+     /* Should we use the visibility map or scan all pages? */
+     if (vacstmt->freeze_min_age != -1)
+         scan_all = true;
+     else
+         scan_all = false;
+
      /* Do the vacuuming */
!     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);

      /* Done with indexes */
      vac_close_indexes(nindexes, Irel, NoLock);
***************
*** 186,195 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
      /* Vacuum the Free Space Map */
      FreeSpaceMapVacuum(onerel);

!     /* Update statistics in pg_class */
      vac_update_relstats(onerel,
                          vacrelstats->rel_pages, vacrelstats->rel_tuples,
!                         vacrelstats->hasindex, FreezeLimit);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared,
--- 196,209 ----
      /* Vacuum the Free Space Map */
      FreeSpaceMapVacuum(onerel);

!     /*
!      * Update statistics in pg_class. We can only advance relfrozenxid if we
!      * didn't skip any pages.
!      */
      vac_update_relstats(onerel,
                          vacrelstats->rel_pages, vacrelstats->rel_tuples,
!                         vacrelstats->hasindex,
!                         vacrelstats->scanned_all ? FreezeLimit : InvalidOid);

      /* report results to the stats collector, too */
      pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared,
***************
*** 230,242 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes)
  {
      BlockNumber nblocks,
                  blkno;
      HeapTupleData tuple;
      char       *relname;
      BlockNumber empty_pages,
                  vacuumed_pages;
      double        num_tuples,
                  tups_vacuumed,
--- 244,257 ----
   */
  static void
  lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
!                Relation *Irel, int nindexes, bool scan_all)
  {
      BlockNumber nblocks,
                  blkno;
      HeapTupleData tuple;
      char       *relname;
      BlockNumber empty_pages,
+                 scanned_pages,
                  vacuumed_pages;
      double        num_tuples,
                  tups_vacuumed,
***************
*** 245,250 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 260,266 ----
      IndexBulkDeleteResult **indstats;
      int            i;
      PGRUsage    ru0;
+     Buffer        vmbuffer = InvalidBuffer;

      pg_rusage_init(&ru0);

***************
*** 254,260 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
                      get_namespace_name(RelationGetNamespace(onerel)),
                      relname)));

!     empty_pages = vacuumed_pages = 0;
      num_tuples = tups_vacuumed = nkeep = nunused = 0;

      indstats = (IndexBulkDeleteResult **)
--- 270,276 ----
                      get_namespace_name(RelationGetNamespace(onerel)),
                      relname)));

!     empty_pages = vacuumed_pages = scanned_pages = 0;
      num_tuples = tups_vacuumed = nkeep = nunused = 0;

      indstats = (IndexBulkDeleteResult **)
***************
*** 278,286 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 294,321 ----
          OffsetNumber frozen[MaxOffsetNumber];
          int            nfrozen;
          Size        freespace;
+         bool        all_visible_according_to_vm = false;
+         bool        all_visible;
+
+         /*
+          * Skip pages that don't require vacuuming according to the
+          * visibility map.
+          */
+         if (!scan_all)
+         {
+             all_visible_according_to_vm =
+                 visibilitymap_test(onerel, blkno, &vmbuffer);
+             if (all_visible_according_to_vm)
+             {
+                 vacrelstats->scanned_all = false;
+                 continue;
+             }
+         }

          vacuum_delay_point();

+         scanned_pages++;
+
          /*
           * If we are close to overrunning the available space for dead-tuple
           * TIDs, pause and do a cycle of vacuuming before we tackle this page.
***************
*** 354,360 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
          {
              empty_pages++;
              freespace = PageGetHeapFreeSpace(page);
!             UnlockReleaseBuffer(buf);
              RecordPageWithFreeSpace(onerel, blkno, freespace);
              continue;
          }
--- 389,414 ----
          {
              empty_pages++;
              freespace = PageGetHeapFreeSpace(page);
!
!             if (!PageIsAllVisible(page))
!             {
!                 SetBufferCommitInfoNeedsSave(buf);
!                 PageSetAllVisible(page);
!             }
!
!             LockBuffer(buf, BUFFER_LOCK_UNLOCK);
!
!             /* Update the visibility map */
!             if (!all_visible_according_to_vm)
!             {
!                 visibilitymap_pin(onerel, blkno, &vmbuffer);
!                 LockBuffer(buf, BUFFER_LOCK_SHARE);
!                 if (PageIsAllVisible(page))
!                     visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
!                 LockBuffer(buf, BUFFER_LOCK_UNLOCK);
!             }
!
!             ReleaseBuffer(buf);
              RecordPageWithFreeSpace(onerel, blkno, freespace);
              continue;
          }
***************
*** 371,376 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 425,431 ----
           * Now scan the page to collect vacuumable items and check for tuples
           * requiring freezing.
           */
+         all_visible = true;
          nfrozen = 0;
          hastup = false;
          prev_dead_count = vacrelstats->num_dead_tuples;
***************
*** 408,413 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 463,469 ----
              if (ItemIdIsDead(itemid))
              {
                  lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
+                 all_visible = false;
                  continue;
              }

***************
*** 442,447 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 498,504 ----
                          nkeep += 1;
                      else
                          tupgone = true; /* we can delete the tuple */
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_LIVE:
                      /* Tuple is good --- but let's do some validity checks */
***************
*** 449,454 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 506,540 ----
                          !OidIsValid(HeapTupleGetOid(&tuple)))
                          elog(WARNING, "relation \"%s\" TID %u/%u: OID is invalid",
                               relname, blkno, offnum);
+
+                     /*
+                      * Is the tuple definitely visible to all transactions?
+                      *
+                      * NB: Like with per-tuple hint bits, we can't set the
+                      * flag if the inserter committed asynchronously. See
+                      * SetHintBits for more info. Check that the
+                      * HEAP_XMIN_COMMITTED hint bit is set because of that.
+                      */
+                     if (all_visible)
+                     {
+                         TransactionId xmin;
+
+                         if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED))
+                         {
+                             all_visible = false;
+                             break;
+                         }
+                         /*
+                          * The inserter definitely committed. But is it
+                          * old enough that everyone sees it as committed?
+                          */
+                         xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+                         if (!TransactionIdPrecedes(xmin, OldestXmin))
+                         {
+                             all_visible = false;
+                             break;
+                         }
+                     }
                      break;
                  case HEAPTUPLE_RECENTLY_DEAD:

***************
*** 457,468 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 543,557 ----
                       * from relation.
                       */
                      nkeep += 1;
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_INSERT_IN_PROGRESS:
                      /* This is an expected case during concurrent vacuum */
+                     all_visible = false;
                      break;
                  case HEAPTUPLE_DELETE_IN_PROGRESS:
                      /* This is an expected case during concurrent vacuum */
+                     all_visible = false;
                      break;
                  default:
                      elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
***************
*** 525,536 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,

          freespace = PageGetHeapFreeSpace(page);

          /* Remember the location of the last page with nonremovable tuples */
          if (hastup)
              vacrelstats->nonempty_pages = blkno + 1;

-         UnlockReleaseBuffer(buf);
-
          /*
           * If we remembered any tuples for deletion, then the page will be
           * visited again by lazy_vacuum_heap, which will compute and record
--- 614,656 ----

          freespace = PageGetHeapFreeSpace(page);

+         /* Update the all-visible flag on the page */
+         if (!PageIsAllVisible(page) && all_visible)
+         {
+             SetBufferCommitInfoNeedsSave(buf);
+             PageSetAllVisible(page);
+         }
+         else if (PageIsAllVisible(page) && !all_visible)
+         {
+             elog(WARNING, "PD_ALL_VISIBLE flag was incorrectly set");
+             SetBufferCommitInfoNeedsSave(buf);
+             PageClearAllVisible(page);
+
+             /*
+              * XXX: Normally, we would drop the lock on the heap page before
+              * updating the visibility map.
+              */
+             visibilitymap_clear(onerel, blkno);
+         }
+
+         LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+         /* Update the visibility map */
+         if (!all_visible_according_to_vm && all_visible)
+         {
+             visibilitymap_pin(onerel, blkno, &vmbuffer);
+             LockBuffer(buf, BUFFER_LOCK_SHARE);
+             if (PageIsAllVisible(page))
+                 visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
+             LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+         }
+
+         ReleaseBuffer(buf);
+
          /* Remember the location of the last page with nonremovable tuples */
          if (hastup)
              vacrelstats->nonempty_pages = blkno + 1;

          /*
           * If we remembered any tuples for deletion, then the page will be
           * visited again by lazy_vacuum_heap, which will compute and record
***************
*** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 680,692 ----
          vacrelstats->num_index_scans++;
      }

+     /* Release the pin on the visibility map page */
+     if (BufferIsValid(vmbuffer))
+     {
+         ReleaseBuffer(vmbuffer);
+         vmbuffer = InvalidBuffer;
+     }
+
      /* Do post-vacuum cleanup and statistics update for each index */
      for (i = 0; i < nindexes; i++)
          lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
***************
*** 572,580 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
                          tups_vacuumed, vacuumed_pages)));

      ereport(elevel,
!             (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u pages",
                      RelationGetRelationName(onerel),
!                     tups_vacuumed, num_tuples, nblocks),
               errdetail("%.0f dead row versions cannot be removed yet.\n"
                         "There were %.0f unused item pointers.\n"
                         "%u pages are entirely empty.\n"
--- 699,707 ----
                          tups_vacuumed, vacuumed_pages)));

      ereport(elevel,
!             (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
                      RelationGetRelationName(onerel),
!                     tups_vacuumed, num_tuples, scanned_pages, nblocks),
               errdetail("%.0f dead row versions cannot be removed yet.\n"
                         "There were %.0f unused item pointers.\n"
                         "%u pages are entirely empty.\n"
***************
*** 623,628 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
--- 750,764 ----
          LockBufferForCleanup(buf);
          tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats);

+         /*
+          * Before we let the page go, prune it. The primary reason is to
+          * update the visibility map in the common special case that we just
+          * vacuumed away the last tuple on the page that wasn't visible to
+          * everyone.
+          */
+         vacrelstats->tuples_deleted +=
+             heap_page_prune(onerel, buf, OldestXmin, false, false);
+
          /* Now that we've compacted the page, record its available space */
          page = BufferGetPage(buf);
          freespace = PageGetHeapFreeSpace(page);
*** src/backend/utils/cache/relcache.c
--- src/backend/utils/cache/relcache.c
***************
*** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp)
--- 305,311 ----
      MemSet(relation, 0, sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks = InvalidBlockNumber;
+     relation->rd_vm_nblocks = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1377,1382 **** formrdesc(const char *relationName, Oid relationReltype,
--- 1378,1384 ----
      relation = (Relation) palloc0(sizeof(RelationData));
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks = InvalidBlockNumber;
+     relation->rd_vm_nblocks = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      relation->rd_smgr = NULL;
***************
*** 1665,1673 **** RelationReloadIndexInfo(Relation relation)
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /* Must reset targblock and fsm_nblocks in case rel was truncated */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
--- 1667,1679 ----
      heap_freetuple(pg_class_tuple);
      /* We must recalculate physical address in case it changed */
      RelationInitPhysicalAddr(relation);
!     /*
!      * Must reset targblock, fsm_nblocks and vm_nblocks in case rel was
!      * truncated
!      */
      relation->rd_targblock = InvalidBlockNumber;
      relation->rd_fsm_nblocks = InvalidBlockNumber;
+     relation->rd_vm_nblocks = InvalidBlockNumber;
      /* Must free any AM cached data, too */
      if (relation->rd_amcache)
          pfree(relation->rd_amcache);
***************
*** 1751,1756 **** RelationClearRelation(Relation relation, bool rebuild)
--- 1757,1763 ----
      {
          relation->rd_targblock = InvalidBlockNumber;
          relation->rd_fsm_nblocks = InvalidBlockNumber;
+         relation->rd_vm_nblocks = InvalidBlockNumber;
          if (relation->rd_rel->relkind == RELKIND_INDEX)
          {
              relation->rd_isvalid = false;        /* needs to be revalidated */
***************
*** 2346,2351 **** RelationBuildLocalRelation(const char *relname,
--- 2353,2359 ----

      rel->rd_targblock = InvalidBlockNumber;
      rel->rd_fsm_nblocks = InvalidBlockNumber;
+     rel->rd_vm_nblocks = InvalidBlockNumber;

      /* make sure relation is marked as having no open file yet */
      rel->rd_smgr = NULL;
***************
*** 3603,3608 **** load_relcache_init_file(void)
--- 3611,3617 ----
          rel->rd_smgr = NULL;
          rel->rd_targblock = InvalidBlockNumber;
          rel->rd_fsm_nblocks = InvalidBlockNumber;
+         rel->rd_vm_nblocks = InvalidBlockNumber;
          if (rel->rd_isnailed)
              rel->rd_refcnt = 1;
          else
*** src/include/access/heapam.h
--- src/include/access/heapam.h
***************
*** 153,158 **** extern void heap_page_prune_execute(Buffer buffer,
--- 153,159 ----
                          OffsetNumber *nowunused, int nunused,
                          bool redirect_move);
  extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
+ extern void heap_page_update_all_visible(Buffer buffer);

  /* in heap/syncscan.c */
  extern void ss_report_location(Relation rel, BlockNumber location);
*** src/include/access/htup.h
--- src/include/access/htup.h
***************
*** 601,609 **** typedef struct xl_heaptid
  typedef struct xl_heap_delete
  {
      xl_heaptid    target;            /* deleted tuple id */
  } xl_heap_delete;

! #define SizeOfHeapDelete    (offsetof(xl_heap_delete, target) + SizeOfHeapTid)

  /*
   * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
--- 601,610 ----
  typedef struct xl_heap_delete
  {
      xl_heaptid    target;            /* deleted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
  } xl_heap_delete;

! #define SizeOfHeapDelete    (offsetof(xl_heap_delete, all_visible_cleared) + sizeof(bool))

  /*
   * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
***************
*** 626,646 **** typedef struct xl_heap_header
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, target) + SizeOfHeapTid)

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;

! #define SizeOfHeapUpdate    (offsetof(xl_heap_update, newtid) + SizeOfIptrData)

  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 627,650 ----
  typedef struct xl_heap_insert
  {
      xl_heaptid    target;            /* inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
      /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_insert;

! #define SizeOfHeapInsert    (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))

  /* This is what we need to know about update|move|hot_update */
  typedef struct xl_heap_update
  {
      xl_heaptid    target;            /* deleted tuple id */
      ItemPointerData newtid;        /* new inserted tuple id */
+     bool all_visible_cleared;    /* PD_ALL_VISIBLE was cleared */
+     bool new_all_visible_cleared; /* same for the page of newtid */
      /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */
      /* and TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;

! #define SizeOfHeapUpdate    (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))

  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** /dev/null
--- src/include/access/visibilitymap.h
***************
*** 0 ****
--- 1,30 ----
+ /*-------------------------------------------------------------------------
+  *
+  * visibilitymap.h
+  *      visibility map interface
+  *
+  *
+  * Portions Copyright (c) 2007, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef VISIBILITYMAP_H
+ #define VISIBILITYMAP_H
+
+ #include "utils/rel.h"
+ #include "storage/buf.h"
+ #include "storage/itemptr.h"
+ #include "access/xlogdefs.h"
+
+ extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk);
+ extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
+                               Buffer *vmbuf);
+ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk,
+                               XLogRecPtr recptr, Buffer *vmbuf);
+ extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+ extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk);
+
+ #endif   /* VISIBILITYMAP_H */
*** src/include/storage/bufpage.h
--- src/include/storage/bufpage.h
***************
*** 152,159 **** typedef PageHeaderData *PageHeader;
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */

! #define PD_VALID_FLAG_BITS    0x0003        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
--- 152,161 ----
  #define PD_HAS_FREE_LINES    0x0001        /* are there any unused line pointers? */
  #define PD_PAGE_FULL        0x0002        /* not enough free space for new
                                           * tuple? */
+ #define PD_ALL_VISIBLE        0x0004        /* all tuples on page are visible to
+                                          * everyone */

! #define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */

  /*
   * Page layout version number 0 is for pre-7.3 Postgres releases.
***************
*** 336,341 **** typedef PageHeaderData *PageHeader;
--- 338,350 ----
  #define PageClearFull(page) \
      (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL)

+ #define PageIsAllVisible(page) \
+     (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE)
+ #define PageSetAllVisible(page) \
+     (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
+ #define PageClearAllVisible(page) \
+     (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+
  #define PageIsPrunable(page, oldestxmin) \
  ( \
      AssertMacro(TransactionIdIsNormal(oldestxmin)), \
*** src/include/storage/relfilenode.h
--- src/include/storage/relfilenode.h
***************
*** 24,37 **** typedef enum ForkNumber
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
  } ForkNumber;

! #define MAX_FORKNUM        FSM_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
--- 24,38 ----
  {
      InvalidForkNumber = -1,
      MAIN_FORKNUM = 0,
!     FSM_FORKNUM,
!     VISIBILITYMAP_FORKNUM
      /*
       * NOTE: if you add a new fork, change MAX_FORKNUM below and update the
       * forkNames array in catalog.c
       */
  } ForkNumber;

! #define MAX_FORKNUM        VISIBILITYMAP_FORKNUM

  /*
   * RelFileNode must provide all that we need to know to physically access
*** src/include/utils/rel.h
--- src/include/utils/rel.h
***************
*** 195,202 **** typedef struct RelationData
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /* size of the FSM, or InvalidBlockNumber if not known yet */
      BlockNumber    rd_fsm_nblocks;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */
--- 195,206 ----
      List       *rd_indpred;        /* index predicate tree, if any */
      void       *rd_amcache;        /* available for use by index AM */

!     /*
!      * sizes of the free space and visibility map forks, or InvalidBlockNumber
!      * if not known yet
!      */
      BlockNumber    rd_fsm_nblocks;
+     BlockNumber    rd_vm_nblocks;

      /* use "struct" here to avoid needing to include pgstat.h: */
      struct PgStat_TableStatus *pgstat_info;        /* statistics collection area */

Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Here's an updated version, with a lot of smaller cleanups, and using 
> relcache invalidation to notify other backends when the visibility map 
> fork is extended. I already committed the change to FSM to do the same. 
> I'm feeling quite satisfied to commit this patch early next week.

Committed.

I haven't done any doc changes for this yet. I think a short section in 
the "database internal storage" chapter is probably in order, and the 
fact that plain VACUUM skips pages should be mentioned somewhere. I'll 
skim through references to vacuum and see what needs to be changed.

Hmm. It just occurred to me that I think this circumvented the 
anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid 
anymore. We'll need to disable the skipping when autovacuum is triggered 
to prevent wraparound. VACUUM FREEZE does that already, but it's 
unnecessarily aggressive in freezing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Hmm. It just occurred to me that I think this circumvented the anti-wraparound
> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
> FREEZE does that already, but it's unnecessarily aggressive in freezing.

Having seen how the anti-wraparound vacuums work in the field I think merely
replacing it with a regular vacuum which covers the whole table will not
actually work well.

What will happen is that, because nothing else is advancing the relfrozenxid,
the age of the relfrozenxid for all tables will advance until they all hit
autovacuum_max_freeze_age. Quite often all the tables were created around the
same time so they will all hit autovacuum_max_freeze_age at the same time.

So a database which was operating fine and receiving regular vacuums at a
reasonable pace will suddenly be hit by vacuums for every table all at the
same time, 3 at a time. If you don't have vacuum_cost_delay set that will
cause a major issue. Even if you do have vacuum_cost_delay set it will prevent
the small busy tables from getting vacuumed regularly due to the backlog in
anti-wraparound vacuums.

Worse, vacuum will set the freeze_xid to nearly the same value for all of the
tables. So it will all happen again in another 100M transactions. And again in
another 100M transactions, and again...

I think there are several things which need to happen here.

1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just  means unnecessary full table vacuums long
beforethey accomplish anything.
 

2) Include a factor which spreads out the anti-wraparound freezes in the  autovacuum launcher. Some ideas:
   . we could implicitly add random(vacuum_freeze_min_age) to the     autovacuum_max_freeze_age. That would spread them
outevenly over 100M     transactions.
 
   . we could check if another anti-wraparound vacuum is still running and     implicitly add a vacuum_freeze_min_age
penaltyto the     autovacuum_max_freeze_age for each running anti-wraparound vacuum. That     would spread them out
withoutbeing introducing non-determinism which     seems better.
 
   . we could leave autovacuum_max_freeze_age and instead pick a semi-random     vacuum_freeze_min_age. This would mean
thefirst set of anti-wraparound     vacuums would still be synchronized but subsequent ones might be spread     out
somewhat.There's not as much room to randomize this though and it     would affect how much i/o vacuum did which makes
itseem less palatable     to me.
 

3) I also think we need to put a clamp on the vacuum_cost_delay. Too many  people are setting it to unreasonably high
valueswhich results in their  vacuums never completing. Actually I think what we should do is junk all  the existing
parametersand replace it with a vacuum_nice_level or  vacuum_bandwidth_cap from which we calculate the cost_limit and
hideall  the other parameters as internal parameters.
 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Visibility map, partial vacuums

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> Hmm. It just occurred to me that I think this circumvented the  
> anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid  
> anymore. We'll need to disable the skipping when autovacuum is triggered  
> to prevent wraparound. VACUUM FREEZE does that already, but it's  
> unnecessarily aggressive in freezing.

Heh :-)  Yes, this should be handled sanely, without having to invoke
FREEZE.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Visibility map, partial vacuums

From
Magnus Hagander
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> 
>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound
>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
>> FREEZE does that already, but it's unnecessarily aggressive in freezing.
> 
> Having seen how the anti-wraparound vacuums work in the field I think merely
> replacing it with a regular vacuum which covers the whole table will not
> actually work well.
> 
> What will happen is that, because nothing else is advancing the relfrozenxid,
> the age of the relfrozenxid for all tables will advance until they all hit
> autovacuum_max_freeze_age. Quite often all the tables were created around the
> same time so they will all hit autovacuum_max_freeze_age at the same time.
> 
> So a database which was operating fine and receiving regular vacuums at a
> reasonable pace will suddenly be hit by vacuums for every table all at the
> same time, 3 at a time. If you don't have vacuum_cost_delay set that will
> cause a major issue. Even if you do have vacuum_cost_delay set it will prevent
> the small busy tables from getting vacuumed regularly due to the backlog in
> anti-wraparound vacuums.
> 
> Worse, vacuum will set the freeze_xid to nearly the same value for all of the
> tables. So it will all happen again in another 100M transactions. And again in
> another 100M transactions, and again...
> 
> I think there are several things which need to happen here.
> 
> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just
>    means unnecessary full table vacuums long before they accomplish anything.
> 
> 2) Include a factor which spreads out the anti-wraparound freezes in the
>    autovacuum launcher. Some ideas:
> 
>     . we could implicitly add random(vacuum_freeze_min_age) to the
>       autovacuum_max_freeze_age. That would spread them out evenly over 100M
>       transactions.
> 
>     . we could check if another anti-wraparound vacuum is still running and
>       implicitly add a vacuum_freeze_min_age penalty to the
>       autovacuum_max_freeze_age for each running anti-wraparound vacuum. That
>       would spread them out without being introducing non-determinism which
>       seems better.
> 
>     . we could leave autovacuum_max_freeze_age and instead pick a semi-random
>       vacuum_freeze_min_age. This would mean the first set of anti-wraparound
>       vacuums would still be synchronized but subsequent ones might be spread
>       out somewhat. There's not as much room to randomize this though and it
>       would affect how much i/o vacuum did which makes it seem less palatable
>       to me.

How about a way to say that only one (or a config parameter for <n>) of
the autovac workers can be used for anti-wraparound vacuum? Then the
other slots would still be available for the
small-but-frequently-updated tables.



> 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many
>    people are setting it to unreasonably high values which results in their
>    vacuums never completing. Actually I think what we should do is junk all
>    the existing parameters and replace it with a vacuum_nice_level or
>    vacuum_bandwidth_cap from which we calculate the cost_limit and hide all
>    the other parameters as internal parameters.

It would certainly be helpful if it was just a single parameter - the
arbitraryness of the parameters there now make them pretty hard to set
properly - or at least easy to set wrong.


//Magnus


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> 
>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound
>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
>> FREEZE does that already, but it's unnecessarily aggressive in freezing.

FWIW, it seems the omission is actually the other way 'round. Autovacuum 
always forces a full-scanning vacuum, making the visibility map useless 
for autovacuum. This obviously needs to be fixed.

> What will happen is that, because nothing else is advancing the relfrozenxid,
> the age of the relfrozenxid for all tables will advance until they all hit
> autovacuum_max_freeze_age. Quite often all the tables were created around the
> same time so they will all hit autovacuum_max_freeze_age at the same time.
> 
> So a database which was operating fine and receiving regular vacuums at a
> reasonable pace will suddenly be hit by vacuums for every table all at the
> same time, 3 at a time. If you don't have vacuum_cost_delay set that will
> cause a major issue. Even if you do have vacuum_cost_delay set it will prevent
> the small busy tables from getting vacuumed regularly due to the backlog in
> anti-wraparound vacuums.
> 
> Worse, vacuum will set the freeze_xid to nearly the same value for all of the
> tables. So it will all happen again in another 100M transactions. And again in
> another 100M transactions, and again...

But we already have that problem, don't we? When you initially load your 
database, all tuples will have the same xmin, and all tables will have 
more or less the same relfrozenxid. I guess you can argue that it 
becomes more obvious if vacuums are otherwise cheaper, but I don't think 
the visibility map makes that much difference to suddenly make this 
issue urgent.

Agreed that it would be nice to do something about it, though.

> I think there are several things which need to happen here.
> 
> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just
>    means unnecessary full table vacuums long before they accomplish anything.

It allows you to truncate clog. If I did my math right, 200M 
transactions amounts to ~50MB of clog. Perhaps we should still raise it, 
disk space is cheap after all.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Gregory Stark wrote:
>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>
>>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound
>>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
>>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
>>> FREEZE does that already, but it's unnecessarily aggressive in freezing.
>
> FWIW, it seems the omission is actually the other way 'round. Autovacuum always
> forces a full-scanning vacuum, making the visibility map useless for
> autovacuum. This obviously needs to be fixed.

How does it do that? Is there some option in the VacStmt to control this? Do
we just need a syntax to set that option?


How easy is it to tell what percentage of the table needs to be vacuumed? If
it's > 50% perhaps it would make sense to scan the whole table? (Hm. Not
really if it's a contiguous 50% though...)

Another idea: Perhaps each page of the visibility map should have a frozenxid
(or multiple frozenxids?). Then if an individual page of the visibility map is
old we could force scanning all the heap pages covered by that map page and
update it. I'm not sure we can do that safely though without locking issues --
or is it ok because it's vacuum doing the updating?

>> Worse, vacuum will set the freeze_xid to nearly the same value for all of the
>> tables. So it will all happen again in another 100M transactions. And again in
>> another 100M transactions, and again...
>
> But we already have that problem, don't we? When you initially load your
> database, all tuples will have the same xmin, and all tables will have more or
> less the same relfrozenxid. I guess you can argue that it becomes more obvious
> if vacuums are otherwise cheaper, but I don't think the visibility map makes
> that much difference to suddenly make this issue urgent.

We already have that problem but it only bites in a specific case: if you have
no other vacuums being triggered by the regular dead tuple scale factor. The
normal case is intended to be that autovacuum triggers much more frequently
than every 100M transactions to reduce bloat.

However in practice this specific case does seem to arise rather alarmingly
easy. Most databases do have some large tables which are never deleted from or
updated. Also, the default scale factor of 20% is actually quite easy to never
reach if your tables are also growing quickly -- effectively moving the
goalposts further out as fast as the updates and deletes bloat the table.

The visibility map essentially widens this specific use case to cover *all*
tables. Since the relfrozenxid would never get advanced by regular vacuums the
only time it would get advanced is when they all hit the 200M wall
simultaneously.

> Agreed that it would be nice to do something about it, though.
>
>> I think there are several things which need to happen here.
>>
>> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just
>>    means unnecessary full table vacuums long before they accomplish anything.
>
> It allows you to truncate clog. If I did my math right, 200M transactions
> amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap
> after all.

Ah. Hm. Then perhaps this belongs in the realm of the config generator people
are working on. They'll need a dial to say how much disk space you expect your
database to take in addition to how much memory your machine has available.
50M is nothing for a 1TB database but it's kind of silly to have to keep
hundreds of megs of clogs on a 1MB database.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Gregory Stark wrote:
>>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>>>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound
>>>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
>>>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
>>>> FREEZE does that already, but it's unnecessarily aggressive in freezing.
>> FWIW, it seems the omission is actually the other way 'round. Autovacuum always
>> forces a full-scanning vacuum, making the visibility map useless for
>> autovacuum. This obviously needs to be fixed.
>
> How does it do that? Is there some option in the VacStmt to control this? Do
> we just need a syntax to set that option?

The way it works now is that if VacuumStmt->freeze_min_age is not -1
(which means "use the default"), the visibility map is not used and the
whole table is scanned. Autovacuum always sets freeze_min_age, so it's
never using the visibility map. Attached is a patch I'm considering to
fix that.

> How easy is it to tell what percentage of the table needs to be vacuumed? If
> it's > 50% perhaps it would make sense to scan the whole table? (Hm. Not
> really if it's a contiguous 50% though...)

Hmm. You could scan the visibility map to see how much you could skip by
using it. You could account for contiguity.

> Another idea: Perhaps each page of the visibility map should have a frozenxid
> (or multiple frozenxids?). Then if an individual page of the visibility map is
> old we could force scanning all the heap pages covered by that map page and
> update it. I'm not sure we can do that safely though without locking issues --
> or is it ok because it's vacuum doing the updating?

We discussed that a while ago:

http://archives.postgresql.org/message-id/492A6032.6080000@enterprisedb.com

Tom was concerned about making the visibility map not just a hint but
critical data. Rightly so. This is certainly 8.5 stuff; perhaps it would
be more palatable after we get the index-only-scans working using the
visibility map, since the map would be critical data anyway.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index fd2429a..3e3cb9d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -171,10 +171,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
     vacrelstats->hasindex = (nindexes > 0);

     /* Should we use the visibility map or scan all pages? */
-    if (vacstmt->freeze_min_age != -1)
-        scan_all = true;
-    else
-        scan_all = false;
+    scan_all = vacstmt->scan_all;

     /* Do the vacuuming */
     lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eb7ab4d..2781f6e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2771,6 +2771,7 @@ _copyVacuumStmt(VacuumStmt *from)
     COPY_SCALAR_FIELD(analyze);
     COPY_SCALAR_FIELD(verbose);
     COPY_SCALAR_FIELD(freeze_min_age);
+    COPY_SCALAR_FIELD(scan_all));
     COPY_NODE_FIELD(relation);
     COPY_NODE_FIELD(va_cols);

diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index d4c57bb..86a032f 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1436,6 +1436,7 @@ _equalVacuumStmt(VacuumStmt *a, VacuumStmt *b)
     COMPARE_SCALAR_FIELD(analyze);
     COMPARE_SCALAR_FIELD(verbose);
     COMPARE_SCALAR_FIELD(freeze_min_age);
+    COMPARE_SCALAR_FIELD(scan_all);
     COMPARE_NODE_FIELD(relation);
     COMPARE_NODE_FIELD(va_cols);

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 85f4616..1aab75c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -5837,6 +5837,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
                     n->analyze = false;
                     n->full = $2;
                     n->freeze_min_age = $3 ? 0 : -1;
+                    n->scan_all = $3;
                     n->verbose = $4;
                     n->relation = NULL;
                     n->va_cols = NIL;
@@ -5849,6 +5850,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
                     n->analyze = false;
                     n->full = $2;
                     n->freeze_min_age = $3 ? 0 : -1;
+                    n->scan_all = $3;
                     n->verbose = $4;
                     n->relation = $5;
                     n->va_cols = NIL;
@@ -5860,6 +5862,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose
                     n->vacuum = true;
                     n->full = $2;
                     n->freeze_min_age = $3 ? 0 : -1;
+                    n->scan_all = $3;
                     n->verbose |= $4;
                     $$ = (Node *)n;
                 }
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 8d8947f..2c68779 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2649,6 +2649,7 @@ autovacuum_do_vac_analyze(autovac_table *tab,
     vacstmt.full = false;
     vacstmt.analyze = tab->at_doanalyze;
     vacstmt.freeze_min_age = tab->at_freeze_min_age;
+    vacstmt.scan_all = tab->at_wraparound;
     vacstmt.verbose = false;
     vacstmt.relation = NULL;    /* not used since we pass a relid */
     vacstmt.va_cols = NIL;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index bb71ac1..df19f7e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1966,6 +1966,7 @@ typedef struct VacuumStmt
     bool        full;            /* do FULL (non-concurrent) vacuum */
     bool        analyze;        /* do ANALYZE step */
     bool        verbose;        /* print progress info */
+    bool        scan_all;        /* force scan of all pages */
     int            freeze_min_age; /* min freeze age, or -1 to use default */
     RangeVar   *relation;        /* single table to process, or NULL */
     List       *va_cols;        /* list of column names, or NIL for all */

Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Gregory Stark <stark@enterprisedb.com> writes:

> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>
>> Gregory Stark wrote:
>>> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just
>>>    means unnecessary full table vacuums long before they accomplish anything.
>>
>> It allows you to truncate clog. If I did my math right, 200M transactions
>> amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap
>> after all.

Hm, the more I think about it the more this bothers me. It's another subtle
change from the current behaviour. 

Currently *every* vacuum tries to truncate the clog. So you're constantly
trimming off a little bit.

With the visibility map (assuming you fix it not to do full scans all the
time) you can never truncate the clog just as you can never advance the
relfrozenxid unless you do a special full-table vacuum.

I think in practice most people had a read-only table somewhere in their
database which prevented the clog from ever being truncated anyways, so
perhaps this isn't such a big deal.

But the bottom line is that the anti-wraparound vacuums are going to be a lot
more important and much more visible now than they were in the past.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Visibility map, partial vacuums

From
Bruce Momjian
Date:
Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
when our wraparound limit is around 2B?

Also, is anything being done about the concern about 'vacuum storm'
explained below?

---------------------------------------------------------------------------

Gregory Stark wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> 
> > Hmm. It just occurred to me that I think this circumvented the anti-wraparound
> > vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to
> > disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM
> > FREEZE does that already, but it's unnecessarily aggressive in freezing.
> 
> Having seen how the anti-wraparound vacuums work in the field I think merely
> replacing it with a regular vacuum which covers the whole table will not
> actually work well.
> 
> What will happen is that, because nothing else is advancing the relfrozenxid,
> the age of the relfrozenxid for all tables will advance until they all hit
> autovacuum_max_freeze_age. Quite often all the tables were created around the
> same time so they will all hit autovacuum_max_freeze_age at the same time.
> 
> So a database which was operating fine and receiving regular vacuums at a
> reasonable pace will suddenly be hit by vacuums for every table all at the
> same time, 3 at a time. If you don't have vacuum_cost_delay set that will
> cause a major issue. Even if you do have vacuum_cost_delay set it will prevent
> the small busy tables from getting vacuumed regularly due to the backlog in
> anti-wraparound vacuums.
> 
> Worse, vacuum will set the freeze_xid to nearly the same value for all of the
> tables. So it will all happen again in another 100M transactions. And again in
> another 100M transactions, and again...
> 
> I think there are several things which need to happen here.
> 
> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just
>    means unnecessary full table vacuums long before they accomplish anything.
> 
> 2) Include a factor which spreads out the anti-wraparound freezes in the
>    autovacuum launcher. Some ideas:
> 
>     . we could implicitly add random(vacuum_freeze_min_age) to the
>       autovacuum_max_freeze_age. That would spread them out evenly over 100M
>       transactions.
> 
>     . we could check if another anti-wraparound vacuum is still running and
>       implicitly add a vacuum_freeze_min_age penalty to the
>       autovacuum_max_freeze_age for each running anti-wraparound vacuum. That
>       would spread them out without being introducing non-determinism which
>       seems better.
> 
>     . we could leave autovacuum_max_freeze_age and instead pick a semi-random
>       vacuum_freeze_min_age. This would mean the first set of anti-wraparound
>       vacuums would still be synchronized but subsequent ones might be spread
>       out somewhat. There's not as much room to randomize this though and it
>       would affect how much i/o vacuum did which makes it seem less palatable
>       to me.
> 
> 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many
>    people are setting it to unreasonably high values which results in their
>    vacuums never completing. Actually I think what we should do is junk all
>    the existing parameters and replace it with a vacuum_nice_level or
>    vacuum_bandwidth_cap from which we calculate the cost_limit and hide all
>    the other parameters as internal parameters.
> 
> -- 
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>   Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Visibility map, partial vacuums

From
Andrew Dunstan
Date:

Bruce Momjian wrote:
> Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
> when our wraparound limit is around 2B?
>   

Presumably because of this (from the docs):

"The commit status uses two bits per transaction, so if 
autovacuum_freeze_max_age has its maximum allowed value of a little less 
than two billion, pg_clog can be expected to grow to about half a gigabyte."

cheers

andrew



Re: Visibility map, partial vacuums

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> Bruce Momjian wrote:
> > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
> > when our wraparound limit is around 2B?
> >   
> 
> Presumably because of this (from the docs):
> 
> "The commit status uses two bits per transaction, so if 
> autovacuum_freeze_max_age has its maximum allowed value of a little less 
> than two billion, pg_clog can be expected to grow to about half a gigabyte."

Oh, that's interesting;  thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Visibility map, partial vacuums

From
Gregory Stark
Date:
Bruce Momjian <bruce@momjian.us> writes:

> Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
> when our wraparound limit is around 2B?

I suggested raising it dramatically in the post you quote and Heikki pointed
it controls the maximum amount of space the clog will take. Raising it to,
say, 800M will mean up to 200MB of space which might be kind of annoying for a
small database.

It would be nice if we could ensure the clog got trimmed frequently enough on
small databases that we could raise the max_age. It's really annoying to see
all these vacuums running 10x more often than necessary.

The rest of the thread is visible at the bottom of:

http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525

> Also, is anything being done about the concern about 'vacuum storm'
> explained below?

I'm interested too.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> 
>> Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
>> when our wraparound limit is around 2B?
> 
> I suggested raising it dramatically in the post you quote and Heikki pointed
> it controls the maximum amount of space the clog will take. Raising it to,
> say, 800M will mean up to 200MB of space which might be kind of annoying for a
> small database.
> 
> It would be nice if we could ensure the clog got trimmed frequently enough on
> small databases that we could raise the max_age. It's really annoying to see
> all these vacuums running 10x more often than necessary.

Well, if it's a small database, you might as well just vacuum it.

> The rest of the thread is visible at the bottom of:
> 
> http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525
> 
>> Also, is anything being done about the concern about 'vacuum storm'
>> explained below?
> 
> I'm interested too.

The additional "vacuum_freeze_table_age" (as I'm now calling it) setting 
I discussed in a later thread should alleviate that somewhat. When a 
table is autovacuumed, the whole table is scanned to freeze tuples if 
it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. 
When different tables reach the autovacuum threshold at different times, 
they will also have their relfrozenxids set to different values. And in 
fact no anti-wraparound vacuum is needed.

That doesn't help with read-only or insert-only tables, but that's not a 
new problem.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Visibility map, partial vacuums

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> >> Also, is anything being done about the concern about 'vacuum storm'
> >> explained below?
> > 
> > I'm interested too.
> 
> The additional "vacuum_freeze_table_age" (as I'm now calling it) setting 
> I discussed in a later thread should alleviate that somewhat. When a 
> table is autovacuumed, the whole table is scanned to freeze tuples if 
> it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. 
> When different tables reach the autovacuum threshold at different times, 
> they will also have their relfrozenxids set to different values. And in 
> fact no anti-wraparound vacuum is needed.
> 
> That doesn't help with read-only or insert-only tables, but that's not a 
> new problem.

OK, is this targeted for 8.4?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Visibility map, partial vacuums

From
Bruce Momjian
Date:
Gregory Stark wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> 
> > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M
> > when our wraparound limit is around 2B?
> 
> I suggested raising it dramatically in the post you quote and Heikki pointed
> it controls the maximum amount of space the clog will take. Raising it to,
> say, 800M will mean up to 200MB of space which might be kind of annoying for a
> small database.
> 
> It would be nice if we could ensure the clog got trimmed frequently enough on
> small databases that we could raise the max_age. It's really annoying to see
> all these vacuums running 10x more often than necessary.

I always assumed that it was our 4-byte xid that was requiring our
vacuum freeze, but I now see our limiting factor is the size of clog;
interesting.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Visibility map, partial vacuums

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>>>> Also, is anything being done about the concern about 'vacuum storm'
>>>> explained below?
>>> I'm interested too.
>> The additional "vacuum_freeze_table_age" (as I'm now calling it) setting 
>> I discussed in a later thread should alleviate that somewhat. When a 
>> table is autovacuumed, the whole table is scanned to freeze tuples if 
>> it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. 
>> When different tables reach the autovacuum threshold at different times, 
>> they will also have their relfrozenxids set to different values. And in 
>> fact no anti-wraparound vacuum is needed.
>>
>> That doesn't help with read-only or insert-only tables, but that's not a 
>> new problem.
> 
> OK, is this targeted for 8.4?

Yes. It's been on my todo list for a long time, and I've also added it 
to the Open Items list so that we don't lose track of it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com