Thread: Index-only quals

Index-only quals

From
Heikki Linnakangas
Date:
Here is an updated version of my patch to return data from b-tree
indexes, and use it to satisfy quals.

I added a new column 'amregurgitate' to pg_am, to mark which indexams
can return index tuples. Also, the data type of the index column in
pg_attribute must match the type of the heap column - this catches the
hack that 'name' is stored as cstring, that I had hardcoded before.

As discussed, GiST/GIN would need more infrastructure to mark which
opclasses can return tuples, but as long as GiST/GIN doesn't support
regurgitation at all, I'm not going to complicate the catalogs with that.

There's also some planner fixes - indexes that are only useful because
of index-only quals are not considered for bitmap scans, and volatile
expressions mustn't be used as index-only quals.


This patch comes in two parts. Indexam API changes, which just splits
the indexam_getnext function into two without providing any new
functionality, and the main patch that applies on top of the indexam API
changes. The patches are also available at
git://git.postgresql.org/git/users/heikki/postgres.git, branches
'indexam-api-changes and 'indexfilter'.

Barring objections, I'm going to apply the indexam API changes part,
since that simplifies the code in question regardless of the rest of the
work. I'm pretty happy now with the indexfilter patch as well, but want
to do some more testing on that before committing. Some more eyeballs
would be appreciated as well.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 1490,1497 **** heap_fetch(Relation relation,
   * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
   * of a HOT chain), and buffer is the buffer holding this tuple.  We search
   * for the first chain member satisfying the given snapshot.  If one is
!  * found, we update *tid to reference that tuple's offset number, and
!  * return TRUE.  If no match, return FALSE without modifying *tid.
   *
   * If all_dead is not NULL, we check non-visible tuples to see if they are
   * globally dead; *all_dead is set TRUE if all members of the HOT chain
--- 1490,1503 ----
   * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
   * of a HOT chain), and buffer is the buffer holding this tuple.  We search
   * for the first chain member satisfying the given snapshot.  If one is
!  * found, we update *tid and *heapTuple to reference that tuple, and
!  * return TRUE.  If no match, return FALSE without modifying *tid (*heapTuple
!  * is undefined in that case).
!  *
!  * To fetch the next matching tuple in the chain, call again with 'firstCall'
!  * set to FALSE and *tid still pointing to the tuple returned by previous call.
!  * (normally only one tuple in a chain can be visible at a time, but that does
!  * not hold for special snapshots like SnapshotAny)
   *
   * If all_dead is not NULL, we check non-visible tuples to see if they are
   * globally dead; *all_dead is set TRUE if all members of the HOT chain
***************
*** 1503,1529 **** heap_fetch(Relation relation,
   */
  bool
  heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
!                        bool *all_dead)
  {
      Page        dp = (Page) BufferGetPage(buffer);
      TransactionId prev_xmax = InvalidTransactionId;
      OffsetNumber offnum;
      bool        at_chain_start;

!     if (all_dead)
          *all_dead = true;

      Assert(TransactionIdIsValid(RecentGlobalXmin));

      Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
      offnum = ItemPointerGetOffsetNumber(tid);
!     at_chain_start = true;

      /* Scan through possible multiple members of HOT-chain */
      for (;;)
      {
          ItemId        lp;
-         HeapTupleData heapTuple;

          /* check for bogus TID */
          if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
--- 1509,1537 ----
   */
  bool
  heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
!                        HeapTuple heapTuple, bool *all_dead, bool firstCall)
  {
      Page        dp = (Page) BufferGetPage(buffer);
      TransactionId prev_xmax = InvalidTransactionId;
      OffsetNumber offnum;
      bool        at_chain_start;
+     bool        skip;

!     if (all_dead && firstCall)
          *all_dead = true;

      Assert(TransactionIdIsValid(RecentGlobalXmin));

      Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
      offnum = ItemPointerGetOffsetNumber(tid);
!
!     at_chain_start = firstCall;
!     skip = !firstCall;

      /* Scan through possible multiple members of HOT-chain */
      for (;;)
      {
          ItemId        lp;

          /* check for bogus TID */
          if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
***************
*** 1546,1558 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
              break;
          }

!         heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
!         heapTuple.t_len = ItemIdGetLength(lp);

          /*
           * Shouldn't see a HEAP_ONLY tuple at chain start.
           */
!         if (at_chain_start && HeapTupleIsHeapOnly(&heapTuple))
              break;

          /*
--- 1554,1566 ----
              break;
          }

!         heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
!         heapTuple->t_len = ItemIdGetLength(lp);

          /*
           * Shouldn't see a HEAP_ONLY tuple at chain start.
           */
!         if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
              break;

          /*
***************
*** 1561,1577 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           */
          if (TransactionIdIsValid(prev_xmax) &&
              !TransactionIdEquals(prev_xmax,
!                                  HeapTupleHeaderGetXmin(heapTuple.t_data)))
              break;

          /* If it's visible per the snapshot, we must return it */
!         if (HeapTupleSatisfiesVisibility(&heapTuple, snapshot, buffer))
          {
              ItemPointerSetOffsetNumber(tid, offnum);
              if (all_dead)
                  *all_dead = false;
              return true;
          }

          /*
           * If we can't see it, maybe no one else can either.  At caller
--- 1569,1586 ----
           */
          if (TransactionIdIsValid(prev_xmax) &&
              !TransactionIdEquals(prev_xmax,
!                                  HeapTupleHeaderGetXmin(heapTuple->t_data)))
              break;

          /* If it's visible per the snapshot, we must return it */
!         if (!skip && HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer))
          {
              ItemPointerSetOffsetNumber(tid, offnum);
              if (all_dead)
                  *all_dead = false;
              return true;
          }
+         skip = false;

          /*
           * If we can't see it, maybe no one else can either.  At caller
***************
*** 1579,1585 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           * transactions.
           */
          if (all_dead && *all_dead &&
!             HeapTupleSatisfiesVacuum(heapTuple.t_data, RecentGlobalXmin,
                                       buffer) != HEAPTUPLE_DEAD)
              *all_dead = false;

--- 1588,1594 ----
           * transactions.
           */
          if (all_dead && *all_dead &&
!             HeapTupleSatisfiesVacuum(heapTuple->t_data, RecentGlobalXmin,
                                       buffer) != HEAPTUPLE_DEAD)
              *all_dead = false;

***************
*** 1587,1599 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           * Check to see if HOT chain continues past this tuple; if so fetch
           * the next offnum and loop around.
           */
!         if (HeapTupleIsHotUpdated(&heapTuple))
          {
!             Assert(ItemPointerGetBlockNumber(&heapTuple.t_data->t_ctid) ==
                     ItemPointerGetBlockNumber(tid));
!             offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
              at_chain_start = false;
!             prev_xmax = HeapTupleHeaderGetXmax(heapTuple.t_data);
          }
          else
              break;                /* end of chain */
--- 1596,1608 ----
           * Check to see if HOT chain continues past this tuple; if so fetch
           * the next offnum and loop around.
           */
!         if (HeapTupleIsHotUpdated(heapTuple))
          {
!             Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
                     ItemPointerGetBlockNumber(tid));
!             offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
              at_chain_start = false;
!             prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
          }
          else
              break;                /* end of chain */
***************
*** 1615,1624 **** heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
  {
      bool        result;
      Buffer        buffer;

      buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
      LockBuffer(buffer, BUFFER_LOCK_SHARE);
!     result = heap_hot_search_buffer(tid, buffer, snapshot, all_dead);
      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
      ReleaseBuffer(buffer);
      return result;
--- 1624,1635 ----
  {
      bool        result;
      Buffer        buffer;
+     HeapTupleData heapTuple;

      buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
      LockBuffer(buffer, BUFFER_LOCK_SHARE);
!     result = heap_hot_search_buffer(tid, buffer, snapshot, &heapTuple,
!                                     all_dead, true);
      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
      ReleaseBuffer(buffer);
      return result;
*** a/src/backend/access/index/genam.c
--- b/src/backend/access/index/genam.c
***************
*** 99,107 **** RelationGetIndexScan(Relation indexRelation,
      ItemPointerSetInvalid(&scan->xs_ctup.t_self);
      scan->xs_ctup.t_data = NULL;
      scan->xs_cbuf = InvalidBuffer;
-     scan->xs_hot_dead = false;
-     scan->xs_next_hot = InvalidOffsetNumber;
-     scan->xs_prev_xmax = InvalidTransactionId;

      /*
       * Let the AM fill in the key and any opaque data it wants.
--- 99,104 ----
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***************
*** 20,26 ****
   *        index_insert    - insert an index tuple into a relation
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
!  *        index_getnext    - get the next tuple from a scan
   *        index_getbitmap - get all tuples from a scan
   *        index_bulk_delete    - bulk deletion of index tuples
   *        index_vacuum_cleanup    - post-deletion cleanup of an index
--- 20,27 ----
   *        index_insert    - insert an index tuple into a relation
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
!  *        index_next        - advance index scan to next match
!  *        index_fetch        - fetch heap tuple for current index scan position
   *        index_getbitmap - get all tuples from a scan
   *        index_bulk_delete    - bulk deletion of index tuples
   *        index_vacuum_cleanup    - post-deletion cleanup of an index
***************
*** 310,317 **** index_rescan(IndexScanDesc scan, ScanKey key)
          scan->xs_cbuf = InvalidBuffer;
      }

-     scan->xs_next_hot = InvalidOffsetNumber;
-
      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall2(procedure,
--- 311,316 ----
***************
*** 389,420 **** index_restrpos(IndexScanDesc scan)
      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amrestrpos);

-     scan->xs_next_hot = InvalidOffsetNumber;
-
      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall1(procedure, PointerGetDatum(scan));
  }

  /* ----------------
!  *        index_getnext - get the next heap tuple from a scan
   *
!  * The result is the next heap tuple satisfying the scan keys and the
!  * snapshot, or NULL if no more matching tuples exist.    On success,
!  * the buffer containing the heap tuple is pinned (the pin will be dropped
!  * at the next index_getnext or index_endscan).
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
   * enough information to do it efficiently in the general case.
   * ----------------
   */
! HeapTuple
! index_getnext(IndexScanDesc scan, ScanDirection direction)
  {
-     HeapTuple    heapTuple = &scan->xs_ctup;
-     ItemPointer tid = &heapTuple->t_self;
      FmgrInfo   *procedure;

      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amgettuple);
--- 388,415 ----
      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amrestrpos);

      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall1(procedure, PointerGetDatum(scan));
  }

  /* ----------------
!  *        index_next - get the next index tuple from a scan
   *
!  * Advances to the next index tuple satisfying the scan keys, returning TRUE
!  * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
   * enough information to do it efficiently in the general case.
   * ----------------
   */
! bool
! index_next(IndexScanDesc scan, ScanDirection direction)
  {
      FmgrInfo   *procedure;
+     bool         found;

      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amgettuple);
***************
*** 422,641 **** index_getnext(IndexScanDesc scan, ScanDirection direction)
      Assert(TransactionIdIsValid(RecentGlobalXmin));

      /*
!      * We always reset xs_hot_dead; if we are here then either we are just
!      * starting the scan, or we previously returned a visible tuple, and in
!      * either case it's inappropriate to kill the prior index entry.
       */
!     scan->xs_hot_dead = false;

!     for (;;)
!     {
!         OffsetNumber offnum;
!         bool        at_chain_start;
!         Page        dp;

!         if (scan->xs_next_hot != InvalidOffsetNumber)
!         {
!             /*
!              * We are resuming scan of a HOT chain after having returned an
!              * earlier member.    Must still hold pin on current heap page.
!              */
!             Assert(BufferIsValid(scan->xs_cbuf));
!             Assert(ItemPointerGetBlockNumber(tid) ==
!                    BufferGetBlockNumber(scan->xs_cbuf));
!             Assert(TransactionIdIsValid(scan->xs_prev_xmax));
!             offnum = scan->xs_next_hot;
!             at_chain_start = false;
!             scan->xs_next_hot = InvalidOffsetNumber;
!         }
!         else
          {
!             bool        found;
!             Buffer        prev_buf;
!
!             /*
!              * If we scanned a whole HOT chain and found only dead tuples,
!              * tell index AM to kill its entry for that TID.
!              */
!             scan->kill_prior_tuple = scan->xs_hot_dead;
!
!             /*
!              * The AM's gettuple proc finds the next index entry matching the
!              * scan keys, and puts the TID in xs_ctup.t_self (ie, *tid). It
!              * should also set scan->xs_recheck, though we pay no attention to
!              * that here.
!              */
!             found = DatumGetBool(FunctionCall2(procedure,
!                                                PointerGetDatum(scan),
!                                                Int32GetDatum(direction)));
!
!             /* Reset kill flag immediately for safety */
!             scan->kill_prior_tuple = false;
!
!             /* If we're out of index entries, break out of outer loop */
!             if (!found)
!                 break;
!
!             pgstat_count_index_tuples(scan->indexRelation, 1);
!
!             /* Switch to correct buffer if we don't have it already */
!             prev_buf = scan->xs_cbuf;
!             scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
!                                                  scan->heapRelation,
!                                              ItemPointerGetBlockNumber(tid));
!
!             /*
!              * Prune page, but only if we weren't already on this page
!              */
!             if (prev_buf != scan->xs_cbuf)
!                 heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf,
!                                     RecentGlobalXmin);
!
!             /* Prepare to scan HOT chain starting at index-referenced offnum */
!             offnum = ItemPointerGetOffsetNumber(tid);
!             at_chain_start = true;
!
!             /* We don't know what the first tuple's xmin should be */
!             scan->xs_prev_xmax = InvalidTransactionId;
!
!             /* Initialize flag to detect if all entries are dead */
!             scan->xs_hot_dead = true;
          }

!         /* Obtain share-lock on the buffer so we can examine visibility */
!         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);

!         dp = (Page) BufferGetPage(scan->xs_cbuf);

-         /* Scan through possible multiple members of HOT-chain */
-         for (;;)
-         {
-             ItemId        lp;
-             ItemPointer ctid;
-
-             /* check for bogus TID */
-             if (offnum < FirstOffsetNumber ||
-                 offnum > PageGetMaxOffsetNumber(dp))
-                 break;
-
-             lp = PageGetItemId(dp, offnum);
-
-             /* check for unused, dead, or redirected items */
-             if (!ItemIdIsNormal(lp))
-             {
-                 /* We should only see a redirect at start of chain */
-                 if (ItemIdIsRedirected(lp) && at_chain_start)
-                 {
-                     /* Follow the redirect */
-                     offnum = ItemIdGetRedirect(lp);
-                     at_chain_start = false;
-                     continue;
-                 }
-                 /* else must be end of chain */
-                 break;
-             }
-
-             /*
-              * We must initialize all of *heapTuple (ie, scan->xs_ctup) since
-              * it is returned to the executor on success.
-              */
-             heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
-             heapTuple->t_len = ItemIdGetLength(lp);
-             ItemPointerSetOffsetNumber(tid, offnum);
-             heapTuple->t_tableOid = RelationGetRelid(scan->heapRelation);
-             ctid = &heapTuple->t_data->t_ctid;
-
-             /*
-              * Shouldn't see a HEAP_ONLY tuple at chain start.  (This test
-              * should be unnecessary, since the chain root can't be removed
-              * while we have pin on the index entry, but let's make it
-              * anyway.)
-              */
-             if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
-                 break;
-
-             /*
-              * The xmin should match the previous xmax value, else chain is
-              * broken.    (Note: this test is not optional because it protects
-              * us against the case where the prior chain member's xmax aborted
-              * since we looked at it.)
-              */
-             if (TransactionIdIsValid(scan->xs_prev_xmax) &&
-                 !TransactionIdEquals(scan->xs_prev_xmax,
-                                   HeapTupleHeaderGetXmin(heapTuple->t_data)))
-                 break;
-
-             /* If it's visible per the snapshot, we must return it */
-             if (HeapTupleSatisfiesVisibility(heapTuple, scan->xs_snapshot,
-                                              scan->xs_cbuf))
-             {
-                 /*
-                  * If the snapshot is MVCC, we know that it could accept at
-                  * most one member of the HOT chain, so we can skip examining
-                  * any more members.  Otherwise, check for continuation of the
-                  * HOT-chain, and set state for next time.
-                  */
-                 if (IsMVCCSnapshot(scan->xs_snapshot))
-                     scan->xs_next_hot = InvalidOffsetNumber;
-                 else if (HeapTupleIsHotUpdated(heapTuple))
-                 {
-                     Assert(ItemPointerGetBlockNumber(ctid) ==
-                            ItemPointerGetBlockNumber(tid));
-                     scan->xs_next_hot = ItemPointerGetOffsetNumber(ctid);
-                     scan->xs_prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
-                 }
-                 else
-                     scan->xs_next_hot = InvalidOffsetNumber;
-
-                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
-
-                 pgstat_count_heap_fetch(scan->indexRelation);
-
-                 return heapTuple;
-             }
-
-             /*
-              * If we can't see it, maybe no one else can either.  Check to see
-              * if the tuple is dead to all transactions.  If we find that all
-              * the tuples in the HOT chain are dead, we'll signal the index AM
-              * to not return that TID on future indexscans.
-              */
-             if (scan->xs_hot_dead &&
-                 HeapTupleSatisfiesVacuum(heapTuple->t_data, RecentGlobalXmin,
-                                          scan->xs_cbuf) != HEAPTUPLE_DEAD)
-                 scan->xs_hot_dead = false;
-
-             /*
-              * Check to see if HOT chain continues past this tuple; if so
-              * fetch the next offnum (we don't bother storing it into
-              * xs_next_hot, but must store xs_prev_xmax), and loop around.
-              */
-             if (HeapTupleIsHotUpdated(heapTuple))
-             {
-                 Assert(ItemPointerGetBlockNumber(ctid) ==
-                        ItemPointerGetBlockNumber(tid));
-                 offnum = ItemPointerGetOffsetNumber(ctid);
-                 at_chain_start = false;
-                 scan->xs_prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
-             }
-             else
-                 break;            /* end of chain */
-         }                        /* loop over a single HOT chain */
-
-         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
-
-         /* Loop around to ask index AM for another TID */
-         scan->xs_next_hot = InvalidOffsetNumber;
-     }

!     /* Release any held pin on a heap page */
!     if (BufferIsValid(scan->xs_cbuf))
      {
!         ReleaseBuffer(scan->xs_cbuf);
!         scan->xs_cbuf = InvalidBuffer;
      }

!     return NULL;                /* failure exit */
  }

  /* ----------------
--- 417,541 ----
      Assert(TransactionIdIsValid(RecentGlobalXmin));

      /*
!      * The AM's gettuple proc finds the next index entry matching the scan
!      * keys, and puts the TID in xs_ctup.t_self. It should also set
!      * scan->xs_recheck, though we pay no attention to that here.
       */
!     found = DatumGetBool(FunctionCall2(procedure,
!                                        PointerGetDatum(scan),
!                                        Int32GetDatum(direction)));

!     /* Reset kill flag immediately for safety */
!     scan->kill_prior_tuple = false;

!     /* If we're out of index entries, release any held pin on a heap page */
!     if (!found)
!     {
!         if (BufferIsValid(scan->xs_cbuf))
          {
!             ReleaseBuffer(scan->xs_cbuf);
!             scan->xs_cbuf = InvalidBuffer;
          }
+         return false;
+     }

!     pgstat_count_index_tuples(scan->indexRelation, 1);

!     return true;
! }


! /* ----------------
!  *        index_fetch - fetch the heap tuple the current index pointer points to
!  *
!  * The result is the heap tuple the current index tuple points to if it
!  * matches the snapshot, or NULL if it doesn't. On success, the buffer
!  * containing the heap tuple is pinned (the pin will be dropped at the next
!  * index_fetch or index_endscan).
!  * ----------------
!  */
! HeapTuple
! index_fetch(IndexScanDesc scan)
! {
!     ItemPointer tid = &scan->xs_ctup.t_self;
!     bool        found;
!     Buffer        prev_buf;
!
!     /*
!      * We only fetch the first matching tuple in a chain, so we can't support
!      * SnapshotAny. All other snapshots only consider one tuple in a HOT chain
!      * as visible at any given moment.
!      */
!     Assert(scan->xs_snapshot != SnapshotAny);
!
!     /* Switch to correct buffer if we don't have it already */
!     prev_buf = scan->xs_cbuf;
!     scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
!                                          scan->heapRelation,
!                                          ItemPointerGetBlockNumber(tid));
!
!     /*
!      * Prune page, but only if we weren't already on this page
!      */
!     if (prev_buf != scan->xs_cbuf)
!         heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf,
!                             RecentGlobalXmin);
!
!     /* Obtain share-lock on the buffer so we can examine visibility */
!     LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!
!     found = heap_hot_search_buffer(tid, scan->xs_cbuf,
!                                    scan->xs_snapshot, &scan->xs_ctup,
!                                    &scan->kill_prior_tuple, true);
!     LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!
!     if (found)
      {
!         Assert(!scan->kill_prior_tuple);
!
!         /*
!          * We must initialize all of *heapTuple (ie, scan->xs_ctup) since
!          * it is returned to the executor on success.
!          */
!         scan->xs_ctup.t_tableOid = RelationGetRelid(scan->heapRelation);
!
!         pgstat_count_heap_fetch(scan->indexRelation);
!
!         return &scan->xs_ctup;
      }
+     else
+         return NULL;
+ }

! /* ----------------
!  *        index_getnext - get the next heap tuple from a scan
!  *
!  * This is a shorthand for index_next() + index_fetch().
!  *
!  * The result is the next heap tuple satisfying the scan keys and the
!  * snapshot, or NULL if no more matching tuples exist.    On success,
!  * the buffer containing the heap tuple is pinned (the pin will be dropped
!  * at the next index_getnext or index_endscan).
!  *
!  * Note: caller must check scan->xs_recheck, and perform rechecking of the
!  * scan keys if required.  We do not do that here because we don't have
!  * enough information to do it efficiently in the general case.
!  * ----------------
!  */
! HeapTuple
! index_getnext(IndexScanDesc scan, ScanDirection direction)
! {
!     for(;;)
!     {
!         HeapTuple tup;
!
!         if (!index_next(scan, direction))
!             return NULL;
!
!         tup = index_fetch(scan);
!         if (tup != NULL)
!             return tup;
!     }
  }

  /* ----------------
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
***************
*** 772,777 **** copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
--- 772,778 ----
      TransactionId OldestXmin;
      TransactionId FreezeXid;
      RewriteState rwstate;
+     ItemPointerData tid;

      /*
       * Open the relations we need.
***************
*** 829,847 **** copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
      scan = index_beginscan(OldHeap, OldIndex,
                             SnapshotAny, 0, (ScanKey) NULL);

!     while ((tuple = index_getnext(scan, ForwardScanDirection)) != NULL)
      {
          HeapTuple    copiedTuple;
          bool        isdead;
          int            i;

          CHECK_FOR_INTERRUPTS();

!         /* Since we used no scan keys, should never need to recheck */
!         if (scan->xs_recheck)
!             elog(ERROR, "CLUSTER does not support lossy index conditions");

!         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);

          switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
                                           scan->xs_cbuf))
--- 830,885 ----
      scan = index_beginscan(OldHeap, OldIndex,
                             SnapshotAny, 0, (ScanKey) NULL);

!     tuple = NULL;
!     for(;;)
      {
          HeapTuple    copiedTuple;
          bool        isdead;
          int            i;
+         bool        found;

          CHECK_FOR_INTERRUPTS();

!         /* Advance to next index tuple if we're done with current HOT chain */
!         if (tuple == NULL)
!         {
!             if (!index_next(scan, ForwardScanDirection))
!                 break;
!
!             /* Since we used no scan keys, should never need to recheck */
!             if (scan->xs_recheck)
!                 elog(ERROR, "CLUSTER does not support lossy index conditions");
!
!             tid = scan->xs_ctup.t_self;

!             scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf, OldHeap,
!                                                  ItemPointerGetBlockNumber(&tid));
!             /* Obtain share-lock on the buffer so we can examine visibility */
!             LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!             found = heap_hot_search_buffer(&tid, scan->xs_cbuf,
!                                            scan->xs_snapshot, &scan->xs_ctup,
!                                            &scan->kill_prior_tuple, true);
!             if (!found)
!             {
!                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!                 break;
!             }
!         }
!         else
!         {
!             /* Fetch next tuple from current HOT chain */
!             LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!             found = heap_hot_search_buffer(&tid, scan->xs_cbuf,
!                                            scan->xs_snapshot, &scan->xs_ctup,
!                                            &scan->kill_prior_tuple, false);
!             if (!found)
!             {
!                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!                 tuple = NULL;
!                 continue;
!             }
!         }
!         tuple = &scan->xs_ctup;

          switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
                                           scan->xs_cbuf))
*** a/src/backend/executor/nodeBitmapHeapscan.c
--- b/src/backend/executor/nodeBitmapHeapscan.c
***************
*** 382,390 **** bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
          {
              OffsetNumber offnum = tbmres->offsets[curslot];
              ItemPointerData tid;

              ItemPointerSet(&tid, page, offnum);
!             if (heap_hot_search_buffer(&tid, buffer, snapshot, NULL))
                  scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
          }
      }
--- 382,392 ----
          {
              OffsetNumber offnum = tbmres->offsets[curslot];
              ItemPointerData tid;
+             HeapTupleData heapTuple;

              ItemPointerSet(&tid, page, offnum);
!             if (heap_hot_search_buffer(&tid, buffer, snapshot, &heapTuple,
!                                        NULL, true))
                  scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
          }
      }
*** a/src/include/access/genam.h
--- b/src/include/access/genam.h
***************
*** 145,150 **** extern void index_endscan(IndexScanDesc scan);
--- 145,153 ----
  extern void index_markpos(IndexScanDesc scan);
  extern void index_restrpos(IndexScanDesc scan);
  extern HeapTuple index_getnext(IndexScanDesc scan, ScanDirection direction);
+ extern bool index_next(IndexScanDesc scan, ScanDirection direction);
+ extern HeapTuple index_fetch(IndexScanDesc scan);
+
  extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);

  extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 83,89 **** extern bool heap_fetch(Relation relation, Snapshot snapshot,
             HeapTuple tuple, Buffer *userbuf, bool keep_buf,
             Relation stats_relation);
  extern bool heap_hot_search_buffer(ItemPointer tid, Buffer buffer,
!                        Snapshot snapshot, bool *all_dead);
  extern bool heap_hot_search(ItemPointer tid, Relation relation,
                  Snapshot snapshot, bool *all_dead);

--- 83,90 ----
             HeapTuple tuple, Buffer *userbuf, bool keep_buf,
             Relation stats_relation);
  extern bool heap_hot_search_buffer(ItemPointer tid, Buffer buffer,
!                        Snapshot snapshot, HeapTuple tuple,
!                        bool *all_dead, bool firstCall);
  extern bool heap_hot_search(ItemPointer tid, Relation relation,
                  Snapshot snapshot, bool *all_dead);

*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 77,87 **** typedef struct IndexScanDescData
      Buffer        xs_cbuf;        /* current heap buffer in scan, if any */
      /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
      bool        xs_recheck;        /* T means scan keys must be rechecked */
-
-     /* state data for traversing HOT chains in index_getnext */
-     bool        xs_hot_dead;    /* T if all members of HOT chain are dead */
-     OffsetNumber xs_next_hot;    /* next member of HOT chain, if any */
-     TransactionId xs_prev_xmax; /* previous HOT chain member's XMAX, if any */
  } IndexScanDescData;

  /* Struct for heap-or-index scans of system tables */
--- 77,82 ----
*** a/src/backend/access/index/genam.c
--- b/src/backend/access/index/genam.c
***************
*** 93,104 **** RelationGetIndexScan(Relation indexRelation,
--- 93,106 ----

      scan->kill_prior_tuple = false;
      scan->ignore_killed_tuples = true;    /* default setting */
+     scan->needIndexTuple = false;

      scan->opaque = NULL;

      ItemPointerSetInvalid(&scan->xs_ctup.t_self);
      scan->xs_ctup.t_data = NULL;
      scan->xs_cbuf = InvalidBuffer;
+     scan->xs_itup = NULL;

      /*
       * Let the AM fill in the key and any opaque data it wants.
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***************
*** 398,404 **** index_restrpos(IndexScanDesc scan)
   *
   * Advances to the next index tuple satisfying the scan keys, returning TRUE
   * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
--- 398,406 ----
   *
   * Advances to the next index tuple satisfying the scan keys, returning TRUE
   * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self. If scan->needIndexTuple is TRUE, the
!  * index tuple itself is stored in scan->xs_itup, if the indexam succeeded in
!  * fetching it.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 281,286 **** btgettuple(PG_FUNCTION_ARGS)
--- 281,298 ----
      else
          res = _bt_first(scan, dir);

+     /* Fetch the index tuple if needed */
+     if (scan->needIndexTuple)
+     {
+         if (scan->xs_itup != NULL)
+         {
+             pfree(scan->xs_itup);
+             scan->xs_itup = NULL;
+         }
+         if (res)
+             scan->xs_itup = _bt_getindextuple(scan);
+     }
+
      PG_RETURN_BOOL(res);
  }

*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 964,969 **** _bt_next(IndexScanDesc scan, ScanDirection dir)
--- 964,1031 ----
  }

  /*
+  * _bt_getindextuple - fetch index tuple at current position.
+  *
+  * The caller must have pin on so->currPos.buf.
+  *
+  * The tuple returned is a palloc'd copy. This can fail to find the tuple if
+  * new tuples have been inserted to the page since we stepped on this index
+  * page. NULL is returned in that case.
+  */
+ IndexTuple
+ _bt_getindextuple(IndexScanDesc scan)
+ {
+     BTScanOpaque so = (BTScanOpaque) scan->opaque;
+     Page        page;
+     BTPageOpaque opaque;
+     OffsetNumber minoff;
+     OffsetNumber maxoff;
+     IndexTuple    ituple, result;
+     int itemIndex;
+     BTScanPosItem *positem;
+     OffsetNumber offnum;
+
+     Assert(BufferIsValid(so->currPos.buf));
+
+     LockBuffer(so->currPos.buf, BT_READ);
+
+     page = BufferGetPage(so->currPos.buf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+     minoff = P_FIRSTDATAKEY(opaque);
+     maxoff = PageGetMaxOffsetNumber(page);
+
+     itemIndex = so->currPos.itemIndex;
+     positem = &so->currPos.items[itemIndex];
+     offnum = positem->indexOffset;
+
+     Assert(itemIndex >= so->currPos.firstItem &&
+            itemIndex <= so->currPos.lastItem);
+     if (offnum < minoff)
+     {
+         /* pure paranoia */
+         LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+         return NULL;
+     }
+
+     ituple = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+     if (ItemPointerEquals(&ituple->t_tid, &scan->xs_ctup.t_self))
+     {
+         /* found the item */
+         Size itupsz = IndexTupleSize(ituple);
+
+         result = palloc(itupsz);
+         memcpy(result, ituple, itupsz);
+     }
+     else
+         result = NULL;
+
+     LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+     return result;
+ }
+
+ /*
   *    _bt_readpage() -- Load data from current index page into so->currPos
   *
   * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***************
*** 951,956 **** ExplainNode(Plan *plan, PlanState *planstate,
--- 951,958 ----
          case T_IndexScan:
              show_scan_qual(((IndexScan *) plan)->indexqualorig,
                             "Index Cond", plan, outer_plan, es);
+             show_scan_qual(((IndexScan *)plan)->indexonlyqualorig,
+                            "Index-only Filter", plan, outer_plan, es);
              show_scan_qual(plan->qual, "Filter", plan, outer_plan, es);
              break;
          case T_BitmapIndexScan:
*** a/src/backend/executor/execTuples.c
--- b/src/backend/executor/execTuples.c
***************
*** 92,97 ****
--- 92,98 ----
  #include "postgres.h"

  #include "funcapi.h"
+ #include "access/itup.h"
  #include "catalog/pg_type.h"
  #include "nodes/nodeFuncs.h"
  #include "storage/bufmgr.h"
***************
*** 578,583 **** ExecStoreVirtualTuple(TupleTableSlot *slot)
--- 579,625 ----
      return slot;
  }

+ TupleTableSlot *
+ ExecStoreIndexTuple(IndexTuple tuple, TupleTableSlot *slot, bool shouldFree)
+ {
+     int attnum;
+
+     /*
+      * sanity checks
+      */
+     Assert(tuple != NULL);
+     Assert(slot != NULL);
+     Assert(slot->tts_tupleDescriptor != NULL);
+
+     /*
+      * Free any old physical tuple belonging to the slot.
+      */
+     if (slot->tts_shouldFree)
+         heap_freetuple(slot->tts_tuple);
+     if (slot->tts_shouldFreeMin)
+         heap_free_minimal_tuple(slot->tts_mintuple);
+
+     /*
+      * Store the new tuple into the specified slot.
+      */
+     slot->tts_isempty = false;
+     slot->tts_shouldFree = shouldFree;
+     slot->tts_shouldFreeMin = false;
+     slot->tts_tuple = NULL;
+     slot->tts_mintuple = NULL;
+
+     for(attnum = 1; attnum <= slot->tts_tupleDescriptor->natts; attnum++)
+     {
+         slot->tts_values[attnum - 1] =
+             index_getattr(tuple, attnum, slot->tts_tupleDescriptor,
+                           &slot->tts_isnull[attnum - 1]);
+     }
+
+     slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+
+     return slot;
+ }
+
  /* --------------------------------
   *        ExecStoreAllNullTuple
   *            Set up the slot to contain a null in every column.
*** a/src/backend/executor/nodeIndexscan.c
--- b/src/backend/executor/nodeIndexscan.c
***************
*** 55,60 **** IndexNext(IndexScanState *node)
--- 55,61 ----
      Index        scanrelid;
      HeapTuple    tuple;
      TupleTableSlot *slot;
+     IndexScan *plan = (IndexScan *) node->ss.ps.plan;

      /*
       * extract necessary information from index scan node
***************
*** 62,68 **** IndexNext(IndexScanState *node)
      estate = node->ss.ps.state;
      direction = estate->es_direction;
      /* flip direction if this is an overall backward scan */
!     if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
      {
          if (ScanDirectionIsForward(direction))
              direction = BackwardScanDirection;
--- 63,69 ----
      estate = node->ss.ps.state;
      direction = estate->es_direction;
      /* flip direction if this is an overall backward scan */
!     if (ScanDirectionIsBackward(plan->indexorderdir))
      {
          if (ScanDirectionIsForward(direction))
              direction = BackwardScanDirection;
***************
*** 72,78 **** IndexNext(IndexScanState *node)
      scandesc = node->iss_ScanDesc;
      econtext = node->ss.ps.ps_ExprContext;
      slot = node->ss.ss_ScanTupleSlot;
!     scanrelid = ((IndexScan *) node->ss.ps.plan)->scan.scanrelid;

      /*
       * Check if we are evaluating PlanQual for tuple of this relation.
--- 73,79 ----
      scandesc = node->iss_ScanDesc;
      econtext = node->ss.ps.ps_ExprContext;
      slot = node->ss.ss_ScanTupleSlot;
!     scanrelid = plan->scan.scanrelid;

      /*
       * Check if we are evaluating PlanQual for tuple of this relation.
***************
*** 94,99 **** IndexNext(IndexScanState *node)
--- 95,103 ----

          ResetExprContext(econtext);

+         if (!ExecQual(node->iss_indexonlyqualorig, econtext, false))
+             ExecClearTuple(slot);        /* would not be returned by scan */
+
          if (!ExecQual(node->indexqualorig, econtext, false))
              ExecClearTuple(slot);        /* would not be returned by scan */

***************
*** 106,116 **** IndexNext(IndexScanState *node)
      /*
       * ok, now that we have what we need, fetch the next tuple.
       */
!     while ((tuple = index_getnext(scandesc, direction)) != NULL)
      {
          /*
           * Store the scanned tuple in the scan tuple slot of the scan state.
!          * Note: we pass 'false' because tuples returned by amgetnext are
           * pointers onto disk pages and must not be pfree()'d.
           */
          ExecStoreTuple(tuple,    /* tuple to store */
--- 110,142 ----
      /*
       * ok, now that we have what we need, fetch the next tuple.
       */
!     while (index_next(scandesc, direction))
      {
+         bool indextuple_fetched = false;
+
+         /*
+          * Before we fetch the heap tuple, see if we can evaluate quals
+          * using the index tuple.
+          */
+         if (node->iss_indexonlyqual && scandesc->xs_itup != NULL)
+         {
+             indextuple_fetched = true;
+
+             ExecStoreIndexTuple(scandesc->xs_itup,
+                                 node->iss_IndexTupleSlot, false);
+
+             econtext->ecxt_scantuple = node->iss_IndexTupleSlot;
+             if (!ExecQual(node->iss_indexonlyqual, econtext, false))
+                 continue;
+         }
+
+         /* Fetch tuple from heap */
+         if ((tuple = index_fetch(scandesc)) == NULL)
+             continue;
+
          /*
           * Store the scanned tuple in the scan tuple slot of the scan state.
!          * Note: we pass 'false' because tuples returned by index_fetch are
           * pointers onto disk pages and must not be pfree()'d.
           */
          ExecStoreTuple(tuple,    /* tuple to store */
***************
*** 118,123 **** IndexNext(IndexScanState *node)
--- 144,158 ----
                         scandesc->xs_cbuf,        /* buffer containing tuple */
                         false);    /* don't pfree */

+         econtext->ecxt_scantuple = slot;
+
+         /*
+          * check index-only quals if we didn't check them already.
+          */
+         if (node->iss_indexonlyqual && !indextuple_fetched &&
+             !ExecQual(node->iss_indexonlyqualorig, econtext, false))
+             continue;
+
          /*
           * If the index was lossy, we have to recheck the index quals using
           * the real tuple.
***************
*** 431,436 **** ExecEndIndexScan(IndexScanState *node)
--- 466,472 ----
       */
      ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
      ExecClearTuple(node->ss.ss_ScanTupleSlot);
+     ExecClearTuple(node->iss_IndexTupleSlot);

      /*
       * close the index relation (no-op if we didn't open it)
***************
*** 519,531 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
          ExecInitExpr((Expr *) node->indexqualorig,
                       (PlanState *) indexstate);

! #define INDEXSCAN_NSLOTS 2

      /*
       * tuple table initialization
       */
      ExecInitResultTupleSlot(estate, &indexstate->ss.ps);
      ExecInitScanTupleSlot(estate, &indexstate->ss);

      /*
       * open the base relation and acquire appropriate lock on it.
--- 555,576 ----
          ExecInitExpr((Expr *) node->indexqualorig,
                       (PlanState *) indexstate);

!     indexstate->iss_indexonlyqual = (List *)
!         ExecInitExpr((Expr *) node->indexonlyqual,
!                      (PlanState *) indexstate);
!
!     indexstate->iss_indexonlyqualorig = (List *)
!         ExecInitExpr((Expr *) node->indexonlyqualorig,
!                      (PlanState *) indexstate);
!
! #define INDEXSCAN_NSLOTS 3

      /*
       * tuple table initialization
       */
      ExecInitResultTupleSlot(estate, &indexstate->ss.ps);
      ExecInitScanTupleSlot(estate, &indexstate->ss);
+     indexstate->iss_IndexTupleSlot = ExecAllocTableSlot(estate->es_tupleTable);

      /*
       * open the base relation and acquire appropriate lock on it.
***************
*** 566,571 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
--- 611,623 ----
                                       relistarget ? NoLock : AccessShareLock);

      /*
+      * Initialize slot to hold index tuple
+      */
+     if (node->indexonlyqual != NIL)
+         ExecSetSlotDescriptor(indexstate->iss_IndexTupleSlot,
+                               RelationGetDescr(indexstate->iss_RelationDesc));
+
+     /*
       * Initialize index-specific scan state
       */
      indexstate->iss_RuntimeKeysReady = false;
***************
*** 611,616 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
--- 663,670 ----
                                                 estate->es_snapshot,
                                                 indexstate->iss_NumScanKeys,
                                                 indexstate->iss_ScanKeys);
+     if (node->indexonlyqual != NIL)
+         indexstate->iss_ScanDesc->needIndexTuple = true;

      /*
       * all done.
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 313,318 **** _copyIndexScan(IndexScan *from)
--- 313,320 ----
      COPY_NODE_FIELD(indexqual);
      COPY_NODE_FIELD(indexqualorig);
      COPY_SCALAR_FIELD(indexorderdir);
+     COPY_NODE_FIELD(indexonlyqual);
+     COPY_NODE_FIELD(indexonlyqualorig);

      return newnode;
  }
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 399,404 **** _outIndexScan(StringInfo str, IndexScan *node)
--- 399,406 ----
      WRITE_NODE_FIELD(indexqual);
      WRITE_NODE_FIELD(indexqualorig);
      WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+     WRITE_NODE_FIELD(indexonlyqual);
+     WRITE_NODE_FIELD(indexonlyqualorig);
  }

  static void
***************
*** 1330,1335 **** _outIndexPath(StringInfo str, IndexPath *node)
--- 1332,1338 ----
      WRITE_ENUM_FIELD(indexscandir, ScanDirection);
      WRITE_FLOAT_FIELD(indextotalcost, "%.2f");
      WRITE_FLOAT_FIELD(indexselectivity, "%.4f");
+     WRITE_INT_FIELD(scantype);
      WRITE_FLOAT_FIELD(rows, "%.0f");
  }

*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 192,197 **** cost_seqscan(Path *path, PlannerInfo *root,
--- 192,199 ----
   *
   * 'index' is the index to be used
   * 'indexQuals' is the list of applicable qual clauses (implicit AND semantics)
+  * 'indexOnlyQuals' is the list of qual clauses that are applied to tuples
+  *        fetched from index, before fetching the heap tuple
   * 'outer_rel' is the outer relation when we are considering using the index
   *        scan as the inside of a nestloop join (hence, some of the indexQuals
   *        are join clauses, and we should expect repeated scans of the index);
***************
*** 213,218 **** void
--- 215,221 ----
  cost_index(IndexPath *path, PlannerInfo *root,
             IndexOptInfo *index,
             List *indexQuals,
+            List *indexOnlyQuals,
             RelOptInfo *outer_rel)
  {
      RelOptInfo *baserel = index->rel;
***************
*** 221,226 **** cost_index(IndexPath *path, PlannerInfo *root,
--- 224,230 ----
      Cost        indexStartupCost;
      Cost        indexTotalCost;
      Selectivity indexSelectivity;
+     Selectivity indexOnlyQualSelectivity;
      double        indexCorrelation,
                  csquared;
      Cost        min_IO_cost,
***************
*** 254,259 **** cost_index(IndexPath *path, PlannerInfo *root,
--- 258,266 ----
                       PointerGetDatum(&indexSelectivity),
                       PointerGetDatum(&indexCorrelation));

+     indexOnlyQualSelectivity =
+         clauselist_selectivity(root, indexOnlyQuals, 0, JOIN_INNER, NULL);
+
      /*
       * Save amcostestimate's results for possible use in bitmap scan planning.
       * We don't bother to save indexStartupCost or indexCorrelation, because a
***************
*** 267,273 **** cost_index(IndexPath *path, PlannerInfo *root,
      run_cost += indexTotalCost - indexStartupCost;

      /* estimate number of main-table tuples fetched */
!     tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);

      /*----------
       * Estimate number of main-table pages fetched, and compute I/O cost.
--- 274,281 ----
      run_cost += indexTotalCost - indexStartupCost;

      /* estimate number of main-table tuples fetched */
!     tuples_fetched = clamp_row_est(
!         indexOnlyQualSelectivity * indexSelectivity * baserel->tuples);

      /*----------
       * Estimate number of main-table pages fetched, and compute I/O cost.
*** a/src/backend/optimizer/path/indxpath.c
--- b/src/backend/optimizer/path/indxpath.c
***************
*** 27,32 ****
--- 27,33 ----
  #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
+ #include "optimizer/planmain.h"
  #include "optimizer/predtest.h"
  #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
***************
*** 45,59 ****
  #define IsBooleanOpfamily(opfamily) \
      ((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID)

-
- /* Whether we are looking for plain indexscan, bitmap scan, or either */
- typedef enum
- {
-     ST_INDEXSCAN,                /* must support amgettuple */
-     ST_BITMAPSCAN,                /* must support amgetbitmap */
-     ST_ANYSCAN                    /* either is okay */
- } ScanTypeControl;
-
  /* Per-path data used within choose_bitmap_and() */
  typedef struct
  {
--- 46,51 ----
***************
*** 67,73 **** typedef struct
  static List *find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
                      List *clauses, List *outer_clauses,
                      bool istoplevel, RelOptInfo *outer_rel,
!                     SaOpControl saop_control, ScanTypeControl scantype);
  static List *find_saop_paths(PlannerInfo *root, RelOptInfo *rel,
                  List *clauses, List *outer_clauses,
                  bool istoplevel, RelOptInfo *outer_rel);
--- 59,65 ----
  static List *find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
                      List *clauses, List *outer_clauses,
                      bool istoplevel, RelOptInfo *outer_rel,
!                     SaOpControl saop_control, ScanType scantype);
  static List *find_saop_paths(PlannerInfo *root, RelOptInfo *rel,
                  List *clauses, List *outer_clauses,
                  bool istoplevel, RelOptInfo *outer_rel);
***************
*** 194,203 **** create_index_paths(PlannerInfo *root, RelOptInfo *rel)
      {
          IndexPath  *ipath = (IndexPath *) lfirst(l);

!         if (ipath->indexinfo->amhasgettuple)
              add_path(rel, (Path *) ipath);

!         if (ipath->indexinfo->amhasgetbitmap &&
              ipath->indexselectivity < 1.0 &&
              !ScanDirectionIsBackward(ipath->indexscandir))
              bitindexpaths = lappend(bitindexpaths, ipath);
--- 186,195 ----
      {
          IndexPath  *ipath = (IndexPath *) lfirst(l);

!         if (ipath->scantype & ST_INDEXSCAN)
              add_path(rel, (Path *) ipath);

!         if ((ipath->scantype & ST_BITMAPSCAN) &&
              ipath->indexselectivity < 1.0 &&
              !ScanDirectionIsBackward(ipath->indexscandir))
              bitindexpaths = lappend(bitindexpaths, ipath);
***************
*** 278,284 **** static List *
  find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
                      List *clauses, List *outer_clauses,
                      bool istoplevel, RelOptInfo *outer_rel,
!                     SaOpControl saop_control, ScanTypeControl scantype)
  {
      Relids        outer_relids = outer_rel ? outer_rel->relids : NULL;
      bool        possibly_useful_pathkeys = has_useful_pathkeys(root, rel);
--- 270,276 ----
  find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
                      List *clauses, List *outer_clauses,
                      bool istoplevel, RelOptInfo *outer_rel,
!                     SaOpControl saop_control, ScanType scantype)
  {
      Relids        outer_relids = outer_rel ? outer_rel->relids : NULL;
      bool        possibly_useful_pathkeys = has_useful_pathkeys(root, rel);
***************
*** 291,296 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
--- 283,289 ----
          IndexOptInfo *index = (IndexOptInfo *) lfirst(ilist);
          IndexPath  *ipath;
          List       *restrictclauses;
+         List       *indexonlyQuals;
          List       *index_pathkeys;
          List       *useful_pathkeys;
          bool        useful_predicate;
***************
*** 398,422 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
              useful_pathkeys = NIL;

          /*
!          * 3. Generate an indexscan path if there are relevant restriction
!          * clauses in the current clauses, OR the index ordering is
!          * potentially useful for later merging or final output ordering, OR
!          * the index has a predicate that was proven by the current clauses.
           */
!         if (found_clause || useful_pathkeys != NIL || useful_predicate)
          {
              ipath = create_index_path(root, index,
                                        restrictclauses,
                                        useful_pathkeys,
                                        index_is_ordered ?
                                        ForwardScanDirection :
                                        NoMovementScanDirection,
                                        outer_rel);
              result = lappend(result, ipath);
          }

          /*
!          * 4. If the index is ordered, a backwards scan might be interesting.
           * Again, this is only interesting at top level.
           */
          if (index_is_ordered && possibly_useful_pathkeys &&
--- 391,475 ----
              useful_pathkeys = NIL;

          /*
!          * 3. The index might be useful if data from it can be used to
!          * evaluate any restrictions not used as index keys.
           */
!         indexonlyQuals = NIL;
!         if ((scantype & ST_INDEXSCAN) && index->amhasgettuple)
!         {
!             ListCell *l, *l2;
!
!             foreach(l, clauses)
!             {
!                 RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
!                 Expr *indexonlyQual;
!                 bool indexed = false;
!
!                 /*
!                  * Only consider restrictions that are not handled by the
!                  * index
!                  */
!                 foreach(l2, restrictclauses)
!                 {
!                     if (list_member((List *) lfirst(l2), rinfo))
!                     {
!                         indexed = true;
!                         break;
!                     }
!                 }
!                 if (!indexed)
!                 {
!                     indexonlyQual = make_indexonly_expr(root, index,
!                                                         rinfo->clause, NULL);
!                     if (indexonlyQual != NULL)
!                         indexonlyQuals = lappend(indexonlyQuals, rinfo->clause);
!                 }
!             }
!         }
!
!         /*
!          * 4. Generate an indexscan path if
!          *
!          * - there are relevant restriction clauses in the current clauses, OR
!          * - the index ordering is potentially useful for later merging or
!          *   final output ordering, OR
!          * - the index has a predicate that was proven by the current
!          *   clauses, OR
!          * - data from the index can be used to evaluate restriction clauses
!          *   before accessing the heap.
!          */
!         if (found_clause || useful_pathkeys != NIL || useful_predicate ||
!             indexonlyQuals != NIL)
          {
              ipath = create_index_path(root, index,
                                        restrictclauses,
+                                       indexonlyQuals,
                                        useful_pathkeys,
                                        index_is_ordered ?
                                        ForwardScanDirection :
                                        NoMovementScanDirection,
                                        outer_rel);
+
+             /*
+              * If the only reason we're considering this path is index-only
+              * quals, this path is only usable for a regular index scan.
+              */
+             if (!found_clause && useful_pathkeys == NIL && !useful_predicate)
+                 ipath->scantype = ST_INDEXSCAN;
+             else
+             {
+                 ipath->scantype = 0;
+                 if (index->amhasgettuple)
+                     ipath->scantype |= ST_INDEXSCAN;
+                 if (index->amhasgetbitmap)
+                     ipath->scantype |= ST_BITMAPSCAN;
+             }
+
              result = lappend(result, ipath);
          }

          /*
!          * 5. If the index is ordered, a backwards scan might be interesting.
           * Again, this is only interesting at top level.
           */
          if (index_is_ordered && possibly_useful_pathkeys &&
***************
*** 430,435 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
--- 483,489 ----
              {
                  ipath = create_index_path(root, index,
                                            restrictclauses,
+                                           indexonlyQuals,
                                            useful_pathkeys,
                                            BackwardScanDirection,
                                            outer_rel);
***************
*** 1788,1797 **** best_inner_indexscan(PlannerInfo *root, RelOptInfo *rel,
      {
          IndexPath  *ipath = (IndexPath *) lfirst(l);

!         if (ipath->indexinfo->amhasgettuple)
              indexpaths = lappend(indexpaths, ipath);

!         if (ipath->indexinfo->amhasgetbitmap)
              bitindexpaths = lappend(bitindexpaths, ipath);
      }

--- 1842,1851 ----
      {
          IndexPath  *ipath = (IndexPath *) lfirst(l);

!         if (ipath->scantype & ST_INDEXSCAN)
              indexpaths = lappend(indexpaths, ipath);

!         if (ipath->scantype & ST_BITMAPSCAN)
              bitindexpaths = lappend(bitindexpaths, ipath);
      }

*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 871,885 **** create_indexscan_plan(PlannerInfo *root,
--- 871,902 ----
      Index        baserelid = best_path->path.parent->relid;
      Oid            indexoid = best_path->indexinfo->indexoid;
      List       *qpqual;
+     List       *indexonlyqual;
      List       *stripped_indexquals;
+     List       *indexonlyqualorig;
      List       *fixed_indexquals;
      ListCell   *l;
      IndexScan  *scan_plan;
+     RelOptInfo *reloptinfo;
+     IndexOptInfo *indexoptinfo;

      /* it should be a base rel... */
      Assert(baserelid > 0);
      Assert(best_path->path.parent->rtekind == RTE_RELATION);

+     /* Find the index that was used */
+     reloptinfo =  root->simple_rel_array[baserelid];
+     foreach(l, reloptinfo->indexlist)
+     {
+         indexoptinfo = (IndexOptInfo *) lfirst(l);
+         if (indexoptinfo->indexoid == indexoid)
+             break;
+         else
+             indexoptinfo = NULL;
+     }
+     if (indexoptinfo == NULL)
+         elog(ERROR, "can't find IndexOptInfo for index scan");
+
      /*
       * Build "stripped" indexquals structure (no RestrictInfos) to pass to
       * executor as indexqualorig
***************
*** 928,936 **** create_indexscan_plan(PlannerInfo *root,
--- 945,957 ----
       * plan so that they'll be properly rechecked by EvalPlanQual testing.
       */
      qpqual = NIL;
+     indexonlyqual = NIL;
+     indexonlyqualorig = NIL;
      foreach(l, scan_clauses)
      {
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+         Expr *indexonlyclause;
+         bool is_indexonly;

          Assert(IsA(rinfo, RestrictInfo));
          if (rinfo->pseudoconstant)
***************
*** 952,958 **** create_indexscan_plan(PlannerInfo *root,
                          continue;
              }
          }
!         qpqual = lappend(qpqual, rinfo);
      }

      /* Sort clauses into best execution order */
--- 973,989 ----
                          continue;
              }
          }
!
!         indexonlyclause = make_indexonly_expr(root, indexoptinfo,
!                                               (Expr *) rinfo->clause,
!                                               &is_indexonly);
!         if (is_indexonly)
!         {
!             indexonlyqual = lappend(indexonlyqual, indexonlyclause);
!             indexonlyqualorig = lappend(indexonlyqualorig, rinfo->clause);
!         }
!         else
!             qpqual = lappend(qpqual, rinfo);
      }

      /* Sort clauses into best execution order */
***************
*** 969,974 **** create_indexscan_plan(PlannerInfo *root,
--- 1000,1007 ----
                                 fixed_indexquals,
                                 stripped_indexquals,
                                 best_path->indexscandir);
+     scan_plan->indexonlyqual = indexonlyqual;
+     scan_plan->indexonlyqualorig = indexonlyqualorig;

      copy_path_costsize(&scan_plan->scan.plan, &best_path->path);
      /* use the indexscan-specific rows estimate, not the parent rel's */
***************
*** 2198,2203 **** fix_indexqual_operand(Node *node, IndexOptInfo *index)
--- 2231,2416 ----
  }

  /*
+  * replace_heapvars_with_indexvars
+  *
+  * Replaces all Vars in an expression with Vars that refer to columns
+  * of an index. The resulting expression can be used in the executor to
+  * evaluate an expression using an IndexTuple from the index, instead
+  * of a heap tuple.
+  *
+  * Returns NULL and sets context->indexonly to FALSE if the expression can't
+  * be evaluated using columns from the index only.
+  */
+ typedef struct
+ {
+     IndexOptInfo *index;    /* index we're dealing with */
+     bool        indexonly;
+ } replace_heapvars_context;
+
+ static Node *
+ replace_heapvars_with_indexvars(Node *node, replace_heapvars_context *context)
+ {
+     /*
+      * We represent index keys by Var nodes having the varno of the base table
+      * but varattno equal to the index's attribute number (index column
+      * position).  This is a bit hokey ... would be cleaner to use a
+      * special-purpose node type that could not be mistaken for a regular Var.
+      * But it will do for now.
+      */
+     Var           *result;
+     int            pos;
+     IndexOptInfo *index = context->index;
+     ListCell   *indexpr_item;
+
+     if (node == NULL || !context->indexonly)
+         return NULL;
+
+     if (IsA(node, Var))
+     {
+         Var *var = (Var *) node;
+
+         if (var->varno == index->rel->relid)
+         {
+             /* Try to match against simple index columns */
+             int            varatt = var->varattno;
+
+             if (varatt != 0)
+             {
+                 for (pos = 0; pos < index->ncolumns; pos++)
+                 {
+                     /*
+                      * If the type stored in the index doesn't match that
+                      * in the heap, we can't use the value from the index.
+                      */
+                     if (index->atttype[pos] != var->vartype)
+                         continue;
+
+                     if (index->indexkeys[pos] == varatt)
+                     {
+
+                         result = (Var *) copyObject(node);
+                         result->varattno = pos + 1;
+
+                         return (Node *) result;
+                     }
+                 }
+                 /* Not in index. The heap tuple is needed for this qual. */
+                 context->indexonly = false;
+                 return NULL;
+             }
+             else
+             {
+                 /* TODO: we don't handle full-row references yet. */
+                 context->indexonly = false;
+                 return NULL;
+             }
+         }
+         else
+         {
+             /* XXX: Can this happen? */
+             context->indexonly = false;
+             return NULL;
+         }
+     }
+
+     /* Try to match against index expressions */
+     if (index->indexprs != NIL)
+     {
+         indexpr_item = list_head(index->indexprs);
+         for (pos = 0; pos < index->ncolumns; pos++)
+         {
+             if (index->indexkeys[pos] == 0)
+             {
+                 Node       *indexkey;
+                 Oid            indexkey_type;
+
+                 if (indexpr_item == NULL)
+                     elog(ERROR, "too few entries in indexprs list");
+
+                 indexkey = (Node *) lfirst(indexpr_item);
+                 if (indexkey && IsA(indexkey, RelabelType))
+                     indexkey = (Node *) ((RelabelType *) indexkey)->arg;
+
+                 /*
+                  * If the type stored in the index doesn't match that of
+                  * the expression, we can't use the value from the index.
+                  */
+                 indexkey_type = exprType(indexkey);
+                 if (index->atttype[pos] != indexkey_type)
+                     continue;
+
+                 if (equal(node, indexkey))
+                 {
+                     /* Found a match */
+                     result = makeVar(index->rel->relid, pos + 1,
+                                      exprType(lfirst(indexpr_item)), -1,
+                                      0);
+                     return (Node *) result;
+                 }
+                 indexpr_item = lnext(indexpr_item);
+             }
+         }
+     }
+
+     /* Not found */
+     return expression_tree_mutator(node, replace_heapvars_with_indexvars,
+                                    context);
+ }
+
+ /*
+  * make_indexonly_expr
+  *
+  * Given an expression, return an equivalent Expr tree that evaluates the
+  * expression using columns from an index. On success, *indexonly is set to
+  * TRUE. If the expression can't be evaluated using only index columns,
+  * *indexonly is set to FALSE and NULL is returned.
+  *
+  * This is similar to fix_indexonly_references, but we return NULL instead
+  * of throwing an error if the expression can't be evaluated using only index
+  * columns, and the input expression doesn't need to be in simple "A op B"
+  * format.
+  */
+ Expr *
+ make_indexonly_expr(PlannerInfo *root, IndexOptInfo *index,
+                     Expr *expr, bool *indexonly)
+ {
+     replace_heapvars_context context;
+
+     /*
+      * If the indexam can't return index tuples, we can forget about it.
+      */
+     if (!index->amregurgitate)
+     {
+         if (indexonly)
+             *indexonly = false;
+         return NULL;
+     }
+
+     /*
+      * Expression containing volatile functions can't be used as index-only
+      * quals, because the expression would be executed on dead rows as well.
+      */
+     if (contain_volatile_functions((Node *) expr))
+     {
+         if (indexonly)
+             *indexonly = false;
+         return NULL;
+     }
+
+     context.index = index;
+     context.indexonly = true;
+
+     expr = (Expr *) replace_heapvars_with_indexvars((Node *) expr, &context);
+
+     if (indexonly)
+         *indexonly = context.indexonly;
+     if (!context.indexonly)
+         return NULL;
+     else
+         return expr;
+ }
+
+ /*
   * get_switched_clauses
   *      Given a list of merge or hash joinclauses (as RestrictInfo nodes),
   *      extract the bare clauses, and rearrange the elements within the
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 402,407 **** build_minmax_path(PlannerInfo *root, RelOptInfo *rel, MinMaxAggInfo *info)
--- 402,408 ----
          new_path = create_index_path(root, index,
                                       restrictclauses,
                                       NIL,
+                                      NIL,
                                       indexscandir,
                                       NULL);

*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 16,21 ****
--- 16,23 ----
  #include "postgres.h"

  #include "access/transam.h"
+ #include "catalog/pg_am.h"
+ #include "catalog/pg_opfamily.h"
  #include "catalog/pg_type.h"
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
***************
*** 274,279 **** set_plan_refs(PlannerGlobal *glob, Plan *plan, int rtoffset)
--- 276,285 ----
                      fix_scan_list(glob, splan->indexqual, rtoffset);
                  splan->indexqualorig =
                      fix_scan_list(glob, splan->indexqualorig, rtoffset);
+                 splan->indexonlyqual =
+                     fix_scan_list(glob, splan->indexonlyqual, rtoffset);
+                 splan->indexonlyqualorig =
+                     fix_scan_list(glob, splan->indexonlyqualorig, rtoffset);
              }
              break;
          case T_BitmapIndexScan:
***************
*** 925,936 **** set_inner_join_references(PlannerGlobal *glob, Plan *inner_plan,
           */
          IndexScan  *innerscan = (IndexScan *) inner_plan;
          List       *indexqualorig = innerscan->indexqualorig;

          /* No work needed if indexqual refers only to its own rel... */
          if (NumRelids((Node *) indexqualorig) > 1)
          {
-             Index        innerrel = innerscan->scan.scanrelid;
-
              /* only refs to outer vars get changed in the inner qual */
              innerscan->indexqualorig = fix_join_expr(glob,
                                                       indexqualorig,
--- 931,942 ----
           */
          IndexScan  *innerscan = (IndexScan *) inner_plan;
          List       *indexqualorig = innerscan->indexqualorig;
+         List       *indexonlyqualorig = innerscan->indexonlyqualorig;
+         Index        innerrel = innerscan->scan.scanrelid;

          /* No work needed if indexqual refers only to its own rel... */
          if (NumRelids((Node *) indexqualorig) > 1)
          {
              /* only refs to outer vars get changed in the inner qual */
              innerscan->indexqualorig = fix_join_expr(glob,
                                                       indexqualorig,
***************
*** 944,963 **** set_inner_join_references(PlannerGlobal *glob, Plan *inner_plan,
                                                   NULL,
                                                   innerrel,
                                                   0);

!             /*
!              * We must fix the inner qpqual too, if it has join clauses (this
!              * could happen if special operators are involved: some indexquals
!              * may get rechecked as qpquals).
!              */
!             if (NumRelids((Node *) inner_plan->qual) > 1)
!                 inner_plan->qual = fix_join_expr(glob,
!                                                  inner_plan->qual,
                                                   outer_itlist,
                                                   NULL,
                                                   innerrel,
                                                   0);
          }
      }
      else if (IsA(inner_plan, BitmapIndexScan))
      {
--- 950,987 ----
                                                   NULL,
                                                   innerrel,
                                                   0);
+         }

!         /* No work needed if indexqual refers only to its own rel... */
!         if (NumRelids((Node *) indexonlyqualorig) > 1)
!         {
!             /* only refs to outer vars get changed in the inner qual */
!             innerscan->indexonlyqualorig = fix_join_expr(glob,
!                                                      indexonlyqualorig,
!                                                      outer_itlist,
!                                                      NULL,
!                                                      innerrel,
!                                                      0);
!             innerscan->indexonlyqual = fix_join_expr(glob,
!                                                  innerscan->indexonlyqual,
                                                   outer_itlist,
                                                   NULL,
                                                   innerrel,
                                                   0);
          }
+
+         /*
+          * We must fix the inner qpqual too, if it has join clauses (this
+          * could happen if special operators are involved: some indexquals
+          * may get rechecked as qpquals).
+          */
+         if (NumRelids((Node *) inner_plan->qual) > 1)
+             inner_plan->qual = fix_join_expr(glob,
+                                              inner_plan->qual,
+                                              outer_itlist,
+                                              NULL,
+                                              innerrel,
+                                              0);
      }
      else if (IsA(inner_plan, BitmapIndexScan))
      {
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 1865,1874 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params)
          case T_IndexScan:
              finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
                                &context);

              /*
!              * we need not look at indexqualorig, since it will have the same
!              * param references as indexqual.
               */
              break;

--- 1865,1876 ----
          case T_IndexScan:
              finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
                                &context);
+             finalize_primnode((Node *) ((IndexScan *) plan)->indexonlyqual,
+                               &context);

              /*
!              * we need not look at indexqualorig/indexonlyqualorig, since they
!              * will have the same param references as indexqual/indexonlyqual.
               */
              break;

*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 414,419 **** create_seqscan_path(PlannerInfo *root, RelOptInfo *rel)
--- 414,421 ----
   * 'index' is a usable index.
   * 'clause_groups' is a list of lists of RestrictInfo nodes
   *            to be used as index qual conditions in the scan.
+  * 'indexonlyquals' is a list of expressions that can be evaluated using
+  *            data from index.
   * 'pathkeys' describes the ordering of the path.
   * 'indexscandir' is ForwardScanDirection or BackwardScanDirection
   *            for an ordered index, or NoMovementScanDirection for
***************
*** 427,432 **** IndexPath *
--- 429,435 ----
  create_index_path(PlannerInfo *root,
                    IndexOptInfo *index,
                    List *clause_groups,
+                   List *indexonlyquals,
                    List *pathkeys,
                    ScanDirection indexscandir,
                    RelOptInfo *outer_rel)
***************
*** 504,510 **** create_index_path(PlannerInfo *root,
          pathnode->rows = rel->rows;
      }

!     cost_index(pathnode, root, index, indexquals, outer_rel);

      return pathnode;
  }
--- 507,513 ----
          pathnode->rows = rel->rows;
      }

!     cost_index(pathnode, root, index, indexquals, indexonlyquals, outer_rel);

      return pathnode;
  }
*** a/src/backend/optimizer/util/plancat.c
--- b/src/backend/optimizer/util/plancat.c
***************
*** 193,201 **** get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
               * We pre-zero the ordering info in case the index is unordered.
               */
              info->indexkeys = (int *) palloc(sizeof(int) * ncolumns);
!             info->opfamily = (Oid *) palloc0(sizeof(Oid) * (4 * ncolumns + 1));
              info->opcintype = info->opfamily + (ncolumns + 1);
!             info->fwdsortop = info->opcintype + ncolumns;
              info->revsortop = info->fwdsortop + ncolumns;
              info->nulls_first = (bool *) palloc0(sizeof(bool) * ncolumns);

--- 193,202 ----
               * We pre-zero the ordering info in case the index is unordered.
               */
              info->indexkeys = (int *) palloc(sizeof(int) * ncolumns);
!             info->opfamily = (Oid *) palloc0(sizeof(Oid) * (5 * ncolumns + 1));
              info->opcintype = info->opfamily + (ncolumns + 1);
!             info->atttype = info->opcintype + ncolumns;
!             info->fwdsortop = info->atttype + ncolumns;
              info->revsortop = info->fwdsortop + ncolumns;
              info->nulls_first = (bool *) palloc0(sizeof(bool) * ncolumns);

***************
*** 204,209 **** get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
--- 205,211 ----
                  info->indexkeys[i] = index->indkey.values[i];
                  info->opfamily[i] = indexRelation->rd_opfamily[i];
                  info->opcintype[i] = indexRelation->rd_opcintype[i];
+                 info->atttype[i] = indexRelation->rd_att->attrs[i]->atttypid;
              }

              info->relam = indexRelation->rd_rel->relam;
***************
*** 212,217 **** get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
--- 214,220 ----
              info->amsearchnulls = indexRelation->rd_am->amsearchnulls;
              info->amhasgettuple = OidIsValid(indexRelation->rd_am->amgettuple);
              info->amhasgetbitmap = OidIsValid(indexRelation->rd_am->amgetbitmap);
+             info->amregurgitate = indexRelation->rd_am->amregurgitate;

              /*
               * Fetch the ordering operators associated with the index, if any.
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 556,561 **** extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
--- 556,562 ----
  extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
  extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
  extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+ extern IndexTuple _bt_getindextuple(IndexScanDesc scan);

  /*
   * prototypes for functions in nbtutils.c
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 16,21 ****
--- 16,22 ----

  #include "access/genam.h"
  #include "access/heapam.h"
+ #include "access/itup.h"


  typedef struct HeapScanDescData
***************
*** 64,69 **** typedef struct IndexScanDescData
--- 65,71 ----
      Snapshot    xs_snapshot;    /* snapshot to see */
      int            numberOfKeys;    /* number of scan keys */
      ScanKey        keyData;        /* array of scan key descriptors */
+     bool        needIndexTuple;    /* should scan store index tuple in xs_itup? */

      /* signaling to index AM about killing index tuples */
      bool        kill_prior_tuple;        /* last-returned tuple is dead */
***************
*** 77,82 **** typedef struct IndexScanDescData
--- 79,86 ----
      Buffer        xs_cbuf;        /* current heap buffer in scan, if any */
      /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
      bool        xs_recheck;        /* T means scan keys must be rechecked */
+
+     IndexTuple    xs_itup;        /* current index tuple, if any */
  } IndexScanDescData;

  /* Struct for heap-or-index scans of system tables */
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 49,54 **** CATALOG(pg_am,2601)
--- 49,55 ----
      bool        amsearchnulls;    /* can AM search for NULL index entries? */
      bool        amstorage;        /* can storage type differ from column type? */
      bool        amclusterable;    /* does AM support cluster command? */
+     bool        amregurgitate;    /* can AM return tuples back from index? */
      Oid            amkeytype;        /* type of data in index, or InvalidOid */
      regproc        aminsert;        /* "insert this tuple" function */
      regproc        ambeginscan;    /* "start new scan" function */
***************
*** 76,82 **** typedef FormData_pg_am *Form_pg_am;
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        26
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
--- 77,83 ----
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        27
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
***************
*** 89,124 **** typedef FormData_pg_am *Form_pg_am;
  #define Anum_pg_am_amsearchnulls        10
  #define Anum_pg_am_amstorage            11
  #define Anum_pg_am_amclusterable        12
! #define Anum_pg_am_amkeytype            13
! #define Anum_pg_am_aminsert                14
! #define Anum_pg_am_ambeginscan            15
! #define Anum_pg_am_amgettuple            16
! #define Anum_pg_am_amgetbitmap            17
! #define Anum_pg_am_amrescan                18
! #define Anum_pg_am_amendscan            19
! #define Anum_pg_am_ammarkpos            20
! #define Anum_pg_am_amrestrpos            21
! #define Anum_pg_am_ambuild                22
! #define Anum_pg_am_ambulkdelete            23
! #define Anum_pg_am_amvacuumcleanup        24
! #define Anum_pg_am_amcostestimate        25
! #define Anum_pg_am_amoptions            26

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 t t t t t t t f t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan
btendscanbtmarkpos btrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 f t f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap
hashrescanhashendscan hashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions
));
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    0 7 f f f t t t t t t 0 gistinsert gistbeginscan gistgettuple gistgetbitmap
gistrescangistendscan gistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions
));
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    0 5 f f f t t f f t f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan
ginmarkposginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

--- 90,126 ----
  #define Anum_pg_am_amsearchnulls        10
  #define Anum_pg_am_amstorage            11
  #define Anum_pg_am_amclusterable        12
! #define Anum_pg_am_amregurgitate        13
! #define Anum_pg_am_amkeytype            14
! #define Anum_pg_am_aminsert                15
! #define Anum_pg_am_ambeginscan            16
! #define Anum_pg_am_amgettuple            17
! #define Anum_pg_am_amgetbitmap            18
! #define Anum_pg_am_amrescan                19
! #define Anum_pg_am_amendscan            20
! #define Anum_pg_am_ammarkpos            21
! #define Anum_pg_am_amrestrpos            22
! #define Anum_pg_am_ambuild                23
! #define Anum_pg_am_ambulkdelete            24
! #define Anum_pg_am_amvacuumcleanup        25
! #define Anum_pg_am_amcostestimate        26
! #define Anum_pg_am_amoptions            27

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 t t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan
btendscanbtmarkpos btrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap
hashrescanhashendscan hashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions
));
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    0 7 f f f t t t t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap
gistrescangistendscan gistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions
));
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    0 5 f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan
ginmarkposginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

*** a/src/include/executor/tuptable.h
--- b/src/include/executor/tuptable.h
***************
*** 15,20 ****
--- 15,21 ----
  #define TUPTABLE_H

  #include "access/htup.h"
+ #include "access/itup.h"
  #include "access/tupdesc.h"
  #include "storage/buf.h"

***************
*** 165,170 **** extern TupleTableSlot *ExecStoreTuple(HeapTuple tuple,
--- 166,174 ----
  extern TupleTableSlot *ExecStoreMinimalTuple(MinimalTuple mtup,
                        TupleTableSlot *slot,
                        bool shouldFree);
+ extern TupleTableSlot *ExecStoreIndexTuple(IndexTuple itup,
+                       TupleTableSlot *slot,
+                       bool shouldFree);
  extern TupleTableSlot *ExecClearTuple(TupleTableSlot *slot);
  extern TupleTableSlot *ExecStoreVirtualTuple(TupleTableSlot *slot);
  extern TupleTableSlot *ExecStoreAllNullTuple(TupleTableSlot *slot);
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1108,1113 **** typedef struct
--- 1108,1115 ----
   *        RuntimeContext       expr context for evaling runtime Skeys
   *        RelationDesc       index relation descriptor
   *        ScanDesc           index scan descriptor
+  *        indexonlyqual       execution state for indexonlyqual expressions
+  *        indexonlyqualorig  execution state for indexonlyqualorig expressions
   * ----------------
   */
  typedef struct IndexScanState
***************
*** 1122,1127 **** typedef struct IndexScanState
--- 1124,1133 ----
      ExprContext *iss_RuntimeContext;
      Relation    iss_RelationDesc;
      IndexScanDesc iss_ScanDesc;
+
+     TupleTableSlot *iss_IndexTupleSlot;    /* slot for storing index tuple */
+     List           *iss_indexonlyqual;
+     List           *iss_indexonlyqualorig;
  } IndexScanState;

  /* ----------------
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 264,269 **** typedef Scan SeqScan;
--- 264,274 ----
   * table).    This is a bit hokey ... would be cleaner to use a special-purpose
   * node type that could not be mistaken for a regular Var.    But it will do
   * for now.
+  *
+  * indexonlyqual and indexonlyqualorig represent expressions that can't be
+  * used as index keys, but can be evaluated without fetching the heap tuple,
+  * using data from the index. The index must be able to regurgitate index
+  * tuples stored in it.
   * ----------------
   */
  typedef struct IndexScan
***************
*** 273,278 **** typedef struct IndexScan
--- 278,285 ----
      List       *indexqual;        /* list of index quals (OpExprs) */
      List       *indexqualorig;    /* the same in original form */
      ScanDirection indexorderdir;    /* forward or backward or don't care */
+     List       *indexonlyqual;    /* quals that can be satisfied from index */
+     List       *indexonlyqualorig;    /* same in original form */
  } IndexScan;

  /* ----------------
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 430,435 **** typedef struct IndexOptInfo
--- 430,436 ----
      Oid           *opfamily;        /* OIDs of operator families for columns */
      int           *indexkeys;        /* column numbers of index's keys, or 0 */
      Oid           *opcintype;        /* OIDs of opclass declared input data types */
+     Oid           *atttype;        /* OIDs of stored key data types */
      Oid           *fwdsortop;        /* OIDs of sort operators for each column */
      Oid           *revsortop;        /* OIDs of sort operators for backward scan */
      bool       *nulls_first;    /* do NULLs come first in the sort order? */
***************
*** 446,451 **** typedef struct IndexOptInfo
--- 447,453 ----
      bool        amsearchnulls;    /* can AM search for NULL index entries? */
      bool        amhasgettuple;    /* does AM have amgettuple interface? */
      bool        amhasgetbitmap; /* does AM have amgetbitmap interface? */
+     bool        amregurgitate;    /* can AM return tuples from index? */
  } IndexOptInfo;


***************
*** 631,636 **** typedef struct Path
--- 633,648 ----
   * rel's restrict clauses alone would do.
   *----------
   */
+
+ /*
+  * ScanType is bitmask to indicate what kind of scans the path can be used
+  * for.
+  */
+ #define ST_INDEXSCAN    1
+ #define ST_BITMAPSCAN    2
+ #define ST_ANYSCAN        (ST_INDEXSCAN | ST_BITMAPSCAN)
+ typedef int ScanType;
+
  typedef struct IndexPath
  {
      Path        path;
***************
*** 641,646 **** typedef struct IndexPath
--- 653,659 ----
      ScanDirection indexscandir;
      Cost        indextotalcost;
      Selectivity indexselectivity;
+     ScanType    scantype;
      double        rows;            /* estimated number of result tuples */
  } IndexPath;

*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 66,72 **** extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
                      double index_pages, PlannerInfo *root);
  extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_index(IndexPath *path, PlannerInfo *root, IndexOptInfo *index,
!            List *indexQuals, RelOptInfo *outer_rel);
  extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                        Path *bitmapqual, RelOptInfo *outer_rel);
  extern void cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root);
--- 66,72 ----
                      double index_pages, PlannerInfo *root);
  extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_index(IndexPath *path, PlannerInfo *root, IndexOptInfo *index,
!                 List *indexQuals, List *indexOnlyQuals, RelOptInfo *outer_rel);
  extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                        Path *bitmapqual, RelOptInfo *outer_rel);
  extern void cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root);
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 31,36 **** extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel);
--- 31,37 ----
  extern IndexPath *create_index_path(PlannerInfo *root,
                    IndexOptInfo *index,
                    List *clause_groups,
+                   List *indexonlyquals,
                    List *pathkeys,
                    ScanDirection indexscandir,
                    RelOptInfo *outer_rel);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 75,80 **** extern SetOp *make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree,
--- 75,83 ----
  extern Result *make_result(PlannerInfo *root, List *tlist,
              Node *resconstantqual, Plan *subplan);
  extern bool is_projection_capable_plan(Plan *plan);
+ extern Expr *make_indexonly_expr(PlannerInfo *root, IndexOptInfo *index,
+                                  Expr *expr, bool *indexonly);
+

  /*
   * prototypes for plan/initsplan.c
***************
*** 118,122 **** extern void record_plan_function_dependency(PlannerGlobal *glob, Oid funcid);
  extern void extract_query_dependencies(List *queries,
                             List **relationOids,
                             List **invalItems);
-
  #endif   /* PLANMAIN_H */
--- 121,124 ----

Re: Index-only quals

From
Greg Stark
Date:
On Fri, Aug 21, 2009 at 12:43 PM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>
> I added a new column 'amregurgitate' to pg_am, to mark which indexams
> can return index tuples.

Very picturesque but uh, perhaps the more mundane "amcanrettuples"
would be clearer?

-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: Index-only quals

From
Greg Stark
Date:
On Fri, Aug 21, 2009 at 12:43 PM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> Here is an updated version of my patch to return data from b-tree
> indexes, and use it to satisfy quals.

+             if (!found_clause && useful_pathkeys == NIL && !useful_predicate)
+                 ipath->scantype = ST_INDEXSCAN;
+             else
+             {
+                 ipath->scantype = 0;
+                 if (index->amhasgettuple)
+                     ipath->scantype |= ST_INDEXSCAN;
+                 if (index->amhasgetbitmap)
+                     ipath->scantype |= ST_BITMAPSCAN;
+             }
+


Does this section need to check amhasgettuple for the index-only scan
case as well? It looks like right now if an indexam has amregurgitate
set but not amhasgettuple then weird things could happen.

-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: Index-only quals

From
Heikki Linnakangas
Date:
Greg Stark wrote:
> On Fri, Aug 21, 2009 at 12:43 PM, Heikki
> Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>> Here is an updated version of my patch to return data from b-tree
>> indexes, and use it to satisfy quals.
> 
> +             if (!found_clause && useful_pathkeys == NIL && !useful_predicate)
> +                 ipath->scantype = ST_INDEXSCAN;
> +             else
> +             {
> +                 ipath->scantype = 0;
> +                 if (index->amhasgettuple)
> +                     ipath->scantype |= ST_INDEXSCAN;
> +                 if (index->amhasgetbitmap)
> +                     ipath->scantype |= ST_BITMAPSCAN;
> +             }
> +
> 
> 
> Does this section need to check amhasgettuple for the index-only scan
> case as well? It looks like right now if an indexam has amregurgitate
> set but not amhasgettuple then weird things could happen.

We check earlier in the function before we construct indexonlyQuals that
the index has amhasgettuple. Hmm, can you find an easier-to-understand
way to write that?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only quals

From
Heikki Linnakangas
Date:
Greg Stark wrote:
> It looks like right now if an indexam has amregurgitate
> set but not amhasgettuple then weird things could happen.

The combination (amregurgitate && !amhasgettuple) makes no sense, BTW.
If an indexam has no gettuple function, there's no way it can return
data from the index.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only quals

From
Jaime Casanova
Date:
On Fri, Aug 21, 2009 at 7:27 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> Greg Stark wrote:
>> It looks like right now if an indexam has amregurgitate
>> set but not amhasgettuple then weird things could happen.
>
> The combination (amregurgitate && !amhasgettuple) makes no sense, BTW.
> If an indexam has no gettuple function, there's no way it can return
> data from the index.
>

to have two columns that can conflict is not error prone? why not make
amhasgettuple an enum?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


Re: Index-only quals

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Barring objections, I'm going to apply the indexam API changes part,
> since that simplifies the code in question regardless of the rest of the
> work. I'm pretty happy now with the indexfilter patch as well, but want
> to do some more testing on that before committing. Some more eyeballs
> would be appreciated as well.

The first patch definitely needs another editorial pass, eg you have
complicated the behavior of heap_hot_search_buffer's all_dead parameter
without adjusting its documentation.  The patch as-submitted is really
quite hard to review, though.  It's hard to convince myself that all the
logic you removed from index_getnext is all accounted for elsewhere.
Could you run through the reasoning on that?

I think the second patch needs a fair amount of work.  The fact that
_bt_getindextuple can fail to get the right tuple seems quite bogus;
what I think it means is that you've attached the functionality in the
wrong place.  nbtree certainly had its hands on the right tuple at some
point, and what you should have done was saved the tuple aside at that
point.  I feel it's important that an indexam either be able to give
back tuples or not; this "maybe I can" semantics will prevent us from
doing many interesting things with the capability.  (More about that
in an upcoming message.)

ExecStoreIndexTuple seems rather confused as well, as it's applying what
apparently is a heap tupdesc to an index tuple.  That might accidentally
fail to fail at the moment, but I think it would be better to keep the
two concepts well separated.  Moreover attr-by-attr extraction is not
what you want, for efficiency reasons.  Use index_deform_tuple(), now
that that's there.  (The internal implementation of that is no better,
but now that there is actually a performance reason to improve it, we
could do that.)

I concur with the objection to "regurgitate" terminology, it seems a
bit yucky.

I haven't had time to go through the planner part in any detail...
        regards, tom lane


Re: Index-only quals

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Barring objections, I'm going to apply the indexam API changes part,
>> since that simplifies the code in question regardless of the rest of the
>> work. I'm pretty happy now with the indexfilter patch as well, but want
>> to do some more testing on that before committing. Some more eyeballs
>> would be appreciated as well.
> 
> The first patch definitely needs another editorial pass, eg you have
> complicated the behavior of heap_hot_search_buffer's all_dead parameter
> without adjusting its documentation.

Ok, I'll take a look.

>  The patch as-submitted is really
> quite hard to review, though.  It's hard to convince myself that all the
> logic you removed from index_getnext is all accounted for elsewhere.
> Could you run through the reasoning on that?

The removed logic in index_getnext() dealt with HOT chains. As the code
stands without the patch, index_getnext() returns all visible tuples
from a HOT chain, and therefore needs bookkeeping to keep track of the
current tuple in the current HOT chain.

With a normal MVCC snapshot, only one tuple in a HOT chain can be
visible, so all that logic is unnecessary for a MVCC snapshot. The same
holds for SnapshotSelf and SnapshotNow, but not for SnapshotAny.
However, there is only caller of index_getnext() that uses SnapshotAny:
CLUSTER. I modified heap_hot_search_buffer() so that it can be used to
continue searching a HOT chain after the first visible tuple, and
modified cluster.c to use that for the HOT chain following. This allowed
me to dumb down index_next()/index_fetch() so that index_fetch() only
needs to find the first visible tuple in a HOT chain.

> I think the second patch needs a fair amount of work.  The fact that
> _bt_getindextuple can fail to get the right tuple seems quite bogus;
> what I think it means is that you've attached the functionality in the
> wrong place.  nbtree certainly had its hands on the right tuple at some
> point, and what you should have done was saved the tuple aside at that
> point.  I feel it's important that an indexam either be able to give
> back tuples or not; this "maybe I can" semantics will prevent us from
> doing many interesting things with the capability.  (More about that
> in an upcoming message.)

Yeah, I think you're right. That seemed like the quickest way to get it
done, but I'm seeing that it's also pretty bad from a performance
standpoint - the index page needs to be locked and unlocked for each
tuple, which in a quick microbenchmark makes the index scan slower than
a seqscan when the data is in cache. I'm seeing about 10-15% of CPU time
spent on that locking, and the index scan is about that much slower than
the seq scan.

That can certainly be improved. Unfortunately we can't just return
pointers to the pinned index page in shared buffer, because a pin
doesn't stop page splits or moving tuples like it does on heap pages.
But we can copy all matching tuples to local memory along with the
matching TIDs when we step on a page.

> I concur with the objection to "regurgitate" terminology, it seems a
> bit yucky.

Apparently that word evokes stronger images in native speakers :-).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only quals

From
Bruce Momjian
Date:
I added this URL to the existing TODO item.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Here is an updated version of my patch to return data from b-tree
> indexes, and use it to satisfy quals.
> 
> I added a new column 'amregurgitate' to pg_am, to mark which indexams
> can return index tuples. Also, the data type of the index column in
> pg_attribute must match the type of the heap column - this catches the
> hack that 'name' is stored as cstring, that I had hardcoded before.
> 
> As discussed, GiST/GIN would need more infrastructure to mark which
> opclasses can return tuples, but as long as GiST/GIN doesn't support
> regurgitation at all, I'm not going to complicate the catalogs with that.
> 
> There's also some planner fixes - indexes that are only useful because
> of index-only quals are not considered for bitmap scans, and volatile
> expressions mustn't be used as index-only quals.
> 
> 
> This patch comes in two parts. Indexam API changes, which just splits
> the indexam_getnext function into two without providing any new
> functionality, and the main patch that applies on top of the indexam API
> changes. The patches are also available at
> git://git.postgresql.org/git/users/heikki/postgres.git, branches
> 'indexam-api-changes and 'indexfilter'.
> 
> Barring objections, I'm going to apply the indexam API changes part,
> since that simplifies the code in question regardless of the rest of the
> work. I'm pretty happy now with the indexfilter patch as well, but want
> to do some more testing on that before committing. Some more eyeballs
> would be appreciated as well.
> 
> -- 
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com



> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.comPG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard
drive,Christ can be your backup. +