Re: WAL bypass for CTAS - Mailing list pgsql-patches

From Tom Lane
Subject Re: WAL bypass for CTAS
Date
Msg-id 19788.1119293419@sss.pgh.pa.us
Whole thread Raw
In response to WAL bypass for CTAS  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: WAL bypass for CTAS
List pgsql-patches
Simon Riggs <simon@2ndquadrant.com> writes:
> I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> SELECT, when not in archive mode (PITR). The main use case for this is
> large BI environments that create summary tables or prejoined tables,
> though there are many general applications.

Applied after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(, and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries.  It will
be a very real hazard for COPY however.  The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones.  To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit.  We can just omit updating the FSM's
running average, if it even has one.  (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

Patch as applied is attached.

            regards, tom lane

*** src/backend/access/heap/heapam.c.orig    Wed Jun  8 11:50:21 2005
--- src/backend/access/heap/heapam.c    Mon Jun 20 13:50:16 2005
***************
*** 1034,1042 ****
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid)
  {
      TransactionId xid = GetCurrentTransactionId();
      Buffer        buffer;
--- 1034,1053 ----
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
+  *
+  * If use_wal is false, the new tuple is not logged in WAL, even for a
+  * non-temp relation.  Safe usage of this behavior requires that we arrange
+  * that all new tuples go into new pages not containing any tuples from other
+  * transactions, that the relation gets fsync'd before commit, and that the
+  * transaction emits at least one WAL record to ensure RecordTransactionCommit
+  * will decide to WAL-log the commit.
+  *
+  * use_fsm is passed directly to RelationGetBufferForTuple, which see for
+  * more info.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm)
  {
      TransactionId xid = GetCurrentTransactionId();
      Buffer        buffer;
***************
*** 1086,1092 ****
          heap_tuple_toast_attrs(relation, tup, NULL);

      /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, tup->t_len, InvalidBuffer);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
--- 1097,1104 ----
          heap_tuple_toast_attrs(relation, tup, NULL);

      /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, tup->t_len,
!                                        InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
***************
*** 1096,1102 ****
      pgstat_count_heap_insert(&relation->pgstat_info);

      /* XLOG stuff */
!     if (!relation->rd_istemp)
      {
          xl_heap_insert xlrec;
          xl_heap_header xlhdr;
--- 1108,1119 ----
      pgstat_count_heap_insert(&relation->pgstat_info);

      /* XLOG stuff */
!     if (relation->rd_istemp)
!     {
!         /* No XLOG record, but still need to flag that XID exists on disk */
!         MyXactMadeTempRelUpdate = true;
!     }
!     else if (use_wal)
      {
          xl_heap_insert xlrec;
          xl_heap_header xlhdr;
***************
*** 1151,1161 ****
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
-     else
-     {
-         /* No XLOG record, but still need to flag that XID exists on disk */
-         MyXactMadeTempRelUpdate = true;
-     }

      END_CRIT_SECTION();

--- 1168,1173 ----
***************
*** 1183,1189 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId());
  }

  /*
--- 1195,1201 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }

  /*
***************
*** 1743,1749 ****
          {
              /* Assume there's no chance to put newtup on same page. */
              newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                buffer);
          }
          else
          {
--- 1755,1761 ----
          {
              /* Assume there's no chance to put newtup on same page. */
              newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                buffer, true);
          }
          else
          {
***************
*** 1760,1766 ****
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                    buffer);
              }
              else
              {
--- 1772,1778 ----
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                    buffer, true);
              }
              else
              {
*** src/backend/access/heap/hio.c.orig    Sat May  7 17:32:23 2005
--- src/backend/access/heap/hio.c    Mon Jun 20 13:50:16 2005
***************
*** 79,90 ****
   *    happen if space is freed in that page after heap_update finds there's not
   *    enough there).    In that case, the page will be pinned and locked only once.
   *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
--- 79,104 ----
   *    happen if space is freed in that page after heap_update finds there's not
   *    enough there).    In that case, the page will be pinned and locked only once.
   *
+  *    If use_fsm is true (the normal case), we use FSM to help us find free
+  *    space.  If use_fsm is false, we always append a new empty page to the
+  *    end of the relation if the tuple won't fit on the current target page.
+  *    This can save some cycles when we know the relation is new and doesn't
+  *    contain useful amounts of free space.
+  *
+  *    The use_fsm = false case is also useful for non-WAL-logged additions to a
+  *    relation, if the caller holds exclusive lock and is careful to invalidate
+  *    relation->rd_targblock before the first insertion --- that ensures that
+  *    all insertions will occur into newly added pages and not be intermixed
+  *    with tuples from other transactions.  That way, a crash can't risk losing
+  *    any committed data of other transactions.  (See heap_insert's comments
+  *    for additional constraints needed for safe usage of this behavior.)
+  *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer, bool use_fsm)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
***************
*** 121,131 ****
       * on each page that proves not to be suitable.)  If the FSM has no
       * record of a page with enough free space, we give up and extend the
       * relation.
       */

      targetBlock = relation->rd_targblock;

!     if (targetBlock == InvalidBlockNumber)
      {
          /*
           * We have no cached target page, so ask the FSM for an initial
--- 135,148 ----
       * on each page that proves not to be suitable.)  If the FSM has no
       * record of a page with enough free space, we give up and extend the
       * relation.
+      *
+      * When use_fsm is false, we either put the tuple onto the existing
+      * target page or extend the relation.
       */

      targetBlock = relation->rd_targblock;

!     if (targetBlock == InvalidBlockNumber && use_fsm)
      {
          /*
           * We have no cached target page, so ask the FSM for an initial
***************
*** 208,213 ****
--- 225,234 ----
              LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
              ReleaseBuffer(buffer);
          }
+
+         /* Without FSM, always fall out of the loop and extend */
+         if (!use_fsm)
+             break;

          /*
           * Update FSM as to condition of this page, and ask for another
*** src/backend/executor/execMain.c.orig    Sun May 22 18:30:19 2005
--- src/backend/executor/execMain.c    Mon Jun 20 14:22:16 2005
***************
*** 33,38 ****
--- 33,39 ----
  #include "postgres.h"

  #include "access/heapam.h"
+ #include "access/xlog.h"
  #include "catalog/heap.h"
  #include "catalog/namespace.h"
  #include "commands/tablecmds.h"
***************
*** 44,49 ****
--- 45,51 ----
  #include "optimizer/clauses.h"
  #include "optimizer/var.h"
  #include "parser/parsetree.h"
+ #include "storage/smgr.h"
  #include "utils/acl.h"
  #include "utils/guc.h"
  #include "utils/lsyscache.h"
***************
*** 784,789 ****
--- 786,805 ----
           * And open the constructed table for writing.
           */
          intoRelationDesc = heap_open(intoRelationId, AccessExclusiveLock);
+
+         /* use_wal off requires rd_targblock be initially invalid */
+         Assert(intoRelationDesc->rd_targblock == InvalidBlockNumber);
+
+         /*
+          * We can skip WAL-logging the insertions, unless PITR is in use.
+          *
+          * Note that for a non-temp INTO table, this is safe only because
+          * we know that the catalog changes above will have been WAL-logged,
+          * and so RecordTransactionCommit will think it needs to WAL-log the
+          * eventual transaction commit.  Else the commit might be lost, even
+          * though all the data is safely fsync'd ...
+          */
+         estate->es_into_relation_use_wal = XLogArchivingActive();
      }

      estate->es_into_relation_descriptor = intoRelationDesc;
***************
*** 979,985 ****
--- 995,1016 ----
       * close the "into" relation if necessary, again keeping lock
       */
      if (estate->es_into_relation_descriptor != NULL)
+     {
+         /*
+          * If we skipped using WAL, and it's not a temp relation,
+          * we must force the relation down to disk before it's
+          * safe to commit the transaction.  This requires forcing
+          * out any dirty buffers and then doing a forced fsync.
+          */
+         if (!estate->es_into_relation_use_wal &&
+             !estate->es_into_relation_descriptor->rd_istemp)
+         {
+             FlushRelationBuffers(estate->es_into_relation_descriptor);
+             smgrimmedsync(estate->es_into_relation_descriptor->rd_smgr);
+         }
+
          heap_close(estate->es_into_relation_descriptor, NoLock);
+    }

      /*
       * close any relations selected FOR UPDATE/FOR SHARE, again keeping locks
***************
*** 1307,1313 ****

          tuple = ExecCopySlotTuple(slot);
          heap_insert(estate->es_into_relation_descriptor, tuple,
!                     estate->es_snapshot->curcid);
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
--- 1338,1346 ----

          tuple = ExecCopySlotTuple(slot);
          heap_insert(estate->es_into_relation_descriptor, tuple,
!                     estate->es_snapshot->curcid,
!                     estate->es_into_relation_use_wal,
!                     false);        /* never any point in using FSM */
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
***************
*** 1386,1392 ****
       * insert the tuple
       */
      newId = heap_insert(resultRelationDesc, tuple,
!                         estate->es_snapshot->curcid);

      IncrAppended();
      (estate->es_processed)++;
--- 1419,1426 ----
       * insert the tuple
       */
      newId = heap_insert(resultRelationDesc, tuple,
!                         estate->es_snapshot->curcid,
!                         true, true);

      IncrAppended();
      (estate->es_processed)++;
***************
*** 2089,2094 ****
--- 2123,2129 ----
      epqstate->es_result_relation_info = estate->es_result_relation_info;
      epqstate->es_junkFilter = estate->es_junkFilter;
      epqstate->es_into_relation_descriptor = estate->es_into_relation_descriptor;
+     epqstate->es_into_relation_use_wal = estate->es_into_relation_use_wal;
      epqstate->es_param_list_info = estate->es_param_list_info;
      if (estate->es_topPlan->nParamExec > 0)
          epqstate->es_param_exec_vals = (ParamExecData *)
*** src/backend/executor/execUtils.c.orig    Thu Apr 28 17:47:12 2005
--- src/backend/executor/execUtils.c    Mon Jun 20 13:08:30 2005
***************
*** 186,192 ****
--- 186,194 ----
      estate->es_result_relation_info = NULL;

      estate->es_junkFilter = NULL;
+
      estate->es_into_relation_descriptor = NULL;
+     estate->es_into_relation_use_wal = false;

      estate->es_param_list_info = NULL;
      estate->es_param_exec_vals = NULL;
*** src/backend/storage/smgr/md.c.orig    Sun May 29 00:23:05 2005
--- src/backend/storage/smgr/md.c    Mon Jun 20 14:26:50 2005
***************
*** 660,665 ****
--- 660,668 ----

  /*
   *    mdimmedsync() -- Immediately sync a relation to stable storage.
+  *
+  * Note that only writes already issued are synced; this routine knows
+  * nothing of dirty buffers that may exist inside the buffer manager.
   */
  bool
  mdimmedsync(SMgrRelation reln)
*** src/backend/storage/smgr/smgr.c.orig    Fri Jun 17 18:32:46 2005
--- src/backend/storage/smgr/smgr.c    Mon Jun 20 14:26:50 2005
***************
*** 650,656 ****
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *        Synchronously force all of the specified relation down to disk.
   *
   *        This is useful for building completely new relations (eg, new
   *        indexes).  Instead of incrementally WAL-logging the index build
--- 650,657 ----
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *        Synchronously force all previous writes to the specified relation
!  *        down to disk.
   *
   *        This is useful for building completely new relations (eg, new
   *        indexes).  Instead of incrementally WAL-logging the index build
***************
*** 664,669 ****
--- 665,674 ----
   *
   *        The preceding writes should specify isTemp = true to avoid
   *        duplicative fsyncs.
+  *
+  *        Note that you need to do FlushRelationBuffers() first if there is
+  *        any possibility that there are dirty buffers for the relation;
+  *        otherwise the sync is not very meaningful.
   */
  void
  smgrimmedsync(SMgrRelation reln)
*** src/include/access/heapam.h.orig    Mon Jun  6 13:01:24 2005
--- src/include/access/heapam.h    Mon Jun 20 13:46:30 2005
***************
*** 156,162 ****
                      ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);

! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
              CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
--- 156,163 ----
                      ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);

! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!                         bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
              CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
*** src/include/access/hio.h.orig    Fri Dec 31 17:46:38 2004
--- src/include/access/hio.h    Mon Jun 20 13:08:23 2005
***************
*** 19,24 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer);

  #endif   /* HIO_H */
--- 19,24 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                                         Buffer otherBuffer, bool use_fsm);

  #endif   /* HIO_H */
*** src/include/nodes/execnodes.h.orig    Fri Jun 17 14:54:18 2005
--- src/include/nodes/execnodes.h    Mon Jun 20 13:08:12 2005
***************
*** 304,310 ****
--- 304,312 ----
      ResultRelInfo *es_result_relation_info;        /* currently active array
                                                   * elt */
      JunkFilter *es_junkFilter;    /* currently active junk filter */
+
      Relation    es_into_relation_descriptor;    /* for SELECT INTO */
+     bool        es_into_relation_use_wal;

      /* Parameter info: */
      ParamListInfo es_param_list_info;    /* values of external params */

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: WAL bypass for CTAS
Next
From: Simon Riggs
Date:
Subject: Re: WAL bypass for CTAS