Thread: WAL bypass for CTAS

WAL bypass for CTAS

From
Simon Riggs
Date:
I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

This contains all the lower level machinery required to do the same
thing for COPY, as discussed on hackers. The machinery includes some
additional freespace thinkery, aimed mainly at the forthcoming COPY
patch, which solely needs to be integrated with Alon's work.

Patch is diff -c format, compiles and make checks on cvstip as of now.

No performance tests *on this patch*, though the general principle has
already been proven via a similar prototype patch not published on list.

Best Regards, Simon Riggs

Attachment

Re: WAL bypass for CTAS

From
Bruce Momjian
Date:
Simon Riggs wrote:
> I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> SELECT, when not in archive mode (PITR). The main use case for this is
> large BI environments that create summary tables or prejoined tables,
> though there are many general applications.
>
> There is no user interface for this. The speed gain is automatic, when
> archiving is not enabled.

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?  I would _love_ to see
pg_dump loads use this automatically, without having to add clauses to
pg_dump output.

I think we decided we can't do it automatically for all zero-row COPYs
because of locking concerns.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL bypass for CTAS

From
Neil Conway
Date:
Bruce Momjian wrote:
> Could we do your NOLOGGING automatically in COPY if we test to see if
> anyone else is connected to our current database?

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

> I would _love_ to see pg_dump loads use this automatically, without
> having to add clauses to pg_dump output.

What's wrong with adding clauses to the pg_dump output?

-Neil


Re: WAL bypass for CTAS

From
Bruce Momjian
Date:
Neil Conway wrote:
> Bruce Momjian wrote:
> > Could we do your NOLOGGING automatically in COPY if we test to see if
> > anyone else is connected to our current database?
>
> That seems pretty fragile -- what happens if someone connects after the
> COPY has started? Considering that many COPY operations can take many
> minutes or hours, I don't think it is wise to make assumptions based on
> the initial state of the system.
>
> > I would _love_ to see pg_dump loads use this automatically, without
> > having to add clauses to pg_dump output.
>
> What's wrong with adding clauses to the pg_dump output?

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL bypass for CTAS

From
Neil Conway
Date:
Bruce Momjian wrote:
> Well, it isn't going to help us for 8.1 because 8.0 will not have it,
> and if we add the clause we make loading the data into previous releases
> harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

-Neil

Re: WAL bypass for CTAS

From
Russell Smith
Date:
On Sun, 5 Jun 2005 10:29 am, Neil Conway wrote:
> Bruce Momjian wrote:
> > Well, it isn't going to help us for 8.1 because 8.0 will not have it,
> > and if we add the clause we make loading the data into previous releases
> > harder.
>
> pg_dump output in general is not compatible with prior releases. It
> would be a nice feature to have, but until we have it, I don't see that
> changing or not changing the COPY syntax will make a major difference to
> dump backward compatibility.

Don't we usually suggest using the new pg_dump to dump the old database anyway?

If that's the case, then we just add the locking options in there.  Otherwise, yes you are
stuck with the original locking mechanism.  But if people are smart and want faster loading
they will play with sed and friends to make it work.

Even if people for 8.1 just get the supposed 500% speed increase because of a better parser,
lots of people will be happy.

Regards

Russell Smith

Re: WAL bypass for CTAS

From
Simon Riggs
Date:
On Sun, 2005-06-05 at 10:20 +1000, Neil Conway wrote:
> Bruce Momjian wrote:
> > Could we do your NOLOGGING automatically in COPY if we test to see if
> > anyone else is connected to our current database?

Remember that this patch doe NOT yet handle COPY, but that is planned...

> That seems pretty fragile -- what happens if someone connects after the
> COPY has started? Considering that many COPY operations can take many
> minutes or hours, I don't think it is wise to make assumptions based on
> the initial state of the system.

Agreed.

Best Regards, Simon Riggs


Re: WAL bypass for CTAS

From
Bruce Momjian
Date:
Tom has applied this patch.  Thanks.

---------------------------------------------------------------------------

Simon Riggs wrote:
> I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> SELECT, when not in archive mode (PITR). The main use case for this is
> large BI environments that create summary tables or prejoined tables,
> though there are many general applications.
>
> There is no user interface for this. The speed gain is automatic, when
> archiving is not enabled.
>
> This contains all the lower level machinery required to do the same
> thing for COPY, as discussed on hackers. The machinery includes some
> additional freespace thinkery, aimed mainly at the forthcoming COPY
> patch, which solely needs to be integrated with Alon's work.
>
> Patch is diff -c format, compiles and make checks on cvstip as of now.
>
> No performance tests *on this patch*, though the general principle has
> already been proven via a similar prototype patch not published on list.
>
> Best Regards, Simon Riggs

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL bypass for CTAS

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> SELECT, when not in archive mode (PITR). The main use case for this is
> large BI environments that create summary tables or prejoined tables,
> though there are many general applications.

Applied after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(, and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries.  It will
be a very real hazard for COPY however.  The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones.  To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit.  We can just omit updating the FSM's
running average, if it even has one.  (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

Patch as applied is attached.

            regards, tom lane

*** src/backend/access/heap/heapam.c.orig    Wed Jun  8 11:50:21 2005
--- src/backend/access/heap/heapam.c    Mon Jun 20 13:50:16 2005
***************
*** 1034,1042 ****
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid)
  {
      TransactionId xid = GetCurrentTransactionId();
      Buffer        buffer;
--- 1034,1053 ----
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
+  *
+  * If use_wal is false, the new tuple is not logged in WAL, even for a
+  * non-temp relation.  Safe usage of this behavior requires that we arrange
+  * that all new tuples go into new pages not containing any tuples from other
+  * transactions, that the relation gets fsync'd before commit, and that the
+  * transaction emits at least one WAL record to ensure RecordTransactionCommit
+  * will decide to WAL-log the commit.
+  *
+  * use_fsm is passed directly to RelationGetBufferForTuple, which see for
+  * more info.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm)
  {
      TransactionId xid = GetCurrentTransactionId();
      Buffer        buffer;
***************
*** 1086,1092 ****
          heap_tuple_toast_attrs(relation, tup, NULL);

      /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, tup->t_len, InvalidBuffer);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
--- 1097,1104 ----
          heap_tuple_toast_attrs(relation, tup, NULL);

      /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, tup->t_len,
!                                        InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
***************
*** 1096,1102 ****
      pgstat_count_heap_insert(&relation->pgstat_info);

      /* XLOG stuff */
!     if (!relation->rd_istemp)
      {
          xl_heap_insert xlrec;
          xl_heap_header xlhdr;
--- 1108,1119 ----
      pgstat_count_heap_insert(&relation->pgstat_info);

      /* XLOG stuff */
!     if (relation->rd_istemp)
!     {
!         /* No XLOG record, but still need to flag that XID exists on disk */
!         MyXactMadeTempRelUpdate = true;
!     }
!     else if (use_wal)
      {
          xl_heap_insert xlrec;
          xl_heap_header xlhdr;
***************
*** 1151,1161 ****
          PageSetLSN(page, recptr);
          PageSetTLI(page, ThisTimeLineID);
      }
-     else
-     {
-         /* No XLOG record, but still need to flag that XID exists on disk */
-         MyXactMadeTempRelUpdate = true;
-     }

      END_CRIT_SECTION();

--- 1168,1173 ----
***************
*** 1183,1189 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId());
  }

  /*
--- 1195,1201 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }

  /*
***************
*** 1743,1749 ****
          {
              /* Assume there's no chance to put newtup on same page. */
              newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                buffer);
          }
          else
          {
--- 1755,1761 ----
          {
              /* Assume there's no chance to put newtup on same page. */
              newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                buffer, true);
          }
          else
          {
***************
*** 1760,1766 ****
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                    buffer);
              }
              else
              {
--- 1772,1778 ----
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
!                                                    buffer, true);
              }
              else
              {
*** src/backend/access/heap/hio.c.orig    Sat May  7 17:32:23 2005
--- src/backend/access/heap/hio.c    Mon Jun 20 13:50:16 2005
***************
*** 79,90 ****
   *    happen if space is freed in that page after heap_update finds there's not
   *    enough there).    In that case, the page will be pinned and locked only once.
   *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
--- 79,104 ----
   *    happen if space is freed in that page after heap_update finds there's not
   *    enough there).    In that case, the page will be pinned and locked only once.
   *
+  *    If use_fsm is true (the normal case), we use FSM to help us find free
+  *    space.  If use_fsm is false, we always append a new empty page to the
+  *    end of the relation if the tuple won't fit on the current target page.
+  *    This can save some cycles when we know the relation is new and doesn't
+  *    contain useful amounts of free space.
+  *
+  *    The use_fsm = false case is also useful for non-WAL-logged additions to a
+  *    relation, if the caller holds exclusive lock and is careful to invalidate
+  *    relation->rd_targblock before the first insertion --- that ensures that
+  *    all insertions will occur into newly added pages and not be intermixed
+  *    with tuples from other transactions.  That way, a crash can't risk losing
+  *    any committed data of other transactions.  (See heap_insert's comments
+  *    for additional constraints needed for safe usage of this behavior.)
+  *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer, bool use_fsm)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
***************
*** 121,131 ****
       * on each page that proves not to be suitable.)  If the FSM has no
       * record of a page with enough free space, we give up and extend the
       * relation.
       */

      targetBlock = relation->rd_targblock;

!     if (targetBlock == InvalidBlockNumber)
      {
          /*
           * We have no cached target page, so ask the FSM for an initial
--- 135,148 ----
       * on each page that proves not to be suitable.)  If the FSM has no
       * record of a page with enough free space, we give up and extend the
       * relation.
+      *
+      * When use_fsm is false, we either put the tuple onto the existing
+      * target page or extend the relation.
       */

      targetBlock = relation->rd_targblock;

!     if (targetBlock == InvalidBlockNumber && use_fsm)
      {
          /*
           * We have no cached target page, so ask the FSM for an initial
***************
*** 208,213 ****
--- 225,234 ----
              LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
              ReleaseBuffer(buffer);
          }
+
+         /* Without FSM, always fall out of the loop and extend */
+         if (!use_fsm)
+             break;

          /*
           * Update FSM as to condition of this page, and ask for another
*** src/backend/executor/execMain.c.orig    Sun May 22 18:30:19 2005
--- src/backend/executor/execMain.c    Mon Jun 20 14:22:16 2005
***************
*** 33,38 ****
--- 33,39 ----
  #include "postgres.h"

  #include "access/heapam.h"
+ #include "access/xlog.h"
  #include "catalog/heap.h"
  #include "catalog/namespace.h"
  #include "commands/tablecmds.h"
***************
*** 44,49 ****
--- 45,51 ----
  #include "optimizer/clauses.h"
  #include "optimizer/var.h"
  #include "parser/parsetree.h"
+ #include "storage/smgr.h"
  #include "utils/acl.h"
  #include "utils/guc.h"
  #include "utils/lsyscache.h"
***************
*** 784,789 ****
--- 786,805 ----
           * And open the constructed table for writing.
           */
          intoRelationDesc = heap_open(intoRelationId, AccessExclusiveLock);
+
+         /* use_wal off requires rd_targblock be initially invalid */
+         Assert(intoRelationDesc->rd_targblock == InvalidBlockNumber);
+
+         /*
+          * We can skip WAL-logging the insertions, unless PITR is in use.
+          *
+          * Note that for a non-temp INTO table, this is safe only because
+          * we know that the catalog changes above will have been WAL-logged,
+          * and so RecordTransactionCommit will think it needs to WAL-log the
+          * eventual transaction commit.  Else the commit might be lost, even
+          * though all the data is safely fsync'd ...
+          */
+         estate->es_into_relation_use_wal = XLogArchivingActive();
      }

      estate->es_into_relation_descriptor = intoRelationDesc;
***************
*** 979,985 ****
--- 995,1016 ----
       * close the "into" relation if necessary, again keeping lock
       */
      if (estate->es_into_relation_descriptor != NULL)
+     {
+         /*
+          * If we skipped using WAL, and it's not a temp relation,
+          * we must force the relation down to disk before it's
+          * safe to commit the transaction.  This requires forcing
+          * out any dirty buffers and then doing a forced fsync.
+          */
+         if (!estate->es_into_relation_use_wal &&
+             !estate->es_into_relation_descriptor->rd_istemp)
+         {
+             FlushRelationBuffers(estate->es_into_relation_descriptor);
+             smgrimmedsync(estate->es_into_relation_descriptor->rd_smgr);
+         }
+
          heap_close(estate->es_into_relation_descriptor, NoLock);
+    }

      /*
       * close any relations selected FOR UPDATE/FOR SHARE, again keeping locks
***************
*** 1307,1313 ****

          tuple = ExecCopySlotTuple(slot);
          heap_insert(estate->es_into_relation_descriptor, tuple,
!                     estate->es_snapshot->curcid);
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
--- 1338,1346 ----

          tuple = ExecCopySlotTuple(slot);
          heap_insert(estate->es_into_relation_descriptor, tuple,
!                     estate->es_snapshot->curcid,
!                     estate->es_into_relation_use_wal,
!                     false);        /* never any point in using FSM */
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
***************
*** 1386,1392 ****
       * insert the tuple
       */
      newId = heap_insert(resultRelationDesc, tuple,
!                         estate->es_snapshot->curcid);

      IncrAppended();
      (estate->es_processed)++;
--- 1419,1426 ----
       * insert the tuple
       */
      newId = heap_insert(resultRelationDesc, tuple,
!                         estate->es_snapshot->curcid,
!                         true, true);

      IncrAppended();
      (estate->es_processed)++;
***************
*** 2089,2094 ****
--- 2123,2129 ----
      epqstate->es_result_relation_info = estate->es_result_relation_info;
      epqstate->es_junkFilter = estate->es_junkFilter;
      epqstate->es_into_relation_descriptor = estate->es_into_relation_descriptor;
+     epqstate->es_into_relation_use_wal = estate->es_into_relation_use_wal;
      epqstate->es_param_list_info = estate->es_param_list_info;
      if (estate->es_topPlan->nParamExec > 0)
          epqstate->es_param_exec_vals = (ParamExecData *)
*** src/backend/executor/execUtils.c.orig    Thu Apr 28 17:47:12 2005
--- src/backend/executor/execUtils.c    Mon Jun 20 13:08:30 2005
***************
*** 186,192 ****
--- 186,194 ----
      estate->es_result_relation_info = NULL;

      estate->es_junkFilter = NULL;
+
      estate->es_into_relation_descriptor = NULL;
+     estate->es_into_relation_use_wal = false;

      estate->es_param_list_info = NULL;
      estate->es_param_exec_vals = NULL;
*** src/backend/storage/smgr/md.c.orig    Sun May 29 00:23:05 2005
--- src/backend/storage/smgr/md.c    Mon Jun 20 14:26:50 2005
***************
*** 660,665 ****
--- 660,668 ----

  /*
   *    mdimmedsync() -- Immediately sync a relation to stable storage.
+  *
+  * Note that only writes already issued are synced; this routine knows
+  * nothing of dirty buffers that may exist inside the buffer manager.
   */
  bool
  mdimmedsync(SMgrRelation reln)
*** src/backend/storage/smgr/smgr.c.orig    Fri Jun 17 18:32:46 2005
--- src/backend/storage/smgr/smgr.c    Mon Jun 20 14:26:50 2005
***************
*** 650,656 ****
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *        Synchronously force all of the specified relation down to disk.
   *
   *        This is useful for building completely new relations (eg, new
   *        indexes).  Instead of incrementally WAL-logging the index build
--- 650,657 ----
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *        Synchronously force all previous writes to the specified relation
!  *        down to disk.
   *
   *        This is useful for building completely new relations (eg, new
   *        indexes).  Instead of incrementally WAL-logging the index build
***************
*** 664,669 ****
--- 665,674 ----
   *
   *        The preceding writes should specify isTemp = true to avoid
   *        duplicative fsyncs.
+  *
+  *        Note that you need to do FlushRelationBuffers() first if there is
+  *        any possibility that there are dirty buffers for the relation;
+  *        otherwise the sync is not very meaningful.
   */
  void
  smgrimmedsync(SMgrRelation reln)
*** src/include/access/heapam.h.orig    Mon Jun  6 13:01:24 2005
--- src/include/access/heapam.h    Mon Jun 20 13:46:30 2005
***************
*** 156,162 ****
                      ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);

! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
              CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
--- 156,163 ----
                      ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);

! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!                         bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
              CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
*** src/include/access/hio.h.orig    Fri Dec 31 17:46:38 2004
--- src/include/access/hio.h    Mon Jun 20 13:08:23 2005
***************
*** 19,24 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer);

  #endif   /* HIO_H */
--- 19,24 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                                         Buffer otherBuffer, bool use_fsm);

  #endif   /* HIO_H */
*** src/include/nodes/execnodes.h.orig    Fri Jun 17 14:54:18 2005
--- src/include/nodes/execnodes.h    Mon Jun 20 13:08:12 2005
***************
*** 304,310 ****
--- 304,312 ----
      ResultRelInfo *es_result_relation_info;        /* currently active array
                                                   * elt */
      JunkFilter *es_junkFilter;    /* currently active junk filter */
+
      Relation    es_into_relation_descriptor;    /* for SELECT INTO */
+     bool        es_into_relation_use_wal;

      /* Parameter info: */
      ParamListInfo es_param_list_info;    /* values of external params */

Re: WAL bypass for CTAS

From
Simon Riggs
Date:
On Mon, 2005-06-20 at 14:50 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> > SELECT, when not in archive mode (PITR). The main use case for this is
> > large BI environments that create summary tables or prejoined tables,
> > though there are many general applications.
>
> Applied

Thanks

> after heavy corrections --- there were a number of things wrong
> with this "simple" patch, starting with having gotten the tests
> backwards :-(

Sorry, I thought I had corrected that error before submission. I was
aware that I had made that error earlier.

> and extending to not having actually flushed the data
> before commit (smgrimmedsync isn't enough, you have to
> FlushRelationBuffers).

I followed the logic as seen in nbtsort.c as you suggested. That code
doesn't perform a FlushRelationBuffers and it looks like I fooled myself
into thinking the CTAS/SELECT INTO case was also in local.

Perhaps we should be building CTAS/SELECT INTO in local buffers anyway?
It looks like we could save time by avoiding shared_buffers completely
and build up a whole page before writing it anywhere. (But thats a story
for another day).

Perhaps this is also related to metapage errors, since the metapage is
always the last page to be written?

> A consideration we had all missed in the original discussions is that
> if the transaction doesn't emit any WAL records at all,
> RecordTransactionCommit will think that it need not WAL-log the
> transaction commit, leading to the possibility that the commit is lost
> even though all the data is preserved :-(

> This is not a hazard for CREATE TABLE AS, since it will certainly have
> emitted WAL records while creating the table's catalog entries.  It will
> be a very real hazard for COPY however.

OK, but I haven't written that patch yet!

> The cleanest solution I can
> think of is that the COPY code should emit a WAL record for the first
> tuple copied in, but not for later ones.  To this end, I separated the
> "use_wal" and "use_fsm" aspects of what the patch was doing.

Not very clean, but will do as you suggest.

> I didn't apply the freespace.c changes either; that struck me as a
> serious kluge with no real benefit.  We can just omit updating the FSM's
> running average, if it even has one.  (ISTM there's a reasonable
> argument to be made that the tuple sizes during CREATE/COPY might not be
> representative of later requests anyway.)

I was striving for completeness only. I was doubtful about that part of
the patch, but thought I'd add that rather than have you say I hadn't
thought about the FSM avg_request_size.

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Best Regards, Simon Riggs


Re: WAL bypass for CTAS

From
Alvaro Herrera
Date:
On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:

> I put those changes in mainly for COPY. If you don't make any request at
> all to FSM then a relation never gets to the MRU relation FSM list. I
> agree that it is not strictly necessary, but leaving it off would be a
> change in behaviour, since COPY did previously cause the relation to get
> to the MRU. That could be a problem, since a relation might not then be
> allocated any FSM pages following a vacuum.

Is that a problem?  If the pages don't fit in FSM, then maybe the system
is misconfigured anyway.  The person running the DW should just increase
the FSM settings, which is hardly a costly thing because it uses so
little memory.

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"No renuncies a nada. No te aferres a nada."

Re: WAL bypass for CTAS

From
Simon Riggs
Date:
On Mon, 2005-06-20 at 17:09 -0400, Alvaro Herrera wrote:
> On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:
>
> > I put those changes in mainly for COPY. If you don't make any request at
> > all to FSM then a relation never gets to the MRU relation FSM list. I
> > agree that it is not strictly necessary, but leaving it off would be a
> > change in behaviour, since COPY did previously cause the relation to get
> > to the MRU. That could be a problem, since a relation might not then be
> > allocated any FSM pages following a vacuum.
>
> Is that a problem?

Not for me, but I wanted to explain the change in behaviour that
implies.

> If the pages don't fit in FSM, then maybe the system
> is misconfigured anyway.  The person running the DW should just increase
> the FSM settings, which is hardly a costly thing because it uses so
> little memory.

If you aren't on the relation list you don't get any more pages than the
minimum. No matter how many fsm_pages you allocate. If fsm_pages covers
everything, then you are right, there is no problem.

Best Regards, Simon Riggs