Re: CompactCheckpointerRequestQueue versus pad bytes - Mailing list pgsql-hackers

From Tom Lane
Subject Re: CompactCheckpointerRequestQueue versus pad bytes
Date
Msg-id 8687.1342480655@sss.pgh.pa.us
Whole thread Raw
In response to Re: CompactCheckpointerRequestQueue versus pad bytes  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: CompactCheckpointerRequestQueue versus pad bytes
Re: CompactCheckpointerRequestQueue versus pad bytes
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 16, 2012 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So I started to do this, and immediately hit a roadblock: although it's
>> clear that we can discard any attempt to fsync a backend-local relation,
>> it's not so clear that we don't need to queue UNLINK_RELATION_REQUEST
>> operations for local relations.
>>
>> I think that this is in fact okay.  The reason for delaying unlink until
>> after the next checkpoint is that if we did not, and the relfilenode got
>> re-used for an unrelated relation, and then we crashed and had to replay
>> WAL, we would replay any WAL referencing the old relation into the
>> unrelated relation's storage, with potential bad consequences if you're
>> unlucky.  However, no WAL should ever be generated for a backend-local
>> relation, so there is nothing to guard against and hence no harm in
>> immediately unlinking backend-local rels when they are deleted.  So we
>> can adjust mdunlink to include SmgrIsTemp() among the reasons to unlink
>> immediately rather than doing the truncate-and-register_unlink dance.
>>
>> If anybody sees a hole in this reasoning, speak now ...

> Hmm, yeah, I have a feeling this might be why I didn't do this when I
> created RelFileNodeBackend.  But I think your reasoning is correct.
> Sticking the above text in a comment might not be out of order,
> however.

Most of this info was already in the comment for mdunlink, so I just
added a bit there.

The attached patch covers everything discussed in this thread, except
for the buggy handling of stats, which I think should be fixed in a
separate patch since it's only relevant to 9.2+.

I had thought that we might get a performance boost here by saving fsync
queue traffic, but I see that md.c was already not calling
register_dirty_segment for temp rels, so there's no joy there.  We might
win a bit by not queuing deletion of temp rels, though.  There's also
some distributed savings by eliminating one field from the request queue
and hash table, but probably not enough to notice.  I think the main
reason to do this is just to tighten up the question of whether pad
bytes matter.

Haven't started on back-patching yet.  Some but not all of this will
need to go back as far as we back-patched the queue compaction logic.

            regards, tom lane

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 417e3bb0d1b0c90632aafa5fbec351ab21f7e5a0..92fd4276cd1b3be81d1ac741f9f6ea09d241ea52 100644
*** a/src/backend/postmaster/checkpointer.c
--- b/src/backend/postmaster/checkpointer.c
***************
*** 105,111 ****
   */
  typedef struct
  {
!     RelFileNodeBackend rnode;
      ForkNumber    forknum;
      BlockNumber segno;            /* see md.c for special values */
      /* might add a real request-type field later; not needed yet */
--- 105,111 ----
   */
  typedef struct
  {
!     RelFileNode    rnode;
      ForkNumber    forknum;
      BlockNumber segno;            /* see md.c for special values */
      /* might add a real request-type field later; not needed yet */
*************** CheckpointerShmemSize(void)
*** 924,940 ****
  void
  CheckpointerShmemInit(void)
  {
      bool        found;

      CheckpointerShmem = (CheckpointerShmemStruct *)
          ShmemInitStruct("Checkpointer Data",
!                         CheckpointerShmemSize(),
                          &found);

      if (!found)
      {
!         /* First time through, so initialize */
!         MemSet(CheckpointerShmem, 0, sizeof(CheckpointerShmemStruct));
          SpinLockInit(&CheckpointerShmem->ckpt_lck);
          CheckpointerShmem->max_requests = NBuffers;
      }
--- 924,945 ----
  void
  CheckpointerShmemInit(void)
  {
+     Size        size = CheckpointerShmemSize();
      bool        found;

      CheckpointerShmem = (CheckpointerShmemStruct *)
          ShmemInitStruct("Checkpointer Data",
!                         size,
                          &found);

      if (!found)
      {
!         /*
!          * First time through, so initialize.  Note that we zero the whole
!          * requests array; this is so that CompactCheckpointerRequestQueue
!          * can assume that any pad bytes in the request structs are zeroes.
!          */
!         MemSet(CheckpointerShmem, 0, size);
          SpinLockInit(&CheckpointerShmem->ckpt_lck);
          CheckpointerShmem->max_requests = NBuffers;
      }
*************** RequestCheckpoint(int flags)
*** 1091,1101 ****
   *        Forward a file-fsync request from a backend to the checkpointer
   *
   * Whenever a backend is compelled to write directly to a relation
!  * (which should be seldom, if the checkpointer is getting its job done),
   * the backend calls this routine to pass over knowledge that the relation
   * is dirty and must be fsync'd before next checkpoint.  We also use this
   * opportunity to count such writes for statistical purposes.
   *
   * segno specifies which segment (not block!) of the relation needs to be
   * fsync'd.  (Since the valid range is much less than BlockNumber, we can
   * use high values for special flags; that's all internal to md.c, which
--- 1096,1110 ----
   *        Forward a file-fsync request from a backend to the checkpointer
   *
   * Whenever a backend is compelled to write directly to a relation
!  * (which should be seldom, if the background writer is getting its job done),
   * the backend calls this routine to pass over knowledge that the relation
   * is dirty and must be fsync'd before next checkpoint.  We also use this
   * opportunity to count such writes for statistical purposes.
   *
+  * This functionality is only supported for regular (not backend-local)
+  * relations, so the rnode argument is intentionally RelFileNode not
+  * RelFileNodeBackend.
+  *
   * segno specifies which segment (not block!) of the relation needs to be
   * fsync'd.  (Since the valid range is much less than BlockNumber, we can
   * use high values for special flags; that's all internal to md.c, which
*************** RequestCheckpoint(int flags)
*** 1112,1119 ****
   * let the backend know by returning false.
   */
  bool
! ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
!                     BlockNumber segno)
  {
      CheckpointerRequest *request;
      bool        too_full;
--- 1121,1127 ----
   * let the backend know by returning false.
   */
  bool
! ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
  {
      CheckpointerRequest *request;
      bool        too_full;
*************** ForwardFsyncRequest(RelFileNodeBackend r
*** 1169,1174 ****
--- 1177,1183 ----
  /*
   * CompactCheckpointerRequestQueue
   *        Remove duplicates from the request queue to avoid backend fsyncs.
+  *        Returns "true" if any entries were removed.
   *
   * Although a full fsync request queue is not common, it can lead to severe
   * performance problems when it does happen.  So far, this situation has
*************** ForwardFsyncRequest(RelFileNodeBackend r
*** 1178,1184 ****
   * gets very expensive and can slow down the whole system.
   *
   * Trying to do this every time the queue is full could lose if there
!  * aren't any removable entries.  But should be vanishingly rare in
   * practice: there's one queue entry per shared buffer.
   */
  static bool
--- 1187,1193 ----
   * gets very expensive and can slow down the whole system.
   *
   * Trying to do this every time the queue is full could lose if there
!  * aren't any removable entries.  But that should be vanishingly rare in
   * practice: there's one queue entry per shared buffer.
   */
  static bool
*************** CompactCheckpointerRequestQueue(void)
*** 1200,1217 ****
      /* must hold CheckpointerCommLock in exclusive mode */
      Assert(LWLockHeldByMe(CheckpointerCommLock));

      /* Initialize temporary hash table */
      MemSet(&ctl, 0, sizeof(ctl));
      ctl.keysize = sizeof(CheckpointerRequest);
      ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
      ctl.hash = tag_hash;
      htab = hash_create("CompactCheckpointerRequestQueue",
                         CheckpointerShmem->num_requests,
                         &ctl,
!                        HASH_ELEM | HASH_FUNCTION);
!
!     /* Initialize skip_slot array */
!     skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);

      /*
       * The basic idea here is that a request can be skipped if it's followed
--- 1209,1228 ----
      /* must hold CheckpointerCommLock in exclusive mode */
      Assert(LWLockHeldByMe(CheckpointerCommLock));

+     /* Initialize skip_slot array */
+     skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
+
      /* Initialize temporary hash table */
      MemSet(&ctl, 0, sizeof(ctl));
      ctl.keysize = sizeof(CheckpointerRequest);
      ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
      ctl.hash = tag_hash;
+     ctl.hcxt = CurrentMemoryContext;
+
      htab = hash_create("CompactCheckpointerRequestQueue",
                         CheckpointerShmem->num_requests,
                         &ctl,
!                        HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);

      /*
       * The basic idea here is that a request can be skipped if it's followed
*************** CompactCheckpointerRequestQueue(void)
*** 1226,1244 ****
       * anyhow), but it's not clear that the extra complexity would buy us
       * anything.
       */
!     for (n = 0; n < CheckpointerShmem->num_requests; ++n)
      {
          CheckpointerRequest *request;
          struct CheckpointerSlotMapping *slotmap;
          bool        found;

          request = &CheckpointerShmem->requests[n];
          slotmap = hash_search(htab, request, HASH_ENTER, &found);
          if (found)
          {
              skip_slot[slotmap->slot] = true;
!             ++num_skipped;
          }
          slotmap->slot = n;
      }

--- 1237,1264 ----
       * anyhow), but it's not clear that the extra complexity would buy us
       * anything.
       */
!     for (n = 0; n < CheckpointerShmem->num_requests; n++)
      {
          CheckpointerRequest *request;
          struct CheckpointerSlotMapping *slotmap;
          bool        found;

+         /*
+          * We use the request struct directly as a hashtable key.  This
+          * assumes that any padding bytes in the structs are consistently the
+          * same, which should be okay because we zeroed them in
+          * CheckpointerShmemInit.  Note also that RelFileNode had better
+          * contain no pad bytes.
+          */
          request = &CheckpointerShmem->requests[n];
          slotmap = hash_search(htab, request, HASH_ENTER, &found);
          if (found)
          {
+             /* Duplicate, so mark the previous occurrence as skippable */
              skip_slot[slotmap->slot] = true;
!             num_skipped++;
          }
+         /* Remember slot containing latest occurrence of this request value */
          slotmap->slot = n;
      }

*************** CompactCheckpointerRequestQueue(void)
*** 1253,1259 ****
      }

      /* We found some duplicates; remove them. */
!     for (n = 0, preserve_count = 0; n < CheckpointerShmem->num_requests; ++n)
      {
          if (skip_slot[n])
              continue;
--- 1273,1280 ----
      }

      /* We found some duplicates; remove them. */
!     preserve_count = 0;
!     for (n = 0; n < CheckpointerShmem->num_requests; n++)
      {
          if (skip_slot[n])
              continue;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 78145472e169decd91557181367b4cc68af84578..a3bf9a4d44e9d880e42ccf53e0fcc02d2e7a04fa 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** DropRelFileNodeBuffers(RelFileNodeBacken
*** 2049,2055 ****
      int            i;

      /* If it's a local relation, it's localbuf.c's problem. */
!     if (rnode.backend != InvalidBackendId)
      {
          if (rnode.backend == MyBackendId)
              DropRelFileNodeLocalBuffers(rnode.node, forkNum, firstDelBlock);
--- 2049,2055 ----
      int            i;

      /* If it's a local relation, it's localbuf.c's problem. */
!     if (RelFileNodeBackendIsTemp(rnode))
      {
          if (rnode.backend == MyBackendId)
              DropRelFileNodeLocalBuffers(rnode.node, forkNum, firstDelBlock);
*************** DropRelFileNodeAllBuffers(RelFileNodeBac
*** 2103,2109 ****
      int            i;

      /* If it's a local relation, it's localbuf.c's problem. */
!     if (rnode.backend != InvalidBackendId)
      {
          if (rnode.backend == MyBackendId)
              DropRelFileNodeAllLocalBuffers(rnode.node);
--- 2103,2109 ----
      int            i;

      /* If it's a local relation, it's localbuf.c's problem. */
!     if (RelFileNodeBackendIsTemp(rnode))
      {
          if (rnode.backend == MyBackendId)
              DropRelFileNodeAllLocalBuffers(rnode.node);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e5dec9d2a329b1a36e82c4ed62f2ba6be48217c6..10afe949786dac3908fce9c0c5e61c8cb4ba9fbd 100644
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 38,44 ****
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
!  * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
   * fsync request from the queue if an identical, subsequent request is found.
   * See comments there before making changes here.
   */
--- 38,44 ----
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
!  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
   * fsync request from the queue if an identical, subsequent request is found.
   * See comments there before making changes here.
   */
*************** static MemoryContext MdCxt;        /* context
*** 122,134 ****
   * be deleted after the next checkpoint, but we use a linked list instead of
   * a hash table, because we don't expect there to be any duplicate requests.
   *
   * (Regular backends do not track pending operations locally, but forward
   * them to the checkpointer.)
   */
  typedef struct
  {
!     RelFileNodeBackend rnode;    /* the targeted relation */
!     ForkNumber    forknum;
      BlockNumber segno;            /* which segment */
  } PendingOperationTag;

--- 122,138 ----
   * be deleted after the next checkpoint, but we use a linked list instead of
   * a hash table, because we don't expect there to be any duplicate requests.
   *
+  * These mechanisms are only used for non-temp relations; we never fsync
+  * temp rels, nor do we need to postpone their deletion (see comments in
+  * mdunlink).
+  *
   * (Regular backends do not track pending operations locally, but forward
   * them to the checkpointer.)
   */
  typedef struct
  {
!     RelFileNode    rnode;            /* the targeted relation */
!     ForkNumber    forknum;        /* which fork */
      BlockNumber segno;            /* which segment */
  } PendingOperationTag;

*************** typedef struct
*** 143,149 ****

  typedef struct
  {
!     RelFileNodeBackend rnode;    /* the dead relation to delete */
      CycleCtr    cycle_ctr;        /* mdckpt_cycle_ctr when request was made */
  } PendingUnlinkEntry;

--- 147,153 ----

  typedef struct
  {
!     RelFileNode    rnode;            /* the dead relation to delete */
      CycleCtr    cycle_ctr;        /* mdckpt_cycle_ctr when request was made */
  } PendingUnlinkEntry;

*************** mdcreate(SMgrRelation reln, ForkNumber f
*** 302,312 ****
  /*
   *    mdunlink() -- Unlink a relation.
   *
!  * Note that we're passed a RelFileNode --- by the time this is called,
   * there won't be an SMgrRelation hashtable entry anymore.
   *
!  * Actually, we don't unlink the first segment file of the relation, but
!  * just truncate it to zero length, and record a request to unlink it after
   * the next checkpoint.  Additional segments can be unlinked immediately,
   * however.  Leaving the empty file in place prevents that relfilenode
   * number from being reused.  The scenario this protects us from is:
--- 306,316 ----
  /*
   *    mdunlink() -- Unlink a relation.
   *
!  * Note that we're passed a RelFileNodeBackend --- by the time this is called,
   * there won't be an SMgrRelation hashtable entry anymore.
   *
!  * For regular relations, we don't unlink the first segment file of the rel,
!  * but just truncate it to zero length, and record a request to unlink it after
   * the next checkpoint.  Additional segments can be unlinked immediately,
   * however.  Leaving the empty file in place prevents that relfilenode
   * number from being reused.  The scenario this protects us from is:
*************** mdcreate(SMgrRelation reln, ForkNumber f
*** 323,328 ****
--- 327,336 ----
   * number until it's safe, because relfilenode assignment skips over any
   * existing file.
   *
+  * We do not need to go through this dance for temp relations, though, because
+  * we never make WAL entries for temp rels, and so a temp rel poses no threat
+  * to the health of a regular rel that has taken over its relfilenode number.
+  *
   * All the above applies only to the relation's main fork; other forks can
   * just be removed immediately, since they are not needed to prevent the
   * relfilenode number from being recycled.    Also, we do not carefully
*************** mdunlink(RelFileNodeBackend rnode, ForkN
*** 345,360 ****

      /*
       * We have to clean out any pending fsync requests for the doomed
!      * relation, else the next mdsync() will fail.
       */
!     ForgetRelationFsyncRequests(rnode, forkNum);

      path = relpath(rnode, forkNum);

      /*
       * Delete or truncate the first segment.
       */
!     if (isRedo || forkNum != MAIN_FORKNUM)
      {
          ret = unlink(path);
          if (ret < 0 && errno != ENOENT)
--- 353,370 ----

      /*
       * We have to clean out any pending fsync requests for the doomed
!      * relation, else the next mdsync() will fail.  There can't be any such
!      * requests for a temp relation, though.
       */
!     if (!RelFileNodeBackendIsTemp(rnode))
!         ForgetRelationFsyncRequests(rnode.node, forkNum);

      path = relpath(rnode, forkNum);

      /*
       * Delete or truncate the first segment.
       */
!     if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
      {
          ret = unlink(path);
          if (ret < 0 && errno != ENOENT)
*************** mdsync(void)
*** 1081,1088 ****
                   * dirtied through this same smgr relation, and so we can save
                   * a file open/close cycle.
                   */
!                 reln = smgropen(entry->tag.rnode.node,
!                                 entry->tag.rnode.backend);

                  /*
                   * It is possible that the relation has been dropped or
--- 1091,1097 ----
                   * dirtied through this same smgr relation, and so we can save
                   * a file open/close cycle.
                   */
!                 reln = smgropen(entry->tag.rnode, InvalidBackendId);

                  /*
                   * It is possible that the relation has been dropped or
*************** mdpostckpt(void)
*** 1228,1234 ****
          Assert((CycleCtr) (entry->cycle_ctr + 1) == mdckpt_cycle_ctr);

          /* Unlink the file */
!         path = relpath(entry->rnode, MAIN_FORKNUM);
          if (unlink(path) < 0)
          {
              /*
--- 1237,1243 ----
          Assert((CycleCtr) (entry->cycle_ctr + 1) == mdckpt_cycle_ctr);

          /* Unlink the file */
!         path = relpathperm(entry->rnode, MAIN_FORKNUM);
          if (unlink(path) < 0)
          {
              /*
*************** mdpostckpt(void)
*** 1255,1275 ****
   *
   * If there is a local pending-ops table, just make an entry in it for
   * mdsync to process later.  Otherwise, try to pass off the fsync request
!  * to the background writer process.  If that fails, just do the fsync
!  * locally before returning (we expect this will not happen often enough
   * to be a performance problem).
   */
  static void
  register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
  {
      if (pendingOpsTable)
      {
          /* push it into local pending-ops table */
!         RememberFsyncRequest(reln->smgr_rnode, forknum, seg->mdfd_segno);
      }
      else
      {
!         if (ForwardFsyncRequest(reln->smgr_rnode, forknum, seg->mdfd_segno))
              return;                /* passed it off successfully */

          ereport(DEBUG1,
--- 1264,1287 ----
   *
   * If there is a local pending-ops table, just make an entry in it for
   * mdsync to process later.  Otherwise, try to pass off the fsync request
!  * to the checkpointer process.  If that fails, just do the fsync
!  * locally before returning (we hope this will not happen often enough
   * to be a performance problem).
   */
  static void
  register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
  {
+     /* Temp relations should never be fsync'd */
+     Assert(!SmgrIsTemp(reln));
+
      if (pendingOpsTable)
      {
          /* push it into local pending-ops table */
!         RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
      }
      else
      {
!         if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
              return;                /* passed it off successfully */

          ereport(DEBUG1,
*************** register_dirty_segment(SMgrRelation reln
*** 1286,1301 ****
  /*
   * register_unlink() -- Schedule a file to be deleted after next checkpoint
   *
   * As with register_dirty_segment, this could involve either a local or
   * a remote pending-ops table.
   */
  static void
  register_unlink(RelFileNodeBackend rnode)
  {
      if (pendingOpsTable)
      {
          /* push it into local pending-ops table */
!         RememberFsyncRequest(rnode, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
      }
      else
      {
--- 1298,1320 ----
  /*
   * register_unlink() -- Schedule a file to be deleted after next checkpoint
   *
+  * We don't bother passing in the fork number, because this is only used
+  * with main forks.
+  *
   * As with register_dirty_segment, this could involve either a local or
   * a remote pending-ops table.
   */
  static void
  register_unlink(RelFileNodeBackend rnode)
  {
+     /* Should never be used with temp relations */
+     Assert(!RelFileNodeBackendIsTemp(rnode));
+
      if (pendingOpsTable)
      {
          /* push it into local pending-ops table */
!         RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
!                              UNLINK_RELATION_REQUEST);
      }
      else
      {
*************** register_unlink(RelFileNodeBackend rnode
*** 1307,1313 ****
           * XXX should we just leave the file orphaned instead?
           */
          Assert(IsUnderPostmaster);
!         while (!ForwardFsyncRequest(rnode, MAIN_FORKNUM,
                                      UNLINK_RELATION_REQUEST))
              pg_usleep(10000L);    /* 10 msec seems a good number */
      }
--- 1326,1332 ----
           * XXX should we just leave the file orphaned instead?
           */
          Assert(IsUnderPostmaster);
!         while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
                                      UNLINK_RELATION_REQUEST))
              pg_usleep(10000L);    /* 10 msec seems a good number */
      }
*************** register_unlink(RelFileNodeBackend rnode
*** 1333,1340 ****
   * structure for them.)
   */
  void
! RememberFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
!                      BlockNumber segno)
  {
      Assert(pendingOpsTable);

--- 1352,1358 ----
   * structure for them.)
   */
  void
! RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
  {
      Assert(pendingOpsTable);

*************** RememberFsyncRequest(RelFileNodeBackend
*** 1347,1353 ****
          hash_seq_init(&hstat, pendingOpsTable);
          while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             if (RelFileNodeBackendEquals(entry->tag.rnode, rnode) &&
                  entry->tag.forknum == forknum)
              {
                  /* Okay, cancel this entry */
--- 1365,1371 ----
          hash_seq_init(&hstat, pendingOpsTable);
          while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             if (RelFileNodeEquals(entry->tag.rnode, rnode) &&
                  entry->tag.forknum == forknum)
              {
                  /* Okay, cancel this entry */
*************** RememberFsyncRequest(RelFileNodeBackend
*** 1368,1374 ****
          hash_seq_init(&hstat, pendingOpsTable);
          while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             if (entry->tag.rnode.node.dbNode == rnode.node.dbNode)
              {
                  /* Okay, cancel this entry */
                  entry->canceled = true;
--- 1386,1392 ----
          hash_seq_init(&hstat, pendingOpsTable);
          while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             if (entry->tag.rnode.dbNode == rnode.dbNode)
              {
                  /* Okay, cancel this entry */
                  entry->canceled = true;
*************** RememberFsyncRequest(RelFileNodeBackend
*** 1382,1388 ****
              PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);

              next = lnext(cell);
!             if (entry->rnode.node.dbNode == rnode.node.dbNode)
              {
                  pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
                  pfree(entry);
--- 1400,1406 ----
              PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);

              next = lnext(cell);
!             if (entry->rnode.dbNode == rnode.dbNode)
              {
                  pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
                  pfree(entry);
*************** RememberFsyncRequest(RelFileNodeBackend
*** 1446,1455 ****
  }

  /*
!  * ForgetRelationFsyncRequests -- forget any fsyncs for a rel
   */
  void
! ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
  {
      if (pendingOpsTable)
      {
--- 1464,1473 ----
  }

  /*
!  * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
   */
  void
! ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
  {
      if (pendingOpsTable)
      {
*************** ForgetRelationFsyncRequests(RelFileNodeB
*** 1484,1495 ****
  void
  ForgetDatabaseFsyncRequests(Oid dbid)
  {
!     RelFileNodeBackend rnode;

!     rnode.node.dbNode = dbid;
!     rnode.node.spcNode = 0;
!     rnode.node.relNode = 0;
!     rnode.backend = InvalidBackendId;

      if (pendingOpsTable)
      {
--- 1502,1512 ----
  void
  ForgetDatabaseFsyncRequests(Oid dbid)
  {
!     RelFileNode rnode;

!     rnode.dbNode = dbid;
!     rnode.spcNode = 0;
!     rnode.relNode = 0;

      if (pendingOpsTable)
      {
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 996065c2edff17f6fb1c63ef8f2d3394f5c72ab2..2e97e6aea5551f338939e680cb58e903fa409dc4 100644
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
*************** extern void CheckpointerMain(void) __att
*** 31,37 ****
  extern void RequestCheckpoint(int flags);
  extern void CheckpointWriteDelay(int flags, double progress);

! extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                      BlockNumber segno);
  extern void AbsorbFsyncRequests(void);

--- 31,37 ----
  extern void RequestCheckpoint(int flags);
  extern void CheckpointWriteDelay(int flags, double progress);

! extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
  extern void AbsorbFsyncRequests(void);

diff --git a/src/include/storage/relfilenode.h b/src/include/storage/relfilenode.h
index 60c38295375d596609cc3526b7af717a5347114d..5ec1d8f71771204cb354e9942e18ccf44bde5281 100644
*** a/src/include/storage/relfilenode.h
--- b/src/include/storage/relfilenode.h
*************** typedef enum ForkNumber
*** 69,74 ****
--- 69,78 ----
   * Note: in pg_class, relfilenode can be zero to denote that the relation
   * is a "mapped" relation, whose current true filenode number is available
   * from relmapper.c.  Again, this case is NOT allowed in RelFileNodes.
+  *
+  * Note: various places use RelFileNode in hashtable keys.  Therefore,
+  * there *must not* be any unused padding bytes in this struct.  That
+  * should be safe as long as all the fields are of type Oid.
   */
  typedef struct RelFileNode
  {
*************** typedef struct RelFileNode
*** 79,85 ****

  /*
   * Augmenting a relfilenode with the backend ID provides all the information
!  * we need to locate the physical storage.
   */
  typedef struct RelFileNodeBackend
  {
--- 83,93 ----

  /*
   * Augmenting a relfilenode with the backend ID provides all the information
!  * we need to locate the physical storage.  The backend ID is InvalidBackendId
!  * for regular relations (those accessible to more than one backend), or the
!  * owning backend's ID for backend-local relations.  Backend-local relations
!  * are always transient and removed in case of a database crash; they are
!  * never WAL-logged or fsync'd.
   */
  typedef struct RelFileNodeBackend
  {
*************** typedef struct RelFileNodeBackend
*** 87,97 ****
      BackendId    backend;
  } RelFileNodeBackend;

  /*
   * Note: RelFileNodeEquals and RelFileNodeBackendEquals compare relNode first
   * since that is most likely to be different in two unequal RelFileNodes.  It
   * is probably redundant to compare spcNode if the other fields are found equal,
!  * but do it anyway to be sure.
   */
  #define RelFileNodeEquals(node1, node2) \
      ((node1).relNode == (node2).relNode && \
--- 95,109 ----
      BackendId    backend;
  } RelFileNodeBackend;

+ #define RelFileNodeBackendIsTemp(rnode) \
+     ((rnode).backend != InvalidBackendId)
+
  /*
   * Note: RelFileNodeEquals and RelFileNodeBackendEquals compare relNode first
   * since that is most likely to be different in two unequal RelFileNodes.  It
   * is probably redundant to compare spcNode if the other fields are found equal,
!  * but do it anyway to be sure.  Likewise for checking the backend ID in
!  * RelFileNodeBackendEquals.
   */
  #define RelFileNodeEquals(node1, node2) \
      ((node1).relNode == (node2).relNode && \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f8fc2b2d6e82857ed2483dc0a3444e470ffcd43a..3560d539076da83a3f2897f3109fcabc4b72d5d0 100644
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
*************** typedef struct SMgrRelationData
*** 69,75 ****
  typedef SMgrRelationData *SMgrRelation;

  #define SmgrIsTemp(smgr) \
!     ((smgr)->smgr_rnode.backend != InvalidBackendId)

  extern void smgrinit(void);
  extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
--- 69,75 ----
  typedef SMgrRelationData *SMgrRelation;

  #define SmgrIsTemp(smgr) \
!     RelFileNodeBackendIsTemp((smgr)->smgr_rnode)

  extern void smgrinit(void);
  extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
*************** extern void mdsync(void);
*** 124,133 ****
  extern void mdpostckpt(void);

  extern void SetForwardFsyncRequests(void);
! extern void RememberFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                       BlockNumber segno);
! extern void ForgetRelationFsyncRequests(RelFileNodeBackend rnode,
!                             ForkNumber forknum);
  extern void ForgetDatabaseFsyncRequests(Oid dbid);

  /* smgrtype.c */
--- 124,132 ----
  extern void mdpostckpt(void);

  extern void SetForwardFsyncRequests(void);
! extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                       BlockNumber segno);
! extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
  extern void ForgetDatabaseFsyncRequests(Oid dbid);

  /* smgrtype.c */

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Closing out the June commitfest
Next
From: Samuel Vogel
Date:
Subject: b-tree index search algorithms