Thread: Fix mdsync never-ending loop problem

Fix mdsync never-ending loop problem

From
Heikki Linnakangas
Date:
Here's a fix for the problem that on a busy system, mdsync never
finishes. See the original problem description on hackers:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php

The solution is taken from ITAGAKI Takahiro's Load Distributed
Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is
copied to a linked list, and that list is then processed until it's empty.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.127
diff -c -r1.127 md.c
*** src/backend/storage/smgr/md.c    17 Jan 2007 16:25:01 -0000    1.127
--- src/backend/storage/smgr/md.c    5 Apr 2007 10:43:56 -0000
***************
*** 863,989 ****
  void
  mdsync(void)
  {
!     bool        need_retry;

      if (!pendingOpsTable)
          elog(ERROR, "cannot sync without a pendingOpsTable");

      /*
!      * The fsync table could contain requests to fsync relations that have
!      * been deleted (unlinked) by the time we get to them.  Rather than
!      * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
!      * what we will do is retry the whole process after absorbing fsync
!      * request messages again.  Since mdunlink() queues a "revoke" message
!      * before actually unlinking, the fsync request is guaranteed to be gone
!      * the second time if it really was this case.  DROP DATABASE likewise
!      * has to tell us to forget fsync requests before it starts deletions.
       */
!     do {
!         HASH_SEQ_STATUS hstat;
!         PendingOperationEntry *entry;
!         int            absorb_counter;

!         need_retry = false;

          /*
!          * If we are in the bgwriter, the sync had better include all fsync
!          * requests that were queued by backends before the checkpoint REDO
!          * point was determined. We go that a little better by accepting all
!          * requests queued up to the point where we start fsync'ing.
           */
          AbsorbFsyncRequests();

!         absorb_counter = FSYNCS_PER_ABSORB;
!         hash_seq_init(&hstat, pendingOpsTable);
!         while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             /*
!              * If fsync is off then we don't have to bother opening the file
!              * at all.  (We delay checking until this point so that changing
!              * fsync on the fly behaves sensibly.)
!              */
!             if (enableFsync)
!             {
!                 SMgrRelation reln;
!                 MdfdVec    *seg;

!                 /*
!                  * If in bgwriter, we want to absorb pending requests every so
!                  * often to prevent overflow of the fsync request queue.  This
!                  * could result in deleting the current entry out from under
!                  * our hashtable scan, so the procedure is to fall out of the
!                  * scan and start over from the top of the function.
!                  */
!                 if (--absorb_counter <= 0)
!                 {
!                     need_retry = true;
!                     break;
!                 }

!                 /*
!                  * Find or create an smgr hash entry for this relation. This
!                  * may seem a bit unclean -- md calling smgr?  But it's really
!                  * the best solution.  It ensures that the open file reference
!                  * isn't permanently leaked if we get an error here. (You may
!                  * say "but an unreferenced SMgrRelation is still a leak!" Not
!                  * really, because the only case in which a checkpoint is done
!                  * by a process that isn't about to shut down is in the
!                  * bgwriter, and it will periodically do smgrcloseall(). This
!                  * fact justifies our not closing the reln in the success path
!                  * either, which is a good thing since in non-bgwriter cases
!                  * we couldn't safely do that.)  Furthermore, in many cases
!                  * the relation will have been dirtied through this same smgr
!                  * relation, and so we can save a file open/close cycle.
!                  */
!                 reln = smgropen(entry->tag.rnode);
!
!                 /*
!                  * It is possible that the relation has been dropped or
!                  * truncated since the fsync request was entered.  Therefore,
!                  * allow ENOENT, but only if we didn't fail once already on
!                  * this file.  This applies both during _mdfd_getseg() and
!                  * during FileSync, since fd.c might have closed the file
!                  * behind our back.
!                  */
!                 seg = _mdfd_getseg(reln,
!                                    entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
!                                    false, EXTENSION_RETURN_NULL);
!                 if (seg == NULL ||
!                     FileSync(seg->mdfd_vfd) < 0)
!                 {
!                     /*
!                      * XXX is there any point in allowing more than one try?
!                      * Don't see one at the moment, but easy to change the
!                      * test here if so.
!                      */
!                     if (!FILE_POSSIBLY_DELETED(errno) ||
!                         ++(entry->failures) > 1)
!                         ereport(ERROR,
!                                 (errcode_for_file_access(),
!                                  errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
!                                         entry->tag.segno,
!                                         entry->tag.rnode.spcNode,
!                                         entry->tag.rnode.dbNode,
!                                         entry->tag.rnode.relNode)));
!                     else
!                         ereport(DEBUG1,
!                                 (errcode_for_file_access(),
!                                  errmsg("could not fsync segment %u of relation %u/%u/%u, but retrying: %m",
!                                         entry->tag.segno,
!                                         entry->tag.rnode.spcNode,
!                                         entry->tag.rnode.dbNode,
!                                         entry->tag.rnode.relNode)));
!                     need_retry = true;
!                     continue;    /* don't delete the hashtable entry */
!                 }
!             }

              /* Okay, delete this entry */
              if (hash_search(pendingOpsTable, &entry->tag,
                              HASH_REMOVE, NULL) == NULL)
                  elog(ERROR, "pendingOpsTable corrupted");
          }
!     } while (need_retry);
  }

  /*
--- 863,1012 ----
  void
  mdsync(void)
  {
!     HASH_SEQ_STATUS hstat;
!     List           *syncOps;
!     ListCell       *cell;
!     PendingOperationEntry *entry;

      if (!pendingOpsTable)
          elog(ERROR, "cannot sync without a pendingOpsTable");

      /*
!      * If fsync=off, mdsync is a no-op. Just clear the pendingOpsTable.
       */
!     if(!enableFsync)
!     {
!         hash_seq_init(&hstat, pendingOpsTable);
!         while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
!         {
!             if (hash_search(pendingOpsTable, &entry->tag,
!                             HASH_REMOVE, NULL) == NULL)
!                 elog(ERROR, "pendingOpsTable corrupted");
!         }
!         return;
!     }
!
!     /*
!      * If we are in the bgwriter, the sync had better include all fsync
!      * requests that were queued by backends before the checkpoint REDO
!      * point was determined. We go that a little better by accepting all
!      * requests queued up to the point where we start fsync'ing.
!      */
!     AbsorbFsyncRequests();
!
!     /* Take a snapshot of pendingOpsTable into a list. We don't want to
!      * scan through pendingOpsTable directly in the loop because we call
!      * AbsorbFsyncRequests inside it which modifies the table.
!      */
!     syncOps = NULL;
!     hash_seq_init(&hstat, pendingOpsTable);
!     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
!     {
!         PendingOperationEntry *dup;
!
!         /* Entries could be deleted in our scan, so copy them. */
!         dup = (PendingOperationEntry *) palloc(sizeof(PendingOperationEntry));
!         memcpy(dup, entry, sizeof(PendingOperationEntry));
!         syncOps = lappend(syncOps, dup);
!     }
!
!     /* Process fsync requests until the list is empty */
!     while((cell = list_head(syncOps)) != NULL)
!     {
!         SMgrRelation reln;
!         MdfdVec    *seg;

!         entry = lfirst(cell);

          /*
!          * If in bgwriter, we want to absorb pending requests every so
!          * often to prevent overflow of the fsync request queue.
           */
          AbsorbFsyncRequests();

!         /* Check that the entry is still in pendingOpsTable. It could've
!          * been deleted by an absorbed relation or database deletion event.
!          */
!         entry = hash_search(pendingOpsTable, &entry->tag, HASH_FIND, NULL);
!         if(entry == NULL)
          {
!             list_delete_cell(syncOps, cell, NULL);
!             continue;
!         }

!         /*
!          * Find or create an smgr hash entry for this relation. This
!          * may seem a bit unclean -- md calling smgr?  But it's really
!          * the best solution.  It ensures that the open file reference
!          * isn't permanently leaked if we get an error here. (You may
!          * say "but an unreferenced SMgrRelation is still a leak!" Not
!          * really, because the only case in which a checkpoint is done
!          * by a process that isn't about to shut down is in the
!          * bgwriter, and it will periodically do smgrcloseall(). This
!          * fact justifies our not closing the reln in the success path
!          * either, which is a good thing since in non-bgwriter cases
!          * we couldn't safely do that.)  Furthermore, in many cases
!          * the relation will have been dirtied through this same smgr
!          * relation, and so we can save a file open/close cycle.
!          */
!         reln = smgropen(entry->tag.rnode);

!         /*
!          * It is possible that the relation has been dropped or
!          * truncated since the fsync request was entered.  Therefore,
!          * allow ENOENT, but only if we didn't fail once already on
!          * this file.  This applies both during _mdfd_getseg() and
!          * during FileSync, since fd.c might have closed the file
!          * behind our back.
!          */
!         seg = _mdfd_getseg(reln,
!                            entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
!                            false, EXTENSION_RETURN_NULL);
!         if (seg == NULL || FileSync(seg->mdfd_vfd) < 0)
!         {
!             /*
!              * The fsync table could contain requests to fsync relations that have
!              * been deleted (unlinked) by the time we get to them.  Rather than
!              * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
!              * what we will do is retry after absorbing fsync
!              * request messages again.  Since mdunlink() queues a "revoke" message
!              * before actually unlinking, the fsync request is guaranteed to be gone
!              * the second time if it really was this case.  DROP DATABASE likewise
!              * has to tell us to forget fsync requests before it starts deletions.
!              *
!              * XXX is there any point in allowing more than one try?
!              * Don't see one at the moment, but easy to change the
!              * test here if so.
!              */
!             if (!FILE_POSSIBLY_DELETED(errno) ||
!                 ++(entry->failures) > 1)
!                 ereport(ERROR,
!                         (errcode_for_file_access(),
!                          errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
!                                 entry->tag.segno,
!                                 entry->tag.rnode.spcNode,
!                                 entry->tag.rnode.dbNode,
!                                 entry->tag.rnode.relNode)));
!             else
!                 ereport(DEBUG1,
!                         (errcode_for_file_access(),
!                          errmsg("could not fsync segment %u of relation %u/%u/%u, but retrying: %m",
!                                 entry->tag.segno,
!                                 entry->tag.rnode.spcNode,
!                                 entry->tag.rnode.dbNode,
!                                 entry->tag.rnode.relNode)));

+         }
+         else
+         {
              /* Okay, delete this entry */
              if (hash_search(pendingOpsTable, &entry->tag,
                              HASH_REMOVE, NULL) == NULL)
                  elog(ERROR, "pendingOpsTable corrupted");
+
+             list_delete_cell(syncOps, cell, NULL);
          }
!     }
  }

  /*

Re: Fix mdsync never-ending loop problem

From
Alvaro Herrera
Date:
While skimming over this I was baffled a bit about the usage of
(InvalidBlockNumber - 1) as value for FORGET_DATABASE_FSYNC.  It took me
a while to realize that this code is abusing the BlockNumber typedef to
pass around *segment* numbers, so the useful range is much smaller and
thus the usage of that value is not a problem in practice.

I wonder if it wouldn't be better to clean this up by creating a
separate typedef for segment numbers, with its own special values?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Fix mdsync never-ending loop problem

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Here's a fix for the problem that on a busy system, mdsync never
> finishes. See the original problem description on hackers:
> http://archives.postgresql.org/pgsql-hackers/2007-04/msg00259.php
>
> The solution is taken from ITAGAKI Takahiro's Load Distributed
> Checkpoint patch. At the beginning of mdsync, the pendingOpsTable is
> copied to a linked list, and that list is then processed until it's empty.

Here's an updated patch, the one I sent earlier is broken. I ignored the
return value of list_delete_cell.

We could just review and apply ITAGAKI's patch as it is instead of this
snippet of it, but because that can take some time I'd like to see this
applied before that.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/smgr/md.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/md.c,v
retrieving revision 1.127
diff -c -r1.127 md.c
*** src/backend/storage/smgr/md.c    17 Jan 2007 16:25:01 -0000    1.127
--- src/backend/storage/smgr/md.c    5 Apr 2007 16:09:31 -0000
***************
*** 863,989 ****
  void
  mdsync(void)
  {
!     bool        need_retry;

      if (!pendingOpsTable)
          elog(ERROR, "cannot sync without a pendingOpsTable");

      /*
!      * The fsync table could contain requests to fsync relations that have
!      * been deleted (unlinked) by the time we get to them.  Rather than
!      * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
!      * what we will do is retry the whole process after absorbing fsync
!      * request messages again.  Since mdunlink() queues a "revoke" message
!      * before actually unlinking, the fsync request is guaranteed to be gone
!      * the second time if it really was this case.  DROP DATABASE likewise
!      * has to tell us to forget fsync requests before it starts deletions.
       */
!     do {
!         HASH_SEQ_STATUS hstat;
!         PendingOperationEntry *entry;
!         int            absorb_counter;

!         need_retry = false;

          /*
!          * If we are in the bgwriter, the sync had better include all fsync
!          * requests that were queued by backends before the checkpoint REDO
!          * point was determined. We go that a little better by accepting all
!          * requests queued up to the point where we start fsync'ing.
           */
          AbsorbFsyncRequests();

!         absorb_counter = FSYNCS_PER_ABSORB;
!         hash_seq_init(&hstat, pendingOpsTable);
!         while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
          {
!             /*
!              * If fsync is off then we don't have to bother opening the file
!              * at all.  (We delay checking until this point so that changing
!              * fsync on the fly behaves sensibly.)
!              */
!             if (enableFsync)
!             {
!                 SMgrRelation reln;
!                 MdfdVec    *seg;

!                 /*
!                  * If in bgwriter, we want to absorb pending requests every so
!                  * often to prevent overflow of the fsync request queue.  This
!                  * could result in deleting the current entry out from under
!                  * our hashtable scan, so the procedure is to fall out of the
!                  * scan and start over from the top of the function.
!                  */
!                 if (--absorb_counter <= 0)
!                 {
!                     need_retry = true;
!                     break;
!                 }

!                 /*
!                  * Find or create an smgr hash entry for this relation. This
!                  * may seem a bit unclean -- md calling smgr?  But it's really
!                  * the best solution.  It ensures that the open file reference
!                  * isn't permanently leaked if we get an error here. (You may
!                  * say "but an unreferenced SMgrRelation is still a leak!" Not
!                  * really, because the only case in which a checkpoint is done
!                  * by a process that isn't about to shut down is in the
!                  * bgwriter, and it will periodically do smgrcloseall(). This
!                  * fact justifies our not closing the reln in the success path
!                  * either, which is a good thing since in non-bgwriter cases
!                  * we couldn't safely do that.)  Furthermore, in many cases
!                  * the relation will have been dirtied through this same smgr
!                  * relation, and so we can save a file open/close cycle.
!                  */
!                 reln = smgropen(entry->tag.rnode);
!
!                 /*
!                  * It is possible that the relation has been dropped or
!                  * truncated since the fsync request was entered.  Therefore,
!                  * allow ENOENT, but only if we didn't fail once already on
!                  * this file.  This applies both during _mdfd_getseg() and
!                  * during FileSync, since fd.c might have closed the file
!                  * behind our back.
!                  */
!                 seg = _mdfd_getseg(reln,
!                                    entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
!                                    false, EXTENSION_RETURN_NULL);
!                 if (seg == NULL ||
!                     FileSync(seg->mdfd_vfd) < 0)
!                 {
!                     /*
!                      * XXX is there any point in allowing more than one try?
!                      * Don't see one at the moment, but easy to change the
!                      * test here if so.
!                      */
!                     if (!FILE_POSSIBLY_DELETED(errno) ||
!                         ++(entry->failures) > 1)
!                         ereport(ERROR,
!                                 (errcode_for_file_access(),
!                                  errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
!                                         entry->tag.segno,
!                                         entry->tag.rnode.spcNode,
!                                         entry->tag.rnode.dbNode,
!                                         entry->tag.rnode.relNode)));
!                     else
!                         ereport(DEBUG1,
!                                 (errcode_for_file_access(),
!                                  errmsg("could not fsync segment %u of relation %u/%u/%u, but retrying: %m",
!                                         entry->tag.segno,
!                                         entry->tag.rnode.spcNode,
!                                         entry->tag.rnode.dbNode,
!                                         entry->tag.rnode.relNode)));
!                     need_retry = true;
!                     continue;    /* don't delete the hashtable entry */
!                 }
!             }

              /* Okay, delete this entry */
              if (hash_search(pendingOpsTable, &entry->tag,
                              HASH_REMOVE, NULL) == NULL)
                  elog(ERROR, "pendingOpsTable corrupted");
          }
!     } while (need_retry);
  }

  /*
--- 863,1012 ----
  void
  mdsync(void)
  {
!     HASH_SEQ_STATUS hstat;
!     List           *syncOps;
!     ListCell       *cell;
!     PendingOperationEntry *entry;

      if (!pendingOpsTable)
          elog(ERROR, "cannot sync without a pendingOpsTable");

      /*
!      * If fsync=off, mdsync is a no-op. Just clear the pendingOpsTable.
       */
!     if(!enableFsync)
!     {
!         hash_seq_init(&hstat, pendingOpsTable);
!         while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
!         {
!             if (hash_search(pendingOpsTable, &entry->tag,
!                             HASH_REMOVE, NULL) == NULL)
!                 elog(ERROR, "pendingOpsTable corrupted");
!         }
!         return;
!     }
!
!     /*
!      * If we are in the bgwriter, the sync had better include all fsync
!      * requests that were queued by backends before the checkpoint REDO
!      * point was determined. We go that a little better by accepting all
!      * requests queued up to the point where we start fsync'ing.
!      */
!     AbsorbFsyncRequests();
!
!     /* Take a snapshot of pendingOpsTable into a list. We don't want to
!      * scan through pendingOpsTable directly in the loop because we call
!      * AbsorbFsyncRequests inside it which modifies the table.
!      */
!     syncOps = NIL;
!     hash_seq_init(&hstat, pendingOpsTable);
!     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
!     {
!         PendingOperationEntry *dup;
!
!         /* Entries could be deleted in our scan, so copy them. */
!         dup = (PendingOperationEntry *) palloc(sizeof(PendingOperationEntry));
!         memcpy(dup, entry, sizeof(PendingOperationEntry));
!         syncOps = lappend(syncOps, dup);
!     }
!
!     /* Process fsync requests until the list is empty */
!     while((cell = list_head(syncOps)) != NULL)
!     {
!         SMgrRelation reln;
!         MdfdVec    *seg;

!         entry = lfirst(cell);

          /*
!          * If in bgwriter, we want to absorb pending requests every so
!          * often to prevent overflow of the fsync request queue.
           */
          AbsorbFsyncRequests();

!         /* Check that the entry is still in pendingOpsTable. It could've
!          * been deleted by an absorbed relation or database deletion event.
!          */
!         entry = hash_search(pendingOpsTable, &entry->tag, HASH_FIND, NULL);
!         if(entry == NULL)
          {
!             syncOps = list_delete_cell(syncOps, cell, NULL);
!             continue;
!         }

!         /*
!          * Find or create an smgr hash entry for this relation. This
!          * may seem a bit unclean -- md calling smgr?  But it's really
!          * the best solution.  It ensures that the open file reference
!          * isn't permanently leaked if we get an error here. (You may
!          * say "but an unreferenced SMgrRelation is still a leak!" Not
!          * really, because the only case in which a checkpoint is done
!          * by a process that isn't about to shut down is in the
!          * bgwriter, and it will periodically do smgrcloseall(). This
!          * fact justifies our not closing the reln in the success path
!          * either, which is a good thing since in non-bgwriter cases
!          * we couldn't safely do that.)  Furthermore, in many cases
!          * the relation will have been dirtied through this same smgr
!          * relation, and so we can save a file open/close cycle.
!          */
!         reln = smgropen(entry->tag.rnode);

!         /*
!          * It is possible that the relation has been dropped or
!          * truncated since the fsync request was entered.  Therefore,
!          * allow ENOENT, but only if we didn't fail once already on
!          * this file.  This applies both during _mdfd_getseg() and
!          * during FileSync, since fd.c might have closed the file
!          * behind our back.
!          */
!         seg = _mdfd_getseg(reln,
!                            entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
!                            false, EXTENSION_RETURN_NULL);
!         if (seg == NULL || FileSync(seg->mdfd_vfd) < 0)
!         {
!             /*
!              * The fsync table could contain requests to fsync relations that have
!              * been deleted (unlinked) by the time we get to them.  Rather than
!              * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
!              * what we will do is retry after absorbing fsync
!              * request messages again.  Since mdunlink() queues a "revoke" message
!              * before actually unlinking, the fsync request is guaranteed to be gone
!              * the second time if it really was this case.  DROP DATABASE likewise
!              * has to tell us to forget fsync requests before it starts deletions.
!              *
!              * XXX is there any point in allowing more than one try?
!              * Don't see one at the moment, but easy to change the
!              * test here if so.
!              */
!             if (!FILE_POSSIBLY_DELETED(errno) ||
!                 ++(entry->failures) > 1)
!                 ereport(ERROR,
!                         (errcode_for_file_access(),
!                          errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
!                                 entry->tag.segno,
!                                 entry->tag.rnode.spcNode,
!                                 entry->tag.rnode.dbNode,
!                                 entry->tag.rnode.relNode)));
!             else
!                 ereport(DEBUG1,
!                         (errcode_for_file_access(),
!                          errmsg("could not fsync segment %u of relation %u/%u/%u, but retrying: %m",
!                                 entry->tag.segno,
!                                 entry->tag.rnode.spcNode,
!                                 entry->tag.rnode.dbNode,
!                                 entry->tag.rnode.relNode)));

+         }
+         else
+         {
              /* Okay, delete this entry */
              if (hash_search(pendingOpsTable, &entry->tag,
                              HASH_REMOVE, NULL) == NULL)
                  elog(ERROR, "pendingOpsTable corrupted");
+
+             syncOps = list_delete_cell(syncOps, cell, NULL);
          }
!     }
  }

  /*

Re: Fix mdsync never-ending loop problem

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Here's a fix for the problem that on a busy system, mdsync never
> finishes. See the original problem description on hackers:

This leaks memory, no?  (list_delete_cell only deletes the ListCell.)
But I dislike copying the table entries anyway, see comment on -hackers.

BTW, it's very hard to see what a patch like this is actually changing.
It might be better to submit a version that doesn't reindent the chunks
of code you aren't changing, so as to reduce the visual size of the
diff.  A note to the committer to reindent the whole function is
sufficient (or if he forgets, pg_indent will fix it eventually).

            regards, tom lane

Re: Fix mdsync never-ending loop problem

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I wonder if it wouldn't be better to clean this up by creating a
> separate typedef for segment numbers, with its own special values?

Probably.  I remember having thought about it when I put in the
FORGET_DATABASE_FSYNC hack.  I think I didn't do it because I needed
to backpatch and so I wanted a minimal-size patch.  Feel free to do it
in HEAD ...

            regards, tom lane

Re: Fix mdsync never-ending loop problem

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Here's a fix for the problem that on a busy system, mdsync never
>> finishes. See the original problem description on hackers:
>
> This leaks memory, no?  (list_delete_cell only deletes the ListCell.)

Oh, I just spotted another problem with it and posted an updated patch,
but I missed that.

> But I dislike copying the table entries anyway, see comment on -hackers.

Frankly the cycle id idea sounds more ugly and fragile to me. You'll
need to do multiple scans of the hash table that way, starting from top
every time you call AbsorbFsyncRequests (like we do know). But whatever...

> BTW, it's very hard to see what a patch like this is actually changing.
> It might be better to submit a version that doesn't reindent the chunks
> of code you aren't changing, so as to reduce the visual size of the
> diff.  A note to the committer to reindent the whole function is
> sufficient (or if he forgets, pg_indent will fix it eventually).

Ok, will do that. Or would you like to just take over from here?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Fix mdsync never-ending loop problem

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> But I dislike copying the table entries anyway, see comment on -hackers.

> Frankly the cycle id idea sounds more ugly and fragile to me. You'll
> need to do multiple scans of the hash table that way, starting from top
> every time you call AbsorbFsyncRequests (like we do know).

How so?  You just ignore entries whose cycleid is too large.  You'd have
to be careful about wraparound in the comparisons, but that's not hard
to deal with.  Also, AFAICS you still have the retry problem (and an
even bigger memory leak problem) with this coding --- the "to-do list"
doesn't eliminate the issue of correct handling of a failure.

> Ok, will do that. Or would you like to just take over from here?

No, I'm up to my ears in varlena.  You're the one in a position to test
this, anyway.

            regards, tom lane

Re: Fix mdsync never-ending loop problem

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> But I dislike copying the table entries anyway, see comment on -hackers.
>
>> Frankly the cycle id idea sounds more ugly and fragile to me. You'll
>> need to do multiple scans of the hash table that way, starting from top
>> every time you call AbsorbFsyncRequests (like we do know).
>
> How so?  You just ignore entries whose cycleid is too large.  You'd have
> to be careful about wraparound in the comparisons, but that's not hard
> to deal with.  Also, AFAICS you still have the retry problem (and an
> even bigger memory leak problem) with this coding --- the "to-do list"
> doesn't eliminate the issue of correct handling of a failure.

You have to start the hash_seq_search from scratch after each call to
AbsorbFsyncRequests because it can remove entries, including the one the
scan is stopped on.

I think the failure handling is correct in the "to-do list" approach,
when an entry is read from the list, it's checked that the entry hasn't
been removed from the hash table. Actually there was a bug in the
original LDC patch in the failure handling: it replaced the per-entry
failures-counter with a local retry_counter variable, but it wasn't
cleared after a successful write which would lead to bogus ERRORs when
multiple relations are dropped during the mdsync. I kept the original
per-entry counter, though the local variable approach could be made to work.

The memory leak obviously needs to be fixed, but that's just a matter of
adding a pfree.

>> Ok, will do that. Or would you like to just take over from here?
>
> No, I'm up to my ears in varlena.  You're the one in a position to test
> this, anyway.

Ok.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Fix mdsync never-ending loop problem

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I think the failure handling is correct in the "to-do list" approach,
> when an entry is read from the list, it's checked that the entry hasn't
> been removed from the hash table. Actually there was a bug in the
> original LDC patch in the failure handling: it replaced the per-entry
> failures-counter with a local retry_counter variable, but it wasn't
> cleared after a successful write which would lead to bogus ERRORs when
> multiple relations are dropped during the mdsync. I kept the original
> per-entry counter, though the local variable approach could be made to work.

Yeah.  One of the things that bothered me about the patch was that it
would be easy to mess up by updating state in the copied entry instead
of the "real" info in the hashtable.  It would be clearer what's
happening if the to-do list contains only the lookup keys and not the
whole struct.

            regards, tom lane

Re: Fix mdsync never-ending loop problem

From
"Simon Riggs"
Date:
On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:

> We could just review and apply ITAGAKI's patch as it is instead of
> this snippet of it, but because that can take some time I'd like to
> see this applied before that.

I think we are just beginning to understand the quality of Itagaki's
thinking.

We should give him a chance to interact on this and if there are parts
of his patch that we want, then it should be him that does it. I'm not
sure that carving the good bits off each others patches is likely to
help teamwork in the long term. At very least he deserves much credit
for his farsighted work.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Fix mdsync never-ending loop problem

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2007-04-05 at 17:14 +0100, Heikki Linnakangas wrote:
>
>> We could just review and apply ITAGAKI's patch as it is instead of
>> this snippet of it, but because that can take some time I'd like to
>> see this applied before that.
>
> I think we are just beginning to understand the quality of Itagaki's
> thinking.
>
> We should give him a chance to interact on this and if there are parts
> of his patch that we want, then it should be him that does it.

Itagaki, would you like to take a stab at this?

> I'm not
> sure that carving the good bits off each others patches is likely to
> help teamwork in the long term. At very least he deserves much credit
> for his farsighted work.

Oh sure! Thank you for your efforts, Itagaki!

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com