Thread: Spread checkpoint sync

Spread checkpoint sync

From
Greg Smith
Date:
Final patch in this series for today spreads out the individual
checkpoint fsync calls over time, and was written by myself and Simon
Riggs.  Patch is based against a system that's already had the two
patches I sent over earlier today applied, rather than HEAD, as both are
useful for measuring how well this one works.  You can grab a tree with
all three from my Github repo, via the "checkpoint" branch:
https://github.com/greg2ndQuadrant/postgres/tree/checkpoint

This is a work in progress.  While I've seen this reduce checkpoint
spike latency significantly on a large system, I don't have any
referencable performance numbers I can share yet.  There are also a
couple of problems I know about, and I'm sure others I haven't thought
of yet  The first known issues is that it delays manual or other
"forced" checkpoints, which is not necessarily wrong if you really are
serious about spreading syncs out, but it is certainly surprising when
you run into it.  I notice this most when running createdb on a busy
system.  No real reason for this to happen, the code passes that it's a
forced checkpoint down but just doesn't act on it yet.

The second issue is that the delay between sync calls is currently
hard-coded, at 3 seconds.  I believe the right path here is to consider
the current checkpoint_completion_target to still be valid, then work
back from there.  That raises the question of what percentage of the
time writes should now be compressed into relative to that, to leave
some time to spread the sync calls.  If we're willing to say "writes
finish in first 1/2 of target, syncs execute in second 1/2", that I
could implement that here.  Maybe that ratio needs to be another
tunable.  Still thinking about that part, and it's certainly open to
community debate.  The thing to realize that complicates the design is
that the actual sync execution may take a considerable period of time.
It's much more likely for that to happen than in the case of an
individual write, as the current spread checkpoint does, because those
are usually cached.  In the spread sync case, it's easy for one slow
sync to make the rest turn into ones that fire in quick succession, to
make up for lost time.

There's some history behind this design that impacts review.  Circa 8.3
development in 2007, I had experimented with putting some delay between
each of the fsync calls that the background writer executes during a
checkpoint.  It didn't help smooth things out at all at the time.  It
turns out that's mainly because all my tests were on Linux using ext3.
On that filesystem, fsync is not very granular.  It's quite likely it
will push out data you haven't asked to sync yet, which means one giant
sync is almost impossible to avoid no matter how you space the fsync
calls.  If you try and review this on ext3, I expect you'll find a big
spike early in each checkpoint (where it flushes just about everything
out) and then quick response for the later files involved.

The system this patch originated to help fix was running XFS.  There,
I've confirmed that problem doesn't exist, that individual syncs only
seem to push out the data related to one file.  The same should be true
on ext4, but I haven't tested that myself.  Not sure how granular the
fsync calls are on Solaris, FreeBSD, Darwin, etc. yet.  Note that it's
still possible to get hung on one sync call for a while, even on XFS.
The worst case seems to be if you've created a new 1GB database table
chunk and fully populated it since the last checkpoint, on a system
that's just cached the whole thing so far.

One change that turned out be necessary rather than optional--to get
good performance from the system under tuning--was to make regular
background writer activity, including fsync absorb checks, happen during
these sync pauses.  The existing code ran the checkpoint sync work in a
pretty tight loop, which as I alluded to in an earlier patch today can
lead to the backends competing with the background writer to get their
sync calls executed.  This squashes that problem if the background
writer is setup properly.

What does properly mean?  Well, it can't do that cleanup if the
background writer is sleeping.  This whole area was refactored.  The
current sync absorb code uses the constant WRITES_PER_ABSORB to make
decisions.  This new version replaces that hard-coded value with
something that scales to the system size.  It now ignores doing work
until the number of pending absorb requests has reached 10% of the
number possible to store (BgWriterShmem->max_requests, which is set to
the size of shared_buffers in 8K pages, AKA NBuffers).  This may
actually postpone this work for too long on systems with large
shared_buffers settings; that's one area I'm still investigating.

As far as concerns about this 10% setting not doing enough work, which
is something I do see, you can always increase how often absorbing
happens by decreasing bgwriter_delay now--giving other benefits too.
For example, if you run the fsync-stress-v2.sh script I included with
the last patch I sent, you'll discover the spread sync version of the
server leaves just as many unabsorbed writes behind as the old code
did.  Those are happening because of periods the background writer is
sleeping.  They drop as you decrease the delay; here's a table showing
some values I tested here, with all three patches installed:

bgwriter_delay    buffers_backend_sync
200 ms    90
50 ms    28
25 ms    3

There's a bunch of performance related review work that needs to be done
here, in addition to the usual code review for the patch.  My hope is
that I can get enough of that done to validate this does what it's
supposed to on public hardware that a later version of this patch is
considered for the next CommitFest.  It's a little more raw than I'd
like still, but the idea has been tested enough here that I believe it's
fundamentally sound and valuable.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us


diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 43a149e..0ce8e2b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct

 static BgWriterShmemStruct *BgWriterShmem;

-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB        1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR    10

 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
         /*
          * Process any requests or signals received recently.
          */
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(false);

         if (got_SIGHUP)
         {
@@ -636,7 +636,7 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(true);
         udelay -= 1000000L;
     }

@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-    static int    absorb_counter = WRITES_PER_ABSORB;
-
     /* Do nothing if checkpoint is being executed by non-bgwriter process */
     if (!am_bg_writer)
         return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
             ProcessConfigFile(PGC_SIGHUP);
         }

-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);

         BgBufferSync();
         CheckArchiveTimeout();
         BgWriterNap();
     }
-    else if (--absorb_counter <= 0)
+    else
     {
         /*
-         * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-         * operations even when we don't sleep, to prevent overflow of the
-         * fsync request queue.
+         * Check for overflow of the fsync request queue.
          */
-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);
+    }
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+    pg_time_t    now;
+     pg_time_t    sync_start_time;
+     int            sync_delay_secs;
+
+     /*
+      * Delay after each sync, in seconds.  This could be a parameter.  But
+      * since ideally this will be auto-tuning in the near future, not
+     * assigning it a GUC setting yet.
+      */
+#define EXTRA_SYNC_DELAY    3
+
+    /* Do nothing if checkpoint is being executed by non-bgwriter process */
+    if (!am_bg_writer)
+        return;
+
+     sync_start_time = (pg_time_t) time(NULL);
+
+    /*
+     * Perform the usual bgwriter duties.
+     */
+     for (;;)
+     {
+        AbsorbFsyncRequests(false);
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+
+         /*
+          * Are we there yet?
+          */
+         now = (pg_time_t) time(NULL);
+         sync_delay_secs = now - sync_start_time;
+         if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+            break;
     }
 }

@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
     BgWriterRequest *requests = NULL;
     BgWriterRequest *request;
     int            n;

+    /*
+     * Divide the size of the request queue by this to determine when
+     * absorption action needs to be taken.  Default here aims to empty the
+     * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+     * you probably need to lower bgwriter_delay, rather than presume
+     * this needs to be a tunable you can decrease.
+     */
+    int            absorb_action_divisor = 10;
+
     if (!am_bg_writer)
         return;

     /*
+     * If the queue isn't very large, don't worry about absorbing yet.
+     * Access integer counter without lock, to avoid queuing.
+     */
+    if (!force && BgWriterShmem->num_requests <
+            (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+    {
+        if (BgWriterShmem->num_requests > 0)
+            elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+                BgWriterShmem->num_requests);
+        return;
+    }
+
+    elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+        BgWriterShmem->num_requests);
+
+    /*
      * We have to PANIC if we fail to absorb all the pending requests (eg,
      * because our hashtable runs out of memory).  This is because the system
      * cannot run safely if we are unable to fsync what we have been told to
@@ -1167,4 +1233,9 @@ AbsorbFsyncRequests(void)
         pfree(requests);

     END_CRIT_SECTION();
+
+    /*
+     * Send off activity statistics to the stats collector
+     */
+    pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7140b94..57066c4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -36,9 +36,6 @@
  */
 #define DEBUG_FSYNC    1

-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB        10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC    (InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC    (InvalidBlockNumber-1)
@@ -931,7 +928,6 @@ mdsync(void)

     HASH_SEQ_STATUS hstat;
     PendingOperationEntry *entry;
-    int            absorb_counter;

 #ifdef DEBUG_FSYNC
     /* Statistics on sync times */
@@ -958,7 +954,7 @@ mdsync(void)
      * queued an fsync request before clearing the buffer's dirtybit, so we
      * are safe as long as we do an Absorb after completing BufferSync().
      */
-    AbsorbFsyncRequests();
+    AbsorbFsyncRequests(true);

     /*
      * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1001,7 +997,6 @@ mdsync(void)
     mdsync_in_progress = true;

     /* Now scan the hashtable for fsync requests to process */
-    absorb_counter = FSYNCS_PER_ABSORB;
     hash_seq_init(&hstat, pendingOpsTable);
     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
     {
@@ -1026,17 +1021,9 @@ mdsync(void)
             int            failures;

             /*
-             * If in bgwriter, we want to absorb pending requests every so
-             * often to prevent overflow of the fsync request queue.  It is
-             * unspecified whether newly-added entries will be visited by
-             * hash_seq_search, but we don't care since we don't need to
-             * process them anyway.
+             * If in bgwriter, perform normal duties.
              */
-            if (--absorb_counter <= 0)
-            {
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;
-            }
+            CheckpointSyncDelay();

             /*
              * The fsync table could contain requests to fsync segments that
@@ -1131,10 +1118,9 @@ mdsync(void)
                 pfree(path);

                 /*
-                 * Absorb incoming requests and check to see if canceled.
+                 * If in bgwriter, perform normal duties.
                  */
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */
+                CheckpointSyncDelay();

                 if (entry->canceled)
                     break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);

 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);

 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                     BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);

 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);

Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> The second issue is that the delay between sync calls is currently
> hard-coded, at 3 seconds.  I believe the right path here is to consider the
> current checkpoint_completion_target to still be valid, then work back from
> there.  That raises the question of what percentage of the time writes
> should now be compressed into relative to that, to leave some time to spread
> the sync calls.  If we're willing to say "writes finish in first 1/2 of
> target, syncs execute in second 1/2", that I could implement that here.
>  Maybe that ratio needs to be another tunable.  Still thinking about that
> part, and it's certainly open to community debate.  The thing to realize
> that complicates the design is that the actual sync execution may take a
> considerable period of time.  It's much more likely for that to happen than
> in the case of an individual write, as the current spread checkpoint does,
> because those are usually cached.  In the spread sync case, it's easy for
> one slow sync to make the rest turn into ones that fire in quick succession,
> to make up for lost time.

I think the behavior of file systems and operating systems is highly
relevant here.  We seem to have a theory that allowing a delay between
the write and the fsync should give the OS a chance to start writing
the data out, but do we have any evidence indicating whether and under
what circumstances that actually occurs?  For example, if we knew that
it's important to wait at least 30 s but waiting 60 s is no better,
that would be useful information.

Another question I have is about how we're actually going to know when
any given fsync can be performed.  For any given segment, there are a
certain number of pages A that are already dirty at the start of the
checkpoint.  Then there are a certain number of additional pages B
that are going to be written out during the checkpoint.  If it so
happens that B = 0, we can call fsync() at the beginning of the
checkpoint without losing anything (in fact, we gain something: any
pages dirtied by cleaning scans or backend writes during the
checkpoint won't need to hit the disk; and if the filesystem dumps
more of its cache than necessary on fsync, we may as well take that
hit before dirtying a bunch more stuff).  But if B > 0, then we should
attempt the fsync() until we've written them all; otherwise we'll end
up having to fsync() that segment twice.

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.  It's much better to dump the
cache at the beginning of the checkpoint and then again after every
file than it is to spew many GB of dirty stuff into the cache and then
drop the hammer.

I'm just brainstorming here; feel free to tell me I'm all wet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Jeff Janes
Date:
On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> The second issue is that the delay between sync calls is currently
>> hard-coded, at 3 seconds.  I believe the right path here is to consider the
>> current checkpoint_completion_target to still be valid, then work back from
>> there.  That raises the question of what percentage of the time writes
>> should now be compressed into relative to that, to leave some time to spread
>> the sync calls.  If we're willing to say "writes finish in first 1/2 of
>> target, syncs execute in second 1/2", that I could implement that here.
>>  Maybe that ratio needs to be another tunable.  Still thinking about that
>> part, and it's certainly open to community debate.

I would speculate that the answer is likely to be nearly binary.  The
best option would either be to do the writes as fast as possible and
spread out the fsyncs, or spread out the writes and do the fsyncs as
fast as possible.  Depending on the system set up.


>> The thing to realize
>> that complicates the design is that the actual sync execution may take a
>> considerable period of time.  It's much more likely for that to happen than
>> in the case of an individual write, as the current spread checkpoint does,
>> because those are usually cached.  In the spread sync case, it's easy for
>> one slow sync to make the rest turn into ones that fire in quick succession,
>> to make up for lost time.
>
> I think the behavior of file systems and operating systems is highly
> relevant here.  We seem to have a theory that allowing a delay between
> the write and the fsync should give the OS a chance to start writing
> the data out,

I thought that the theory was that doing too many fsync in short order
can lead to some kind of starvation of other IO.

If the theory is that we want to wait between writes and fsyncs, then
the current behavior is probably the best, Spreading out the writes
and then doing all the syncs at the end gives the best delay time
between an average write and the sync of that written to file.  Or,
spread the writes out over 150 seconds, sleep for 140 seconds, then do
the fsyncs.  But I don't think that that is the theory.


> but do we have any evidence indicating whether and under
> what circumstances that actually occurs?  For example, if we knew that
> it's important to wait at least 30 s but waiting 60 s is no better,
> that would be useful information.
>
> Another question I have is about how we're actually going to know when
> any given fsync can be performed.  For any given segment, there are a
> certain number of pages A that are already dirty at the start of the
> checkpoint.

Dirty in the shared pool, or dirty in the OS cache?

> Then there are a certain number of additional pages B
> that are going to be written out during the checkpoint.  If it so
> happens that B = 0, we can call fsync() at the beginning of the
> checkpoint without losing anything (in fact, we gain something: any
> pages dirtied by cleaning scans or backend writes during the
> checkpoint won't need to hit the disk;

Aren't those pages written out by cleaning scans and backend writes
while the checkpoint is occurring exactly what you defined to be page
set B, and then to be zero?

> and if the filesystem dumps
> more of its cache than necessary on fsync, we may as well take that
> hit before dirtying a bunch more stuff).  But if B > 0, then we should
> attempt the fsync() until we've written them all; otherwise we'll end
> up having to fsync() that segment twice.
>
> Doing all the writes and then all the fsyncs meets this requirement
> trivially, but I'm not so sure that's a good idea.  For example, given
> files F1 ... Fn with dirty pages needing checkpoint writes, we could
> do the following: first, do any pending fsyncs for files not among F1
> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
> and fsync, write all pages for F3 and fsync, etc.  This might seem
> dumb because we're not really giving the OS a chance to write anything
> out before we fsync, but think about the ext3 case where the whole
> filesystem cache gets flushed anyway.  It's much better to dump the
> cache at the beginning of the checkpoint and then again after every
> file than it is to spew many GB of dirty stuff into the cache and then
> drop the hammer.

But the kernel has knobs to prevent that from happening.
dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
is supposed to do a journal commit every 5 seconds under default mount
conditions.

Cheers,

Jeff


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> The thing to realize
>>> that complicates the design is that the actual sync execution may take a
>>> considerable period of time.  It's much more likely for that to happen than
>>> in the case of an individual write, as the current spread checkpoint does,
>>> because those are usually cached.  In the spread sync case, it's easy for
>>> one slow sync to make the rest turn into ones that fire in quick succession,
>>> to make up for lost time.
>>
>> I think the behavior of file systems and operating systems is highly
>> relevant here.  We seem to have a theory that allowing a delay between
>> the write and the fsync should give the OS a chance to start writing
>> the data out,
>
> I thought that the theory was that doing too many fsync in short order
> can lead to some kind of starvation of other IO.
>
> If the theory is that we want to wait between writes and fsyncs, then
> the current behavior is probably the best, Spreading out the writes
> and then doing all the syncs at the end gives the best delay time
> between an average write and the sync of that written to file.  Or,
> spread the writes out over 150 seconds, sleep for 140 seconds, then do
> the fsyncs.  But I don't think that that is the theory.

Well, I've heard Bruce and, I think, possibly also Greg talk about
wanting to wait after doing the writes in the hopes that the kernel
will start to flush the dirty pages, but I'm wondering whether it
wouldn't be better to just give up on that and do: small batch of
writes - fsync those writes - another small batch of writes - fsync
that batch - etc.

>> but do we have any evidence indicating whether and under
>> what circumstances that actually occurs?  For example, if we knew that
>> it's important to wait at least 30 s but waiting 60 s is no better,
>> that would be useful information.
>>
>> Another question I have is about how we're actually going to know when
>> any given fsync can be performed.  For any given segment, there are a
>> certain number of pages A that are already dirty at the start of the
>> checkpoint.
>
> Dirty in the shared pool, or dirty in the OS cache?

OS cache, sorry.

>> Then there are a certain number of additional pages B
>> that are going to be written out during the checkpoint.  If it so
>> happens that B = 0, we can call fsync() at the beginning of the
>> checkpoint without losing anything (in fact, we gain something: any
>> pages dirtied by cleaning scans or backend writes during the
>> checkpoint won't need to hit the disk;
>
> Aren't those pages written out by cleaning scans and backend writes
> while the checkpoint is occurring exactly what you defined to be page
> set B, and then to be zero?

No, sorry, I'm referring to cases where all the dirty pages in a
segment have been written out the OS but we have not yet issued the
necessary fsync.

>> and if the filesystem dumps
>> more of its cache than necessary on fsync, we may as well take that
>> hit before dirtying a bunch more stuff).  But if B > 0, then we should
>> attempt the fsync() until we've written them all; otherwise we'll end
>> up having to fsync() that segment twice.
>>
>> Doing all the writes and then all the fsyncs meets this requirement
>> trivially, but I'm not so sure that's a good idea.  For example, given
>> files F1 ... Fn with dirty pages needing checkpoint writes, we could
>> do the following: first, do any pending fsyncs for files not among F1
>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
>> and fsync, write all pages for F3 and fsync, etc.  This might seem
>> dumb because we're not really giving the OS a chance to write anything
>> out before we fsync, but think about the ext3 case where the whole
>> filesystem cache gets flushed anyway.  It's much better to dump the
>> cache at the beginning of the checkpoint and then again after every
>> file than it is to spew many GB of dirty stuff into the cache and then
>> drop the hammer.
>
> But the kernel has knobs to prevent that from happening.
> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
> kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
> is supposed to do a journal commit every 5 seconds under default mount
> conditions.

I don't know in detail.  dirty_expire_centisecs sounds useful; I think
the problem with dirty_background_ratio and dirty_ratio is that the
default ratios are large enough that on systems with a huge pile of
memory, they allow more dirty data to accumulate than can be flushed
without causing an I/O storm.  I believe Greg Smith made a comment
along the lines of - memory sizes are grow faster than I/O speeds;
therefore a ratio that is OK for a low-end system with a modest amount
of memory causes problems on a high-end system that has faster I/O but
MUCH more memory.

As a kernel developer, I suspect the tendency is to try to set the
ratio so that you keep enough free memory around to service future
allocation requests.  Optimizing for possible future fsyncs is
probably not the top priority...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Jeff Janes
Date:
On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

>>> Doing all the writes and then all the fsyncs meets this requirement
>>> trivially, but I'm not so sure that's a good idea.  For example, given
>>> files F1 ... Fn with dirty pages needing checkpoint writes, we could
>>> do the following: first, do any pending fsyncs for files not among F1
>>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
>>> and fsync, write all pages for F3 and fsync, etc.  This might seem
>>> dumb because we're not really giving the OS a chance to write anything
>>> out before we fsync, but think about the ext3 case where the whole
>>> filesystem cache gets flushed anyway.  It's much better to dump the
>>> cache at the beginning of the checkpoint and then again after every
>>> file than it is to spew many GB of dirty stuff into the cache and then
>>> drop the hammer.
>>
>> But the kernel has knobs to prevent that from happening.
>> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
>> kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
>> is supposed to do a journal commit every 5 seconds under default mount
>> conditions.
>
> I don't know in detail.  dirty_expire_centisecs sounds useful; I think
> the problem with dirty_background_ratio and dirty_ratio is that the
> default ratios are large enough that on systems with a huge pile of
> memory, they allow more dirty data to accumulate than can be flushed
> without causing an I/O storm.

True, but I think that changing these from their defaults is not
considered to be a dark art reserved for kernel hackers, i.e they are
something that sysadmins are expected to tweak to suite their work
load, just like the shmmax and such.  And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

> I believe Greg Smith made a comment
> along the lines of - memory sizes are grow faster than I/O speeds;
> therefore a ratio that is OK for a low-end system with a modest amount
> of memory causes problems on a high-end system that has faster I/O but
> MUCH more memory.

Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

Some experiments I have just done show that dirty_expire_centisecs
does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes
seem reliable on ext2, ext3, and ext4.

But that may not apply to RAID, I don't have one I can test.


Cheers,

Jeff


Re: Spread checkpoint sync

From
Greg Smith
Date:
Jeff Janes wrote:
> And for very large memory
> systems, even 1% may be too much to cache (dirty*_ratio can only be
> set in integer percent points), so recent kernels introduced
> dirty*_bytes parameters.  I like these better because they do what
> they say.  With the dirty*_ratio, I could never figure out what it was
> a ratio of, and the results were unpredictable without extensive
> experimentation.
>   

Right, you can't set dirty_background_ratio low enough to make this 
problem go away.  Even attempts to set it to 1%, back when that that was 
the right size for it, seem to be defeated by other mechanisms within 
the kernel.  Last time I looked at the related source code, it seemed 
the "congestion control" logic that kicks in to throttle writes was a 
likely suspect.  This is why I'm not real optimistic about newer 
mechanism like the dirty_background_bytes added 2.6.29 to help here, as 
that just gives a mapping to setting lower values; the same basic logic 
is under the hood.

Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly 
due to the same congestion mechanism. 

> Yes, but how much work do we want to put into redoing the checkpoint
> logic so that the sysadmin on a particular OS and configuration and FS
> can avoid having to change the kernel parameters away from their
> defaults?  (Assuming of course I am correctly understanding the
> problem, always a dangerous assumption.)
>   

I've been trying to make this problem go away using just the kernel 
tunables available since 2006.  I adjusted them carefully on the server 
that ran into this problem so badly that it motivated the submitted 
patch, months before this issue got bad.  It didn't help.  Maybe if they 
were running a later kernel that supported dirty_background_bytes that 
would have worked better.  During the last few years, the only thing 
that has consistently helped in every case is the checkpoint spreading 
logic that went into 8.3.  I no longer expect that the kernel developers 
will ever make this problem go away the way checkpoints are written out 
right now, whereas the last good PostgreSQL work in this area definitely 
helped.

The basic premise of the current checkpoint code is that if you write 
all of the buffers out early enough, by the time syncs execute enough of 
the data should have gone out that those don't take very long to 
process.  That was usually true for the last few years, on systems with 
a battery-backed cache; the amount of memory cached by the OS was 
relatively small relative to the RAID cache size.  That's not the case 
anymore, and that divergence is growing bigger.

The idea that the checkpoint sync code can run in a relatively tight 
loop, without stopping to do the normal background writer cleanup work, 
is also busted by that observation.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> Doing all the writes and then all the fsyncs meets this requirement
> trivially, but I'm not so sure that's a good idea.  For example, given
> files F1 ... Fn with dirty pages needing checkpoint writes, we could
> do the following: first, do any pending fsyncs for files not among F1
> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
> and fsync, write all pages for F3 and fsync, etc.  This might seem
> dumb because we're not really giving the OS a chance to write anything
> out before we fsync, but think about the ext3 case where the whole
> filesystem cache gets flushed anyway.

I'm not horribly interested in optimizing for the ext3 case per se, as I 
consider that filesystem fundamentally broken from the perspective of 
its ability to deliver low-latency here.  I wouldn't want a patch that 
improved behavior on filesystem with granular fsync to make the ext3 
situation worst.  That's as much as I'd want design to lean toward 
considering its quirks.  Jeff Janes made a case downthread for "why not 
make it the admin/OS's job to worry about this?"  In cases where there 
is a reasonable solution available, in the form of "switch to XFS or 
ext4", I'm happy to take that approach.

Let me throw some numbers out to give a better idea of the shape and 
magnitude of the problem case I've been working on here.  In the 
situation that leads that the near hour-long sync phase I've seen, 
checkpoints will start with about a 3GB backlog of data in the kernel 
write cache to deal with.  That's about 4% of RAM, just under the 5% 
threshold set by dirty_background_ratio.  Whether or not the 256MB write 
cache on the controller is also filled is a relatively minor detail I 
can't monitor easily.  The checkpoint itself?  <250MB each time. 

This proportion is why I didn't think to follow the alternate path of 
worrying about spacing the write and fsync calls out differently.  I 
shrunk shared_buffers down to make the actual checkpoints smaller, which 
helped to some degree; that's what got them down to smaller than the 
RAID cache size.  But the amount of data cached by the operating system 
is the real driver of total sync time here.  Whether or not you include 
all of the writes from the checkpoint itself before you start calling 
fsync didn't actually matter very much; in the case I've been chasing, 
those are getting cached anyway.  The write storm from the fsync calls 
themselves forcing things out seems to be the driver on I/O spikes, 
which is why I started with spacing those out.

Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog 
takes a minimum of 10 minutes of real time.  There are about 300 1GB 
relation files involved in the case I've been chasing.  This is where 
the 3 second delay number came from; 300 files, 3 seconds each, 900 
seconds = 15 minutes of sync spread.  You can turn that math around to 
figure out how much delay per relation you can afford while still 
keeping checkpoints to a planned end time, which isn't done in the patch 
I submitted yet.

Ultimately what I want to do here is some sort of smarter write-behind 
sync operation, perhaps with a LRU on relations with pending fsync 
requests.  The idea would be to sync relations that haven't been touched 
in a while in advance of the checkpoint even.  I think that's similar to 
the general idea Robert is suggesting here, to get some sync calls 
flowing before all of the checkpoint writes have happened.  I think that 
the final sync calls will need to get spread out regardless, and since 
doing that requires a fairly small amount of code too that's why we 
started with that.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Martijn van Oosterhout
Date:
On Sun, Nov 21, 2010 at 04:54:00PM -0500, Greg Smith wrote:
> Ultimately what I want to do here is some sort of smarter write-behind
> sync operation, perhaps with a LRU on relations with pending fsync
> requests.  The idea would be to sync relations that haven't been touched
> in a while in advance of the checkpoint even.  I think that's similar to
> the general idea Robert is suggesting here, to get some sync calls
> flowing before all of the checkpoint writes have happened.  I think that
> the final sync calls will need to get spread out regardless, and since
> doing that requires a fairly small amount of code too that's why we
> started with that.

For a similar problem we had (kernel buffering too much) we had success
using the fadvise and madvise WONTNEED syscalls to force the data to
exit the cache much sooner than it would otherwise. This was on Linux
and it had the side-effect that the data was deleted from the kernel
cache, which we wanted, but probably isn't appropriate here.

There is also sync_file_range, but that's linux specific, although
close to what you want I think. It would allow you to work with blocks
smaller than 1GB.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Spread checkpoint sync

From
Andres Freund
Date:
On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote:
> For a similar problem we had (kernel buffering too much) we had success
> using the fadvise and madvise WONTNEED syscalls to force the data to
> exit the cache much sooner than it would otherwise. This was on Linux
> and it had the side-effect that the data was deleted from the kernel
> cache, which we wanted, but probably isn't appropriate here.
Yep, works fine. Although it has the issue that the data will get read again if 
archiving/SR is enabled.

> There is also sync_file_range, but that's linux specific, although
> close to what you want I think. It would allow you to work with blocks
> smaller than 1GB.
Unfortunately that puts the data under quite high write-out pressure inside 
the kernel - which is not what you actually want because it limits reordering 
and such significantly.

It would be nicer if you could get a mix of both semantics (looking at it, 
depending on the approach that seems to be about a 10 line patch to the 
kernel). I.e. indicate that you want to write the pages soonish, but don't put 
it on the head of the writeout queue.

Andres


Re: Spread checkpoint sync

From
Josh Berkus
Date:
On 11/20/10 6:11 PM, Jeff Janes wrote:
> True, but I think that changing these from their defaults is not
> considered to be a dark art reserved for kernel hackers, i.e they are
> something that sysadmins are expected to tweak to suite their work
> load, just like the shmmax and such. 

I disagree.  Linux kernel hackers know about these kinds of parameters,
and I suppose that Linux performance experts do.  But very few
sysadmins, in my experience, have any idea.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sun, Nov 21, 2010 at 4:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Let me throw some numbers out [...]

Interesting.

> Ultimately what I want to do here is some sort of smarter write-behind sync
> operation, perhaps with a LRU on relations with pending fsync requests.  The
> idea would be to sync relations that haven't been touched in a while in
> advance of the checkpoint even.  I think that's similar to the general idea
> Robert is suggesting here, to get some sync calls flowing before all of the
> checkpoint writes have happened.  I think that the final sync calls will
> need to get spread out regardless, and since doing that requires a fairly
> small amount of code too that's why we started with that.

Doing some kind of background fsyinc-ing might indeed be sensible, but
I agree that's secondary to trying to spread out the fsyncs during the
checkpoint itself.  I guess the question is what we can do there
sensibly without an unreasonable amount of new code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Cédric Villemain
Date:
2010/11/21 Andres Freund <andres@anarazel.de>:
> On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote:
>> For a similar problem we had (kernel buffering too much) we had success
>> using the fadvise and madvise WONTNEED syscalls to force the data to
>> exit the cache much sooner than it would otherwise. This was on Linux
>> and it had the side-effect that the data was deleted from the kernel
>> cache, which we wanted, but probably isn't appropriate here.
> Yep, works fine. Although it has the issue that the data will get read again if
> archiving/SR is enabled.

mmhh . the current code does call DONTNEED or WILLNEED for WAL
depending of the archiving off or on.

This matters *only* once the data is writen (fsync, fdatasync), before
that it should not have  an effect.

>
>> There is also sync_file_range, but that's linux specific, although
>> close to what you want I think. It would allow you to work with blocks
>> smaller than 1GB.
> Unfortunately that puts the data under quite high write-out pressure inside
> the kernel - which is not what you actually want because it limits reordering
> and such significantly.
>
> It would be nicer if you could get a mix of both semantics (looking at it,
> depending on the approach that seems to be about a 10 line patch to the
> kernel). I.e. indicate that you want to write the pages soonish, but don't put
> it on the head of the writeout queue.
>
> Andres
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Spread checkpoint sync

From
Ron Mayer
Date:
Josh Berkus wrote:
> On 11/20/10 6:11 PM, Jeff Janes wrote:
>> True, but I think that changing these from their defaults is not
>> considered to be a dark art reserved for kernel hackers, i.e they are
>> something that sysadmins are expected to tweak to suite their work
>> load, just like the shmmax and such. 
> 
> I disagree.  Linux kernel hackers know about these kinds of parameters,
> and I suppose that Linux performance experts do.  But very few
> sysadmins, in my experience, have any idea.

To me, a lot of this conversation feels parallel to the
arguments the occasionally come up debating writing directly
to raw disks bypassing the filesystems altogether.

Might smoother checkpoints be better solved by talking
to the OS vendors & virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.


Hypothetically - what would the ideal knobs be?

Something like madvise WONTNEED but that leaves pages
in the OS's cache after writing them?



Re: Spread checkpoint sync

From
Greg Smith
Date:
Ron Mayer wrote:
> Might smoother checkpoints be better solved by talking
> to the OS vendors & virtual-memory-tunning-knob-authors
> to work with them on exposing the ideal knobs; rather than
> saying that our only tool is a hammer(fsync) so the problem
> must be handled as a nail.
>

Maybe, but it's hard to argue that the current implementation--just
doing all of the sync calls as fast as possible, one after the other--is
going to produce worst-case behavior in a lot of situations.  Given that
it's not a huge amount of code to do better, I'd rather do some work in
that direction, instead of presuming the kernel authors will ever make
this go away.  Spreading the writes out as part of the checkpoint rework
in 8.3 worked better than any kernel changes I've tested since then, and
I'm not real optimisic about this getting resolved at the system level.
So long as the database changes aren't antagonistic toward kernel
improvements, I'd prefer to have some options here that become effective
as soon as the database code is done.

I've attached an updated version of the initial sync spreading patch
here, one that applies cleanly on top of HEAD and over top of the sync
instrumentation patch too.  The conflict that made that hard before is
gone now.

Having the pg_stat_bgwriter.buffers_backend_fsync patch available all
the time now has made me reconsider how important one potential bit of
refactoring here would be.  I managed to catch one of the situations
where really popular relations were being heavily updated in a way that
was competing with the checkpoint on my test system (which I can happily
share the logs of), with the instrumentation patch applied but not the
spread sync one:

LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm
time=10132.830000 msec
DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm
time=0.001000 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s

Note here how the checkpoint was hung on trying to get 16448_fsm written
out, but the backends were issuing constant competing fsync calls to
these other relations.  This is very similar to the production case this
patch was written to address, which I hadn't been able to share a good
example of yet.  That's essentially what it looks like, except with the
contention going on for minutes instead of seconds.

One of the ideas Simon and I had been considering at one point was
adding some better de-duplication logic to the fsync absorb code, which
I'm reminded by the pattern here might be helpful independently of other
improvements.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct

 static BgWriterShmemStruct *BgWriterShmem;

-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB        1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR    10

 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
         /*
          * Process any requests or signals received recently.
          */
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(false);

         if (got_SIGHUP)
         {
@@ -636,7 +636,7 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(true);
         udelay -= 1000000L;
     }

@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-    static int    absorb_counter = WRITES_PER_ABSORB;
-
     /* Do nothing if checkpoint is being executed by non-bgwriter process */
     if (!am_bg_writer)
         return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
             ProcessConfigFile(PGC_SIGHUP);
         }

-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);

         BgBufferSync();
         CheckArchiveTimeout();
         BgWriterNap();
     }
-    else if (--absorb_counter <= 0)
+    else
     {
         /*
-         * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-         * operations even when we don't sleep, to prevent overflow of the
-         * fsync request queue.
+         * Check for overflow of the fsync request queue.
          */
-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);
+    }
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+    pg_time_t    now;
+     pg_time_t    sync_start_time;
+     int            sync_delay_secs;
+
+     /*
+      * Delay after each sync, in seconds.  This could be a parameter.  But
+      * since ideally this will be auto-tuning in the near future, not
+     * assigning it a GUC setting yet.
+      */
+#define EXTRA_SYNC_DELAY    3
+
+    /* Do nothing if checkpoint is being executed by non-bgwriter process */
+    if (!am_bg_writer)
+        return;
+
+     sync_start_time = (pg_time_t) time(NULL);
+
+    /*
+     * Perform the usual bgwriter duties.
+     */
+     for (;;)
+     {
+        AbsorbFsyncRequests(false);
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+
+         /*
+          * Are we there yet?
+          */
+         now = (pg_time_t) time(NULL);
+         sync_delay_secs = now - sync_start_time;
+         if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+            break;
     }
 }

@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
     BgWriterRequest *requests = NULL;
     BgWriterRequest *request;
     int            n;

+    /*
+     * Divide the size of the request queue by this to determine when
+     * absorption action needs to be taken.  Default here aims to empty the
+     * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+     * you probably need to lower bgwriter_delay, rather than presume
+     * this needs to be a tunable you can decrease.
+     */
+    int            absorb_action_divisor = 10;
+
     if (!am_bg_writer)
         return;

     /*
+     * If the queue isn't very large, don't worry about absorbing yet.
+     * Access integer counter without lock, to avoid queuing.
+     */
+    if (!force && BgWriterShmem->num_requests <
+            (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+    {
+        if (BgWriterShmem->num_requests > 0)
+            elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+                BgWriterShmem->num_requests);
+        return;
+    }
+
+    elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+        BgWriterShmem->num_requests);
+
+    /*
      * We have to PANIC if we fail to absorb all the pending requests (eg,
      * because our hashtable runs out of memory).  This is because the system
      * cannot run safely if we are unable to fsync what we have been told to
@@ -1164,4 +1230,9 @@ AbsorbFsyncRequests(void)
         pfree(requests);

     END_CRIT_SECTION();
+
+    /*
+     * Send off activity statistics to the stats collector
+     */
+    pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index cadd938..c89486e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,9 +31,6 @@
 #include "pg_trace.h"


-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB        10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC    (InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC    (InvalidBlockNumber-1)
@@ -926,7 +923,6 @@ mdsync(void)

     HASH_SEQ_STATUS hstat;
     PendingOperationEntry *entry;
-    int            absorb_counter;

     /* Statistics on sync times */
     int processed = 0;
@@ -951,7 +947,7 @@ mdsync(void)
      * queued an fsync request before clearing the buffer's dirtybit, so we
      * are safe as long as we do an Absorb after completing BufferSync().
      */
-    AbsorbFsyncRequests();
+    AbsorbFsyncRequests(true);

     /*
      * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -994,7 +990,6 @@ mdsync(void)
     mdsync_in_progress = true;

     /* Now scan the hashtable for fsync requests to process */
-    absorb_counter = FSYNCS_PER_ABSORB;
     hash_seq_init(&hstat, pendingOpsTable);
     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
     {
@@ -1019,17 +1014,9 @@ mdsync(void)
             int            failures;

             /*
-             * If in bgwriter, we want to absorb pending requests every so
-             * often to prevent overflow of the fsync request queue.  It is
-             * unspecified whether newly-added entries will be visited by
-             * hash_seq_search, but we don't care since we don't need to
-             * process them anyway.
+             * If in bgwriter, perform normal duties.
              */
-            if (--absorb_counter <= 0)
-            {
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;
-            }
+            CheckpointSyncDelay();

             /*
              * The fsync table could contain requests to fsync segments that
@@ -1121,10 +1108,9 @@ mdsync(void)
                 pfree(path);

                 /*
-                 * Absorb incoming requests and check to see if canceled.
+                 * If in bgwriter, perform normal duties.
                  */
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */
+                CheckpointSyncDelay();

                 if (entry->canceled)
                     break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);

 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);

 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                     BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);

 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);

Re: Spread checkpoint sync

From
Josh Berkus
Date:
> Maybe, but it's hard to argue that the current implementation--just
> doing all of the sync calls as fast as possible, one after the other--is
> going to produce worst-case behavior in a lot of situations.  Given that
> it's not a huge amount of code to do better, I'd rather do some work in
> that direction, instead of presuming the kernel authors will ever make
> this go away.  Spreading the writes out as part of the checkpoint rework
> in 8.3 worked better than any kernel changes I've tested since then, and
> I'm not real optimisic about this getting resolved at the system level. 
> So long as the database changes aren't antagonistic toward kernel
> improvements, I'd prefer to have some options here that become effective
> as soon as the database code is done.

Besides, even if kernel/FS authors did improve things, the improvements
would not be available on production platforms for years.  And, for that
matter, while Linux and BSD are pretty responsive to our feedback,
Apple, Microsoft and Oracle are most definitely not.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Spread checkpoint sync

From
Jeff Janes
Date:
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:

...

> One change that turned out be necessary rather than optional--to get good
> performance from the system under tuning--was to make regular background
> writer activity, including fsync absorb checks, happen during these sync
> pauses.  The existing code ran the checkpoint sync work in a pretty tight
> loop, which as I alluded to in an earlier patch today can lead to the
> backends competing with the background writer to get their sync calls
> executed.  This squashes that problem if the background writer is setup
> properly.

Have you tested out this "absorb during syncing phase" code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?

I wonder if all the improvement you see might not be due entirely to
the absorb between syncs, and none or very little from
the sleep itself.

I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?

The healing power of clearing out the absorb queue seems much more obvious.

Cheers,

Jeff


Re: Spread checkpoint sync

From
Greg Smith
Date:
Jeff Janes wrote:
> Have you tested out this "absorb during syncing phase" code without
> the sleep between the syncs?
> I.e. so that it still a tight loop, but the loop alternates between
> sync and absorb, with no intentional pause?
>   

Yes; that's how it was developed.  It helped to have just the extra 
absorb work without the pauses, but that alone wasn't enough to really 
improve things on the server we ran into this problem badly on.

> I ask because I don't have a mental model of how the pause can help.
> Given that this dirty data has been hanging around for many minutes
> already, what is a 3 second pause going to heal?
>   

The difference is that once an fsync call is made, dirty data is much 
more likely to be forced out.  It's the one thing that bypasses all 
other ways the kernel might try to avoid writing the data--both the 
dirty ratio guidelines and the congestion control logic--and forces 
those writes to happen as soon as they can be scheduled.  If you graph 
the amount of data shown "Dirty:" by /proc/meminfo over time, once the 
sync calls start happening it's like a descending staircase pattern, 
dropping a little bit as each sync fires. 

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Heikki Linnakangas
Date:
On 01.12.2010 06:25, Greg Smith wrote:
> Jeff Janes wrote:
>> I ask because I don't have a mental model of how the pause can help.
>> Given that this dirty data has been hanging around for many minutes
>> already, what is a 3 second pause going to heal?
>
> The difference is that once an fsync call is made, dirty data is much
> more likely to be forced out. It's the one thing that bypasses all other
> ways the kernel might try to avoid writing the data--both the dirty
> ratio guidelines and the congestion control logic--and forces those
> writes to happen as soon as they can be scheduled. If you graph the
> amount of data shown "Dirty:" by /proc/meminfo over time, once the sync
> calls start happening it's like a descending staircase pattern, dropping
> a little bit as each sync fires.

Do you have any idea how to autotune the delay between fsyncs?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Spread checkpoint sync

From
Greg Smith
Date:
Heikki Linnakangas wrote:
> Do you have any idea how to autotune the delay between fsyncs?

I'm thinking to start by counting the number of relations that need them 
at the beginning of the checkpoint.  Then use the same basic math that 
drives the spread writes, where you assess whether you're on schedule or 
not based on segment/time progress relative to how many have been sync'd 
out of that total.  At a high level I think that idea translates over 
almost directly into the existing write spread code.  Was hoping for a 
sanity check from you in particular about whether that seems reasonable 
or not before diving into the coding.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Heikki Linnakangas
Date:
On 01.12.2010 23:30, Greg Smith wrote:
> Heikki Linnakangas wrote:
>> Do you have any idea how to autotune the delay between fsyncs?
>
> I'm thinking to start by counting the number of relations that need them
> at the beginning of the checkpoint. Then use the same basic math that
> drives the spread writes, where you assess whether you're on schedule or
> not based on segment/time progress relative to how many have been sync'd
> out of that total. At a high level I think that idea translates over
> almost directly into the existing write spread code. Was hoping for a
> sanity check from you in particular about whether that seems reasonable
> or not before diving into the coding.

Sounds reasonable to me. fsync()s are a lot less uniform than write()s, 
though. If you fsync() a file with one dirty page in it, it's going to 
return very quickly, but a 1GB file will take a while. That could be 
problematic if you have a thousand small files and a couple of big ones, 
as you would want to reserve more time for the big ones. I'm not sure 
what to do about it, maybe it's not a problem in practice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Spread checkpoint sync

From
Greg Stark
Date:
On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>> I ask because I don't have a mental model of how the pause can help.
>> Given that this dirty data has been hanging around for many minutes
>> already, what is a 3 second pause going to heal?
>>
>
> The difference is that once an fsync call is made, dirty data is much more
> likely to be forced out.  It's the one thing that bypasses all other ways
> the kernel might try to avoid writing the data

I had always assumed the problem was that other I/O had been done to
the files in the meantime. I.e. the fsync is not just syncing the
checkpoint but any other blocks that had been flushed since the
checkpoint started. The longer the checkpoint is spread over the more
other I/O is included as well.

Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem. We might still have to
delay the begining of the sync to allow the dirty blocks to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.




--
greg


Re: Spread checkpoint sync

From
Josh Berkus
Date:
> Using sync_file_range you can specify the set of blocks to sync and
> then block on them only after some time has passed. But there's no
> documentation on how this relates to the I/O scheduler so it's not
> clear it would have any effect on the problem. We might still have to
> delay the begining of the sync to allow the dirty blocks to be synced
> naturally and then when we issue it still end up catching a lot of
> other i/o as well.

This *really* sounds like we should be working with the FS geeks on
making the OS do this work for us.  Greg, you wanna go to LinuxCon next
year?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Thu, Dec 2, 2010 at 2:24 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> I ask because I don't have a mental model of how the pause can help.
>>> Given that this dirty data has been hanging around for many minutes
>>> already, what is a 3 second pause going to heal?
>>>
>>
>> The difference is that once an fsync call is made, dirty data is much more
>> likely to be forced out.  It's the one thing that bypasses all other ways
>> the kernel might try to avoid writing the data
>
> I had always assumed the problem was that other I/O had been done to
> the files in the meantime. I.e. the fsync is not just syncing the
> checkpoint but any other blocks that had been flushed since the
> checkpoint started.

It strikes me that we might start the syncs of the files that the
checkpoint isn't going to dirty further at the start of the
checkpoint, and do the rest at the end.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Greg Stark wrote:
> Using sync_file_range you can specify the set of blocks to sync and
> then block on them only after some time has passed. But there's no
> documentation on how this relates to the I/O scheduler so it's not
> clear it would have any effect on the problem. 

I believe this is the exact spot we're stalled at in regards to getting 
this improved on the Linux side, as I understand it at least.  *The* 
answer for this class of problem on Linux is to use sync_file_range, and 
I don't think we'll ever get any sympathy from those kernel developers 
until we do.  But that's a Linux specific call, so doing that is going 
to add a write path fork with platform-specific code into the database.  
If I thought sync_file_range was a silver bullet guaranteed to make this 
better, maybe I'd go for that.  I think there's some relatively 
low-hanging fruit on the database side that would do better before going 
to that extreme though, thus the patch.

> We might still have to delay the begining of the sync to allow the dirty blocks to be synced
> naturally and then when we issue it still end up catching a lot of
> other i/o as well.
>   

Whether it's "lots" or not is really workload dependent.  I work from 
the assumption that the blocks being written out by the checkpoint are 
the most popular ones in the database, the ones that accumulate a high 
usage count and stay there.  If that's true, my guess is that the writes 
being done while the checkpoint is executing are a bit less likely to be 
touching the same files.  You raise a valid concern, I just haven't seen 
that actually happen in practice yet.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us




Re: Spread checkpoint sync

From
Greg Smith
Date:
Heikki Linnakangas wrote:
> If you fsync() a file with one dirty page in it, it's going to return 
> very quickly, but a 1GB file will take a while. That could be 
> problematic if you have a thousand small files and a couple of big 
> ones, as you would want to reserve more time for the big ones. I'm not 
> sure what to do about it, maybe it's not a problem in practice.

It's a problem in practice allright, with the bulk-loading situation 
being the main one you'll hit it.  If somebody is running a giant COPY 
to populate a table at the time the checkpoint starts, there's probably 
a 1GB file of dirty data that's unsynced around there somewhere.  I 
think doing anything about that situation requires an additional leap in 
thinking about buffer cache evicition and fsync absorption though.  
Ultimately I think we'll end up doing sync calls for relations that have 
gone "cold" for a while all the time as part of BGW activity, not just 
at checkpoint time, to try and avoid this whole area better.  That's a 
lot more than I'm trying to do in my first pass of improvements though.

In the interest of cutting the number of messy items left in the 
official CommitFest, I'm going to mark my patch here "Returned with 
Feedback" and continue working in the general direction I was already 
going.  Concept shared, underlying patches continue to advance, good 
discussion around it; those were my goals for this CF and I think we're 
there.

I have a good idea how to autotune the sync spread that's hardcoded in 
the current patch.  I'll work on finishing that up and organizing some 
more extensive performance tests.  Right now I'm more concerned about 
finishing the tests around the wal_sync_method issues, which are related 
to this and need to get sorted out a bit more urgently.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us




Re: Spread checkpoint sync

From
Rob Wultsch
Date:
On Sun, Dec 5, 2010 at 2:53 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Heikki Linnakangas wrote:
>>
>> If you fsync() a file with one dirty page in it, it's going to return very
>> quickly, but a 1GB file will take a while. That could be problematic if you
>> have a thousand small files and a couple of big ones, as you would want to
>> reserve more time for the big ones. I'm not sure what to do about it, maybe
>> it's not a problem in practice.
>
> It's a problem in practice allright, with the bulk-loading situation being
> the main one you'll hit it.  If somebody is running a giant COPY to populate
> a table at the time the checkpoint starts, there's probably a 1GB file of
> dirty data that's unsynced around there somewhere.  I think doing anything
> about that situation requires an additional leap in thinking about buffer
> cache evicition and fsync absorption though.  Ultimately I think we'll end
> up doing sync calls for relations that have gone "cold" for a while all the
> time as part of BGW activity, not just at checkpoint time, to try and avoid
> this whole area better.  That's a lot more than I'm trying to do in my first
> pass of improvements though.
>
> In the interest of cutting the number of messy items left in the official
> CommitFest, I'm going to mark my patch here "Returned with Feedback" and
> continue working in the general direction I was already going.  Concept
> shared, underlying patches continue to advance, good discussion around it;
> those were my goals for this CF and I think we're there.
>
> I have a good idea how to autotune the sync spread that's hardcoded in the
> current patch.  I'll work on finishing that up and organizing some more
> extensive performance tests.  Right now I'm more concerned about finishing
> the tests around the wal_sync_method issues, which are related to this and
> need to get sorted out a bit more urgently.
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services and Support        www.2ndQuadrant.us
>

Forgive me, but is all of this a step on the slippery slope to
direction io? And is this a bad thing?


--
Rob Wultsch
wultsch@gmail.com


Re: Spread checkpoint sync

From
Greg Smith
Date:
Rob Wultsch wrote:
> Forgive me, but is all of this a step on the slippery slope to
> direct io? And is this a bad thing

I don't really think so.  There's an important difference in my head 
between direct I/O, where the kernel is told "write this immediately!", 
and what I'm trying to achive.  I want to give the kernel an opportunity 
to write blocks out in an efficient way, so that it can take advantage 
of elevator sorting, write combining, and similar tricks.  But, 
eventually, those writes have to make it out to disk.  Linux claims to 
have concepts like a "deadline" for I/O to happen, but they turn out to 
not be so effective once the system gets backed up with enough writes.  
Since fsync time is the only effective deadline, I'm progressing from 
the standpoint that adjusting when it happens relative to the write will 
help, while still allowing the kernel an opportunity to get the writes 
out on its own schedule.

When ends up happening if you push toward fully sync I/O is the design 
you see in some other databases, where you need multiple writer 
processes.  Then requests for new pages can continue to allocate as 
needed, while keeping any one write from blocking things.  That's one 
sort of a way to simulate asynchronous I/O, and you can substitute true 
async I/O instead in many of those implementations.  We didn't have much 
luck with portability on async I/O when that was last experimented with, 
and having multiple background writer processes seems like overkill; 
that whole direction worries me.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us




Re: Spread checkpoint sync

From
Alvaro Herrera
Date:
Excerpts from Greg Smith's message of dom dic 05 20:02:48 -0300 2010:

> When ends up happening if you push toward fully sync I/O is the design 
> you see in some other databases, where you need multiple writer 
> processes.  Then requests for new pages can continue to allocate as 
> needed, while keeping any one write from blocking things.  That's one 
> sort of a way to simulate asynchronous I/O, and you can substitute true 
> async I/O instead in many of those implementations.  We didn't have much 
> luck with portability on async I/O when that was last experimented with, 
> and having multiple background writer processes seems like overkill; 
> that whole direction worries me.

Why would multiple bgwriter processes worry you?

Of course, it wouldn't work to have multiple processes trying to execute
a checkpoint simultaneously, but what if we separated the tasks so that
one process is in charge of checkpoints, and another oneZis in charge of
the LRU scan?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Spread checkpoint sync

From
Greg Smith
Date:
Alvaro Herrera wrote:
> Why would multiple bgwriter processes worry you?
>
> Of course, it wouldn't work to have multiple processes trying to execute
> a checkpoint simultaneously, but what if we separated the tasks so that
> one process is in charge of checkpoints, and another one is in charge of
> the LRU scan?
>   

I was commenting more in the context of development resource 
allocation.  Moving toward that design would be helpful, but it alone 
isn't enough to improve the checkpoint sync issues.  My concern is that 
putting work into that area will be a distraction from making progress 
on those.  If individual syncs take so long that the background writer 
gets lost for a while executing them, and therefore doesn't do LRU 
cleanup, you've got a problem that LRU-related improvements probably 
aren't enough to solve.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Simon Riggs
Date:
On Mon, 2010-12-06 at 23:26 -0300, Alvaro Herrera wrote:

> Why would multiple bgwriter processes worry you?

Because it complicates the tracking of files requiring fsync.

As Greg says, the last attempt to do that was a lot of code.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the
> time now has made me reconsider how important one potential bit of
> refactoring here would be.  I managed to catch one of the situations where
> really popular relations were being heavily updated in a way that was
> competing with the checkpoint on my test system (which I can happily share
> the logs of), with the instrumentation patch applied but not the spread sync
> one:
>
> LOG:  checkpoint starting: xlog
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 7747 of relation base/16424/16442
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 42688 of relation base/16424/16437
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 9723 of relation base/16424/16442
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 58117 of relation base/16424/16437
> DEBUG:  could not forward fsync request because request queue is full
> CONTEXT:  writing block 165128 of relation base/16424/16437
> [330 of these total, all referring to the same two relations]
>
> DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm
> time=10132.830000 msec
> DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
> DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
> DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
> DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
> DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
> DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
> DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
> DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
> DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000
> msec
> DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000
> msec
> LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
> file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
> total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s
>
> Note here how the checkpoint was hung on trying to get 16448_fsm written
> out, but the backends were issuing constant competing fsync calls to these
> other relations.  This is very similar to the production case this patch was
> written to address, which I hadn't been able to share a good example of yet.
>  That's essentially what it looks like, except with the contention going on
> for minutes instead of seconds.
>
> One of the ideas Simon and I had been considering at one point was adding
> some better de-duplication logic to the fsync absorb code, which I'm
> reminded by the pattern here might be helpful independently of other
> improvements.

Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be "not very", so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote: <blockquote cite="mid:AANLkTi=P0te3oFq0LVS8cGLkGF_Wp9ery0fOu9SHEcs9@mail.gmail.com"
type="cite"><prewrap="">On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <a class="moz-txt-link-rfc2396E"
href="mailto:greg@2ndquadrant.com"><greg@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre
wrap="">Oneof the ideas Simon and I had been considering at one point was adding
 
some better de-duplication logic to the fsync absorb code, which I'm
reminded by the pattern here might be helpful independently of other
improvements.   </pre></blockquote><pre wrap="">
Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be "not very", so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat. </pre></blockquote><br /> No toe damage, this is great,
Ihadn't gotten to coding for this angle yet at all.  Suffering from an overload of ideas and (mostly wasted) test data,
sothanks for exploring this concept and proving it works.<br /><br /> I'm not sure what to do with the rest of the work
I'vebeen doing in this area here, so I'm tempted to just combine this new bit from you with the older patch I
submitted,streamline, and see if that makes sense.  Expected to be there already, then "how about spending 5 minutes
firstchecking out that autovacuum lock patch again" turned out to be a wild underestimate.<br /><br /> Part of the
problemis that it's become obvious to me the last month that right now is a terrible time to be doing Linux benchmarks
thatimpact filesystem sync behavior.  The recent kernel changes that are showing in the next rev of the enterprise
distributions--likeRHEL6 and Debian Squeeze both working to get a stable 2.6.32--have made testing a nightmare.  I
don'twant to dump a lot of time into optimizing for <2.6.32 if this problem changes its form in newer kernels, but
thedistributions built around newer kernels are just not fully baked enough yet to tell.  For example, the pre-release
Squeezenumbers we're seeing are awful so far, but it's not really done yet either.  I expect 3-6 months from today,
thatall will have settled down enough that I can make some sense of it.  Lately my work with the latest distributions
hasjust been burning time installing stuff that doesn't work quite right yet.<br /><br /><pre class="moz-signature"
cols="72">--
 
Greg Smith   2ndQuadrant US    <a class="moz-txt-link-abbreviated"
href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  Baltimore, MD
 
PostgreSQL Training, Services, and 24x7 Support  <a class="moz-txt-link-abbreviated"
href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a>
"PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a>
</pre>

Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> No toe damage, this is great, I hadn't gotten to coding for this angle yet
> at all.  Suffering from an overload of ideas and (mostly wasted) test data,
> so thanks for exploring this concept and proving it works.

Yeah - obviously I want to make sure that someone reviews the logic
carefully, since a loss of fsyncs or a corruption of the request queue
could affect system stability, but only very rarely, since you'd need
full fsync queue + crash.  But the code is pretty simple, so it should
be possible to convince ourselves as to its correctness (or
otherwise).  Obviously, major credit to you and Simon for identifying
the problem and coming up with a proposed fix.

> I'm not sure what to do with the rest of the work I've been doing in this
> area here, so I'm tempted to just combine this new bit from you with the
> older patch I submitted, streamline, and see if that makes sense.  Expected
> to be there already, then "how about spending 5 minutes first checking out
> that autovacuum lock patch again" turned out to be a wild underestimate.

I'd rather not combine the patches, because this one is pretty simple
and just does one thing, but feel free to write something that applies
over top of it.  Looking through your old patch (sync-spread-v3),
there seem to be a couple of components there:

- Compact the fsync queue based on percentage fill rather than number
of writes per absorb.  If we apply my queue-compacting logic, do we
still need this?  The queue compaction may hold the BgWriterCommLock
for slightly longer than AbsorbFsyncRequests() would, but I'm not
inclined to jump to the conclusion that this is worth getting excited
about.  The whole idea of accessing BgWriterShmem->num_requests
without the lock gives me the willies anyway - sure, it'll probably
work OK most of the time, especially on x86, but it seems hard to
predict whether there will be occasional bad behavior on platforms
with weak memory ordering.

- Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests().
Not sure what the motivation for this is.

- CheckpointSyncDelay(), to make sure that we absorb fsync requests
and free up buffers during a long checkpoint.  I think this part is
clearly valuable, although I'm not sure we've satisfactorily solved
the problem of how to spread out the fsyncs and still complete the
checkpoint on schedule.

As to that, I have a couple of half-baked ideas I'll throw out so you
can laugh at them.  Some of these may be recycled versions of ideas
you've already had/mentioned, so, again, credit to you for getting the
ball rolling.

Idea #1: When we absorb fsync requests, don't just remember that there
was an fsync request; also remember the time of said fsync request.
If a new fsync request arrives for a segment for which we're already
remembering an fsync request, update the timestamp on the request.
Periodically scan the fsync request queue for requests older than,
say, 30 s, and perform one such request.   The idea is - if we wrote a
bunch of data to a relation and then haven't touched it for a while,
force it out to disk before the checkpoint actually starts so that the
volume of work required by the checkpoint is lessened.

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs.  Start by
performing any fsyncs that need to happen but have no buffers to sync
(i.e. everything that must be written to that relation has already
been written).  Then, start performing the writes, decrementing the
pending-write counters as you go.  If the pending-write count for a
relation hits zero, you can add it to the list of fsyncs that can be
performed before the writes are finished.  If it doesn't hit zero
(perhaps because a non-bgwriter process wrote a buffer that we were
going to write anyway), then we'll do it at the end.  One problem with
this - aside from complexity - is that most likely most fsyncs would
either happen at the beginning or very near the end, because there's
no reason to assume that buffers for the same relation would be
clustered together in shared_buffers.  But I'm inclined to think that
at least the idea of performing fsyncs for which no dirty buffers
remain in shared_buffers at the beginning of the checkpoint rather
than at the end might have some value.

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  Every
30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
closes the file, thus dumping the cache and preventing a ridiculous
growth in the amount of data to be sync'd at checkpoint time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Simon Riggs
Date:
On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
> Robert Haas wrote: 
> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> >   
> > > One of the ideas Simon and I had been considering at one point was adding
> > > some better de-duplication logic to the fsync absorb code, which I'm
> > > reminded by the pattern here might be helpful independently of other
> > > improvements.
> > >     
> > 
> > Hopefully I'm not stepping on any toes here, but I thought this was an
> > awfully good idea and had a chance to take a look at how hard it would
> > be today while en route from point A to point B.  The answer turned
> > out to be "not very", so PFA a patch that seems to work.  I tested it
> > by attaching gdb to the background writer while running pgbench, and
> > it eliminate the backend fsyncs without even breaking a sweat.
> >   
> 
> No toe damage, this is great, I hadn't gotten to coding for this angle
> yet at all.  Suffering from an overload of ideas and (mostly wasted)
> test data, so thanks for exploring this concept and proving it works.

No toe damage either, but are we sure we want the de-duplication logic
and in this place?

I was originally of the opinion that de-duplicating the list would save
time in the bgwriter, but that guess was wrong by about two orders of
magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
>> Robert Haas wrote:
>> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> >
>> > > One of the ideas Simon and I had been considering at one point was adding
>> > > some better de-duplication logic to the fsync absorb code, which I'm
>> > > reminded by the pattern here might be helpful independently of other
>> > > improvements.
>> > >
>> >
>> > Hopefully I'm not stepping on any toes here, but I thought this was an
>> > awfully good idea and had a chance to take a look at how hard it would
>> > be today while en route from point A to point B.  The answer turned
>> > out to be "not very", so PFA a patch that seems to work.  I tested it
>> > by attaching gdb to the background writer while running pgbench, and
>> > it eliminate the backend fsyncs without even breaking a sweat.
>> >
>>
>> No toe damage, this is great, I hadn't gotten to coding for this angle
>> yet at all.  Suffering from an overload of ideas and (mostly wasted)
>> test data, so thanks for exploring this concept and proving it works.
>
> No toe damage either, but are we sure we want the de-duplication logic
> and in this place?
>
> I was originally of the opinion that de-duplicating the list would save
> time in the bgwriter, but that guess was wrong by about two orders of
> magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

Well, the point of this is not to save time in the bgwriter - I'm not
surprised to hear that wasn't noticeable.  The point is that when the
fsync request queue fills up, backends start performing an fsync *for
every block they write*, and that's about as bad for performance as
it's possible to be.  So it's worth going to a little bit of trouble
to try to make sure it doesn't happen.  It didn't happen *terribly*
frequently before, but it does seem to be common enough to worry about
- e.g. on one occasion, I was able to reproduce it just by running
pgbench -i -s 25 or something like that on a laptop.

With this patch applied, there's no performance impact vs. current
code in the very, very common case where space remains in the queue -
999 times out of 1000, writing to the fsync queue will be just as fast
as ever.  But in the unusual case where the queue has been filled up,
compacting the queue is much much faster than performing an fsync, and
the best part is that the compaction is generally massive.  I was
seeing things like "4096 entries compressed to 14".  So clearly even
if the compaction took as long as the fsync itself it would be worth
it, because the next 4000+ guys who come along again go through the
fast path.  But in fact I think it's much faster than an fsync.

In order to get pathological behavior even with this patch applied,
you'd need to have NBuffers pending fsync requests and they'd all have
to be different.  I don't think that's theoretically impossible, but
Greg's research seems to indicate that even on busy systems we don't
come even a little bit close to the circumstances that would cause it
to occur in practice.  Every other change we might make in this area
will further improve this case, too: for example, doing an absorb
after each fsync would presumably help, as would the more drastic step
of splitting the bgwriter into two background processes (one to do
background page cleaning, and the other to do checkpoints, for
example).  But even without those sorts of changes, I think this is
enough to effectively eliminate the full fsync queue problem in
practice, which seems worth doing independently of anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> Idea #2: At the beginning of a checkpoint when we scan all the
> buffers, count the number of buffers that need to be synced for each
> relation.  Use the same hashtable that we use for tracking pending
> fsync requests.  Then, interleave the writes and the fsyncs...
>
> Idea #3: Stick with the idea of a fixed delay between fsyncs, but
> compute how many fsyncs you think you're ultimately going to need at
> the start of the checkpoint, and back up the target completion time by
> 3 s per fsync from the get-go, so that the checkpoint still finishes
> on schedule.
>

What I've been working on is something halfway between these two ideas.
I have a patch, and it doesn't work right yet because I just broke it,
but since I have some faint hope this will all come together any minute
now I'm going to share it before someone announces a deadline has passed
or something.  (whistling).  I'm going to add this messy thing and the
patch you submitted upthread to the CF list; I'll review yours, I'll
either fix the remaining problem in this one myself or rewrite to one of
your ideas, and then it's onto a round of benchmarking.

Once upon a time we got a patch from Itagaki Takahiro whose purpose was
to sort writes before sending them out:

http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well
understood ext3 issues--I never replicated that speedup at the time for
example.  And this was before the spread checkpoint code was in 8.3.
The hope was that it wasn't really going to be necessary after that anyway.

Back to today...instead of something complicated, it struck me that if I
just had a count of exactly how many files were involved in each
checkpoint, that would be helpful.  I could keep the idea of a fixed
delay between fsyncs, but just auto-tune that delay amount based on the
count.  And how do you count the number of unique things in a list?
Well, you can always sort them.  I thought that if the sorted writes
patch got back to functional again, it could serve two purposes.  It
would group all of the writes for a file together, and if you did the
syncs in the same sorted order they would have the maximum odds of
discovering the data was already written.  So rather than this possible
order:

table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd get
this one:

table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been
written before the associated sync comes along.

Also, I could just traverse the sorted list with some simple logic to
count the number of unique files, and then set the delay between fsync
writes based on it.  In the above, once the list was sorted, easy to
just see how many times the table name changes on a linear scan of the
sorted data.  3 files, so if the checkpoint target gives me, say, a
minute of time to sync them, I can delay 20 seconds between.  Simple
math, and exactly the sort I used to get reasonable behavior on the busy
production system this all started on.  There's some unresolved
trickiness in the segment-driven checkpoint case, but one thing at a time.

So I fixed the bitrot on the old sorted patch, which was fun as it came
from before the 8.3 changes.  It seemed to work.  I then moved the
structure it uses to hold the list of buffers to write, the thing that's
sorted, into shared memory.  It's got a predictable maximum size,
relying on palloc in the middle of the checkpoint code seems bad, and
there's some potential gain from not reallocating it every time through.

Somewhere along the way, it started doing this instead of what I wanted:

 BadArgument("!(((header->context) != ((void *)0) &&
(((((Node*)((header->context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 589)

(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used
up my window of time wandering down.

There's good bits in the patch I submitted for the last CF and in the
patch you wrote earlier this week.  This unfinished patch may be a
valuable idea to fit in there too once I fix it, or maybe it's
fundamentally flawed and one of the other ideas you suggested (or I have
sitting on the potential design list) will work better.  There's a patch
integration problem that needs to be solved here, but I think almost all
the individual pieces are available.  I'd hate to see this fail to get
integrated now just for lack of time, considering the problem is so
serious when you run into it.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..c8c0f67 100644
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
***************
*** 20,25 ****
--- 20,26 ----

  BufferDesc *BufferDescriptors;
  char       *BufferBlocks;
+ BufAndTag  *BufferTags;
  int32       *PrivateRefCount;


*************** int32       *PrivateRefCount;
*** 72,79 ****
  void
  InitBufferPool(void)
  {
!     bool        foundBufs,
!                 foundDescs;

      BufferDescriptors = (BufferDesc *)
          ShmemInitStruct("Buffer Descriptors",
--- 73,81 ----
  void
  InitBufferPool(void)
  {
!     bool        foundBufs;
!     bool        foundDescs;
!     bool        foundTags;

      BufferDescriptors = (BufferDesc *)
          ShmemInitStruct("Buffer Descriptors",
*************** InitBufferPool(void)
*** 83,92 ****
          ShmemInitStruct("Buffer Blocks",
                          NBuffers * (Size) BLCKSZ, &foundBufs);

!     if (foundDescs || foundBufs)
      {
!         /* both should be present or neither */
!         Assert(foundDescs && foundBufs);
          /* note: this path is only taken in EXEC_BACKEND case */
      }
      else
--- 85,98 ----
          ShmemInitStruct("Buffer Blocks",
                          NBuffers * (Size) BLCKSZ, &foundBufs);

!     BufferTags = (BufAndTag *)
!         ShmemInitStruct("Dirty Buffer Tags",
!                         NBuffers * sizeof(BufAndTag), &foundTags);
!
!     if (foundDescs || foundBufs || foundTags)
      {
!         /* all should be present or none */
!         Assert(foundDescs && foundBufs && foundTags);
          /* note: this path is only taken in EXEC_BACKEND case */
      }
      else
*************** BufferShmemSize(void)
*** 171,176 ****
--- 177,185 ----
      /* size of data pages */
      size = add_size(size, mul_size(NBuffers, BLCKSZ));

+     /* size of checkpoint buffer tags */
+     size = add_size(size, mul_size(NBuffers, sizeof(BufAndTag)));
+
      /* size of stuff controlled by freelist.c */
      size = add_size(size, StrategyShmemSize());

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..bd779bf 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1158,1181 ----
      }
  }

+ static int
+ bufcmp(const void *a, const void *b)
+ {
+     const BufAndTag *lhs = (const BufAndTag *) a;
+     const BufAndTag *rhs = (const BufAndTag *) b;
+     int        r;
+
+     r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+     if (r != 0)
+         return r;
+     if (lhs->tag.blockNum < rhs->tag.blockNum)
+         return -1;
+     else if (lhs->tag.blockNum > rhs->tag.blockNum)
+         return 1;
+     else
+         return 0;
+ }
+
  /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
*************** static void
*** 1171,1180 ****
  BufferSync(int flags)
  {
      int            buf_id;
-     int            num_to_scan;
      int            num_to_write;
      int            num_written;
      int            mask = BM_DIRTY;

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1189,1202 ----
  BufferSync(int flags)
  {
      int            buf_id;
      int            num_to_write;
      int            num_written;
      int            mask = BM_DIRTY;
+     int            dirty_buf;
+     int            dirty_files;
+     Oid            last_seen_rel;
+     ForkNumber  last_seen_fork;
+     BlockNumber last_seen_block;

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1238,1245 ----
          if ((bufHdr->flags & mask) == mask)
          {
              bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+             BufferTags[num_to_write].buf_id = buf_id;
+             BufferTags[num_to_write].tag = bufHdr->tag;
              num_to_write++;
          }

*************** BufferSync(int flags)
*** 1225,1246 ****
      if (num_to_write == 0)
          return;                    /* nothing to do */

      TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);

      /*
       * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
!      * since we might as well dump soon-to-be-recycled buffers first.
       *
       * Note that we don't read the buffer alloc count here --- that should be
       * left untouched till the next BgBufferSync() call.
!      */
!     buf_id = StrategySyncStart(NULL, NULL);
!     num_to_scan = NBuffers;
      num_written = 0;
!     while (num_to_scan-- > 0)
!     {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
--- 1249,1307 ----
      if (num_to_write == 0)
          return;                    /* nothing to do */

+     /*
+      * Sort the list of buffers to write.  It's then straightforward to
+      * count the approximate number of files involved.  There may be
+      * some small error from buffers that turn out to be skipped below,
+      * but for the purposes the file count is needed that's acceptable.
+      */
+     qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+
+     /*
+      * Count the number of unique node/fork combinations, relying on the
+      * sorted order
+      */
+
+     /* Initialize with the first entry in the dirty buffer list */
+     last_seen_rel = BufferTags[0].tag.rnode.relNode;
+     last_seen_fork = BufferTags[0].tag.forkNum;
+     last_seen_block = BufferTags[0].tag.blockNum;
+     dirty_files = 1;
+
+     for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+       {
+         if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+             (last_seen_fork != BufferTags[dirty_buf].tag.forkNum))
+         {
+             last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+             last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+             dirty_files++;
+         }
+     }
+
+     /*
+      * TODO:  This doesn't account for the fact that blocks might span multiple
+      * files within the same relation yet.
+      */
+
+     elog(DEBUG1, "BufferSync found %d buffers to write involving %d files",
+         num_to_write,dirty_files)
+
      TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);

      /*
       * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.
       *
       * Note that we don't read the buffer alloc count here --- that should be
       * left untouched till the next BgBufferSync() call.
!      */
      num_written = 0;
!     for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
!       {
!         volatile BufferDesc *bufHdr;
!         buf_id = BufferTags[dirty_buf].buf_id;
!         bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
                  num_written++;

                  /*
-                  * We know there are at most num_to_write buffers with
-                  * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-                  * num_written reaches num_to_write.
-                  *
-                  * Note that num_written doesn't include buffers written by
-                  * other backends, or by the bgwriter cleaning scan. That
-                  * means that the estimate of how much progress we've made is
-                  * conservative, and also that this test will often fail to
-                  * trigger.  But it seems worth making anyway.
-                  */
-                 if (num_written >= num_to_write)
-                     break;
-
-                 /*
                   * Perform normal bgwriter duties and sleep to throttle our
                   * I/O rate.
                   */
--- 1324,1329 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
                                       (double) num_written / num_to_write);
              }
          }
-
-         if (++buf_id >= NBuffers)
-             buf_id = 0;
      }

      /*
--- 1331,1336 ----
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0652bdf..1c9c910 100644
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 167,175 ****
--- 167,185 ----
  #define LockBufHdr(bufHdr)        SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)    SpinLockRelease(&(bufHdr)->buf_hdr_lock)

+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+     int            buf_id;
+     BufferTag    tag;
+ } BufAndTag;

  /* in buf_init.c */
  extern PGDLLIMPORT BufferDesc *BufferDescriptors;
+ extern PGDLLIMPORT BufAndTag *BufferTags;

  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;

Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Once upon a time we got a patch from Itagaki Takahiro whose purpose was to
> sort writes before sending them out:
>
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

Ah, a fine idea!

> Which has very low odds of the sync on "a" finishing quickly, we'd get this
> one:
>
> table block
> a 1
> a 2
> b 1
> b 2
> c 1
> c 2
> sync a
> sync b
> sync c
>
> Which sure seems like a reasonable way to improve the odds data has been
> written before the associated sync comes along.

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.  It seems to me that it's
going to be a bear to figure out what fraction of the checkpoint
you've completed if you put all of the syncs at the end, and this
whole problem appears to be predicated the assumption that the OS
*isn't* writing out in a timely fashion.  Are we sure that postponing
the fsync relative to the writes is anything more than wishful
thinking?

> Also, I could just traverse the sorted list with some simple logic to count
> the number of unique files, and then set the delay between fsync writes
> based on it.  In the above, once the list was sorted, easy to just see how
> many times the table name changes on a linear scan of the sorted data.  3
> files, so if the checkpoint target gives me, say, a minute of time to sync
> them, I can delay 20 seconds between.  Simple math, and exactly the sort I

How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.

> So I fixed the bitrot on the old sorted patch, which was fun as it came from
> before the 8.3 changes.  It seemed to work.  I then moved the structure it
> uses to hold the list of buffers to write, the thing that's sorted, into
> shared memory.  It's got a predictable maximum size, relying on palloc in
> the middle of the checkpoint code seems bad, and there's some potential gain
> from not reallocating it every time through.

Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.

> There's good bits in the patch I submitted for the last CF and in the patch
> you wrote earlier this week.  This unfinished patch may be a valuable idea
> to fit in there too once I fix it, or maybe it's fundamentally flawed and
> one of the other ideas you suggested (or I have sitting on the potential
> design list) will work better.  There's a patch integration problem that
> needs to be solved here, but I think almost all the individual pieces are
> available.  I'd hate to see this fail to get integrated now just for lack of
> time, considering the problem is so serious when you run into it.

Likewise, but committing something half-baked is no good either.  I
think we're in a position to crush the full-fsync-queue problem flat
(my patch should do that, and there are several other obvious things
we can do for extra certainty) but the problem of spreading out the
fsyncs looks to me like something we don't completely know how to
solve.  If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> I'll believe it when I see it.  How about this:
>
> a 1
> a 2
> sync a
> b 1
> b 2
> sync b
> c 1
> c 2
> sync c
>
> Or maybe some variant, where we become willing to fsync a file a
> certain number of seconds after writing the last block, or when all
> the writes are done, whichever comes first.

That's going to give worse performance than the current code in some 
cases.  The goal of what's in there now is that you get a sequence like 
this:

a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so 
large that it becomes lazier about writing things.  The fundamental 
idea--push writes out some time before the sync, in hopes the filesystem 
will get to them before that said--it not unsound.  On some systems, 
doing the sync more aggressively than that will be a regression.  This 
approach just breaks down in some cases, and those cases are happening 
more now because their likelihood scales with total RAM.  I don't want 
to screw the people with smaller systems, who may be getting 
considerable benefit from the existing sequence.  Today's little 
systems--which are very similar to the high-end ones the spread 
checkpoint stuff was developed on during 8.3--do get some benefit from 
it as far as I know.

Anyway, now that the ability to get logging on all this stuff went in 
during the last CF, it's way easier to just setup a random system to run 
tests in this area than it used to be.  Whatever testing does happen 
should include, say, a 2GB laptop with a single hard drive in it.  I 
think that's the bottom of what is reasonable to consider a reasonable 
target for tweaking write performance on, given hardware 9.1 is likely 
to be deployed on.

> How does the checkpoint target give you any time to sync them?  Unless
> you squeeze the writes together more tightly, but that seems sketchy.
>   

Obviously the checkpoint target idea needs to be shuffled around some 
too.  I was thinking of making the new default 0.8, and having it split 
the time in half for write and sync.  That will make the write phase 
close to the speed people are seeing now, at the default of 0.5, while 
giving some window for spread sync too.  The exact way to redistribute 
that around I'm not so concerned about yet.  When I get to where that's 
the most uncertain thing left I'll benchmark the TPS vs. latency 
trade-off and see what happens.  If the rest of the code is good enough 
but this just needs to be tweaked, that's a perfect thing to get beta 
feedback to finalize.

> Well you don't have to put it in shared memory on account of any of
> that.  You can just hang it on a global variable.
>   

Hmm.  Because it's so similar to other things being allocated in shared 
memory, I just automatically pushed it over to there.  But you're right; 
it doesn't need to be that complicated.  Nobody is touching it but the 
background writer.

> If we can find something that's a modest improvement on the
> status quo and we can be confident in quickly, good, but I'd rather
> have 9.1 go out the door on time without fully fixing this than delay
> the release.
>   

I'm not somebody who needs to be convinced of that.  There are two near 
commit quality pieces of this out there now:

1) Keep some BGW cleaning and fsync absorption going while sync is 
happening, rather than starting it and ignoring everything else until 
it's done.

2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.  
I realize I only have a very short period of time to complete a major 
integration breakthrough on the pieces floating around before the goal 
here has to drop to something less ambitious.  I head to the West Coast 
for a week on the 23rd; I'll be forced to throw in the towel at that 
point if I can't get the better ideas we have in pieces here all 
assembled well by then.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Simon Riggs
Date:
On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote:
> On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
> >> Robert Haas wrote:
> >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> >> >
> >> > > One of the ideas Simon and I had been considering at one point was adding
> >> > > some better de-duplication logic to the fsync absorb code, which I'm
> >> > > reminded by the pattern here might be helpful independently of other
> >> > > improvements.
> >> > >
> >> >
> >> > Hopefully I'm not stepping on any toes here, but I thought this was an
> >> > awfully good idea and had a chance to take a look at how hard it would
> >> > be today while en route from point A to point B.  The answer turned
> >> > out to be "not very", so PFA a patch that seems to work.  I tested it
> >> > by attaching gdb to the background writer while running pgbench, and
> >> > it eliminate the backend fsyncs without even breaking a sweat.
> >> >
> >>
> >> No toe damage, this is great, I hadn't gotten to coding for this angle
> >> yet at all.  Suffering from an overload of ideas and (mostly wasted)
> >> test data, so thanks for exploring this concept and proving it works.
> >
> > No toe damage either, but are we sure we want the de-duplication logic
> > and in this place?
> >
> > I was originally of the opinion that de-duplicating the list would save
> > time in the bgwriter, but that guess was wrong by about two orders of
> > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.
> 
> Well, the point of this is not to save time in the bgwriter - I'm not
> surprised to hear that wasn't noticeable.  The point is that when the
> fsync request queue fills up, backends start performing an fsync *for
> every block they write*, and that's about as bad for performance as
> it's possible to be.  So it's worth going to a little bit of trouble
> to try to make sure it doesn't happen.  It didn't happen *terribly*
> frequently before, but it does seem to be common enough to worry about
> - e.g. on one occasion, I was able to reproduce it just by running
> pgbench -i -s 25 or something like that on a laptop.
> 
> With this patch applied, there's no performance impact vs. current
> code in the very, very common case where space remains in the queue -
> 999 times out of 1000, writing to the fsync queue will be just as fast
> as ever.  But in the unusual case where the queue has been filled up,
> compacting the queue is much much faster than performing an fsync, and
> the best part is that the compaction is generally massive.  I was
> seeing things like "4096 entries compressed to 14".  So clearly even
> if the compaction took as long as the fsync itself it would be worth
> it, because the next 4000+ guys who come along again go through the
> fast path.  But in fact I think it's much faster than an fsync.
> 
> In order to get pathological behavior even with this patch applied,
> you'd need to have NBuffers pending fsync requests and they'd all have
> to be different.  I don't think that's theoretically impossible, but
> Greg's research seems to indicate that even on busy systems we don't
> come even a little bit close to the circumstances that would cause it
> to occur in practice.  Every other change we might make in this area
> will further improve this case, too: for example, doing an absorb
> after each fsync would presumably help, as would the more drastic step
> of splitting the bgwriter into two background processes (one to do
> background page cleaning, and the other to do checkpoints, for
> example).  But even without those sorts of changes, I think this is
> enough to effectively eliminate the full fsync queue problem in
> practice, which seems worth doing independently of anything else.

You've persuaded me.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 10:31 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> That's going to give worse performance than the current code in some cases.

OK.

>> How does the checkpoint target give you any time to sync them?  Unless
>> you squeeze the writes together more tightly, but that seems sketchy.
>
> Obviously the checkpoint target idea needs to be shuffled around some too.
>  I was thinking of making the new default 0.8, and having it split the time
> in half for write and sync.  That will make the write phase close to the
> speed people are seeing now, at the default of 0.5, while giving some window
> for spread sync too.  The exact way to redistribute that around I'm not so
> concerned about yet.  When I get to where that's the most uncertain thing
> left I'll benchmark the TPS vs. latency trade-off and see what happens.  If
> the rest of the code is good enough but this just needs to be tweaked,
> that's a perfect thing to get beta feedback to finalize.

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> That seems like a bad idea - don't we routinely recommend that people
> crank this up to 0.9?  You'd be effectively bounding the upper range
> of this setting to a value to the less than the lowest value we
> recommend anyone use today.
>   

I was just giving an example of how I might do an initial split.  
There's a checkpoint happening now at time T; we have a rough idea that 
it needs to be finished before some upcoming time T+D.  Currently with 
default parameters this becomes:

Write:  0.5 * D; Sync:  0

Even though Sync obviously doesn't take zero.  The slop here is enough 
that it usually works anyway.

I was suggesting that a quick reshuffling to:

Write:  0.4 * D; Sync:  0.4 * D

Might be a good first candidate for how to split the time up better.  
The fact that this gives less writing time than the current biggest 
spread possible:

Write:  0.9 * D; Sync: 0

Is true.  It's also true that in the case where sync time really is 
zero, this new default would spread writes less than the current 
default.  I think that's optimistic, but it could happen if checkpoints 
are small and you have a good write cache.

Step back from that a second though.  Ultimately, the person who is 
getting checkpoints at a 5 minute interval, and is being nailed by 
spikes, should have the option of just increasing the parameters to make 
that interval bigger.  First you increase the measly default segments to 
a reasonable range, then checkpoint_completion_target is the second one 
you can try.  But from there, you quickly move onto making 
checkpoint_timeout longer.  At some point, there is no option but to 
give up checkpoints every 5 minutes as being practical, and make the 
average interval longer.

Whether or not a refactoring here makes things slightly worse for cases 
closer to the default doesn't bother me too much.  What bothers me is 
the way trying to stretch checkpoints out further fails to deliver as 
well as it should.  I'd be OK with saying "to get the exact same spread 
situation as in older versions, you may need to retarget for checkpoints 
every 6 minutes" *if* in the process I get a much better sync latency 
distribution in most cases.

Here's an interesting data point from the customer site this all started 
at, one I don't think they'll mind sharing since it helps make the 
situation more clear to the community.  After applying this code to 
spread sync out, in order to get their server back to functional we had 
to move all the parameters involved up to where checkpoints were spaced 
35 minutes apart.  It just wasn't possible to write any faster than that 
without disrupting foreground activity. 

The whole current model where people think of this stuff in terms of 
segments and completion targets is a UI disaster.  The direction I want 
to go in is where users can say "make sure checkpoints happen every N 
minutes", and something reasonable happens without additional parameter 
fiddling.  And if the resulting checkpoint I/O spike is too big, they 
just increase the timeout to N+1 or N*2 to spread the checkpoint 
further.  Getting too bogged down thinking in terms of the current, 
really terrible interface is something I'm trying to break myself of.  
Long-term, I want there to be checkpoint_timeout, and all the other 
parameters are gone, replaced by an internal implementation of the best 
practices proven to work even on busy systems.  I don't have as much 
clarity on exactly what that best practice is the way that, say, I just 
suggested exactly how to eliminate wal_buffers as an important thing to 
manually set.  But that's the dream UI:  you shoot for a checkpoint 
interval, and something reasonable happens; if that's too intense, you 
just increase the interval to spread further.  There probably will be 
small performance regression possible vs. the current code with 
parameter combination that happen to work well on it.  Preserving every 
one of those is something that's not as important to me as making the 
tuning interface simple and clear.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Marti Raudsepp
Date:
On Sat, Jan 15, 2011 at 14:05, Robert Haas <robertmhaas@gmail.com> wrote:
> Idea #4: For ext3 filesystems that like to dump the entire buffer
> cache instead of only the requested file, write a little daemon that
> runs alongside of (and completely indepdently of) PostgreSQL.  Every
> 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
> closes the file, thus dumping the cache and preventing a ridiculous
> growth in the amount of data to be sync'd at checkpoint time.

Wouldn't it be easier to just mount in data=writeback mode? This
provides a similar level of journaling as most other file systems.

Regards,
Marti


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 5:57 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I was just giving an example of how I might do an initial split.  There's a
> checkpoint happening now at time T; we have a rough idea that it needs to be
> finished before some upcoming time T+D.  Currently with default parameters
> this becomes:
>
> Write:  0.5 * D; Sync:  0
>
> Even though Sync obviously doesn't take zero.  The slop here is enough that
> it usually works anyway.
>
> I was suggesting that a quick reshuffling to:
>
> Write:  0.4 * D; Sync:  0.4 * D
>
> Might be a good first candidate for how to split the time up better.

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time, but I have a pretty
clear sense of why that's a sensible thing to try: you give the rest
of the system a moment or two to get some I/O done for something other
than the checkpoint before flushing the next batch of buffers.  But
the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> What is the basis for thinking that the sync should get the same
> amount of time as the writes?  That seems pretty arbitrary.  Right
> now, you're allowing 3 seconds per fsync, which could be a lot more or
> a lot less than 40% of the total checkpoint time...

Just that it's where I ended up at when fighting with this for a month 
on the system I've seen the most problems at.  The 3 second number was 
reversed from a computation that said "aim for an internal of X minutes; 
we have Y relations on average involved in the checkpoint".  The 
direction my latest patch is strugling to go is computing a reasonable 
time automatically in the same way--count the relations, do a time 
estimate, add enough delay so the sync calls should be spread linearly 
over the given time range.


> the checkpoint activity is always going to be spikey if it does
> anything at all, so spacing it out *more* isn't obviously useful.
>   

One of the components to the write queue is some notion that writes that 
have been waiting longest should eventually be flushed out.  Linux has 
this number called dirty_expire_centiseconds which suggests it enforces 
just that, set to a default of 30 seconds.  This is why some 5-minute 
interval checkpoints with default parameters, effectively spreading the 
checkpoint over 2.5 minutes, can work under the current design.  
Anything you wrote at T+0 to T+2:00 *should* have been written out 
already when you reach T+2:30 and sync.  Unfortunately, when the system 
gets busy, there is this "congestion control" logic that basically 
throws out any guarantee of writes starting shortly after the expiration 
time.

It turns out that the only thing that really works are the tunables that 
block new writes from happening once the queue is full, but they can't 
be set low enough to work well in earlier kernels when combined with 
lots of RAM.  Using the terminology of 
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point 
you hit a point where "a process generating disk writes will itself 
start writeback."  This is anologous to the PostgreSQL situation where 
backends do their own fsync calls.  The kernel will eventually move to 
where those trying to write new data are instead recruited into being 
additional sources of write flushing.  That's the part you just can't 
make aggressive enough on older kernels; dirty writers can always win.  
Ideally, the system never digs itself into a hole larger than you can 
afford to wait to write out.  It's a transacton speed vs. latency thing 
though, and the older kernels just don't consider the latency side well 
enough.

There is new mechanism in the latest kernels to control this much 
better:  dirty_bytes and dirty_background_bytes are the tunables.  I 
haven't had a chance to test yet.  As mentioned upthread, some of the 
bleding edge kernels that have this feature available in are showing 
such large general performance regressions in our tests, compared to the 
boring old RHEL5 kernel, that whether this feature works or not is 
irrelevant.  I haven't tracked down which new kernel distributions work 
well performance-wise and which don't yet for PostgreSQL.

I'm hoping that when I get there, I'll see results like 
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages 
, where the ideal setting for dirty_bytes  to keep latency under control 
with BBWC was 15MB.  To put that into perspective, the lowest useful 
setting you can set dirty_ratio to is 5% of RAM.  That's 410MB on my 
measly 8GB desktop, and 3.3GB on the 64GB production server I've been 
trying to tune.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



ToDo List Item - System Table Index Clustering

From
Simone Aiken
Date:
Hello Postgres Hackers,

In reference to this todo item about clustering system table indexes,
( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
I have been studying the system tables to see which would benefit  from
clustering.  I have some index suggestions and a question if you have a
moment.

Cluster Candidates:
pg_attribute:  Make the existing index ( attrelid, attnum ) clustered to order it by table and column.pg_attrdef:
Existingindex ( adrelid, adnum ) clustered to order itby table and column. 
pg_constraint:  Existing index ( conrelid ) clustered to get table constraints contiguous.
pg_depend: Existing Index (refclassid, refobjid, refobjsubid) clusteredto so that when the referenced object is changed
itsdependencies arevcontiguous. 
pg_description: Make the existing index ( Objoid, classoid, objsubid ) clustered to order it by entity, catalog, and
optionalcolumn.      * reversing the first two columns makes more sense to me ...     catalog, object, column or since
objectimplies catalog ( right? )     just dispensing with catalog altogether, but that would mean     creating a new
index.pg_shdependent:Existing index (refclassid, refobjid) clustered for same reason as pg_depend. 
pg_statistic: Existing index (starelid, staattnum) clustered to order it by table and column.
pg_trigger:  Make the existing index ( tgrelid, tgname ) clustered to order it by table then name getting all the
triggerson a table together. 

Maybe Cluster:
pg_rewrite: Not sure about this one ... The existing index ( ev_class,rulename ) seems logical to cluster to get all
therewrite rules for agiven table contiguous but in the db's available to me virtually everytable only has one rewrite
rule.  
pg_auth_members:  We could order it by role or by member ofthat role.  Not sure which would be more valuable.


Stupid newbie question:

is there a way to make queries on the system tables show me what is actually there when I'm poking around?  So for
example:
    Select * from pg_type limit 1;
tells me that the typoutput is 'boolout'.  An english string rather than a number.  So even though the documentation
saysthat columnmaps to pg_proc.oid I can't then write: 
    Select * from pg_proc where oid = 'boolout';
It would be very helpful if I wasn't learning the system but since Iam I'd like to turn it off for now.  Fewer layers
ofabstraction. 


Thanks,

Simone Aiken

303-956-7188
Quietly Competent Consulting






Re: ToDo List Item - System Table Index Clustering

From
Nicolas Barbier
Date:
2011/1/16 Simone Aiken <saiken@ulfheim.net>:

>        is there a way to make queries on the system tables show me what
>        is actually there when I'm poking around?  So for example:
>
>                Select * from pg_type limit 1;
>
>        tells me that the typoutput is 'boolout'.  An english string rather than
>        a number.  So even though the documentation says that column
>        maps to pg_proc.oid I can't then write:
>
>                Select * from pg_proc where oid = 'boolout';

Type type of typoutput is "regproc", which is really an oid with a
different output function. To get the numeric value, do:

Select typoutput::oid from pg_type limit 1;

Nicolas


Re: ToDo List Item - System Table Index Clustering

From
Tom Lane
Date:
Nicolas Barbier <nicolas.barbier@gmail.com> writes:
> 2011/1/16 Simone Aiken <saiken@ulfheim.net>:
>>        ... So even though the documentation says that column
>>        maps to pg_proc.oid I can't then write:
>>                Select * from pg_proc where oid = 'boolout';

> Type type of typoutput is "regproc", which is really an oid with a
> different output function. To get the numeric value, do:
> Select typoutput::oid from pg_type limit 1;

Also, you *can* go back the other way.  It's very common to write
              Select * from pg_proc where oid = 'boolout'::regproc

rather than looking up the OID first.  There are similar pseudotypes for
relation and operator names; see "Object Identifier Types" in the
manual.
        regards, tom lane


Re: ToDo List Item - System Table Index Clustering

From
Simone Aiken
Date:

>> Select typoutput::oid from pg_type limit 1;


> Also, you *can* go back the other way.  It's very common to write
> 
>               Select * from pg_proc where oid = 'boolout'::regproc
> 
> rather than looking up the OID first.  


>  see "Object Identifier Types" in the manual.


Many thanks to you both, that helps tremendously.   

- Simone Aiken




Re: Spread checkpoint sync

From
Jeff Janes
Date:
On Tue, Jan 11, 2011 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> One of the ideas Simon and I had been considering at one point was adding
>> some better de-duplication logic to the fsync absorb code, which I'm
>> reminded by the pattern here might be helpful independently of other
>> improvements.
>
> Hopefully I'm not stepping on any toes here, but I thought this was an
> awfully good idea and had a chance to take a look at how hard it would
> be today while en route from point A to point B.  The answer turned
> out to be "not very", so PFA a patch that seems to work.  I tested it
> by attaching gdb to the background writer while running pgbench, and
> it eliminate the backend fsyncs without even breaking a sweat.

I had been concerned about how long the lock would be held, and I was
pondering ways to do only partial deduplication to reduce the time.

But since you already wrote a patch to do the whole thing, I figured
I'd time it.

I arranged to test an instrumented version of your patch under large
shared_buffers of 4GB, conditions that would maximize the opportunity
for it to take a long time.  Running your compaction to go from 524288
to a handful (14 to 29, depending on run) took between 36 and 39
milliseconds.

For comparison, doing just the memcpy part of AbsorbFsyncRequest on
a full queue took from 24 to 27 milliseconds.

They are close enough to each other that I am no longer interested in
partial deduplication.  But both are long enough that I wonder if
having a hash table in shared memory that is kept unique automatically
at each update might not be worthwhile.

Cheers,

Jeff


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sun, Jan 16, 2011 at 7:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> But since you already wrote a patch to do the whole thing, I figured
> I'd time it.

Thanks!

> I arranged to test an instrumented version of your patch under large
> shared_buffers of 4GB, conditions that would maximize the opportunity
> for it to take a long time.  Running your compaction to go from 524288
> to a handful (14 to 29, depending on run) took between 36 and 39
> milliseconds.
>
> For comparison, doing just the memcpy part of AbsorbFsyncRequest on
> a full queue took from 24 to 27 milliseconds.
>
> They are close enough to each other that I am no longer interested in
> partial deduplication.  But both are long enough that I wonder if
> having a hash table in shared memory that is kept unique automatically
> at each update might not be worthwhile.

There are basically three operations that we care about here: (1) time
to add an fsync request to the queue, (2) time to absorb requests from
the queue, and (3) time to compact the queue.  The first is by far the
most common, and at least in any situation that anyone's analyzed so
far, the second will be far more common than the third.  Therefore, it
seems unwise to accept any slowdown in #1 to speed up either #2 or #3,
and a hash table probe is definitely going to be slower than what's
required to add an element under the status quo.

We could perhaps mitigate this by partitioning the hash table.
Alternatively, we could split the queue in half and maintain a global
variable - protected by the same lock - indicating which half is
currently open for insertions.  The background writer would grab the
lock, flip the global, release the lock, and then drain the half not
currently open to insertions; the next iteration would flush the other
half.  However, it's unclear to me that either of these things has any
value.  I can't remember any reports of contention on the
BgWriterCommLock, so it seems like changing the logic as minimally as
possible as the way to go.

(In contrast, note that the WAL insert lock, proc array lock, and lock
manager/buffer manager partition locks are all known to be heavily
contended in certain workloads.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
I have finished a first run of benchmarking the current 9.1 code at 
various sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm 
for many details.  The interesting stuff is in Test Set 3, near the 
bottom.  That's the first one that includes buffer_backend_fsync data.  
This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one 
from Ubuntu 10.04.

The results are classic Linux in 2010:  latency pauses from checkpoint 
sync will easily leave the system at a dead halt for a minute, with the 
worst one observed this time dropping still for 108 seconds.  That one 
is weird, but these two are completely averge cases:

http://www.2ndquadrant.us/pgbench-results/210/index.html
http://www.2ndquadrant.us/pgbench-results/215/index.html

I think a helpful next step here would be to put Robert's fsync 
compaction patch into here and see if that helps.  There are enough 
backend syncs showing up in the difficult workloads (scale>=1000, 
clients >=32) that its impact should be obvious.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Sun, Jan 16, 2011 at 10:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I have finished a first run of benchmarking the current 9.1 code at various
> sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm for many
> details.  The interesting stuff is in Test Set 3, near the bottom.  That's
> the first one that includes buffer_backend_fsync data.  This iall on ext3 so
> far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04.
>
> The results are classic Linux in 2010:  latency pauses from checkpoint sync
> will easily leave the system at a dead halt for a minute, with the worst one
> observed this time dropping still for 108 seconds.

I wish I understood better what makes Linux systems "freeze up" under
heavy I/O load.  Linux - like other UNIX-like systems - generally has
reasonably effective mechanisms for preventing a single task from
monopolizing the (or a) CPU in the presence of other processes that
also wish to be time-sliced, but the same thing doesn't appear to be
true of I/O.

> I think a helpful next step here would be to put Robert's fsync compaction
> patch into here and see if that helps.  There are enough backend syncs
> showing up in the difficult workloads (scale>=1000, clients >=32) that its
> impact should be obvious.

Thanks for doing this work.  I look forward to the results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Greg Smith wrote:
> One of the components to the write queue is some notion that writes that 
> have been waiting longest should eventually be flushed out.  Linux has 
> this number called dirty_expire_centiseconds which suggests it enforces 
> just that, set to a default of 30 seconds.  This is why some 5-minute 
> interval checkpoints with default parameters, effectively spreading the 
> checkpoint over 2.5 minutes, can work under the current design.  
> Anything you wrote at T+0 to T+2:00 *should* have been written out 
> already when you reach T+2:30 and sync.  Unfortunately, when the system 
> gets busy, there is this "congestion control" logic that basically 
> throws out any guarantee of writes starting shortly after the expiration 
> time.

Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Jeff Janes
Date:
On Sun, Jan 16, 2011 at 7:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I have finished a first run of benchmarking the current 9.1 code at various
> sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm for many
> details.  The interesting stuff is in Test Set 3, near the bottom.  That's
> the first one that includes buffer_backend_fsync data.  This iall on ext3 so
> far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04.
>
> The results are classic Linux in 2010:  latency pauses from checkpoint sync
> will easily leave the system at a dead halt for a minute, with the worst one
> observed this time dropping still for 108 seconds.  That one is weird, but
> these two are completely averge cases:
>
> http://www.2ndquadrant.us/pgbench-results/210/index.html
> http://www.2ndquadrant.us/pgbench-results/215/index.html
>
> I think a helpful next step here would be to put Robert's fsync compaction
> patch into here and see if that helps.  There are enough backend syncs
> showing up in the difficult workloads (scale>=1000, clients >=32) that its
> impact should be obvious.

Have you ever tested Robert's other idea of having a metronome process
do a periodic fsync on a dummy file which is located on the same ext3fs
as the table files?  I think that that would be interesting to see.

Cheers,

Jeff


Re: Spread checkpoint sync

From
Greg Smith
Date:
Jeff Janes wrote:
> Have you ever tested Robert's other idea of having a metronome process
> do a periodic fsync on a dummy file which is located on the same ext3fs
> as the table files?  I think that that would be interesting to see.
>   

To be frank, I really don't care about fixing this behavior on ext3, 
especially in the context of that sort of hack.  That filesystem is not 
the future, it's not possible to ever really make it work right, and 
every minute spent on pandering to its limitations would be better spent 
elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide 
some proper context for the worst-case behavior people can see right 
now, and to make sure refactoring here doesn't make things worse on it.  
My target is same or slightly better on ext3, much better on XFS and ext4.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Jim Nasby
Date:
On Jan 15, 2011, at 8:15 AM, Robert Haas wrote:
> Well, the point of this is not to save time in the bgwriter - I'm not
> surprised to hear that wasn't noticeable.  The point is that when the
> fsync request queue fills up, backends start performing an fsync *for
> every block they write*, and that's about as bad for performance as
> it's possible to be.  So it's worth going to a little bit of trouble
> to try to make sure it doesn't happen.  It didn't happen *terribly*
> frequently before, but it does seem to be common enough to worry about
> - e.g. on one occasion, I was able to reproduce it just by running
> pgbench -i -s 25 or something like that on a laptop.

Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production system
isin flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other means to
trackit? 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 17, 2011 at 6:07 PM, Jim Nasby <jim@nasby.net> wrote:
> On Jan 15, 2011, at 8:15 AM, Robert Haas wrote:
>> Well, the point of this is not to save time in the bgwriter - I'm not
>> surprised to hear that wasn't noticeable.  The point is that when the
>> fsync request queue fills up, backends start performing an fsync *for
>> every block they write*, and that's about as bad for performance as
>> it's possible to be.  So it's worth going to a little bit of trouble
>> to try to make sure it doesn't happen.  It didn't happen *terribly*
>> frequently before, but it does seem to be common enough to worry about
>> - e.g. on one occasion, I was able to reproduce it just by running
>> pgbench -i -s 25 or something like that on a laptop.
>
> Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production
systemis in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other
meansto track it? 

Something like this?

http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=3134d8863e8473e3ed791e27d484f9e548220411

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Jim Nasby wrote:
> Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production
systemis in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other
meansto track it
 

That's why we already added pg_stat_bgwriter.buffers_backend_fsync to 
track the problem before trying to improve it.  It was driving me crazy 
on a production server not having any visibility into when it happened.  
I haven't seen that we need anything beyond that so far.  In the context 
of this new patch for example, if you get to where a backend does its 
own sync, you'll know it did a compaction as part of that.  The existing 
statistic would tell you enough.

There's now enough data in test set 3 at 
http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how 
this breaks down on a moderately big system (well, by most people's 
standards, but not Jim for whom this is still a toy).  Note the 
backend_sync column on the right, very end of the page; that's the 
relevant counter I'm commenting on:

scale=175:  Some backend fsync with 64 clients, 2/3 runs.
scale=250:  Significant backend fsync with 32 and 64 clients, every run.
scale=500:  Moderate to large backend fsync at any client count >=16.  
This seems to be worst spot of those mapped.  Above here, I would guess 
the TPS numbers start slowing enough that the fsync request queue 
activity drops, too.
scale=1000:  Backend fsync starting at 8 clients
scale=2000:  Backend fsync starting at 16 clients.  By here I think the 
TPS volumes are getting low enough that clients are stuck significantly 
more often waiting for seeks rather than fsync.

Looks like the most effective spot for me to focus testing on with this 
server is scales of 500 and 1000, with 16 to 64 clients.  Now that I've 
got the scale fine tuned better, I may crank up the client counts too 
and see what that does.  I'm glad these are appearing in reasonable 
volume here though, was starting to get nervous about only having NDA 
restricted results to work against.  Some days you just have to cough up 
for your own hardware.

I just tagged pgbench-tools-0.6.0 and pushed to 
GitHub/git.postgresql.org with the changes that track and report on 
buffers_backend_fsync if anyone else wants to try this out.  It includes 
those numbers if you have a 9.1 with them, otherwise just reports 0 for 
it all the time; detection of the feature wasn't hard to add.  The end 
portion of a config file for the program (the first part specifies 
host/username info and the like) that would replicate the third test set 
here is:

MAX_WORKERS="4"
SCRIPT="tpc-b.sql"
SCALES="1 10 100 175 250 500 1000 2000"
SETCLIENTS="4 8 16 32 64"
SETTIMES=3
RUNTIME=600
TOTTRANS=""

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Greg Smith
Date:
Bruce Momjian wrote:
> Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?
>   

The idea of having a dead period doing no work at all between write 
phase and sync phase may have some merit.  I don't have enough test data 
yet on some more fundamental issues in this area to comment on whether 
that smaller optimization would be valuable.  It may be a worthwhile 
concept to throw into the sequencing.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: ToDo List Item - System Table Index Clustering

From
Simone Aiken
Date:

Followup on System Table Index clustering ToDo -

It looks like to implement this I need to do the following:

1 - Add statements to indexing.h to cluster the selected indexes.
A do-nothing define at the top to suppress warnings and then
lines below for perl to parse out.

#define DECLARE_CLUSTER_INDEX(table,index) ...
( add the defines under the index declarations ).

2 - Alter genbki.pl to produce the appropriate statements in 
postgres.bki when it reads the new lines in indexing.h.
Will hold them in memory until the end of the file so they
will come in after 'Build Indices' is called.

CLUSTER tablename USING indexname

3 - Initdb will pipe the commands in postgres.bki to the
postgres executable running in --boot mode. Code
will need to be added to bootparse.y to recognize
this new command and resolve it into a call to
    cluster_rel( tabOID, indOID, 0, 0, -1, -1 );


Speak now before I learn Bison ... actually I should probably
learn Bison anyway.  After ProC other pre-compilation languages
can't be that bad.

Sound all right?

Thanks,

-Simone Aiken


On Jan 15, 2011, at 10:11 PM, Simone Aiken wrote:


Hello Postgres Hackers,

In reference to this todo item about clustering system table indexes,           
( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
I have been studying the system tables to see which would benefit  from
clustering.  I have some index suggestions and a question if you have a
moment.

Cluster Candidates:

pg_attribute:  Make the existing index ( attrelid, attnum ) clustered to
order it by table and column.

pg_attrdef:  Existing index ( adrelid, adnum ) clustered to order it
by table and column.

pg_constraint:  Existing index ( conrelid ) clustered to get table
constraints contiguous.

pg_depend: Existing Index (refclassid, refobjid, refobjsubid) clustered
to so that when the referenced object is changed its dependencies
arevcontiguous.

pg_description: Make the existing index ( Objoid, classoid, objsubid )
clustered to order it by entity, catalog, and optional column.  
* reversing the first two columns makes more sense to me ...
catalog, object, column or since object implies catalog ( right? )
just dispensing with catalog altogether, but that would mean
creating a new index.

pg_shdependent: Existing index (refclassid, refobjid) clustered for
same reason as pg_depend.

pg_statistic: Existing index (starelid, staattnum) clustered to order
it by table and column.

pg_trigger:  Make the existing index ( tgrelid, tgname ) clustered to
order it by table then name getting all the triggers on a table together.

Maybe Cluster:

pg_rewrite: Not sure about this one ... The existing index ( ev_class,
rulename ) seems logical to cluster to get all the rewrite rules for a
given table contiguous but in the db's available to me virtually every
table only has one rewrite rule.  

pg_auth_members:  We could order it by role or by member of
that role.  Not sure which would be more valuable.


Stupid newbie question:


is there a way to make queries on the system tables show me what
is actually there when I'm poking around?  So for example:

Select * from pg_type limit 1;

tells me that the typoutput is 'boolout'.  An english string rather than
a number.  So even though the documentation says that column
maps to pg_proc.oid I can't then write:

Select * from pg_proc where oid = 'boolout';

It would be very helpful if I wasn't learning the system but since I
am I'd like to turn it off for now.  Fewer layers of abstraction.


Thanks,

Simone Aiken

303-956-7188
Quietly Competent Consulting





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Spread checkpoint sync

From
Cédric Villemain
Date:
2011/1/18 Greg Smith <greg@2ndquadrant.com>:
> Bruce Momjian wrote:
>>
>> Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?
>>
>
> The idea of having a dead period doing no work at all between write phase
> and sync phase may have some merit.  I don't have enough test data yet on
> some more fundamental issues in this area to comment on whether that smaller
> optimization would be valuable.  It may be a worthwhile concept to throw
> into the sequencing.

Are we able to have some pause without strict rules like 'stop for 30
sec' ? (case : my hardware is very good and I can write 400MB/sec with
no interrupt, XXX IOPS)

I wonder if we are not going to have issue with  "RAID firmware + BBU
+ linux scheduler" because we are adding 'unexpected' behavior in the
middle.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> Idea #4: For ext3 filesystems that like to dump the entire buffer
> cache instead of only the requested file, write a little daemon that
> runs alongside of (and completely indepdently of) PostgreSQL.  Every
> 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
> closes the file, thus dumping the cache and preventing a ridiculous
> growth in the amount of data to be sync'd at checkpoint time.
>   

Today's data suggests this problem has been resolved in the latest 
kernels.  I saw the "giant flush/series of small flushes" pattern quite 
easily on the CentOS5 system I last did heavy pgbench testing on.  The 
one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't 
show it at all.

Here's what a bad checkpoint looks like on this system:

LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec
DEBUG:  checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec
DEBUG:  checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec
DEBUG:  checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec
DEBUG:  checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec
DEBUG:  checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec
DEBUG:  checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec
DEBUG:  checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec
DEBUG:  checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec
DEBUG:  checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec
DEBUG:  checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec
DEBUG:  checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec
DEBUG:  checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec
DEBUG:  checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec
DEBUG:  checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec
DEBUG:  checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec
DEBUG:  checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec
DEBUG:  checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec
DEBUG:  checkpoint sync: number=19 file=base/24746/36596.10 
time=9759.898 msec
DEBUG:  checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 
msec
DEBUG:  checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec
DEBUG:  checkpoint sync: number=22 file=base/24746/36596.11 
time=9153.437 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1057833 of relation base/24746/36596

[>800 more of these]

DEBUG:  checkpoint sync: number=23 file=base/24746/36601.1 
time=48697.179 msec
DEBUG:  could not forward fsync request because request queue is full
DEBUG:  checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec
DEBUG:  checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec
DEBUG:  checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 20619 of relation base/24746/36601
DEBUG:  checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec
DEBUG:  checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 278967 of relation base/24746/36596
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 1582400 of relation base/24746/36596
DEBUG:  checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec
DEBUG:  checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec
DEBUG:  checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec
DEBUG:  checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec
DEBUG:  checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec
DEBUG:  checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec
DEBUG:  checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec
DEBUG:  checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec
DEBUG:  checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec
LOG:  checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log 
file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, 
total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s

Here the file that's been brutally delayed via backend contention is 
#23, after already seeing quite long delays on the earlier ones.  That 
I've never seen with earlier kernels running ext3.

This is good in that it makes it more likely a spread sync approach that 
works on XFS will also work on these newer kernels with ext4.  Then the 
only group we wouldn't be able to help if that works the ext3 + old 
kernel crowd.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: ToDo List Item - System Table Index Clustering

From
Alvaro Herrera
Date:
Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
> 
> Hello Postgres Hackers,
> 
> In reference to this todo item about clustering system table indexes,           
> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) 
> I have been studying the system tables to see which would benefit  from 
> clustering.  I have some index suggestions and a question if you have a 
> moment.

Wow, this is really old stuff.  I don't know if this is really of any
benefit, given that these catalogs are loaded into syscaches anyway.
Furthermore, if you cluster at initdb time, they will soon lose the
ordering, given that updates move tuples around and inserts put them
anywhere.  So you'd need the catalogs to be re-clustered once in a
while, and I don't see how you'd do that (except by asking the user to
do it, which doesn't sound so great).

I think you need some more discussion on the operational details before
engaging in the bootstrap bison stuff (unless you just want to play with
Bison for educational purposes, of course, which is always a good thing
to do).

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: ToDo List Item - System Table Index Clustering

From
Alvaro Herrera
Date:
Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
> 
> Hello Postgres Hackers,

BTW whatever you do, don't start a new thread by replying to an existing
message and just changing the subject line.  It will mess up the
threading for some readers, and some might not even see your message.
Compose a fresh message instead.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
>>
>> Hello Postgres Hackers,
>>
>> In reference to this todo item about clustering system table indexes,
>> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
>> I have been studying the system tables to see which would benefit  from
>> clustering.  I have some index suggestions and a question if you have a
>> moment.
>
> Wow, this is really old stuff.  I don't know if this is really of any
> benefit, given that these catalogs are loaded into syscaches anyway.
> Furthermore, if you cluster at initdb time, they will soon lose the
> ordering, given that updates move tuples around and inserts put them
> anywhere.  So you'd need the catalogs to be re-clustered once in a
> while, and I don't see how you'd do that (except by asking the user to
> do it, which doesn't sound so great).

The idea of the TODO seems to have been to set the default clustering
to something reasonable.  That doesn't necessarily seem like a bad
idea even if we can't automatically maintain the cluster order, but
there's some question in my mind whether we'd get any measurable
benefit from the clustering.  Even on a database with a gigantic
number of tables, it seems likely that the relevant system catalogs
will stay fully cached and, as you point out, the system caches will
further blunt the impact of any work in this area.  I think the first
thing to do would be to try to come up with a reproducible test case
where clustering the tables improves performance.  If we can't, that
might mean it's time to remove this TODO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
Simone Aiken
Date:

On Jan 18, 2011, at 6:35 AM, Alvaro Herrera wrote:


Wow, this is really old stuff.  I don't know if this is really of any
benefit, given that these catalogs are loaded into syscaches anyway.


The benefit is educational primarily.  I was looking for a todo list item
that would expose me to the system tables.  Learning the data model
of a new system is always step 1 for me.  So that one was perfect as
it would have me study and consider each one to determine if there
was any benefit from clustering on its initial load into cache.  


Furthermore, if you cluster at initdb time, they will soon lose the
ordering, given that updates move tuples around and inserts put them
anywhere.  So you'd need the catalogs to be re-clustered once in a
while, and I don't see how you'd do that (except by asking the user to
do it, which doesn't sound so great).


I did discover that last night.  I'm used to databases that keep up their
clustering.  One that falls apart over time is distinctly strange.  And the
way you guys do your re-clustering logic is overkill if just a few rows
are out of place.  On the upside, a call to mass re-clustering goes
and updates all the clustered indexes in the system and that includes
these tables.  Will have to study auto-vacuum as well to consider that.


 (unless you just want to play with
Bison for educational purposes, of course, which is always a good thing
to do).

Pretty much, yeah.  


- Simone Aiken





Re: ToDo List Item - System Table Index Clustering

From
"Simone Aiken"
Date:

On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera <alvherre@commandprompt.com>
wrote:
>> Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
>>>
>> >Hello Postgres Hackers,
>>>
>> >In reference to this todo item about clustering system table indexes,
>>> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
>>
>> Wow, this is really old stuff.  I don't know if this is really of any
>
>If we can't, that might mean it's time to remove this TODO.

When I'm learning a new system I like to first learn how to use it,
second learn its data model, third start seriously looking at the code.
So that Todo is ideal for my learning method.

If there is something else that would also involve studying all the system
tables it would also be great.  For example, I noticed we have column
level comments on the web but not in the database itself.  This seems
silly.  Why not have the comments in the database and have the web
query the tables of template databases for the given versions?

That way \d+ pg_tablename would provide instant gratification for users.
And we all like our gratification to be instant.  They could be worked into
The .h files though as inserts to pg_description they wouldn't provide an
excuse to learn bison.

I'm open to other suggestions as well.

-Simone Aiken




Re: Spread checkpoint sync

From
Josh Berkus
Date:
> To be frank, I really don't care about fixing this behavior on ext3,
> especially in the context of that sort of hack.  That filesystem is not
> the future, it's not possible to ever really make it work right, and
> every minute spent on pandering to its limitations would be better spent
> elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide
> some proper context for the worst-case behavior people can see right
> now, and to make sure refactoring here doesn't make things worse on it. 
> My target is same or slightly better on ext3, much better on XFS and ext4.

Please don't forget that we need to avoid performance regressions on
NTFS and ZFS as well.  They don't need to improve, but we can't let them
regress.  I think we can ignore BSD/UFS and Solaris/UFS, as well as
HFS+, though.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: ToDo List Item - System Table Index Clustering

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011:
> >>
> >> Hello Postgres Hackers,
> >>
> >> In reference to this todo item about clustering system table indexes,
> >> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
> >> I have been studying the system tables to see which would benefit ?from
> >> clustering. ?I have some index suggestions and a question if you have a
> >> moment.
> >
> > Wow, this is really old stuff. ?I don't know if this is really of any
> > benefit, given that these catalogs are loaded into syscaches anyway.
> > Furthermore, if you cluster at initdb time, they will soon lose the
> > ordering, given that updates move tuples around and inserts put them
> > anywhere. ?So you'd need the catalogs to be re-clustered once in a
> > while, and I don't see how you'd do that (except by asking the user to
> > do it, which doesn't sound so great).
> 
> The idea of the TODO seems to have been to set the default clustering
> to something reasonable.  That doesn't necessarily seem like a bad
> idea even if we can't automatically maintain the cluster order, but
> there's some question in my mind whether we'd get any measurable
> benefit from the clustering.  Even on a database with a gigantic
> number of tables, it seems likely that the relevant system catalogs
> will stay fully cached and, as you point out, the system caches will
> further blunt the impact of any work in this area.  I think the first
> thing to do would be to try to come up with a reproducible test case
> where clustering the tables improves performance.  If we can't, that
> might mean it's time to remove this TODO.

I think CLUSTER is a win when you are looking up multiple rows in the
same table, either using a non-unique index or a range search.  What
places do such lookups?  Having them all in adjacent pages would be a
win --- single-row lookups are usually not.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 12:16 PM, Simone Aiken <saiken@ulfheim.net> wrote:
> When I'm learning a new system I like to first learn how to use it,
> second learn its data model, third start seriously looking at the code.
> So that Todo is ideal for my learning method.

Sure - my point is just that we usually have as a criteria for any
performance related patch that it actually does improve performance.
So, we'd need a test case.

> If there is something else that would also involve studying all the system
> tables it would also be great.  For example, I noticed we have column
> level comments on the web but not in the database itself.  This seems
> silly.  Why not have the comments in the database and have the web
> query the tables of template databases for the given versions?

Uh... I don't know what this means.

> I'm open to other suggestions as well.

Here are a few TODO items that look relatively easy to me (they may
not actually be easy when you dig in, of course):

Clear table counters on TRUNCATE
Allow the clearing of cluster-level statistics
Allow ALTER TABLE ... ALTER CONSTRAINT ... RENAME
Allow ALTER TABLE to change constraint deferrability and actions

Unfortunately we don't have a lot of easy TODOs.  People keep doing
the ones we think up...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
"Simone Aiken"
Date:
-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com] 
Sent: Tuesday, January 18, 2011 2:53 PM
To: Simone Aiken
Cc: Alvaro Herrera; pgsql-hackers
Subject: Re: [HACKERS] ToDo List Item - System Table Index Clustering


>Sure - my point is just that we usually have as a criteria for any
>performance related patch that it actually does improve performance.


Sorry wasn't arguing your point.   Conceding it actually. =)  
I wasn't explaining why I chose it anyway to contest your statements,
but as an invitation for you to point me towards something more useful 
that fit what I was looking for in a task. 


>
> Uh... I don't know what this means.
>

Pages like this one have column comments for the system tables:

http://www.psql.it/manuale/8.3/catalog-pg-attribute.html

But in my database when I look for comments they aren't there:

qcc=> \d+ pg_attribute         Table "pg_catalog.pg_attribute"   Column     |   Type   | Modifiers | Description
---------------+----------+-----------+-------------attrelid      | oid      | not null  |attname       | name     |
notnull  |atttypid      | oid      | not null  |attstattarget | integer  | not null  |attlen        | smallint | not
null |attnum        | smallint | not null  |attndims      | integer  | not null  |attcacheoff   | integer  | not null
|atttypmod    | integer  | not null  |attbyval      | boolean  | not null  |attstorage    | "char"   | not null
|attalign     | "char"   | not null  |attnotnull    | boolean  | not null  |atthasdef     | boolean  | not null
|attisdropped | boolean  | not null  |attislocal    | boolean  | not null  |attinhcount   | integer  | not null  |
 


So I have to fire up a web browser and start googling to learn 
about the columns.  Putting them in pg_description would be 
more handy, no?


-Simone Aiken




Re: ToDo List Item - System Table Index Clustering

From
"Simone Aiken"
Date:
> Robert
> 
> I think the first 
> thing to do would be to try to come up with a reproducible test case 
> where clustering the tables improves performance.  
>

On that note, is there any standard way you guys do benchmarks?  


> Bruce
>
>I think CLUSTER is a win when you are looking up multiple rows in the same
table, either using a non-unique index or a range search.  What places do
such lookups?  >Having them all in adjacent pages would be a win ---
single-row lookups are usually not.
>

Mostly the tables that track column level data.  Typically you will want to
grab rows for multiple columns for a given table at once so it would be
helpful to have them be contiguous on disk. 

I could design a benchmark to display this by building a thousand tables one
column at a time using 'alter add column' to scatter the catalog rows for
the tables across many blocks.  So they'll be a range with column 1 for each
table and column 2 for each table and column three for each table.  Then
fill a couple data tables with a lot of data and set some noise makers to
loop through them over and over with full table scans ... filling up cache
with unrelated data and hopefully ageing out the cache of the pg_tables.
Then do some benchmark index lookup queries to see the retrieval time before
and after clustering the pg_ctalog tables to record a difference.

If the criteria is "doesn't hurt anything and helps a little" I think this
passes.  Esp since clusters aren't maintained automatically so adding them
has no negative impact on insert or update.  It'd just be a nice thing to do
if you know it can be done that doesn't harm anyone who doesn't know.


-Simone Aiken






Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken
<saiken@quietlycompetent.com> wrote:
> Pages like this one have column comments for the system tables:
>
> http://www.psql.it/manuale/8.3/catalog-pg-attribute.html

Oh, I see.  I don't think we want to go there.  We'd need some kind of
system for keeping the two places in sync.  And there'd be no easy way
to upgrade the in-database descriptions when we upgraded to a newer
minor release, supposing they'd changed in the meantime.  And some of
the descriptions are quite long, so they wouldn't fit nicely in the
amount of space you typically have available when you run \d+.  And it
would enlarge the size of an empty database by however much was
required to store all those comments, which could be an issue for
PostgreSQL instances that have many small databases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken
> <saiken@quietlycompetent.com> wrote:
>> Pages like this one have column comments for the system tables:
>> 
>> http://www.psql.it/manuale/8.3/catalog-pg-attribute.html

> Oh, I see.  I don't think we want to go there.  We'd need some kind of
> system for keeping the two places in sync.

I seem to recall some muttering about teaching genbki to extract such
comments from the SGML sources or perhaps the C header files.  I tend to
agree though that it would be a lot more work than it's worth.  And as
you say, pg_description entries aren't free.

Which brings up another point though.  I have a personal TODO item to
make the comments for operator support functions more consistent:
http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us
Should we consider removing those comments altogether, instead?
        regards, tom lane


Re: ToDo List Item - System Table Index Clustering

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mié ene 19 15:25:00 -0300 2011:

> Oh, I see.  I don't think we want to go there.  We'd need some kind of
> system for keeping the two places in sync.

Maybe autogenerate both the .sgml and the postgres.description files
from a single source.

> And there'd be no easy way
> to upgrade the in-database descriptions when we upgraded to a newer
> minor release, supposing they'd changed in the meantime.

I wouldn't worry about this issue.  We don't do many catalog changes in
minor releases anyway.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken
>> <saiken@quietlycompetent.com> wrote:
>>> Pages like this one have column comments for the system tables:
>>>
>>> http://www.psql.it/manuale/8.3/catalog-pg-attribute.html
>
>> Oh, I see.  I don't think we want to go there.  We'd need some kind of
>> system for keeping the two places in sync.
>
> I seem to recall some muttering about teaching genbki to extract such
> comments from the SGML sources or perhaps the C header files.  I tend to
> agree though that it would be a lot more work than it's worth.  And as
> you say, pg_description entries aren't free.
>
> Which brings up another point though.  I have a personal TODO item to
> make the comments for operator support functions more consistent:
> http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us
> Should we consider removing those comments altogether, instead?

I could go either way on that.  Most of those comments are pretty
short, aren't they?  How much storage are they really costing us?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Which brings up another point though. I have a personal TODO item to
>> make the comments for operator support functions more consistent:
>> http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us
>> Should we consider removing those comments altogether, instead?

> I could go either way on that.  Most of those comments are pretty
> short, aren't they?  How much storage are they really costing us?

Well, on my machine pg_description is about 210K (per database) as of
HEAD.  90% of its contents are pg_proc entries, though I have no good
fix on how much of that is for internal-use-only functions.  A very
rough estimate from counting pg_proc and pg_operator entries suggests
that the answer might be "about a third".  So if we do what was said in
the above-cited thread, ie move existing comments to pg_operator and
add boilerplate ones to pg_proc, we probably would pay <100K for it.
        regards, tom lane


Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Which brings up another point though. I have a personal TODO item to
>>> make the comments for operator support functions more consistent:
>>> http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us
>>> Should we consider removing those comments altogether, instead?
>
>> I could go either way on that.  Most of those comments are pretty
>> short, aren't they?  How much storage are they really costing us?
>
> Well, on my machine pg_description is about 210K (per database) as of
> HEAD.  90% of its contents are pg_proc entries, though I have no good
> fix on how much of that is for internal-use-only functions.  A very
> rough estimate from counting pg_proc and pg_operator entries suggests
> that the answer might be "about a third".  So if we do what was said in
> the above-cited thread, ie move existing comments to pg_operator and
> add boilerplate ones to pg_proc, we probably would pay <100K for it.

I guess that's not enormously expensive, but it's not insignificant
either.  On my machine, a template database is 5.5MB.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 19, 2011 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, on my machine pg_description is about 210K (per database) as of
>> HEAD. �90% of its contents are pg_proc entries, though I have no good
>> fix on how much of that is for internal-use-only functions. �A very
>> rough estimate from counting pg_proc and pg_operator entries suggests
>> that the answer might be "about a third". �So if we do what was said in
>> the above-cited thread, ie move existing comments to pg_operator and
>> add boilerplate ones to pg_proc, we probably would pay <100K for it.

> I guess that's not enormously expensive, but it's not insignificant
> either.  On my machine, a template database is 5.5MB.

The implementation I was thinking about was to have initdb run a SQL
command that would do something like

INSERT INTO pg_description SELECT oprcode, 'pg_proc'::regclass, 0, 'implementation of ' || oprname FROM pg_operator
WHEREtheres-not-already-a-description-of-the-oprcode-function
 

So it would be minimal work to either provide or omit the boilerplate
descriptions.  I think we can postpone the decision till we have a
closer fix on the number of entries we're talking about.
        regards, tom lane


Re: ToDo List Item - System Table Index Clustering

From
"Simone Aiken"
Date:
>
>I seem to recall some muttering about teaching genbki to extract such
comments from the SGML sources or perhaps the C header files.  I tend to
agree though that it would be a lot >more work than it's worth.  And as you
say, pg_description entries aren't free.
>

I know I can't do all of the work, any submission requires review etc, but
it is worth it to me provided it does no harm to the codebase.

So the only outstanding question is the impact of increased size.

In my experience size increases related to documentation are almost always
worth it.  So I'm prejudiced right out of the gate.  I was wondering if
every pg_ table gets copied out to every database ..  if there is already a
mechanism for not replicating all of them we could utilize views or
re-writes rules to merge a single copy of catalog comments in a separate
table with each deployed database's pg_descriptions.  

If all catalog descriptions were handled this way it would actually decrease
the size of a deployed database ( by 210K? ) by absorbing the
pg_descriptions that are currently being duplicated.   Since users shouldn't
be messing with them anyway and they are purely for humans to refer to - not
computers to calculate explain plans with -  there shouldn't be anything
inherently wrong with moving static descriptions out of user space.  In
theory at least.  


-Simone Aiken





Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Wed, Jan 19, 2011 at 4:27 PM, Simone Aiken <saiken@ulfheim.net> wrote:
> In my experience size increases related to documentation are almost always
> worth it.  So I'm prejudiced right out of the gate.  I was wondering if
> every pg_ table gets copied out to every database ..  if there is already a
> mechanism for not replicating all of them we could utilize views or
> re-writes rules to merge a single copy of catalog comments in a separate
> table with each deployed database's pg_descriptions.

All of them get copied, except for a handful of so-called shared
catalogs.  Changing that would be difficult.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
"Simone Aiken"
Date:
After playing with this in benchmarks and researching the weird results I
got I'm going to advise dropping the todo for now unless something happens
to change how postgres handles clustering.  You guys probably already
grokked this so I am just recording it for the list archives.   

The primary factor here is that postgres doesn't maintain clustered indexes.
Clustering is a one-time operation that clusters the table at this current
point in time.  Basically, there really isn't any such thing in postgres as
a clustered index.  There is an operation - Cluster - which takes an index
and a table as input and re-orders the table according to the index.   But
it is borderline fiction to call the index used "clustered" because the next
row inserted will pop in at the end of the table instead of slipping into
the middle of the table per the desired ordering.  

All the pg_table cluster candidates are candidates because they have a row
per table column and we expect that a query will want to get several of
these rows at once.  These rows are naturally clustered because the scripts
that create them insert their information into the catalog contiguously.
When you create a catalog table the pg_attribute rows for its columns are
inserted together.  When you then create all its triggers they too are put
into pg_triggers one after the other.  So calling the Cluster operation
after dbinit doesn't help anything.

Over time table alterations can fragment this information.   If a user loads
a bunch of tables, then alters them over time the columns added later on
will have their metadata stored separately from the columns created
originally.     

Which gets us to the down and dirty of how the Cluster function works.  It
puts an access exclusive lock on the entire table - blocking all attempts to
read and write to the table - creates a copy of the table in the desired
order, drops the original, and renames the copy.  Doing this to a catalog
table that is relevant to queries pretty much brings everything else in the
database to a halt while the system table is locked up.  And the brute force
logic makes this time consuming even if the table is perfectly ordered
already.  Additionally, snapshots taken of the table during the Cluster
operation make the table appear to be empty which introduces the possibility
of system table corruption if transactions are run concurrently with a
Cluster operation.

So basically, the Cluster operation in its current form is not something you
want running automatically on a bunch of system table as it is currently
implemented.  It gives your system the hiccups.  You would only want to run
it manually during downtime.  And you can do that just as easily with or
without any preparation during dbinit.


Thanks everyone,

-Simone Aiken






Re: ToDo List Item - System Table Index Clustering

From
Robert Haas
Date:
On Thu, Jan 20, 2011 at 4:40 PM, Simone Aiken
<saiken@quietlycompetent.com> wrote:
> After playing with this in benchmarks and researching the weird results I
> got I'm going to advise dropping the todo for now unless something happens
> to change how postgres handles clustering.

I agree, let's remove it.

That having been said, analyzing TODO items to figure out which ones
are worthless is a useful thing to do, so please feel free to keep at
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ToDo List Item - System Table Index Clustering

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Thu, Jan 20, 2011 at 4:40 PM, Simone Aiken
> <saiken@quietlycompetent.com> wrote:
> > After playing with this in benchmarks and researching the weird results I
> > got I'm going to advise dropping the todo for now unless something happens
> > to change how postgres handles clustering.
> 
> I agree, let's remove it.
> 
> That having been said, analyzing TODO items to figure out which ones
> are worthless is a useful thing to do, so please feel free to keep at
> it.

OK, removed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Greg Smith
Date:
Greg Smith wrote:
> I think a helpful next step here would be to put Robert's fsync 
> compaction patch into here and see if that helps.  There are enough 
> backend syncs showing up in the difficult workloads (scale>=1000, 
> clients >=32) that its impact should be obvious.

Initial tests show everything expected from this change and more.  This 
took me a while to isolate because of issues where the filesystem 
involved degraded over time, giving a heavy bias toward a faster first 
test run, before anything was fragmented.  I just had to do a whole new 
mkfs on the database/xlog disks when switching between test sets in 
order to eliminate that.

At a scale of 500, I see the following average behavior:

Clients TPS backend-fsync
16 557 155
32 587 572
64 628 843
128 621 1442
256 632 2504

On one run through with the fsync compaction patch applied this turned into:

Clients TPS backend-fsync
16 637 0
32 621 0
64 721 0
128 716 0
256 841 0

So not only are all the backend fsyncs gone, there is a very clear TPS 
improvement too.  The change in results at >=64 clients are well above 
the usual noise threshold in these tests. 

The problem where individual fsync calls during checkpoints can take a 
long time is not appreciably better.  But I think this will greatly 
reduce the odds of running into the truly dysfunctional breakdown, where 
checkpoint and backend fsync calls compete with one another, that caused 
the worst-case situation kicking off this whole line of research here.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Thu, Jan 27, 2011 at 12:18 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Greg Smith wrote:
>>
>> I think a helpful next step here would be to put Robert's fsync compaction
>> patch into here and see if that helps.  There are enough backend syncs
>> showing up in the difficult workloads (scale>=1000, clients >=32) that its
>> impact should be obvious.
>
> Initial tests show everything expected from this change and more.  This took
> me a while to isolate because of issues where the filesystem involved
> degraded over time, giving a heavy bias toward a faster first test run,
> before anything was fragmented.  I just had to do a whole new mkfs on the
> database/xlog disks when switching between test sets in order to eliminate
> that.
>
> At a scale of 500, I see the following average behavior:
>
> Clients TPS backend-fsync
> 16 557 155
> 32 587 572
> 64 628 843
> 128 621 1442
> 256 632 2504
>
> On one run through with the fsync compaction patch applied this turned into:
>
> Clients TPS backend-fsync
> 16 637 0
> 32 621 0
> 64 721 0
> 128 716 0
> 256 841 0
>
> So not only are all the backend fsyncs gone, there is a very clear TPS
> improvement too.  The change in results at >=64 clients are well above the
> usual noise threshold in these tests.
> The problem where individual fsync calls during checkpoints can take a long
> time is not appreciably better.  But I think this will greatly reduce the
> odds of running into the truly dysfunctional breakdown, where checkpoint and
> backend fsync calls compete with one another, that caused the worst-case
> situation kicking off this whole line of research here.

Dude!  That's pretty cool.  Thanks for doing that measurement work -
that's really awesome.

Barring objections, I'll go ahead and commit my patch.

Based on what I saw looking at this, I'm thinking that the backend
fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs
spread uniformly throughout the test, but clusters of 100 or more that
happen in very quick succession, followed by relief when the
background writer gets around to emptying the queue.  During each
cluster, the system probably slows way down, and then recovers when
the queue is emptied.  So the TPS improvement isn't at all a uniform
speedup, but simply relief from the stall that would otherwise result
from a full queue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> Based on what I saw looking at this, I'm thinking that the backend
> fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs
> spread uniformly throughout the test, but clusters of 100 or more that
> happen in very quick succession, followed by relief when the
> background writer gets around to emptying the queue.

That's exactly the case.  You'll be running along fine, the queue will 
fill, and then hundreds of them can pile up in seconds.  Since the worst 
of that seemed to be during the sync phase of the checkpoint, adding 
additional queue management logic to there is where we started at.  I 
thought this compaction idea would be more difficult to implement than 
your patch proved to be though, so doing this first is working out quite 
well instead.

This is what all the log messages from the patch look like here, at 
scale=500 and shared_buffers=256MB:

DEBUG:  compacted fsync request queue from 32768 entries to 11 entries

That's an 8GB database, and from looking at the relative sizes I'm 
guessing 7 entries refer to the 1GB segments of the accounts table, 2 to 
its main index, and the other 2 are likely branches/tellers data.  Since 
I know the production system I ran into this on has about 400 file 
segments on it regularly dirtied a higher shared_buffers than that, I 
expect this will demolish this class of problem on it, too.

I'll have all the TPS over time graphs available to publish by the end 
of my day here, including tests at a scale of 1000 as well.  Those 
should give a little more insight into how the patch is actually 
impacting high-level performance.  I don't dare disturb the ongoing 
tests by copying all that data out of there until they're finished, will 
be a few hours yet.

My only potential concern over committing this is that I haven't done a 
sanity check over whether it impacts the fsync mechanics in a way that 
might cause an issue.  Your assumptions there are documented and look 
reasonable on quick review; I just haven't had much time yet to look for 
flaws in them.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> During each cluster, the system probably slows way down, and then recovers when
> the queue is emptied.  So the TPS improvement isn't at all a uniform
> speedup, but simply relief from the stall that would otherwise result
> from a full queue.
>

That does seem to be the case here.
http://www.2ndquadrant.us/pgbench-results/index.htm now has results from
my a long test series, at two database scales that caused many backend
fsyncs during earlier tests.  Set #5 is the existing server code, #6 is
with the patch applied.  There are zero backend fsync calls with the
patch applied, which isn't surprising given how simple the schema is on
this test case.  An average of a 14% TPS gain appears at a scale of 500
and a 8% one at 1000; the attached CSV file summarizes the average
figures for the archives.  The gains do appear to be from smoothing out
the dead period that normally occur during the sync phase of the checkpoint.

For example, here are the fastest runs at scale=1000/clients=256 with
and without the patch:

http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361)
http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380)

Here the difference in how much less of a slowdown there is around the
checkpoint end points is really obvious, and obviously an improvement.
You can see the same thing to a lesser extent at the other end of the
scale; here's the fastest runs at scale=500/clients=16:

http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590)
http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643)

Where there are still very ugly maximum latency figures here in every
case, these periods just aren't as wide with the patch in place.

I'm moving onto some brief testing some of the newer kernel behavior
here, then returning to testing the other checkpoint spreading ideas on
top of this compation patch, presuming something like it will end up
being committed first.  I think it's safe to say I can throw away the
changes to try and alter the fsync absorption code present in what I
submitted before, as this scheme does a much better job of avoiding that
problem than those earlier queue alteration ideas.  I'm glad Robert
grabbed the right one from the pile of ideas I threw out for what else
might help here.

P.S. Yes, I know I have other review work to do as well.  Starting on
the rest of that tomorrow.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

,,"Unmodified",,"Compacted Fsync",,,
"scale","clients","tps","max_latency","tps","max_latency","TPS Gain","% Gain"
500,16,557,17963.41,631,17116.31,74,13.3%
500,32,587,25838.8,655,24311.54,68,11.6%
500,64,628,35198.39,727,38040.39,99,15.8%
500,128,621,41001.91,687,48195.77,66,10.6%
500,256,632,49610.39,747,46799.48,115,18.2%
,,,,,,,
1000,16,306,39298.95,321,40826.58,15,4.9%
1000,32,314,40120.35,345,27910.51,31,9.9%
1000,64,334,46244.86,358,45138.1,24,7.2%
1000,128,343,72501.57,372,47125.46,29,8.5%
1000,256,321,80588.63,350,83232.14,29,9.0%

Re: Spread checkpoint sync

From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 12:53 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Where there are still very ugly maximum latency figures here in every case,
> these periods just aren't as wide with the patch in place.

OK, committed the patch, with some additional commenting, and after
fixing the compiler warning Chris Browne noticed.

> P.S. Yes, I know I have other review work to do as well.  Starting on the
> rest of that tomorrow.

*cracks whip*

Man, this thing doesn't work at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> I've attached an updated version of the initial sync spreading patch here,
> one that applies cleanly on top of HEAD and over top of the sync
> instrumentation patch too.  The conflict that made that hard before is gone
> now.

With the fsync queue compaction patch applied, I think most of this is
now not needed.  Attached please find an attempt to isolate the
portion that looks like it might still be useful.  The basic idea of
what remains here is to make the background writer still do its normal
stuff even when it's checkpointing.  In particular, with this patch
applied, PG will:

1. Absorb fsync requests a lot more often during the sync phase.
2. Still try to run the cleaning scan during the sync phase.
3. Pause for 3 seconds after every fsync.

I suspect that #1 is probably a good idea.  It seems pretty clear
based on your previous testing that the fsync compaction patch should
be sufficient to prevent us from hitting the wall, but if we're going
to any kind of nontrivial work here then cleaning the queue is a
sensible thing to do along the way, and there's little downside.

I also suspect #2 is a good idea.  The fact that we're checkpointing
doesn't mean that the system suddenly doesn't require clean buffers,
and the experimentation I've done recently (see: limiting hint bit
I/O) convinces me that it's pretty expensive from a performance
standpoint when backends have to start writing out their own buffers,
so continuing to do that work during the sync phase of a checkpoint,
just as we do during the write phase, seems pretty sensible.

I think something along the lines of #3 is probably a good idea, but
the current coding doesn't take checkpoint_completion_target into
account.  The underlying problem here is that it's at least somewhat
reasonable to assume that if we write() a whole bunch of blocks, each
write() will take approximately the same amount of time.  But this is
not true at all with respect to fsync() - they neither take the same
amount of time as each other, nor is there any fixed ratio between
write() time and fsync() time to go by.  So if we want the checkpoint
to finish in, say, 20 minutes, we can't know whether the write phase
needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

One idea I have is to try to get some of the fsyncs out of the queue
at times other than end-of-checkpoint.  Even if this resulted in some
modest increase in the total number of fsync() calls, it might improve
performance by causing data to be flushed to disk in smaller chunks.
For example, suppose we kept an LRU list of pending fsync requests -
every time we remember an fsync request for a particular relation, we
move it to the head (hot end) of the LRU.  And periodically we pull
the tail entry off the list and fsync it - say, after
checkpoint_timeout / (# of items in the list).  That way, when we
arrive at the end of the checkpoint and starting syncing everything,
the syncs hopefully complete more quickly because we've already forced
a bunch of the data down to disk.  That algorithm may well be too
stupid or just not work in real life, but perhaps there's some
variation that would be sensible.  The point is: instead of or in
addition to trying to spread out the sync phase, we might want to
investigate whether it's possible to reduce its size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Spread checkpoint sync

From
Itagaki Takahiro
Date:
On Mon, Jan 31, 2011 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. Absorb fsync requests a lot more often during the sync phase.
> 2. Still try to run the cleaning scan during the sync phase.
> 3. Pause for 3 seconds after every fsync.
>
> So if we want the checkpoint
> to finish in, say, 20 minutes, we can't know whether the write phase
> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.

We probably need deadline-based scheduling, that is being used in write()
phase. If we want to sync 100 files in 20 minutes, each file should be
sync'ed in 12 seconds if we think each fsync takes the same time.
If we would have better estimation algorithm (file size? dirty ratio?),
each fsync chould have some weight factor.  But deadline-based scheduling
is still needed then.

BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command,
shutdown, pg_start_backup(), and some of checkpoints during recovery
might don't want to sleep.

-- 
Itagaki Takahiro


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> On Mon, Jan 31, 2011 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. Absorb fsync requests a lot more often during the sync phase.
>> 2. Still try to run the cleaning scan during the sync phase.
>> 3. Pause for 3 seconds after every fsync.
>>
>> So if we want the checkpoint
>> to finish in, say, 20 minutes, we can't know whether the write phase
>> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.
>
> We probably need deadline-based scheduling, that is being used in write()
> phase. If we want to sync 100 files in 20 minutes, each file should be
> sync'ed in 12 seconds if we think each fsync takes the same time.
> If we would have better estimation algorithm (file size? dirty ratio?),
> each fsync chould have some weight factor.  But deadline-based scheduling
> is still needed then.

Right.  I think the problem is balancing the write and sync phases.
For example, if your operating system is very aggressively writing out
dirty pages to disk, then you want the write phase to be as long as
possible and the sync phase can be very short because there won't be
much work to do.  But if your operating system is caching lots of
stuff in memory and writing dirty pages out to disk only when
absolutely necessary, then the write phase could be relatively quick
without much hurting anything, but the sync phase will need to be long
to keep from crushing the I/O system.  The trouble is, we don't really
have a priori way to know which it's doing.  Maybe we could try to
tune based on the behavior of previous checkpoints, but I'm wondering
if we oughtn't to take the cheesy path first and split
checkpoint_completion_target into checkpoint_write_target and
checkpoint_sync_target.  That's another parameter to set, but I'd
rather add a parameter that people have to play with to find the right
value than impose an arbitrary rule that creates unavoidable bad
performance in certain environments.

> BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command,
> shutdown, pg_start_backup(), and some of checkpoints during recovery
> might don't want to sleep.

Yeah, I think that's understood.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Heikki Linnakangas
Date:
On 31.01.2011 16:44, Robert Haas wrote:
> On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro
> <itagaki.takahiro@gmail.com>  wrote:
>> On Mon, Jan 31, 2011 at 13:41, Robert Haas<robertmhaas@gmail.com>  wrote:
>>> 1. Absorb fsync requests a lot more often during the sync phase.
>>> 2. Still try to run the cleaning scan during the sync phase.
>>> 3. Pause for 3 seconds after every fsync.
>>>
>>> So if we want the checkpoint
>>> to finish in, say, 20 minutes, we can't know whether the write phase
>>> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59.
>>
>> We probably need deadline-based scheduling, that is being used in write()
>> phase. If we want to sync 100 files in 20 minutes, each file should be
>> sync'ed in 12 seconds if we think each fsync takes the same time.
>> If we would have better estimation algorithm (file size? dirty ratio?),
>> each fsync chould have some weight factor.  But deadline-based scheduling
>> is still needed then.
>
> Right.  I think the problem is balancing the write and sync phases.
> For example, if your operating system is very aggressively writing out
> dirty pages to disk, then you want the write phase to be as long as
> possible and the sync phase can be very short because there won't be
> much work to do.  But if your operating system is caching lots of
> stuff in memory and writing dirty pages out to disk only when
> absolutely necessary, then the write phase could be relatively quick
> without much hurting anything, but the sync phase will need to be long
> to keep from crushing the I/O system.  The trouble is, we don't really
> have a priori way to know which it's doing.  Maybe we could try to
> tune based on the behavior of previous checkpoints, ...

IMHO we should re-consider the patch to sort the writes. Not so much 
because of the performance gain that gives, but because we can then 
re-arrange the fsyncs so that you write one file, then fsync it, then 
write the next file and so on. That way we the time taken by the fsyncs 
is distributed between the writes, so we don't need to accurately 
estimate how long each will take. If one fsync takes a long time, the 
writes that follow will just be done a bit faster to catch up.

> ... but I'm wondering
> if we oughtn't to take the cheesy path first and split
> checkpoint_completion_target into checkpoint_write_target and
> checkpoint_sync_target.  That's another parameter to set, but I'd
> rather add a parameter that people have to play with to find the right
> value than impose an arbitrary rule that creates unavoidable bad
> performance in certain environments.

That is of course simpler..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Spread checkpoint sync

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> IMHO we should re-consider the patch to sort the writes. Not so much 
> because of the performance gain that gives, but because we can then 
> re-arrange the fsyncs so that you write one file, then fsync it, then 
> write the next file and so on.

Isn't that going to make performance worse not better?  Generally you
want to give the kernel as much scheduling flexibility as possible,
which you do by issuing the write as far before the fsync as you can.
An arrangement like the above removes all cross-file scheduling freedom.
For example, if two files are on different spindles, you've just
guaranteed that no I/O overlap is possible.

> That way we the time taken by the fsyncs 
> is distributed between the writes,

That sounds like you have an entirely wrong mental model of where the
cost comes from.  Those times are not independent.
        regards, tom lane


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> IMHO we should re-consider the patch to sort the writes. Not so much
>> because of the performance gain that gives, but because we can then
>> re-arrange the fsyncs so that you write one file, then fsync it, then
>> write the next file and so on.
>
> Isn't that going to make performance worse not better?  Generally you
> want to give the kernel as much scheduling flexibility as possible,
> which you do by issuing the write as far before the fsync as you can.
> An arrangement like the above removes all cross-file scheduling freedom.
> For example, if two files are on different spindles, you've just
> guaranteed that no I/O overlap is possible.
>
>> That way we the time taken by the fsyncs
>> is distributed between the writes,
>
> That sounds like you have an entirely wrong mental model of where the
> cost comes from.  Those times are not independent.

Yeah, Greg Smith made the same point a week or three ago.  But it
seems to me that there is potential value in overlaying the write and
sync phases to some degree.  For example, if the write phase is spread
over 15 minutes and you have 30 files, then by, say, minute 7, it's a
probably OK to flush the file you wrote first.  Waiting longer isn't
necessarily going to help - the kernel has probably written what it is
going to write without prodding.

In fact, it might be that on a busy system, you could lose by waiting
*too long* to perform the fsync.  The cleaning scan and/or backends
may kick out additional dirty buffers that will now have to get forced
down to disk, even though you don't really care about them (because
they were dirtied after the checkpoint write had already been done).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That sounds like you have an entirely wrong mental model of where the
>> cost comes from. �Those times are not independent.

> Yeah, Greg Smith made the same point a week or three ago.  But it
> seems to me that there is potential value in overlaying the write and
> sync phases to some degree.  For example, if the write phase is spread
> over 15 minutes and you have 30 files, then by, say, minute 7, it's a
> probably OK to flush the file you wrote first.

Yeah, probably, but we can't do anything as stupid as file-by-file.

I wonder whether it'd be useful to keep track of the total amount of
data written-and-not-yet-synced, and to issue fsyncs often enough to
keep that below some parameter; the idea being that the parameter would
limit how much dirty kernel disk cache there is.  Of course, ideally the
kernel would have a similar tunable and this would be a waste of effort
on our part...
        regards, tom lane


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> That sounds like you have an entirely wrong mental model of where the
>>> cost comes from.  Those times are not independent.
>
>> Yeah, Greg Smith made the same point a week or three ago.  But it
>> seems to me that there is potential value in overlaying the write and
>> sync phases to some degree.  For example, if the write phase is spread
>> over 15 minutes and you have 30 files, then by, say, minute 7, it's a
>> probably OK to flush the file you wrote first.
>
> Yeah, probably, but we can't do anything as stupid as file-by-file.

Eh?

> I wonder whether it'd be useful to keep track of the total amount of
> data written-and-not-yet-synced, and to issue fsyncs often enough to
> keep that below some parameter; the idea being that the parameter would
> limit how much dirty kernel disk cache there is.  Of course, ideally the
> kernel would have a similar tunable and this would be a waste of effort
> on our part...

It's not clear to me how you'd maintain that information without it
turning into a contention bottleneck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> 3. Pause for 3 seconds after every fsync.

> I think something along the lines of #3 is probably a good idea,

Really?  Any particular delay is guaranteed wrong.
        regards, tom lane


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> 3. Pause for 3 seconds after every fsync.
>
>> I think something along the lines of #3 is probably a good idea,
>
> Really?  Any particular delay is guaranteed wrong.

What I was getting at was - I think it's probably a good idea not to
do the fsyncs at top speed, but I'm not too sure how they should be
spaced out.  I agree a fixed delay isn't necessarily right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether it'd be useful to keep track of the total amount of
>> data written-and-not-yet-synced, and to issue fsyncs often enough to
>> keep that below some parameter; the idea being that the parameter would
>> limit how much dirty kernel disk cache there is. �Of course, ideally the
>> kernel would have a similar tunable and this would be a waste of effort
>> on our part...

> It's not clear to me how you'd maintain that information without it
> turning into a contention bottleneck.

What contention bottleneck?  I was just visualizing the bgwriter process
locally tracking how many writes it'd issued.  Backend-issued writes
should happen seldom enough to be ignorable for this purpose.
        regards, tom lane


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder whether it'd be useful to keep track of the total amount of
>>> data written-and-not-yet-synced, and to issue fsyncs often enough to
>>> keep that below some parameter; the idea being that the parameter would
>>> limit how much dirty kernel disk cache there is.  Of course, ideally the
>>> kernel would have a similar tunable and this would be a waste of effort
>>> on our part...
>
>> It's not clear to me how you'd maintain that information without it
>> turning into a contention bottleneck.
>
> What contention bottleneck?  I was just visualizing the bgwriter process
> locally tracking how many writes it'd issued.  Backend-issued writes
> should happen seldom enough to be ignorable for this purpose.

Ah.  Well, if you ignore backend writes, then yes, there's no
contention bottleneck.  However, I seem to recall Greg Smith showing a
system at PGCon last year with a pretty respectable volume of backend
writes (30%?) and saying "OK, so here's a healthy system".  Perhaps
I'm misremembering.  But at any rate any backend that is using a
BufferAccessStrategy figures to do a lot of its own writes.  This is
probably an area for improvement in future releases, if we an figure
out how to do it: if we're doing a bulk load into a system with 4GB of
shared_buffers using a 16MB ring buffer, we'd ideally like the
background writer - or somebody other than the foreground process - to
go nuts on those buffers, writing them out as fast as it possibly can
- rather than letting the backend do it when the ring wraps around.

Back to the idea at hand - I proposed something a bit along these
lines upthread, but my idea was to proactively perform the fsyncs on
the relations that had gone the longest without a write, rather than
the ones with the most dirty data.  I'm not sure which is better.
Obviously, doing the ones that have "gone idle" gives the OS more time
to write out the data, but OTOH it might not succeed in purging much
dirty data.  Doing the ones with the most dirty data will definitely
reduce the size of the final checkpoint, but might also cause a
latency spike if it's triggered immediately after heavy write activity
on that file.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Robert Haas wrote:
> Back to the idea at hand - I proposed something a bit along these
> lines upthread, but my idea was to proactively perform the fsyncs on
> the relations that had gone the longest without a write, rather than
> the ones with the most dirty data.  I'm not sure which is better.
> Obviously, doing the ones that have "gone idle" gives the OS more time
> to write out the data, but OTOH it might not succeed in purging much
> dirty data.  Doing the ones with the most dirty data will definitely
> reduce the size of the final checkpoint, but might also cause a
> latency spike if it's triggered immediately after heavy write activity
> on that file.

Crazy idea #2 --- it would be interesting if you issued an fsync
_before_ you wrote out data to a file that needed an fsync.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Greg Smith
Date:
Tom Lane wrote:
> I wonder whether it'd be useful to keep track of the total amount of
> data written-and-not-yet-synced, and to issue fsyncs often enough to
> keep that below some parameter; the idea being that the parameter would
> limit how much dirty kernel disk cache there is.  Of course, ideally the
> kernel would have a similar tunable and this would be a waste of effort
> on our part...
>   

I wanted to run the tests again before reporting in detail here, because 
the results are so bad, but I threw out an initial report about trying 
to push this toward this down to be the kernel's job at 
http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html

So far it looks like the newish Linux dirty_bytes parameter works well 
at reducing latency by limiting how much dirty data can pile up before 
it gets nudged heavily toward disk.  But the throughput drop you pay on 
VACUUM in particular is brutal, I'm seeing over a 50% slowdown in some 
cases.  I suspect we need to let the regular cleaner and backend writes 
queue up in the largest possible cache for VACUUM, so it benefits as 
much as possible from elevator sorting of writes.  I suspect this being 
the worst case now for a tightly controlled write cache is an unintended 
side-effect of the ring buffer implementation it uses now.

Right now I'm running the same tests on XFS instead of ext3, and those 
are just way more sensible all around; I'll revisit this on that 
filesystem and ext4.  The scale=500 tests I've running lots of lately 
are a full 3X TPS faster on XFS relative to ext3, with about 1/8 as much 
worst-case latency.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Back to the idea at hand - I proposed something a bit along these
> lines upthread, but my idea was to proactively perform the fsyncs on
> the relations that had gone the longest without a write, rather than
> the ones with the most dirty data.

Yeah.  What I meant to suggest, but evidently didn't explain well, was
to use that or something much like it as the rule for deciding *what* to
fsync next, but to use amount-of-unsynced-data-versus-threshold as the
method for deciding *when* to do the next fsync.
        regards, tom lane


Re: Spread checkpoint sync

From
Greg Smith
Date:
Tom Lane wrote: <blockquote cite="mid:18450.1296493308@sss.pgh.pa.us" type="cite"><pre wrap="">Robert Haas <a
class="moz-txt-link-rfc2396E"href="mailto:robertmhaas@gmail.com"><robertmhaas@gmail.com></a> writes:
</pre><blockquotetype="cite"><pre wrap="">3. Pause for 3 seconds after every fsync.   </pre></blockquote><pre wrap="">
</pre><blockquotetype="cite"><pre wrap="">I think something along the lines of #3 is probably a good idea,
</pre></blockquote><prewrap="">
 
Really?  Any particular delay is guaranteed wrong.
 </pre></blockquote><br /> '3 seconds' is just a placeholder for whatever comes out of a "total time scheduled to sync
/relations to sync" computation.  (Still doing all my thinking in terms of time, altough I recognize a showdown with
segment-basedcheckpoints is coming too)<br /><br /> I think the right way to compute "relations to sync" is to finish
thesorted writes patch I sent over a not quite right yet update to already, which is my next thing to work on here.  I
remainpessimistic that any attempt to issue fsync calls without the maximum possible delay after asking kernel to write
thingsout first will work out well.  My recent tests with low values of dirty_bytes on Linux just reinforces how bad
thatcan turn out.  In addition to computing the relation count while sorting them, placing writes in-order by relation
andthen doing all writes followed by all syncs should place the database right in the middle of the throughput/latency
trade-offhere.  It will have had the maximum amount of time we can give it to sort and flush writes for any given
relationbefore it is asked to sync it.  I don't want to try and be any smarter than that without trying to be a *lot*
smarter--timingindividual sync calls, feedback loops on time estimation, etc.<br /><br /> At this point I have to agree
withRobert's observation that splitting checkpoints into checkpoint_write_target and checkpoint_sync_target is the only
reasonablething left that might be possible complete in a short period.  So that's how this can compute the total time
numeratorhere.<br /><br /> The main thing I will warn about in relations to discussion today is the danger of true
dead-lineoriented scheduling in this area.  The checkpoint process may discover the sync phase is falling behind
expectationsbecause the individual sync calls are taking longer than expected.  If that happens, aiming for the "finish
ontarget anyway" goal puts you right back to a guaranteed nasty write spike again.  I think many people would prefer
loggingthe overrun as tuning feedback for the DBA rather than to accelerate, which is likely to make the problem even
worseif the checkpoint is falling behind.  But since ultimately the feedback for this will be "make the checkpoints
longeror increase checkpoint_sync_target", sync acceleration to meet the deadline isn't unacceptable; DBA can try both
ofthose themselves if seeing spikes.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Greg Smith   2ndQuadrant US    <a class="moz-txt-link-abbreviated"
href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  Baltimore, MD
 
PostgreSQL Training, Services, and 24x7 Support  <a class="moz-txt-link-abbreviated"
href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a>
"PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a>
</pre>

Re: Spread checkpoint sync

From
Greg Smith
Date:
Greg Smith wrote:
I think the right way to compute "relations to sync" is to finish the sorted writes patch I sent over a not quite right yet update to already

Attached update now makes much more sense than the misguided patch I submitted two weesk ago.  This takes the original sorted write code, first adjusting it so it only allocates the memory its tag structure is stored in once (in a kind of lazy way I can improve on right now).  It then computes a bunch of derived statistics from a single walk of the sorted data on each pass through.  Here's an example of what comes out:

DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG:  BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG:  BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG:  BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG:  BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0
DEBUG:  BufferSync 2097 buffers to write, 17 total dirty segment file(s) expected to need sync

This is the first checkpoint after starting to populate a new pgbench database.  The next four show it extending into new segments:

DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync

DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync

DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync

DEBUG:  BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0
DEBUG:  BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync

The fact that it's always showing 2048 dirty blocks on these makes me think I'm computing something wrong still, but the general idea here is working now.  I had to use some magic from the md layer to let bufmgr.c know how its writes were going to get mapped into file segments and correspondingly fsync calls later.  Not happy about breaking the API encapsulation there, but don't see an easy way to compute that data at the per-segment level--and it's not like that's going to change in the near future anyway.

I like this approach for a providing a map of how to spread syncs out for a couple of reasons:

-It computes data that could be used to drive sync spread timing in a relatively short amount of simple code.

-You get write sorting at the database level helping out the OS.  Everything I've been seeing recently on benchmarks says Linux at least needs all the help it can get in that regard, even if block order doesn't necessarily align perfectly with disk order.

-It's obvious how to take this same data and build a future model where the time allocated for fsyncs was proportional to how much that particular relation was touched.

Benchmarks of just the impact of the sorting step and continued bug swatting to follow.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: Spread checkpoint sync

From
Robert Haas
Date:
On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Back to the idea at hand - I proposed something a bit along these
>> lines upthread, but my idea was to proactively perform the fsyncs on
>> the relations that had gone the longest without a write, rather than
>> the ones with the most dirty data.
>
> Yeah.  What I meant to suggest, but evidently didn't explain well, was
> to use that or something much like it as the rule for deciding *what* to
> fsync next, but to use amount-of-unsynced-data-versus-threshold as the
> method for deciding *when* to do the next fsync.

Oh, I see.  Yeah, that could be a good algorithm.

I also think Bruce's idea of calling fsync() on each relation just
*before* we start writing the pages from that relation might have some
merit.  (I'm assuming here that we are sorting the writes.)  That
should tend to result in the end-of-checkpoint fsyncs being quite
fast, because we'll only have as much dirty data floating around as we
actually wrote during the checkpoint, which according to Greg Smith is
usually a small fraction of the total data in need of flushing.  Also,
if one of the pre-write fsyncs takes a long time, then that'll get
factored into our calculations of how fast we need to write the
remaining data to finish the checkpoint on schedule.  Of course
there's still the possibility that the I/O system literally can't
finish a checkpoint in X minutes, but even in that case, the I/O
saturation will hopefully be more spread out across the entire
checkpoint instead of falling like a hammer at the very end.

Back to your idea: One problem with trying to bound the unflushed data
is that it's not clear what the bound should be.  I've had this mental
model where we want the OS to write out pages to disk, but that's not
always true, per Greg Smith's recent posts about Linux kernel tuning
slowing down VACUUM.  A possible advantage of the Momjian algorithm
(as it's known in the literature) is that we don't actually start
forcing anything out to disk until we have a reason to do so - namely,
an impending checkpoint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> I also think Bruce's idea of calling fsync() on each relation just
> *before* we start writing the pages from that relation might have
> some merit.
What bothers me about that is that you may have a lot of the same
dirty pages in the OS cache as the PostgreSQL cache, and you've just
ensured that the OS will write those *twice*.  I'm pretty sure that
the reason the aggressive background writer settings we use have not
caused any noticeable increase in OS disk writes is that many
PostgreSQL writes of the same buffer keep an OS buffer page from
becoming stale enough to get flushed until PostgreSQL writes to it
taper off.  Calling fsync() right before doing "one last push" of
the data could be really pessimal for some workloads.
-Kevin


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Robert Haas wrote:
> Back to your idea: One problem with trying to bound the unflushed data
> is that it's not clear what the bound should be.  I've had this mental
> model where we want the OS to write out pages to disk, but that's not
> always true, per Greg Smith's recent posts about Linux kernel tuning
> slowing down VACUUM.  A possible advantage of the Momjian algorithm
> (as it's known in the literature) is that we don't actually start
> forcing anything out to disk until we have a reason to do so - namely,
> an impending checkpoint.

My trivial idea was:  let's assume we checkpoint every 10 minutes, and
it takes 5 minutes for us to write the data to the kernel.   If no one
else is writing to those files, we can safely wait maybe 5 more minutes
before issuing the fsync.  If, however, hundreds of writes are coming in
for the same files in those final 5 minutes, we should fsync right away.

My idea is that our delay between writes and fsync should somehow be
controlled by how many writes to the same files are coming to the kernel
while we are considering waiting because the only downside to delay is
the accumulation of non-critical writes coming into the kernel for the
same files we are going to fsync later.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Greg Smith wrote:
> Greg Smith wrote:
> > I think the right way to compute "relations to sync" is to finish the 
> > sorted writes patch I sent over a not quite right yet update to already
> 
> Attached update now makes much more sense than the misguided patch I 
> submitted two weesk ago.  This takes the original sorted write code, 
> first adjusting it so it only allocates the memory its tag structure is 
> stored in once (in a kind of lazy way I can improve on right now).  It 
> then computes a bunch of derived statistics from a single walk of the 
> sorted data on each pass through.  Here's an example of what comes out:

In that patch, I would like to see a meta-comment explaining why the
sorting is happening and what we hope to gain.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Robert Haas
Date:
On Tue, Feb 1, 2011 at 12:58 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> I also think Bruce's idea of calling fsync() on each relation just
>> *before* we start writing the pages from that relation might have
>> some merit.
>
> What bothers me about that is that you may have a lot of the same
> dirty pages in the OS cache as the PostgreSQL cache, and you've just
> ensured that the OS will write those *twice*.  I'm pretty sure that
> the reason the aggressive background writer settings we use have not
> caused any noticeable increase in OS disk writes is that many
> PostgreSQL writes of the same buffer keep an OS buffer page from
> becoming stale enough to get flushed until PostgreSQL writes to it
> taper off.  Calling fsync() right before doing "one last push" of
> the data could be really pessimal for some workloads.

I was thinking about what Greg reported here:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01387.php

If the amount of pre-checkpoint dirty data is 3GB and the checkpoint
is writing 250MB, then you shouldn't have all that many extra
writes... but you might have some, and that might be enough to send
the whole thing down the tubes.

InnoDB apparently handles this problem by advancing the redo pointer
in small steps instead of in large jumps.  AIUI, in addition to
tracking the LSN of each page, they also track the first-dirtied LSN.
That lets you checkpoint to an arbitrary LSN by flushing just the
pages with an older first-dirtied LSN.  So instead of doing a
checkpoint every hour, you might do a mini-checkpoint every 10
minutes.  Since the mini-checkpoints each need to flush less data,
they should be less disruptive than a full checkpoint.  But that, too,
will generate some extra writes.  Basically, any idea that involves
calling fsync() more often is going to tend to smooth out the I/O load
at the cost of some increase in the total number of writes.

If we don't want any increase at all in the number of writes,
spreading out the fsync() calls is pretty much the only other option.
I'm worried that even with good tuning that won't be enough to tamp
down the latency spikes.  But maybe it will be...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>  
> > I also think Bruce's idea of calling fsync() on each relation just
> > *before* we start writing the pages from that relation might have
> > some merit.
>  
> What bothers me about that is that you may have a lot of the same
> dirty pages in the OS cache as the PostgreSQL cache, and you've just
> ensured that the OS will write those *twice*.  I'm pretty sure that
> the reason the aggressive background writer settings we use have not
> caused any noticeable increase in OS disk writes is that many
> PostgreSQL writes of the same buffer keep an OS buffer page from
> becoming stale enough to get flushed until PostgreSQL writes to it
> taper off.  Calling fsync() right before doing "one last push" of
> the data could be really pessimal for some workloads.

OK, maybe my idea needs to be adjusted and we should trigger an early
fsync if non-fsync writes are coming in for blocks _other_ than the ones
we already wrote for that checkpoint.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> My trivial idea was:  let's assume we checkpoint every 10 minutes, and
> it takes 5 minutes for us to write the data to the kernel.   If no one
> else is writing to those files, we can safely wait maybe 5 more minutes
> before issuing the fsync.  If, however, hundreds of writes are coming in
> for the same files in those final 5 minutes, we should fsync right away.

Huh?  I would surely hope we could assume that nobody but Postgres is
writing the database files?  Or are you considering that the bgwriter
doesn't know exactly what the backends are doing?  That's true, but
I still maintain that we should design the bgwriter's behavior on the
assumption that writes from backends are negligible.  Certainly the
backends aren't issuing fsyncs.
        regards, tom lane


Re: Spread checkpoint sync

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > My trivial idea was:  let's assume we checkpoint every 10 minutes, and
> > it takes 5 minutes for us to write the data to the kernel.   If no one
> > else is writing to those files, we can safely wait maybe 5 more minutes
> > before issuing the fsync.  If, however, hundreds of writes are coming in
> > for the same files in those final 5 minutes, we should fsync right away.
> 
> Huh?  I would surely hope we could assume that nobody but Postgres is
> writing the database files?  Or are you considering that the bgwriter
> doesn't know exactly what the backends are doing?  That's true, but
> I still maintain that we should design the bgwriter's behavior on the
> assumption that writes from backends are negligible.  Certainly the
> backends aren't issuing fsyncs.

Right, no one else is writing but us.  When I said "no one else" I meant
no other bgwrites writes are going to the files we wrote as part of the
checkpoint, but have not fsync'ed yet.  I assume we have two write
streams --- the checkpoint writes, which we know at the start of the
checkpoint, and the bgwriter writes that are happening in an
unpredictable way based on database activity.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Spread checkpoint sync

From
Michael Banck
Date:
On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote:
> For example, the pre-release Squeeze numbers we're seeing are awful so
> far, but it's not really done yet either. 

Unfortunately, it does not look like Debian squeeze will change any more
(or has changed much since your post) at this point, except for maybe
further stable kernel updates.  

Which file system did you see those awful numbers on and could you maybe
go into some more detail?


Thanks,

Michael

-- 
<marco_g> I did send an email to propose multithreading to       grub-devel on the first of april.
<marco_g> Unfortunately everyone thought I was serious ;-)


Re: Spread checkpoint sync

From
Greg Smith
Date:
Michael Banck wrote: <blockquote cite="mid:20110203225510.GA29855@nighthawk.chemicalconnection.dyndns.org"
type="cite"><prewrap="">On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote: </pre><blockquote type="cite"><pre
wrap="">Forexample, the pre-release Squeeze numbers we're seeing are awful so
 
far, but it's not really done yet either.    </pre></blockquote><pre wrap="">
Unfortunately, it does not look like Debian squeeze will change any more
(or has changed much since your post) at this point, except for maybe
further stable kernel updates.  

Which file system did you see those awful numbers on and could you maybe
go into some more detail? </pre></blockquote><br /> Once the release comes out any day now I'll see if I can duplicate
themon hardware I can talk about fully, and share the ZCAV graphs if it's still there.  The server I've been running
allof the extended pgbench tests in this thread on is running Ubuntu simply as a temporary way to get 2.6.32 before
Squeezeships.  Last time I tried installing one of the Squeeze betas I didn't get anywhere; hoping the installer bug I
raninto has been sorted when I try again.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Greg Smith   2ndQuadrant US    <a class="moz-txt-link-abbreviated"
href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  Baltimore, MD
 
PostgreSQL Training, Services, and 24x7 Support  <a class="moz-txt-link-abbreviated"
href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a>
"PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a>
</pre>

Re: Spread checkpoint sync

From
Greg Smith
Date:
As already mentioned in the broader discussion at 
http://archives.postgresql.org/message-id/4D4C4610.1030109@2ndquadrant.com 
, I'm seeing no solid performance swing in the checkpoint sorting code 
itself.  Better sometimes, worse others, but never by a large amount.

Here's what the statistics part derived from the sorted data looks like 
on a real checkpoint spike:

2011-02-04 07:02:51 EST: LOG:  checkpoint starting: xlog
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.0_2
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 159 dirty blocks in 
relation.segment_fork 17216.0_1
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 10 dirty blocks in 
relation.segment_fork 17216.3_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 548 dirty blocks in 
relation.segment_fork 17216.4_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 808 dirty blocks in 
relation.segment_fork 17216.5_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 799 dirty blocks in 
relation.segment_fork 17216.6_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 807 dirty blocks in 
relation.segment_fork 17216.7_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 716 dirty blocks in 
relation.segment_fork 17216.8_0
2011-02-04 07:02:51 EST: DEBUG:  BufferSync 3857 buffers to write, 8 
total dirty segment file(s) expected to need sync
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/17216.5 time=1324.614 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/17216.4 time=0.002 msec
2011-02-04 07:03:31 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/17216_fsm time=0.001 msec
2011-02-04 07:03:47 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/17216.10 time=16446.753 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/17216.8 time=5804.252 msec
2011-02-04 07:03:53 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/17216.7 time=0.001 msec
2011-02-04 07:03:54 EST: DEBUG:  compacted fsync request queue from 
32768 entries to 2 entries
2011-02-04 07:03:54 EST: CONTEXT:  writing block 1642223 of relation 
base/16384/17216
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/17216.11 time=6350.577 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/17216.9 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/17216.6 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/17216.3 time=0.001 msec
2011-02-04 07:04:00 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/17216_vm time=0.001 msec
2011-02-04 07:04:00 EST: LOG:  checkpoint complete: wrote 3813 buffers 
(11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled; 
write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11, 
longest=16.446 s, average=2.720 s

You can see that it ran out of fsync absorption space in the middle of 
the sync phase, which is usually when compaction is needed, but the 
recent patch to fix that kicked in and did its thing.

Couple of observations:

-The total number of buffers I'm computing based on the checkpoint 
writes being sorted it not a perfect match to the number reported by the 
"checkpoint complete" status line.  Sometimes they are the same, 
sometimes not.  Not sure why yet.

-The estimate for "expected to need sync" computed as a by-product of 
the checkpoint sorting is not completely accurate either.  This 
particular one has a fairly large error in it, percentage-wise, being 
off by 3 with a total of 11.  Presumably these are absorbed fsync 
requests that were already queued up before the checkpoint even 
started.  So any time estimate I drive based off of this count is only 
going to be approximate.

-The order in which the sync phase processes files is unrelated to the 
order in which they are written out.  Note that 17216.10 here, the 
biggest victim (cause?) of the I/O spike, isn't even listed among the 
checkpoint writes!

The fuzziness here is a bit disconcerting, and I'll keep digging for why 
it happens.  But I don't see any reason not to continue forward using 
the rough count here to derive a nap time from, which I can then feed 
into the "useful leftovers" patch that Robert already refactored here.  
Can always sharpen up that estimate later, I need to get some solid 
results I can share on what the delay time does to the 
throughput/latency pattern next.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Fri, Feb 4, 2011 at 2:08 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> -The total number of buffers I'm computing based on the checkpoint writes
> being sorted it not a perfect match to the number reported by the
> "checkpoint complete" status line.  Sometimes they are the same, sometimes
> not.  Not sure why yet.

My first guess would be that in the cases where it's not the same,
some backend evicted the buffer before the background writer got to
it.  That's expected under heavy contention for shared_buffers.

> -The estimate for "expected to need sync" computed as a by-product of the
> checkpoint sorting is not completely accurate either.  This particular one
> has a fairly large error in it, percentage-wise, being off by 3 with a total
> of 11.  Presumably these are absorbed fsync requests that were already
> queued up before the checkpoint even started.  So any time estimate I drive
> based off of this count is only going to be approximate.

As previously noted, I wonder if we ought sync queued-up requests that
don't require writes before beginning the write phase.

> -The order in which the sync phase processes files is unrelated to the order
> in which they are written out.  Note that 17216.10 here, the biggest victim
> (cause?) of the I/O spike, isn't even listed among the checkpoint writes!

That's awful.  If more than 50% of the I/O is going to happen from one
fsync() call, that seems to put a pretty pessimal bound on how much
improvement we can hope to achieve here.  Or am I missing something?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Spread checkpoint sync

From
Greg Smith
Date:
Robert Haas wrote:
> With the fsync queue compaction patch applied, I think most of this is
> now not needed.  Attached please find an attempt to isolate the
> portion that looks like it might still be useful.  The basic idea of
> what remains here is to make the background writer still do its normal
> stuff even when it's checkpointing.  In particular, with this patch
> applied, PG will:
>
> 1. Absorb fsync requests a lot more often during the sync phase.
> 2. Still try to run the cleaning scan during the sync phase.
> 3. Pause for 3 seconds after every fsync.
>

Yes, the bits you extracted were the remaining useful parts from the
original patch.  Today was quiet here because there were sports on or
something, and I added full auto-tuning magic to the attached update.  I
need to kick off benchmarks and report back tomorrow to see how well
this does, but any additional patch here would only be code cleanup on
the messy stuff I did in here (plus proper implementation of the pair of
GUCs).  This has finally gotten to the exact logic I've been meaning to
complete as spread sync since the idea was first postponed in 8.3, with
the benefit of some fsync aborption improvements along the way too

The automatic timing is modeled on the existing
checkpoint_completion_target concept, except with a new tunable (not yet
added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right
now.  What I think I want to do is make the existing
checkpoint_completion_target now be the target for the end of the sync
phase, matching its name; people who bumped it up won't necessarily even
have to change anything.  Then the new guc can be
checkpoint_write_target, representing the target that is in there right now.

I tossed the earlier idea of counting relations to sync based on the
write phase data as too inaccurate after testing, and with it for now
goes checkpoint sorting.  Instead, I just take a first pass over
pendingOpsTable to get a total number of things to sync, which will
always match the real count barring strange circumstances (like dropping
a table).

As for the automatically determining the interval, I take the number of
syncs that have finished so far, divide by the total, and get a number
between 0.0 and 1.0 that represents progress on the sync phase.  I then
use the same basic CheckpointWriteDelay logic that is there for
spreading writes out, except with the new sync target.  I realized that
if we assume the checkpoint writes should have finished in
CheckPointCompletionTarget worth of time or segments, we can compute a
new progress metric with the formula:

progress = CheckPointCompletionTarget + (1.0 -
CheckPointCompletionTarget) * finished / goal;

Where "finished" is the number of segments written out, while "goal" is
the total.  To turn this into an example, let's say the default
parameters are set, we've finished the writes, and  finished 1 out of 4
syncs; that much work will be considered:

progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625

On a scale that effectively aimes to be finished sync work by 0.8.

I don't use quite the same logic as the CheckpointWriteDelay though.  It
turns out the existing checkpoint_completion implementation doesn't
always work like I thought it did, which provide some very interesting
insight into why my attempts to work around checkpoint problems haven't
worked as well as expected the last few years.  I thought that what it
did was wait until an amount of time determined by the target was
reached until it did the next write.  That's not quite it; what it
actually does is check progress against the target, then sleep exactly
one nap interval if it is is ahead of schedule.  That is only the same
thing if you have a lot of buffers to write relative to the amount of
time involved.  There's some alternative logic if you don't have
bgwriter_lru_maxpages set, but in the normal situation it effectively it
means that:

maximum write spread time=bgwriter_delay * checkpoint dirty blocks

No matter how far apart you try to spread the checkpoints.  Now,
typically, when people run into these checkpoint spikes in production,
reducing shared_buffers improves that.  But I now realize that doing so
will then reduce the average number of dirty blocks participating in the
checkpoint, and therefore potentially pull the spread down at the same
time!  Also, if you try and tune bgwriter_delay down to get better
background cleaning, you're also reducing the maximum spread.  Between
this issue and the bad behavior when the fsync queue fills, no wonder
this has been so hard to tune out of production systems.  At some point,
the reduction in spread defeats further attempts to reduce the size of
what's written at checkpoint time, by lowering the amount of data involved.

What I do instead is nap until just after the planned schedule, then
execute the sync.  What ends up happening then is that there can be a
long pause between the end of the write phase and when syncs start to
happen, which I consider a good thing.  Gives the kernel a little more
time to try and get writes moving out to disk.  Here's what that looks
like on my development desktop:

2011-02-07 00:46:24 EST: LOG:  checkpoint starting: time
2011-02-07 00:48:04 EST: DEBUG:  checkpoint sync:  estimated segments=10
2011-02-07 00:48:24 EST: DEBUG:  checkpoint sync: naps=99
2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=1
file=base/16736/16749.1 time=12033.898 msec
2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=2
file=base/16736/16749 time=60.799 msec
2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: naps=59
2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: number=3
file=base/16736/16756 time=0.003 msec
2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: number=4
file=base/16736/16750 time=0.003 msec
2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: number=5
file=base/16736/16737 time=0.004 msec
2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: number=6
file=base/16736/16749_fsm time=0.004 msec
2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: number=7
file=base/16736/16740 time=0.003 msec
2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: number=8
file=base/16736/16749_vm time=0.003 msec
2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: number=9
file=base/16736/16752 time=0.003 msec
2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: number=10
file=base/16736/16754 time=0.003 msec
2011-02-07 00:50:12 EST: LOG:  checkpoint complete: wrote 14335 buffers
(43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
longest=12.033 s, average=1.209 s

Since this is ext3 the spike during the first sync is brutal, anyway,
but it tried very hard to avoid that:  it waited 99 * 200ms = 19.8
seconds between writing the last buffer and when it started syncing them
(00:42:04 to 00:48:24).  Given the slow write for #1, it was then
behind, so it immediately moved onto #2.  But after that, it was able to
insert a moderate nap time between successive syncs--60 naps is 12
seconds, and it keeps that pace for the remainder of the sync.  This is
the same sort of thing I'd worked out as optimal on the system this
patch originated from, except it had a lot more dirty relations; that's
why its naptime was the 3 seconds hard-coded into earlier versions of
this patch.

Results on XFS with mini-server class hardware should be interesting...

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4df69c2..f58ac3e 100644
*** a/src/backend/postmaster/bgwriter.c
--- b/src/backend/postmaster/bgwriter.c
*************** static bool am_bg_writer = false;
*** 168,173 ****
--- 168,175 ----

  static bool ckpt_active = false;

+ static int checkpoint_flags = 0;
+
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
  static XLogRecPtr ckpt_start_recptr;
*************** static pg_time_t last_xlog_switch_time;
*** 180,186 ****

  static void CheckArchiveTimeout(void);
  static void BgWriterNap(void);
! static bool IsCheckpointOnSchedule(double progress);
  static bool ImmediateCheckpointRequested(void);
  static bool CompactBgwriterRequestQueue(void);

--- 182,188 ----

  static void CheckArchiveTimeout(void);
  static void BgWriterNap(void);
! static bool IsCheckpointOnSchedule(double progress,double target);
  static bool ImmediateCheckpointRequested(void);
  static bool CompactBgwriterRequestQueue(void);

*************** CheckpointWriteDelay(int flags, double p
*** 691,696 ****
--- 693,701 ----
      if (!am_bg_writer)
          return;

+     /* Cache this value for a later spread sync */
+     checkpoint_flags=flags;
+
      /*
       * Perform the usual bgwriter duties and take a nap, unless we're behind
       * schedule, in which case we just try to catch up as quickly as possible.
*************** CheckpointWriteDelay(int flags, double p
*** 698,704 ****
      if (!(flags & CHECKPOINT_IMMEDIATE) &&
          !shutdown_requested &&
          !ImmediateCheckpointRequested() &&
!         IsCheckpointOnSchedule(progress))
      {
          if (got_SIGHUP)
          {
--- 703,709 ----
      if (!(flags & CHECKPOINT_IMMEDIATE) &&
          !shutdown_requested &&
          !ImmediateCheckpointRequested() &&
!         IsCheckpointOnSchedule(progress,CheckPointCompletionTarget))
      {
          if (got_SIGHUP)
          {
*************** CheckpointWriteDelay(int flags, double p
*** 726,731 ****
--- 731,799 ----
  }

  /*
+  * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+  *
+  * This function is called after each file sync performed by mdsync().
+  * It is responsible for keeping the bgwriter's normal activities in
+  * progress during a long checkpoint.
+  */
+ void
+ CheckpointSyncDelay(int finished,int goal)
+ {
+     int flags = checkpoint_flags;
+     int nap_count = 0;
+     double progress;
+     double CheckPointSyncTarget = 0.8;
+
+     /* Do nothing if checkpoint is being executed by non-bgwriter process */
+     if (!am_bg_writer)
+         return;
+
+     /*
+      * Limit progress to the goal, which
+      * may be possible if the segments to sync were calculated wrong.
+      */
+     ckpt_cached_elapsed = 0;
+     if (finished > goal) finished=goal;
+
+     /*
+      * Base our progress on the assumption that the write took
+      * checkpoint_completion_target worth of time, and that sync
+      * progress is advancing from beyond that point.
+      */
+     progress = CheckPointCompletionTarget +
+         (1.0 - CheckPointCompletionTarget) * finished / goal;
+
+     /*
+      * Perform the usual bgwriter duties and nap until we've just
+      * crossed our deadline.
+      */
+     elog(DEBUG2,"checkpoint sync: considering a nap after progress=%.1f",progress);
+     while (!(flags & CHECKPOINT_IMMEDIATE) &&
+             !shutdown_requested &&
+             !ImmediateCheckpointRequested() &&
+             (IsCheckpointOnSchedule(progress,CheckPointSyncTarget)))
+     {
+         if (got_SIGHUP)
+         {
+             got_SIGHUP = false;
+             ProcessConfigFile(PGC_SIGHUP);
+         }
+
+         elog(DEBUG2,"checkpoint sync: nap count=%d",nap_count);
+         nap_count++;
+
+         AbsorbFsyncRequests();
+
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+     }
+     if (nap_count > 0)
+         elog(DEBUG1,"checkpoint sync: naps=%d",nap_count);
+ }
+
+ /*
   * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
   *         in time?
   *
*************** CheckpointWriteDelay(int flags, double p
*** 734,740 ****
   * than the elapsed time/segments.
   */
  static bool
! IsCheckpointOnSchedule(double progress)
  {
      XLogRecPtr    recptr;
      struct timeval now;
--- 802,808 ----
   * than the elapsed time/segments.
   */
  static bool
! IsCheckpointOnSchedule(double progress,double target)
  {
      XLogRecPtr    recptr;
      struct timeval now;
*************** IsCheckpointOnSchedule(double progress)
*** 743,750 ****

      Assert(ckpt_active);

!     /* Scale progress according to checkpoint_completion_target. */
!     progress *= CheckPointCompletionTarget;

      /*
       * Check against the cached value first. Only do the more expensive
--- 811,820 ----

      Assert(ckpt_active);

!     /* Scale progress according to given target. */
!     progress *= target;
!
!     elog(DEBUG2,"checkpoint schedule check: scaled progress=%.1f target=%.1f",progress,target);

      /*
       * Check against the cached value first. Only do the more expensive
*************** IsCheckpointOnSchedule(double progress)
*** 773,778 ****
--- 843,850 ----
               ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
              CheckPointSegments;

+         elog(DEBUG2,"checkpoint schedule: elapsed xlogs=%.1f",elapsed_xlogs);
+
          if (progress < elapsed_xlogs)
          {
              ckpt_cached_elapsed = elapsed_xlogs;
*************** IsCheckpointOnSchedule(double progress)
*** 787,792 ****
--- 859,866 ----
      elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
                      now.tv_usec / 1000000.0) / CheckPointTimeout;

+     elog(DEBUG2,"checkpoint schedule: elapsed time=%.1f",elapsed_time);
+
      if (progress < elapsed_time)
      {
          ckpt_cached_elapsed = elapsed_time;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9d585b6..f294f6f 100644
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 31,39 ****
  #include "pg_trace.h"


- /* interval for calling AbsorbFsyncRequests in mdsync */
- #define FSYNCS_PER_ABSORB        10
-
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
--- 31,36 ----
*************** mdsync(void)
*** 932,938 ****

      HASH_SEQ_STATUS hstat;
      PendingOperationEntry *entry;
-     int            absorb_counter;

      /* Statistics on sync times */
      int            processed = 0;
--- 929,934 ----
*************** mdsync(void)
*** 943,948 ****
--- 939,948 ----
      uint64        longest = 0;
      uint64        total_elapsed = 0;

+     /* Sync spreading counters */
+     int            sync_segments = 0;
+     int            current_segment = 0;
+
      /*
       * This is only called during checkpoints, and checkpoints should only
       * occur in processes that have created a pendingOpsTable.
*************** mdsync(void)
*** 1001,1008 ****
      /* Set flag to detect failure if we don't reach the end of the loop */
      mdsync_in_progress = true;

      /* Now scan the hashtable for fsync requests to process */
-     absorb_counter = FSYNCS_PER_ABSORB;
      hash_seq_init(&hstat, pendingOpsTable);
      while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
      {
--- 1001,1033 ----
      /* Set flag to detect failure if we don't reach the end of the loop */
      mdsync_in_progress = true;

+     /* For spread sync timing purposes, make a scan through the
+      * hashtable to count its entries.  Using hash_get_num_entries
+      * instead would require a stronger lock than we want to have at
+      * this point, and we don't want to count requests destined for
+      * next cycle anyway
+      *
+      * XXX Should we skip this if there is no sync spreading, or if
+      *     fsync is off?
+      */
+     hash_seq_init(&hstat, pendingOpsTable);
+     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+     {
+         if (entry->cycle_ctr == mdsync_cycle_ctr)
+             continue;
+         sync_segments++;
+     }
+
+     /*
+      * In the unexpected situation where the above estimate says there
+      * is nothing to sync, avoid division by zero errors in the
+      * progress computation below.
+      */
+     if (sync_segments == 0)
+         sync_segments = 1;
+     elog(DEBUG1, "checkpoint sync:  estimated segments=%d",sync_segments);
+
      /* Now scan the hashtable for fsync requests to process */
      hash_seq_init(&hstat, pendingOpsTable);
      while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
      {
*************** mdsync(void)
*** 1027,1043 ****
              int            failures;

              /*
!              * If in bgwriter, we want to absorb pending requests every so
!              * often to prevent overflow of the fsync request queue.  It is
!              * unspecified whether newly-added entries will be visited by
!              * hash_seq_search, but we don't care since we don't need to
!              * process them anyway.
               */
!             if (--absorb_counter <= 0)
!             {
!                 AbsorbFsyncRequests();
!                 absorb_counter = FSYNCS_PER_ABSORB;
!             }

              /*
               * The fsync table could contain requests to fsync segments that
--- 1052,1060 ----
              int            failures;

              /*
!              * If in bgwriter, perform normal duties.
               */
!             CheckpointSyncDelay(current_segment,sync_segments);

              /*
               * The fsync table could contain requests to fsync segments that
*************** mdsync(void)
*** 1131,1140 ****
                  pfree(path);

                  /*
!                  * Absorb incoming requests and check to see if canceled.
                   */
!                 AbsorbFsyncRequests();
!                 absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */

                  if (entry->canceled)
                      break;
--- 1148,1156 ----
                  pfree(path);

                  /*
!                  * If in bgwriter, perform normal duties.
                   */
!                 CheckpointSyncDelay(current_segment,sync_segments);

                  if (entry->canceled)
                      break;
*************** mdsync(void)
*** 1149,1154 ****
--- 1165,1172 ----
          if (hash_search(pendingOpsTable, &entry->tag,
                          HASH_REMOVE, NULL) == NULL)
              elog(ERROR, "pendingOpsTable corrupted");
+
+         current_segment++;
      }                            /* end loop over hashtable entries */

      /* Return sync performance metrics for report at checkpoint end */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..5da0aa2 100644
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
*************** extern void BackgroundWriterMain(void);
*** 26,31 ****
--- 26,32 ----

  extern void RequestCheckpoint(int flags);
  extern void CheckpointWriteDelay(int flags, double progress);
+ extern void CheckpointSyncDelay(int finished,int goal);

  extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                      BlockNumber segno);

Re: Spread checkpoint sync

From
Cédric Villemain
Date:
2011/2/7 Greg Smith <greg@2ndquadrant.com>:
> Robert Haas wrote:
>>
>> With the fsync queue compaction patch applied, I think most of this is
>> now not needed.  Attached please find an attempt to isolate the
>> portion that looks like it might still be useful.  The basic idea of
>> what remains here is to make the background writer still do its normal
>> stuff even when it's checkpointing.  In particular, with this patch
>> applied, PG will:
>>
>> 1. Absorb fsync requests a lot more often during the sync phase.
>> 2. Still try to run the cleaning scan during the sync phase.
>> 3. Pause for 3 seconds after every fsync.
>>
>
> Yes, the bits you extracted were the remaining useful parts from the
> original patch.  Today was quiet here because there were sports on or
> something, and I added full auto-tuning magic to the attached update.  I
> need to kick off benchmarks and report back tomorrow to see how well this
> does, but any additional patch here would only be code cleanup on the messy
> stuff I did in here (plus proper implementation of the pair of GUCs).  This
> has finally gotten to the exact logic I've been meaning to complete as
> spread sync since the idea was first postponed in 8.3, with the benefit of
> some fsync aborption improvements along the way too
>
> The automatic timing is modeled on the existing checkpoint_completion_target
> concept, except with a new tunable (not yet added as a GUC) currently called
> CheckPointSyncTarget, set to 0.8 right now.  What I think I want to do is
> make the existing checkpoint_completion_target now be the target for the end
> of the sync phase, matching its name; people who bumped it up won't
> necessarily even have to change anything.  Then the new guc can be
> checkpoint_write_target, representing the target that is in there right now.

Is it worth a new thread with the different IO improvements done so
far or on-going and how we may add new GUC(if required !!!) with
intelligence between those patches ? ( For instance, hint bit IO limit
needs probably a tunable to define something similar to
hint_write_completion_target and/or IO_throttling strategy, ...items
which are still in gestation...)

>
> I tossed the earlier idea of counting relations to sync based on the write
> phase data as too inaccurate after testing, and with it for now goes
> checkpoint sorting.  Instead, I just take a first pass over pendingOpsTable
> to get a total number of things to sync, which will always match the real
> count barring strange circumstances (like dropping a table).
>
> As for the automatically determining the interval, I take the number of
> syncs that have finished so far, divide by the total, and get a number
> between 0.0 and 1.0 that represents progress on the sync phase.  I then use
> the same basic CheckpointWriteDelay logic that is there for spreading writes
> out, except with the new sync target.  I realized that if we assume the
> checkpoint writes should have finished in CheckPointCompletionTarget worth
> of time or segments, we can compute a new progress metric with the formula:
>
> progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) *
> finished / goal;
>
> Where "finished" is the number of segments written out, while "goal" is the
> total.  To turn this into an example, let's say the default parameters are
> set, we've finished the writes, and  finished 1 out of 4 syncs; that much
> work will be considered:
>
> progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625
>
> On a scale that effectively aimes to be finished sync work by 0.8.
>
> I don't use quite the same logic as the CheckpointWriteDelay though.  It
> turns out the existing checkpoint_completion implementation doesn't always
> work like I thought it did, which provide some very interesting insight into
> why my attempts to work around checkpoint problems haven't worked as well as
> expected the last few years.  I thought that what it did was wait until an
> amount of time determined by the target was reached until it did the next
> write.  That's not quite it; what it actually does is check progress against
> the target, then sleep exactly one nap interval if it is is ahead of
> schedule.  That is only the same thing if you have a lot of buffers to write
> relative to the amount of time involved.  There's some alternative logic if
> you don't have bgwriter_lru_maxpages set, but in the normal situation it
> effectively it means that:
>
> maximum write spread time=bgwriter_delay * checkpoint dirty blocks
>
> No matter how far apart you try to spread the checkpoints.  Now, typically,
> when people run into these checkpoint spikes in production, reducing
> shared_buffers improves that.  But I now realize that doing so will then
> reduce the average number of dirty blocks participating in the checkpoint,
> and therefore potentially pull the spread down at the same time!  Also, if
> you try and tune bgwriter_delay down to get better background cleaning,
> you're also reducing the maximum spread.  Between this issue and the bad
> behavior when the fsync queue fills, no wonder this has been so hard to tune
> out of production systems.  At some point, the reduction in spread defeats
> further attempts to reduce the size of what's written at checkpoint time, by
> lowering the amount of data involved.

interesting!

>
> What I do instead is nap until just after the planned schedule, then execute
> the sync.  What ends up happening then is that there can be a long pause
> between the end of the write phase and when syncs start to happen, which I
> consider a good thing.  Gives the kernel a little more time to try and get
> writes moving out to disk.

Sounds like a really good idea like that.

> Here's what that looks like on my development
> desktop:
>
> 2011-02-07 00:46:24 EST: LOG:  checkpoint starting: time
> 2011-02-07 00:48:04 EST: DEBUG:  checkpoint sync:  estimated segments=10
> 2011-02-07 00:48:24 EST: DEBUG:  checkpoint sync: naps=99
> 2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=1
> file=base/16736/16749.1 time=12033.898 msec
> 2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=2
> file=base/16736/16749 time=60.799 msec
> 2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: naps=59
> 2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: number=3
> file=base/16736/16756 time=0.003 msec
> 2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: number=4
> file=base/16736/16750 time=0.003 msec
> 2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: number=5
> file=base/16736/16737 time=0.004 msec
> 2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: number=6
> file=base/16736/16749_fsm time=0.004 msec
> 2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: number=7
> file=base/16736/16740 time=0.003 msec
> 2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: number=8
> file=base/16736/16749_vm time=0.003 msec
> 2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: number=9
> file=base/16736/16752 time=0.003 msec
> 2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: naps=60
> 2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: number=10
> file=base/16736/16754 time=0.003 msec
> 2011-02-07 00:50:12 EST: LOG:  checkpoint complete: wrote 14335 buffers
> (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
> write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
> longest=12.033 s, average=1.209 s
>
> Since this is ext3 the spike during the first sync is brutal, anyway, but it
> tried very hard to avoid that:  it waited 99 * 200ms = 19.8 seconds between
> writing the last buffer and when it started syncing them (00:42:04 to
> 00:48:24).  Given the slow write for #1, it was then behind, so it
> immediately moved onto #2.  But after that, it was able to insert a moderate
> nap time between successive syncs--60 naps is 12 seconds, and it keeps that
> pace for the remainder of the sync.  This is the same sort of thing I'd
> worked out as optimal on the system this patch originated from, except it
> had a lot more dirty relations; that's why its naptime was the 3 seconds
> hard-coded into earlier versions of this patch.
>
> Results on XFS with mini-server class hardware should be interesting...
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Spread checkpoint sync

From
Greg Smith
Date:
Cédric Villemain wrote:
> Is it worth a new thread with the different IO improvements done so
> far or on-going and how we may add new GUC(if required !!!) with
> intelligence between those patches ? ( For instance, hint bit IO limit
> needs probably a tunable to define something similar to
> hint_write_completion_target and/or IO_throttling strategy, ...items
> which are still in gestation...)
>   

Maybe, but I wouldn't bring all that up right now.  Trying to wrap up 
the CommitFest, too distracting, etc.

As a larger statement on this topic, I'm never very excited about 
redesigning here starting from any point other than "saw a bottleneck 
doing <x> on a production system".  There's a long list of such things 
already around waiting to be addressed, and I've never seen any good 
evidence of work related to hint bits being on it.  Please correct me if 
you know of some--I suspect you do from the way you're brining this up.  
If we were to consider kicking off some larger work here, I would drive 
that by asking where the data supporting that work being necessary is at 
first.  It's hard enough to fix a bottleneck that's staring right at 
you, trying to address one that's just theorized is impossible.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote:

> As a larger statement on this topic, I'm never very excited about
> redesigning here starting from any point other than "saw a
> bottleneck doing <x> on a production system".  There's a long list
> of such things already around waiting to be addressed, and I've
> never seen any good evidence of work related to hint bits being on
> it.  Please correct me if you know of some--I suspect you do from
> the way you're brining this up.

There are occasional posts from those wondering why their read-only
queries are so slow after a bulk load, and why they are doing heavy
writes.  (I remember when I posted about that, as a relative newbie,
and I know I've seen others.)

I think worst case is probably:

- Bulk load data.
- Analyze (but don't vacuum) the new data.
- Start a workload with a lot of small, concurrent random reads.
- Watch performance tank when the write cache gluts.

This pattern is why we've adopted a pretty strict rule in our shop
that we run VACUUM FREEZE ANALYZE between a bulk load and putting
the database back into production.  It's probably a bigger issue for
those who can't do that.

-Kevin


Re: Spread checkpoint sync

From
Greg Smith
Date:
Kevin Grittner wrote:
> There are occasional posts from those wondering why their read-only
> queries are so slow after a bulk load, and why they are doing heavy
> writes.  (I remember when I posted about that, as a relative newbie,
> and I know I've seen others.)
>   

Sure; I created http://wiki.postgresql.org/wiki/Hint_Bits a while back 
specifically to have a resource to explain that mystery to offer 
people.  But there's a difference between having a performance issue 
that people don't understand, and having a real bottleneck you can't get 
rid of.  My experience is that people who have hint bit issues run into 
them as a minor side-effect of a larger vacuum issue, and that if you 
get that under control they're only a minor detail in comparison.  Makes 
it hard to get too excited about optimizing them.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Greg Smith
Date:
Looks like it's time to close the book on this one for 9.1 
development...the unfortunate results are at 
http://www.2ndquadrant.us/pgbench-results/index.htm  Test set #12 is the 
one with spread sync I was hoping would turn out better than #9, the 
reference I was trying to improve on.  TPS is about 5% slower on the 
scale=500 and 15% slower on the scale=1000 tests with sync spread out.  
Even worse, maximum latency went up a lot. 

I am convinced of a couple of things now:

1) Most of the benefit we were seeing from the original patch I 
submitted was simply from doing much better at absorbing fsync requests 
from backends while the checkpoint sync was running.  The already 
committed fsync compaction patch effectively removes that problem 
though, to the extent it's possible to do so, making the remaining 
pieces here not as useful in its wake.

2) I need to start over testing here with something that isn't 100% 
write all of the time the way pgbench is.  It's really hard to isolate 
out latency improvements when the test program guarantees all associated 
write caches will be completely filled at every moment.  Also, I can't 
see any benefit if I make changes that improve performance only for 
readers with it, which is quite unrealistic relative to real-world 
workloads.

3) The existing write spreading code in the background writer needs to 
be overhauled, too, before spreading the syncs around is going to give 
the benefits I was hoping for.

Given all that, I'm going to take my feedback and give the test server a 
much deserved break.  I'm happy that the fsync compaction patch has made 
9.1 much more tolerant of write-heavy loads than earlier versions, so 
it's not like no progress was made in this release.

For anyone who wants more details here...the news on this spread sync 
implementation is not all bad.  If you compare this result from HEAD, 
with scale=1000 and clients=256:

http://www.2ndquadrant.us/pgbench-results/611/index.html

Against its identically configured result with spread sync:

http://www.2ndquadrant.us/pgbench-results/708/index.html

There are actually significantly less times in the >2000 ms latency 
area.  That shows up as a reduction in the 90th percentile latency 
figures I compute, and you can see it in the graph if you look at how 
much denser the points are in the 2000 - 4000 ms are on #611.  But 
that's a pretty weak change.

But the most disappointing part here relative to what I was hoping is 
what happens with bigger buffer caches.  The main idea driving this 
approach was that it would enable larger values of shared_buffers 
without the checkpoint spikes being as bad.  Test set #13 tries that 
out, by increasing shared_buffers from 256MB to 4GB, along with a big 
enough increase in checkpoint_segments to make most checkpoints time 
based.  Not only did smaller scale TPS drop in half, all kinds of bad 
things happened to latency.  Here's a sample of the sort of 
dysfunctional checkpoints that came out of that:

2011-02-10 02:41:17 EST: LOG:  checkpoint starting: xlog
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync:  estimated segments=22
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=1 
file=base/16384/16768 time=150.008 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=2 
file=base/16384/16749 time=0.002 msec
2011-02-10 02:53:15 EST: DEBUG:  checkpoint sync: number=3 
file=base/16384/16749_fsm time=0.001 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=4 
file=base/16384/16761 time=8014.102 msec
2011-02-10 02:53:23 EST: DEBUG:  checkpoint sync: number=5 
file=base/16384/16752_vm time=0.002 msec
2011-02-10 02:53:35 EST: DEBUG:  checkpoint sync: number=6 
file=base/16384/16761.5 time=11739.038 msec
2011-02-10 02:53:37 EST: DEBUG:  checkpoint sync: number=7 
file=base/16384/16761.6 time=2205.721 msec
2011-02-10 02:53:45 EST: DEBUG:  checkpoint sync: number=8 
file=base/16384/16761.2 time=8273.849 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=9 
file=base/16384/16766 time=20874.167 msec
2011-02-10 02:54:06 EST: DEBUG:  checkpoint sync: number=10 
file=base/16384/16762 time=0.002 msec
2011-02-10 02:54:08 EST: DEBUG:  checkpoint sync: number=11 
file=base/16384/16761.3 time=2440.441 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=12 
file=base/16384/16766.1 time=635.839 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=13 
file=base/16384/16752_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=14 
file=base/16384/16764 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=15 
file=base/16384/16768_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=16 
file=base/16384/16761_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=17 
file=base/16384/16761.4 time=150.702 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=18 
file=base/16384/16752 time=0.002 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=19 
file=base/16384/16761_fsm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=20 
file=base/16384/16749_vm time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=21 
file=base/16384/16385 time=0.001 msec
2011-02-10 02:54:09 EST: DEBUG:  checkpoint sync: number=22 
file=base/16384/16761.1 time=175.575 msec
2011-02-10 02:54:10 EST: LOG:  checkpoint complete: wrote 242614 buffers 
(46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled; 
write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22, 
longest=20.874 s, average=2.484 s

That's 12 minutes for the write phase, even though checkpoints should be 
happening every 5 minutes here.  With that bad of a write phase overrun, 
spread sync had no room to work, so no net improvement at all.  What is 
happening here is similar to the behavior I described seeing on my 
client system but didn't have an example to share until now.  During the 
write phase, looking at "Dirty:" in /proc/meminfo showed the value 
peaking at over 1GB while writes were happening, and eventually the 
background writer process wasn't getting any serious CPU time compared 
to the backends; this is what it looked like via ps:
%CPU     %MEM        TIME+     COMMAND
4    0    01:51.28     /home/gsmith/pgwork/inst/spread-sync/bin/pgbench 
-f /home/gsmith/pgbench-tools
2    8.1    00:39.71     postgres: gsmith pgbench ::1(43871) UPDATE
2    8    00:39.28     postgres: gsmith pgbench ::1(43875) UPDATE
2    8.1    00:39.92     postgres: gsmith pgbench ::1(43865) UPDATE
2    8.1    00:39.54     postgres: gsmith pgbench ::1(43868) UPDATE
2    8    00:39.36     postgres: gsmith pgbench ::1(43870) INSERT
2    8.1    00:39.47     postgres: gsmith pgbench ::1(43877) UPDATE
1    8    00:39.39     postgres: gsmith pgbench ::1(43864) COMMIT
1    8.1    00:39.78     postgres: gsmith pgbench ::1(43866) UPDATE
1    8    00:38.99     postgres: gsmith pgbench ::1(43867) UPDATE
1    8.1    00:39.55     postgres: gsmith pgbench ::1(43872) UPDATE
1    8.1    00:39.90     postgres: gsmith pgbench ::1(43873) UPDATE
1    8.1    00:39.64     postgres: gsmith pgbench ::1(43876) UPDATE
1    8.1    00:39.93     postgres: gsmith pgbench ::1(43878) UPDATE
1    8.1    00:39.83     postgres: gsmith pgbench ::1(43863) UPDATE
1    8    00:39.47     postgres: gsmith pgbench ::1(43869) UPDATE
1    8.1    00:40.11     postgres: gsmith pgbench ::1(43874) UPDATE
1    0    00:11.91     [flush-9:1]
0    0    27:43.75     [xfsdatad/6]
0    9.4    00:02.21     postgres: writer process

I want to make this problem go away, but as you can see spreading the 
sync calls around isn't enough.  I think the main write loop needs to 
get spread out more, too, so that the background writer is trying to 
work at a more reasonable pace.  I am pleased I've been able to 
reproduce this painful behavior at home using test data, because that 
much improves my odds of being able to isolate its cause and test 
solutions.  But it's a tricky problem, and I'm certainly not going to
fix it in the next week.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: Spread checkpoint sync

From
Robert Haas
Date:
On Thu, Feb 10, 2011 at 10:30 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> 3) The existing write spreading code in the background writer needs to be
> overhauled, too, before spreading the syncs around is going to give the
> benefits I was hoping for.

I've been thinking about this problem a bit.  It strikes me that the
whole notion of a background writer delay is probably wrong-headed.
Instead of having fixed-length cycles, we might want to make the delay
dependent on whether we're actually keeping up.  So during each cycle,
we decide how many buffers we want to clean, and we write 'em.  Then
we go to sleep.  When we wake up again, we figure out whether we kept
up.  If the number of buffers we wrote during the prior cycle was more
than the required number, then we'll sleep longer the next time, up to
some maximum; if we we didn't write enough, we'll reduce the sleep.

Along with this, we'd want to change the minimum rate of writing
checkpoint buffers from 1 per cycle to 1 for every 200 ms, or
something like that.

We could even possibly have a system where backends wake the
background writer up early if they notice that it's not keeping up,
although it's not exactly clear what a good algorithm would be.
Another thing that would be really nice is if backends could somehow
let the background writer know when they're using a
BufferAccessStrategy, and somehow convince the background writer to
write those buffers out to the OS at top speed.

> I want to make this problem go away, but as you can see spreading the sync
> calls around isn't enough.  I think the main write loop needs to get spread
> out more, too, so that the background writer is trying to work at a more
> reasonable pace.  I am pleased I've been able to reproduce this painful
> behavior at home using test data, because that much improves my odds of
> being able to isolate its cause and test solutions.  But it's a tricky
> problem, and I'm certainly not going to fix it in the next week.

Thanks for working on this.  I hope we get a better handle on it for 9.2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company