Thread: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

   PostgreSQL (<=9.4) trend to smooth buffer write smooth in a checkpoint_completion_target (checkpoint_timeout or checkpoint_segments), but when we use synchronous_commit=off, there is a little problem for the checkpoint_segments target, because xlog write fast(for full page write which the first page write after checkpoint), so checkpointer cann't sleep and write buffer not smooth. 
   There is an test:
# stap -DMAXSKIPPED=100000 -v 11111 -e '
global s_var, e_var, stat_var;

/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {
  s_var[pid(),1] = gettimeofday_us()
}

/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {
  e_var[pid(),1] = gettimeofday_us()
  if ( s_var[pid(),1] > 0 )
    stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]
}

/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {
  s_var[pid(),2] = gettimeofday_us()
}

/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {
  e_var[pid(),2] = gettimeofday_us()
  if ( s_var[pid(),2] > 0 )
    stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]
}

probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {
  printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n", $NBuffers, $num_to_write)
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
  printf("checkpoint start\n")
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
  printf("checkpoint done\n")
}

probe timer.s(1) {
  foreach ([v1,v2] in stat_var +) {
    if ( @count(stat_var[v1,v2]) >0 ) {
      printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
    }
  }
  printf("----------------------------------end-----------------------------\n")
  delete s_var
  delete e_var
  delete stat_var
}'



Use the test table and data:
create table tbl(id primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();


Use pgbench test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;


$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000
When on schedule checkpoint occure , the tps:
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524

The parameters for onschedule checkpoint:
checkpoint_segments = 512
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9

stap's output : 
there is 156467 dirty blocks, we can see the buffer write per second, write buffer is not smooth between time target.
but between xlog target.
156467/(4.5*60*0.9) = 579.5 write per second.


checkpoint start
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count: 8596
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count: 8097
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count: 4748
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count: 6304
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count: 4703
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count: 6534
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count: 3318
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216
......



I think we can add an condition to the IsCheckpointOnSchedule, 
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }


# vi src/backend/postmaster/checkpointer.c

#include "access/xact.h"

/*
 * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
 *               in time?
 *
 * Compares the current progress against the time/segments elapsed since last
 * checkpoint, and returns true if the progress we've made this far is greater
 * than the elapsed time/segments.
 */
static bool
IsCheckpointOnSchedule(double progress)
{
        XLogRecPtr      recptr;
        struct timeval now;
        double          elapsed_xlogs,
                                elapsed_time;

        Assert(ckpt_active);

        /* Scale progress according to checkpoint_completion_target. */
        progress *= CheckPointCompletionTarget;

        /*
         * Check against the cached value first. Only do the more expensive
         * calculations once we reach the target previously calculated. Since
         * neither time or WAL insert pointer moves backwards, a freshly
         * calculated value can only be greater than or equal to the cached value.
         */
        if (progress < ckpt_cached_elapsed)
                return false;

        /*
         * Check progress against WAL segments written and checkpoint_segments.
         *
         * We compare the current WAL insert location against the location
         * computed before calling CreateCheckPoint. The code in XLogInsert that
         * actually triggers a checkpoint when checkpoint_segments is exceeded
         * compares against RedoRecptr, so this is not completely accurate.
         * However, it's good enough for our purposes, we're only calculating an
         * estimate anyway.
         */
        if (!RecoveryInProgress())
        {
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }
        }

        /*
         * Check progress against time elapsed and checkpoint_timeout.
         */
        gettimeofday(&now, NULL);
        elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
                                        now.tv_usec / 1000000.0) / CheckPointTimeout;

        if (progress < elapsed_time)
        {
                ckpt_cached_elapsed = elapsed_time;
                return false;
        }

        /* It looks like we're on schedule. */
        return true;
}

# gmake && gmake install

$ pg_ctl restart -m fast
Test again:
progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383
progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433
progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460
progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043
progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286
progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941
progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356
progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676
progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134
progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470
progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325
progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297
progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309

We can see checkpointer wiriter buffer smooth(spread time perid) this time.
checkpoint start
----------------------------------end-----------------------------
buffer__sync__start num_buffers: 262144, dirty_buffers: 156761
r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545


But there is also a little problem, When PostgreSQL write xlog reach checkpoint_segments earlier then checkpoint_timeout, the next checkpoint will start soon, so we must tuning the checkpoint_segments larger when the checkpoint occure busy.

Regards,
Digoal

--
公益是一辈子的事,I'm Digoal,Just Do It.
On 05/12/2015 03:27 AM, digoal zhou wrote:
>     PostgreSQL (<=9.4) trend to smooth buffer write smooth in a
> checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
> but when we use synchronous_commit=off, there is a little problem for
> the checkpoint_segments
> target, because xlog write fast(for full page write which the first page
> write after checkpoint), so checkpointer cann't sleep and write buffer not
> smooth.
> ...
> I think we can add an condition to the IsCheckpointOnSchedule,
>                  if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
>                  {
>                      recptr = GetInsertRecPtr();
>                      elapsed_xlogs = (((double) (recptr -
> ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
>
>                      if (progress < elapsed_xlogs)
>                      {
>                          ckpt_cached_elapsed = elapsed_xlogs;
>                          return false;
>                      }
>                   }

This has nothing to do with asynchronous_commit, except that setting
asynchronous_commit=off makes your test case run faster, and hit the
problem harder.

I think the real problem here is that IsCheckpointOnSchedule assumes
that the rate of WAL generated is constant throughout the checkpoint
cycle, but in reality you generate a lot more WAL immediately after the
checkpoint begins, thanks to full_page_writes. For example, in the
beginning of the cycle, you quickly use up, say, 20% of the WAL space in
the first 10 seconds, and the scheduling thinks it's in a lot of hurry
to finish the checkpoint because it extrapolates that the rest of the
WAL will be used up in the next 40 seconds. But in reality, the WAL
consumption levels off, and you have many minutes left until
CheckPointSegments.

Can you try the attached patch? It modifies the above calculation to
take the full-page-write effect into account. I used X^1.5 as the
corrective function, which roughly reflects the typical WAL consumption
pattern. You can adjust the exponent, 1.5, to make the correction more
or less aggressive.

- Heikki


Attachment
(please keep the mailing list CC'd, and please don't top-post)

On 05/13/2015 05:00 AM, digoal zhou wrote:
> I test it, but use exponent not very perfect in any environment.
> why cann't use time only?

As you mentioned yourself earlier, if you only use time but you reach
checkpoint_segments before checkpoint_timeout, you will not complete the
checkpoint until you'd already need to begin the next checkpoint. You
can't completely ignore checkpoint_segments.

Comparing the numbers you give below with
compensate-fpw-effect-on-checkpoint-scheduling-1.patch, with the ones
from your first post, it looks like the patch already made the situation
much better. You still have a significant burst in the beginning of the
checkpoint cycle, but it's a lot smaller than without the patch. Before
the patch, the "count" topped at 9078, and below it topped at 2964.
There is a strange "lull" after the burst, I'm not sure what's going on
there, but overall it seems like a big improvement.

Did the patch alleviate the bump in latency that pgbench reports?

I put the "count" numbers from your original post and below into a
spreadsheet, and created some fancy charts. See attached. It shows the
same thing but with pretty pictures. Assuming we want the checkpoint to
be spread as evenly as possible across the cycle, the ideal would be a
straight line from 0 to about 150000 in 270 seconds in the cumulative
chart. You didn't give the full data, but you can extrapolate the lines
to get a rough picture of how close the different versions are from that
ideal.

In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.

I don't want to over-compensate for the full-page-write effect either,
because there are also applications where that effect isn't so big. For
example, an application that performs a lot of updates, but all the
updates are on a small number of pages, so the full-page-write storm
immediately after checkpoint doesn't last long. A worst case for this
patch would be such an application - lots of updates on only a few pages
- with a long checkpoint_timeoout but relatively small
checkpoint_segments, so that checkpoints are always driven by
checkpoint_segments. I'd like to see some benchmarking of that worst
case before committing anything like this.

> ----------------------------------end-----------------------------
> checkpoint start
> buffer__sync__start num_buffers: 524288, dirty_buffers: 156931
> r1_or_w2 2, pid: 29132, min: 44, max: 151, avg: 52, sum: 49387, count: 932
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 95, avg: 49, sum: 41532, count: 837
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 747, avg: 54, sum: 100419, count: 1849
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 372, avg: 52, sum: 110701, count: 2090
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 115, avg: 57, sum: 147510, count: 2575
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 470, avg: 58, sum: 145217, count: 2476
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 120, avg: 54, sum: 161401, count: 2964
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 208, avg: 59, sum: 170280, count: 2847
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 10089, avg: 62, sum: 136106, count:
> 2181
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 41, max: 487, avg: 56, sum: 88990, count: 1570
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 39, max: 102, avg: 55, sum: 59807, count: 1083
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 557, avg: 56, sum: 117274, count: 2083
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 537, avg: 58, sum: 169867, count: 2882
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 147, avg: 60, sum: 92835, count: 1538
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 55, sum: 14641, count: 264
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 92, avg: 56, sum: 11834, count: 210
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 91, avg: 56, sum: 9151, count: 162
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 8621, count: 151
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 57, sum: 7962, count: 139
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 93, avg: 58, sum: 7194, count: 123
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 58, sum: 7143, count: 123
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 57, sum: 6801, count: 118
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 58, sum: 6818, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 57, sum: 6982, count: 121
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 6459, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 88, avg: 58, sum: 7022, count: 121
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 47, max: 94, avg: 57, sum: 5952, count: 104
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 49, max: 95, avg: 57, sum: 6871, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 49, max: 85, avg: 58, sum: 6829, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 89, avg: 57, sum: 6851, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 57, sum: 6779, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 93, avg: 55, sum: 6502, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 47, max: 98, avg: 58, sum: 6805, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 57, sum: 6771, count: 118
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 56, sum: 6593, count: 116
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 57, sum: 6809, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 100, avg: 57, sum: 6171, count: 107
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 105, avg: 57, sum: 6801, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 95, avg: 57, sum: 6792, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6693, count: 118
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 57, sum: 6878, count: 120
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6664, count: 117
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 57, sum: 7051, count: 123
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 6957, count: 120
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6842, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 100, avg: 57, sum: 6865, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 49, max: 102, avg: 58, sum: 6915, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6187, count: 107
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 50, max: 86, avg: 58, sum: 6957, count: 119
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 97, avg: 55, sum: 33636, count: 609
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 55, sum: 34180, count: 620
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 29, max: 92, avg: 53, sum: 36569, count: 680
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 91, avg: 54, sum: 37374, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 86, avg: 54, sum: 33347, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 54, sum: 37603, count: 684
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 93, avg: 55, sum: 33777, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 104, avg: 55, sum: 37566, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 92, avg: 54, sum: 37037, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 42, max: 106, avg: 57, sum: 35181, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 99, avg: 54, sum: 36981, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 53, sum: 33202, count: 622
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 89, avg: 54, sum: 36825, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 88, avg: 53, sum: 33917, count: 635
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 55, sum: 36234, count: 658
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 99, avg: 55, sum: 37719, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 54, sum: 33491, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 37365, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 101, avg: 54, sum: 33481, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 37, max: 93, avg: 54, sum: 37102, count: 685
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 87, avg: 54, sum: 36968, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 84, avg: 54, sum: 33565, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 54, sum: 37271, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 48, max: 96, avg: 55, sum: 34272, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 85, avg: 54, sum: 37378, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 107, avg: 53, sum: 36715, count: 680
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 54, sum: 33620, count: 616
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 29, max: 94, avg: 54, sum: 37093, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 39, max: 110, avg: 53, sum: 33013, count: 612
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 97, avg: 54, sum: 37215, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 37, max: 90, avg: 54, sum: 37240, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 41, max: 95, avg: 54, sum: 33555, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 38, max: 89, avg: 54, sum: 37503, count: 683
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 38, max: 95, avg: 55, sum: 33803, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 89, avg: 56, sum: 38403, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 92, avg: 54, sum: 37354, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 55, sum: 33881, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 34, max: 91, avg: 54, sum: 37047, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 85, avg: 53, sum: 33003, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 92, avg: 53, sum: 36854, count: 683
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 92, avg: 54, sum: 36597, count: 673
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33689, count: 620
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 92, avg: 54, sum: 37194, count: 684
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 53, sum: 32813, count: 612
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 100, avg: 54, sum: 37485, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 54, sum: 33294, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 94, avg: 54, sum: 37320, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 28, max: 92, avg: 54, sum: 37067, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 87, avg: 54, sum: 33766, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 110, avg: 53, sum: 36220, count: 680
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 98, avg: 54, sum: 33442, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 97, avg: 55, sum: 37692, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 54, sum: 37073, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 88, avg: 54, sum: 33676, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 103, avg: 53, sum: 36770, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33447, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 91, avg: 55, sum: 37643, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 54, sum: 37377, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 103, avg: 56, sum: 34531, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 121, avg: 54, sum: 37412, count: 683
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 40, max: 89, avg: 54, sum: 33173, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 31, max: 94, avg: 54, sum: 37385, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 28, max: 106, avg: 55, sum: 38132, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 34, max: 96, avg: 55, sum: 33800, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 37, max: 98, avg: 56, sum: 38305, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 28, max: 104, avg: 55, sum: 33744, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 103, avg: 54, sum: 36923, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 36, max: 89, avg: 55, sum: 37797, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 103, avg: 56, sum: 34902, count: 620
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 38025, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 43, max: 102, avg: 56, sum: 34545, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 34, max: 94, avg: 55, sum: 37756, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 54, sum: 33530, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 55, sum: 37992, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 99, avg: 55, sum: 37923, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 39, max: 101, avg: 55, sum: 34027, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 53, sum: 36078, count: 680
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 41, max: 89, avg: 51, sum: 31563, count: 612
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 31, max: 92, avg: 52, sum: 35596, count: 680
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 102, avg: 55, sum: 37816, count: 685
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 45, max: 102, avg: 55, sum: 33828, count: 613
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 54, sum: 37285, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 90, avg: 55, sum: 34037, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 86, avg: 54, sum: 37584, count: 684
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 32, max: 103, avg: 55, sum: 37946, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34556, count: 617
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 33, max: 99, avg: 56, sum: 38213, count: 681
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34613, count: 614
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 55, sum: 37925, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 35, max: 93, avg: 55, sum: 35504, count: 639
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 46, max: 90, avg: 55, sum: 36459, count: 655
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 30, max: 97, avg: 54, sum: 37369, count: 682
> ----------------------------------end-----------------------------
> r1_or_w2 2, pid: 29132, min: 31, max: 93, avg: 54, sum: 33161, count: 612
> ----------------------------------end----------------------------

- Heikki

Attachment
This needs more performance testing.



> This needs more performance testing.

Definitely. I may do that some day. However I'm not sure that this is 
currently the main issue in the checkpointer.

-- 
Fabien.



On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
 
In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.

Agreed
 
I don't want to over-compensate for the full-page-write effect either, because there are also applications where that effect isn't so big. For example, an application that performs a lot of updates, but all the updates are on a small number of pages, so the full-page-write storm immediately after checkpoint doesn't last long. A worst case for this patch would be such an application - lots of updates on only a few pages - with a long checkpoint_timeoout but relatively small checkpoint_segments, so that checkpoints are always driven by checkpoint_segments. I'd like to see some benchmarking of that worst case before committing anything like this.

We could do better, but that is not a reason not to commit this, as is. Commit, please.

This has been in place for a while and still remains: TODO: reduce impact of full page writes

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
 
In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.

Agreed

Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?


Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue? 



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello Simon,

> We could do better, but that is not a reason not to commit this, as is.
> Commit, please.

My 0,02€: Please do not commit without further testing...

I've submitted a patch to improve checkpoint write scheduling, including 
X00 hours of performance test on various cases. This patch changes 
significantly the load distribution over the whole checkpoint, and AFAICS 
has been tested on rather small cases.

I'm not sure that the power 1.5 is the right one for all cases. For a big 
checkpoint over 30 minutes, it may have, or not, very large and possibly 
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what 
I really think is that it needs performance measures.

In conclusion, and very egoistically, I would prefer if this patch could 
wait for the checkpoint scheduling patch to be considered, as it would 
basically invalidate the X00 hours of performance tests I ran:-)

-- 
Fabien.

On 3 July 2015 at 06:38, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Simon,

We could do better, but that is not a reason not to commit this, as is.
Commit, please.

My 0,02€: Please do not commit without further testing...

I've submitted a patch to improve checkpoint write scheduling, including X00 hours of performance test on various cases. This patch changes significantly the load distribution over the whole checkpoint, and AFAICS has been tested on rather small cases.

I'm not sure that the power 1.5 is the right one for all cases. For a big checkpoint over 30 minutes, it may have, or not, very large and possibly unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I really think is that it needs performance measures.

power 1,5 is almost certainly not right for all cases, but it is simple and better. And easy to remove if something even better arrives.

I don't see the two patches being in conflict.
 
In conclusion, and very egoistically, I would prefer if this patch could wait for the checkpoint scheduling patch to be considered, as it would basically invalidate the X00 hours of performance tests I ran:-)

 I recommend making peace with yourself that probably 50% of development time is wasted. But we try to keep the best half.

Thank you for your time spent contributing.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-03 07:38:15 +0200, Fabien COELHO wrote:
> I've submitted a patch to improve checkpoint write scheduling, including X00
> hours of performance test on various cases. This patch changes significantly
> the load distribution over the whole checkpoint, and AFAICS has been tested
> on rather small cases.
> 
> I'm not sure that the power 1.5 is the right one for all cases. For a big
> checkpoint over 30 minutes, it may have, or not, very large and possibly
> unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I
> really think is that it needs performance measures.
> 
> In conclusion, and very egoistically, I would prefer if this patch could
> wait for the checkpoint scheduling patch to be considered, as it would
> basically invalidate the X00 hours of performance tests I ran:-)

These two patches target pretty independent mechanics. If you patch were
significantly influenced by this something would be wrong. It might
decrease the benefit of your patch a mite, but that's not really a
problem.



> power 1,5 is almost certainly not right for all cases, but it is simple 
> and better.

It is better "in some cases", as I've been told on my patch. If you have a 
separate disk for WAL writes the power formula may just degrade 
performance, or maybe not, or not too much, or it really should be a guc.

Well, I just think that it needs more performance testing with various 
loads and sizes, really. I'm not against this patch at all.

> And easy to remove if something even better arrives.
>
> I don't see the two patches being in conflict.

They are not "in conflict" from a git point of view, or even so it would 
be trivial to solve.

They are in conflict as the patch changes the checkpoint load 
significantly, which would mean that my X00 hours of performance testing 
on the checkpoint scheduler should more or less be run again. Ok, it is 
somehow egoistic, but I'm trying to avoid wasting people time.

Another point is that I'm not sure I understand the decision process: for 
some patch in some area extensive performance tests are required, and for 
other patches in the same area they would not be.

-- 
Fabien.



Hello Andres,

>> In conclusion, and very egoistically, I would prefer if this patch could
>> wait for the checkpoint scheduling patch to be considered, as it would
>> basically invalidate the X00 hours of performance tests I ran:-)
>
> These two patches target pretty independent mechanics. If you patch were
> significantly influenced by this something would be wrong. It might
> decrease the benefit of your patch a mite, but that's not really a
> problem.

That is not the issue I see. On the principle of performance testing it 
really means that I should rerun the tests, even if I expect that the 
overall influence would be pretty small in this case. This is my egoistic 
argument. Well, probably I would just rerun a few cases to check that the 
impact is "mite", as you said, not all cases.

Another point is that I'm not sure that this patch is ripe, in particular 
I'm skeptical about the hardcoded 1.5 without further testing. Maybe it is 
good, maybe 1.3 or 1.6 is better, maybe it depends and it should just be a 
guc with some advises about how to set it. So I really think that it needs 
more performance figures than "it has a positive effect on one load".

Well, this is just my opinion, no need to care too much about it:-)

-- 
Fabien.



On Thu, Jul 2, 2015 at 4:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>  
>>>
>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.
>>
>>
>> Agreed
>
>
> Do we want to consider if wal_compression is enabled as that
> can reduce the effect full_page_writes?
>
>
> Also I am planning to run some tests for this patch, but not sure
> if tps and or latency numbers by pgbench are sufficient or do you
> people want to see actual read/write count via some form of
> dynamic tracing (stap) as done by the reporter of this issue?
>

I have ran some tests with this patch and the detailed results of the runs are
attached with this mail.  I thought the patch should show difference if I keep
max_wal_size to somewhat lower or moderate value so that checkpoint
should get triggered due to wal size, but I am not seeing any major difference
in the writes spreading.  Below is the configuration, I have used to run the
tests

synchronous_commit=off
checkpoint_timeout = 5min
max_wal_size=1GB
min_wal_size=80MB
checkpoint_completion_target=0.9
shared_buffers=8GB

Test (used almost similar testing mechanism as reported in mail)
------------------------------------------------------------------------------------------------
1.
Use the test table and data:
create table tbl(id int primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();

2.
Use pgbench to test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;

3.
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 300

on another session, I ran system tap script
stap pgbenchrun.stp -DMAXSKIPPED=100000 -o pgbenchrun.out

pgbenchrun.stp (attached with mail).

During configure, you need to use (--enable-dtrace) and need to use
debug-build option if you want capture results using system tap script.
Also to run this SystemTap should be installed in your system.

I should have ran more tests to see in which cases this patch makes
difference, but I thought it is better to first discuss the current results
and see what other kind of tests you want to run for this patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment
>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't
>>>> completely eliminate the problem, but it makes it a lot better.

I've looked at the maths.

I think that the load is distributed as the derivative of this function, 
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it 
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that 
is the finishing load is 1.5 the average load, just before fsyncing files. 
This looks like a recipee for a bad time: I would say this is too large an 
overload. I would suggest a much lower value, say around 1.1...

The other issue with this function is that it should only degrade 
performance by disrupting the write distribution if someone has WAL on a 
different disk. As I understand it this thing does only make sense if the 
WAL & the data are on the samee disk. This really suggest a guc.

> I have ran some tests with this patch and the detailed results of the 
> runs are attached with this mail.

I do not understand really the aggregated figures in the files attached.

I guess that maybe between "end" markers there is a summary of figures 
collected for 28 backends over 300-second runs (?), but I do not know what 
the min/max/avg/sum/count figures are about.

> I thought the patch should show difference if I keep max_wal_size to 
> somewhat lower or moderate value so that checkpoint should get triggered 
> due to wal size, but I am not seeing any major difference in the writes 
> spreading.

I'm not sure I understand your point. I would say that at full speed 
pgbench the disk is always busy writing as much as possible, either 
checkpoint writes or wal writes, so the write load as such should not be 
that different anyway?

I understood that the point of the patch is to check whether there is a 
tps dip or not when the checkpoint begins, but I'm not sure how this can 
be infered from the many aggregated data you sent, and from my recent 
tests the tps is very variable anyway on HDD.

-- 
Fabien.



On Jul 4, 2015, at 11:34 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't
>>>>> completely eliminate the problem, but it makes it a lot better.
>
> I've looked at the maths.
>
> I think that the load is distributed as the derivative of this function, that is (1.5 * x ** 0.5): It starts at 0 but
veryquicky reaches 0.5, it pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that is the finishing
loadis 1.5 the average load, just before fsyncing files. This looks like a recipee for a bad time: I would say this is
toolarge an overload. I would suggest a much lower value, say around 1.1... 
>
> The other issue with this function is that it should only degrade performance by disrupting the write distribution if
someonehas WAL on a different disk. As I understand it this thing does only make sense if the WAL & the data are on the
sameedisk. This really suggest a guc. 

I am a bit skeptical about this.  We need test scenarios that clearly show the benefit of having and of not having this
behavior.It might be that doing this always is fine for everyone. 

...Robert


Hello Robert,

>> I've looked at the maths.
>>
>> I think that the load is distributed as the derivative of this 
>> function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky 
>> reaches 0.5, it pass the 1.0 (average load) around 40% progress, and 
>> ends up at 1.5, that is the finishing load is 1.5 the average load, 
>> just before fsyncing files. This looks like a recipee for a bad time: I 
>> would say this is too large an overload. I would suggest a much lower 
>> value, say around 1.1...

>> The other issue with this function is that it should only degrade 
>> performance by disrupting the write distribution if someone has WAL on 
>> a different disk. As I understand it this thing does only make sense if 
>> the WAL & the data are on the samee disk. This really suggest a guc.
>
> I am a bit skeptical about this.  We need test scenarios that clearly 
> show the benefit of having and of not having this behavior. It might be 
> that doing this always is fine for everyone.

Do you mean I have to proove that there is an actual problem induced from 
this patch?

The logic fails me: I thought the patch submitter would have to show that 
his/her patch did not harm performance in various reasonable cases. At 
least this is what I'm told in another thread:-)

Currently this patch changes heavily the checkpoint write load 
distribution in many cases with a proof which consist in showing that it 
may improve tps *briefly* on *one* example, as far as I understood the 
issue and the tests. If this is enough proof to apply the patch, then the 
minimum is that it should be possible to desactivate it, hence a guc.

Having a guc would also help to test the feature with different values 
than 1.5, which really seems harmful from a math point of view. I'm not 
sure at all that a power formula is the right approach.

The potential impact I see would be to aggravate significantly the write 
stall issues I'm working on, but the measures provided in these tests do 
not even look at that or measure that.

-- 
Fabien.



On 07/04/2015 07:34 PM, Fabien COELHO wrote:
>
>>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't
>>>>> completely eliminate the problem, but it makes it a lot better.
>
> I've looked at the maths.
>
> I think that the load is distributed as the derivative of this function,
> that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
> pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
> is the finishing load is 1.5 the average load, just before fsyncing files.
> This looks like a recipee for a bad time: I would say this is too large an
> overload. I would suggest a much lower value, say around 1.1...

Hmm. Load is distributed as a derivate of that, but probably not the way 
you think. Note that X means the amount of WAL consumed, not time. The 
goal is that I/O is constant over time, but the consumption of WAL over 
time is non-linear, with a lot more WAL consumed in the beginning of a 
checkpoint cycle. The function compensates for that.

> The other issue with this function is that it should only degrade
> performance by disrupting the write distribution if someone has WAL on a
> different disk. As I understand it this thing does only make sense if the
> WAL & the data are on the samee disk. This really suggest a guc.

No, the I/O storm caused by full-page-writes is a problem even if WAL is 
on a different disk. Even though the burst of WAL I/O then happens on a 
different disk, the fact that we consume a lot of WAL in the beginning 
of a checkpoint makes the checkpointer think that it needs to hurry up, 
in order to meet the deadline. It will flush a lot of pages in a rush, 
so you get a burst of I/O on the data disk too. Yes, it's even worse 
when WAL and data are on the same disk, but even then, I think the 
random I/O caused by the checkpointer hurrying is more significant than 
the extra WAL I/O, which is sequential.

To illustrate that, imagine that the checkpoint begins now. The 
checkpointer calculates that it has 10 minutes to complete the 
checkpoint (checkpoint_timeout), or until 1 GB of WAL has been generated 
(derived from max_wal_size), whichever happens first. Immediately after 
the Redo-point has been established, in the very beginning of the 
checkpoint, the WAL storm begins. Every backend that dirties a page also 
writes a full-page image. After just 10 seconds, those backends have 
already written 200 MB of WAL. That's 1/5 of the quota, and based on 
that, the checkpointer will quickly flush 1/5 of all buffers. In 
reality, the WAL consumption is not linear, and will slow down as time 
passes and less full-page writes happen. So in reality, the checkpointer 
would have a lot more time to complete the checkpoint - it is 
unnecessarily aggressive in the beginning of the checkpoint.

The correction factor in the patch compensates for that. With the X^1.5 
formula, when 20% of the WAL has already been consumed, the checkpointer 
have flushed only ~ 9% of the buffers, not 20% as without the patch.

The ideal correction formula f(x), would be such that f(g(X)) = X, where:
 X is time, 0 = beginning of checkpoint, 1.0 = targeted end of 
checkpoint (checkpoint_segments), and
 g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0 
= targeted end of checkpoint (derived from max_wal_size).

Unfortunately, we don't know the shape of g(X), as that depends on the 
workload. It might be linear, if there is no effect at all from 
full_page_writes. Or it could be a step-function, where every write 
causes a full page write, until all pages have been touched, and after 
that none do (something like an UPDATE without a where-clause might 
cause that). In pgbench-like workloads, it's something like sqrt(x). I 
picked X^1.5 as a reasonable guess. It's close enough to linear that it 
shouldn't hurt too much if g(x) is linear. But it cuts the worst spike 
at the very beginning, if g(x) is more like sqrt(x).

This is all assuming that the application load is constant. If it's 
not, g(x) can obviously have any shape, and there's no way we can 
predict that. But that's a different story, nothing to do with 
full_page_writes.

>> I have ran some tests with this patch and the detailed results of the
>> runs are attached with this mail.
>
> I do not understand really the aggregated figures in the files attached.

Me neither. It looks like Amit measured the time spent in mdread and 
mdwrite, but I'm not sure what conclusions one can draw from that.

>> I thought the patch should show difference if I keep max_wal_size to
>> somewhat lower or moderate value so that checkpoint should get triggered
>> due to wal size, but I am not seeing any major difference in the writes
>> spreading.
>
> I'm not sure I understand your point. I would say that at full speed
> pgbench the disk is always busy writing as much as possible, either
> checkpoint writes or wal writes, so the write load as such should not be
> that different anyway?
>
> I understood that the point of the patch is to check whether there is a
> tps dip or not when the checkpoint begins, but I'm not sure how this can
> be infered from the many aggregated data you sent, and from my recent
> tests the tps is very variable anyway on HDD.

Right, that's my understanding too. If the disk is not saturated, 
perhaps because you used pgbench's rate-limiting option, then measuring 
the disk I/O would be useful too: flatter is better.

- Heikki




On 07/05/2015 08:19 AM, Fabien COELHO wrote:
>> I am a bit skeptical about this.  We need test scenarios that clearly
>> show the benefit of having and of not having this behavior. It might be
>> that doing this always is fine for everyone.
>
> Do you mean I have to proove that there is an actual problem induced from
> this patch?

You don't have to do anything if you don't want to. I said myself that 
this needs performance testing of the worst-case scenario, one where we 
would expect this to perform worse than without the patch. Then we can 
look at how bad that effect is, and decide if that's acceptable.

That said, if you could do that testing, that would be great! I'm not 
planning to spend much time on this myself, and it would take me a fair 
amount of time to set up the hardware and tools to test this. I was 
hoping Digoal would have the time to do that, since he started this 
thread, or someone else that has a system ready for this kind of 
testing. If no-one steps up to the plate to test this more, however, 
we'll have to just forget about this.

> Having a guc would also help to test the feature with different values
> than 1.5, which really seems harmful from a math point of view. I'm not
> sure at all that a power formula is the right approach.

Yeah, a GUC would be helpful in testing this. I'm hoping that we would 
come up with a reasonable formula that would work well enough for 
everyone that we wouldn't need to have a GUC in the final patch, though.

- Heikki




> You don't have to do anything if you don't want to.

Sure:-) What I mean is that I think that this patch is not ripe, and I 
understood that some people were suggesting that it could be applied as is 
right away. I'm really disagreeing with that.

> I said myself that this needs performance testing of the worst-case 
> scenario, one where we would expect this to perform worse than without 
> the patch. Then we can look at how bad that effect is, and decide if 
> that's acceptable.

Ok, I'm fine with that. It's quite different from "looks ok apply now".

> That said, if you could do that testing, that would be great!

Hmmm. I was not really planing to. On the other hand, I have some scripts 
and a small setup that I've been using to test checkpointer flushing, and 
it would be easy to start some tests.

>> Having a guc would also help to test the feature with different values
>> than 1.5, which really seems harmful from a math point of view. I'm not
>> sure at all that a power formula is the right approach.
>
> Yeah, a GUC would be helpful in testing this. I'm hoping that we would come 
> up with a reasonable formula that would work well enough for everyone that we 
> wouldn't need to have a GUC in the final patch, though.

Yep. If it is a guc testing is quite easy and I may run my scripts...

-- 
Fabien.



Hello Heikki,

>> I think that the load is distributed as the derivative of this function,
>> that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
>> pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
>> is the finishing load is 1.5 the average load, just before fsyncing files.
>> This looks like a recipee for a bad time: I would say this is too large an
>> overload. I would suggest a much lower value, say around 1.1...
>
> Hmm. Load is distributed as a derivate of that, but probably not the way you 
> think. Note that X means the amount of WAL consumed, not time.

Interesting point. After a look at IsCheckpointOnSchedule, and if I 
understand the code correctly, it is actually *both*, so it really depends 
whether the checkpoint was xlog or time triggered, and especially which 
one (time/xlog) is proeminent at the beginning of the checkpoint.

If it is time triggered and paced my reasonning is probably right and 
things will go bad/worse in the end, but if it is xlog-triggered and paced 
your line of argument is probably closer to what happens.

This suggest that the corrective function should be applied with more 
care, maybe only for the xlog-based on schedule test, but not the 
time-based check.

> The goal is that I/O is constant over time, but the consumption of WAL 
> over time is non-linear, with a lot more WAL consumed in the beginning 
> of a checkpoint cycle. The function compensates for that.

*If* the checkpointer pacing comes from WAL size, which may or may not be 
the case.

> [...]
>
> Unfortunately, we don't know the shape of g(X), as that depends on the 
> workload. It might be linear, if there is no effect at all from 
> full_page_writes. Or it could be a step-function, where every write causes a 
> full page write, until all pages have been touched, and after that none do 
> (something like an UPDATE without a where-clause might cause that).

If postgresql is running in its cache (i.e. within shared buffers), the 
usual assumption would be an unknown exponential probability decreasing 
with time while the same pages are hit over and over.

If postgresql is running on memory or disk (effective database size 
greater than shared buffers), pages are statiscally not reused by another 
update before being sent out, so the full page write would be always used 
during the whole checkpoint, there is no WAL storm (or it is always a 
storm, depending on the point of view) and the corrective factor would 
only create issues...

So basically I would say that what to do heavily depends on the database 
size and checkpoint trigger (time vs xlog), which really suggest that a 
guc is indispensible, and maybe that the place the correction is applied 
is currently not the right one.

> In pgbench-like workloads, it's something like sqrt(x).

Probably for a small database size?

> I picked X^1.5 as a reasonable guess. It's close enough to linear that 
> it shouldn't hurt too much if g(x) is linear.

My understanding is still a 50% overload at the end of the checkpoint just 
before issuing fsync... I think that could hurt in some case.

> But it cuts the worst spike at the very beginning, if g(x) is more like 
> sqrt(x).

Hmmm. It's a balance between saving the 10 first seconds of the checkpoint 
at the price of risking a panic at the end of the checkpoint.

Now the right approach might be for pg to know what is happening by 
collecting statistics while running, and to apply a correction when it is 
needed, for the amount needed.

-- 
Fabien.



On Sun, Jul 5, 2015 at 1:19 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Do you mean I have to proove that there is an actual problem induced from
> this patch?

No, I'm not saying anyone *has* to do anything.  What I'm saying is
that I'm not convinced by your analysis.  I don't think we have enough
evidence at this point to conclude that a GUC is necessary, and I hope
it isn't, because I can't imagine what advice we would be able to give
people about how to set it, other than "try all the value and see what
works best", which isn't going to be satisfying.

More broadly, I don't really know how to test this patch and show when
it helps and when it hurts.  And I think we need that, rather than
just a theoretical analysis, to tune the behavior.  Heikki, can you
describe what you think a good test setup would be?  Like, what
workload should we run, and what measurements should we gather to see
what the patch is doing that is good or bad?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2015-07-05 11:05:28 -0400, Robert Haas wrote:
> More broadly, I don't really know how to test this patch and show when
> it helps and when it hurts.  And I think we need that, rather than
> just a theoretical analysis, to tune the behavior.  Heikki, can you
> describe what you think a good test setup would be?  Like, what
> workload should we run, and what measurements should we gather to see
> what the patch is doing that is good or bad?

I think a good start would be to graph the writeout rate over several
checkpoints.  It'd be cool if there were a better way, but it's probably
easiest to just graph the number of bytes written (using iostat) and the
number of dirty bytes in the kernel. That'll unfortunately include WAL,
but I can't immediately see how to avoid that.



> No, I'm not saying anyone *has* to do anything.  What I'm saying is
> that I'm not convinced by your analysis.

Well, the gist of my analysis is really to say that there are potential 
performance issues with the proposed change, and that it must be tested 
thoroughly. The details may varry:-)

> I don't think we have enough evidence at this point to conclude that a 
> GUC is necessary, and I hope it isn't, because I can't imagine what 
> advice we would be able to give people about how to set it, other than 
> "try all the value and see what works best", which isn't going to be 
> satisfying.

At least for testing, ISTM that a GUC would be really useful.

> More broadly, I don't really know how to test this patch and show when 
> it helps and when it hurts. And I think we need that, rather than just a 
> theoretical analysis, to tune the behavior.

The point of an analysis is to think about how it works and what to test, 
but it is not a substitute for testing, obviously.

-- 
Fabien.



On Sun, Jul 5, 2015 at 1:18 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/04/2015 07:34 PM, Fabien COELHO wrote:
I have ran some tests with this patch and the detailed results of the
runs are attached with this mail.

I do not understand really the aggregated figures in the files attached.

Me neither. It looks like Amit measured the time spent in mdread and mdwrite, but I'm not sure what conclusions one can draw from that.

As Heikki has pointed, it is stats data for mdread and mdwrite
between the checkpoints (in the data, you need to search for
"checkpoint start"/"checkpoint done").  In between checkpoint
start and checkpoint done, all the data shows the amount of read/
write done (I am just trying to reproduce what Digoal has reported, so
I am using his script and I also don't understand every thing, but I think
we can look at count between checkpoints to deduce whether the IO
is flattened after patch).  Digoal was seeing a spike at the beginning of
checkpoint (after checkpoint start) in his configuration without this patch
and the spike seems to be reduced after this patch where as in my tests
I don't see the spike immediately after checkpoint (although there are some
spikes in-between) even without patch which means that either I might not
be using the right configuration to measure the IO or there is some other
difference between the way Digoal ran the test and I ran the tests.  I have done
the setup (even though hardware will not be same, but at least I can run the
tests and collect the data in the format similar to Digoal), so if you guys have
suggestions about which kind of parameters we should tweek or some tests
to gather the results, I can do that present the results here for further discussion. 
 
I thought the patch should show difference if I keep max_wal_size to
somewhat lower or moderate value so that checkpoint should get triggered
due to wal size, but I am not seeing any major difference in the writes
spreading.

I'm not sure I understand your point. I would say that at full speed
pgbench the disk is always busy writing as much as possible, either
checkpoint writes or wal writes, so the write load as such should not be
that different anyway?

I understood that the point of the patch is to check whether there is a
tps dip or not when the checkpoint begins, but I'm not sure how this can
be infered from the many aggregated data you sent, and from my recent
tests the tps is very variable anyway on HDD.

Yes, we definitely want to see the effect on TPS at the beginning of checkpoint,
but even measuring the IO during checkpoint with the way Digoal was capturing
the data can show the effect of this patch.


With Regards,
Amit Kapila.
On Mon, Jul 6, 2015 at 12:30 PM, Amit Kapila wrote:
> Yes, we definitely want to see the effect on TPS at the beginning of
> checkpoint,
> but even measuring the IO during checkpoint with the way Digoal was
> capturing
> the data can show the effect of this patch.

I am marking this patch as returned with feedback.
-- 
Michael



Hi,

I was planning to do some review/testing on this patch, but then I 
noticed it was rejected with feedback in 2015-07 and never resubmitted 
into another CF. So I won't waste time in testing this unless someone 
shouts that I should do that anyway. Instead I'll just post some ideas 
about how we might improve the patch, because I'd forget about them 
otherwise.

On 07/05/2015 09:48 AM, Heikki Linnakangas wrote:
>
> The ideal correction formula f(x), would be such that f(g(X)) = X, where:
>
>   X is time, 0 = beginning of checkpoint, 1.0 = targeted end of
> checkpoint (checkpoint_segments), and
>
>   g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0
> = targeted end of checkpoint (derived from max_wal_size).
>
> Unfortunately, we don't know the shape of g(X), as that depends on the
> workload. It might be linear, if there is no effect at all from
> full_page_writes. Or it could be a step-function, where every write
> causes a full page write, until all pages have been touched, and after
> that none do (something like an UPDATE without a where-clause might
> cause that). In pgbench-like workloads, it's something like sqrt(x). I
> picked X^1.5 as a reasonable guess. It's close enough to linear that it
> shouldn't hurt too much if g(x) is linear. But it cuts the worst spike
> at the very beginning, if g(x) is more like sqrt(x).

Exactly. I think the main "problem" here is that we do mix two types of 
WAL records, with quite different characteristics:
 (a) full_page_writes - very high volume right after checkpoint, then     usually drops to much lower volume
 (b) regular records - about the same volume over time (well, lower     volume right after the checkpoint, as that's
whereFPWs happen)
 

We completely ignore this when computing elapsed_xlogs, because we 
compute it (about) like this:
    elapsed_xlogs = wal_since_checkpoint / CheckPointSegments;

which of course gets confused when we write a lot of WAL right after a 
checkpoint, because of FPW. But what if we actually tracked the amount 
of WAL produced by FWP in a checkpoint (which we current don't AFAIK)?

Then we could compute the expected *remaining* amount of WAL to be 
produced within the checkpoint interval, and use that to compute a 
better progress like this:
  wal_bytes          - WAL (total)  wal_fpw_bytes      - WAL (due to FPW)  prev_wal_bytes     - WAL (total) in previous
checkpoint prev_wal_fpw_bytes - WAL (due to FPW) in previous checkpoint
 

So we know that we should expect about
  (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
  (       regular WAL        ) + (              FPW WAL             )

to be produced until the end of the current checkpoint. I don't have a 
clear idea how to transform this into the 'progress' yet, but I'm pretty 
sure tracking the two types of WAL is a key to a better solution. The 
x^1.5 is probably a step in the right direction, but I don't feel 
particularly confident about the 1.5 (which is rather arbitrary).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So we know that we should expect about
>
>   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
>
>   (       regular WAL        ) + (              FPW WAL             )
>
> to be produced until the end of the current checkpoint. I don't have a clear
> idea how to transform this into the 'progress' yet, but I'm pretty sure
> tracking the two types of WAL is a key to a better solution. The x^1.5 is
> probably a step in the right direction, but I don't feel particularly
> confident about the 1.5 (which is rather arbitrary).

If it works well empirically, does it really matter that it's
arbitrary?  I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and which
to ignore.  The proof that we have made good decisions there is in the
query plans it generates.  (The proof that we have made bad decisions
in some cases in the query plans, too.)

I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 17/12/15 19:07, Robert Haas wrote:
> On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So we know that we should expect about
>>
>>    (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
>>
>>    (       regular WAL        ) + (              FPW WAL             )
>>
>> to be produced until the end of the current checkpoint. I don't have a clear
>> idea how to transform this into the 'progress' yet, but I'm pretty sure
>> tracking the two types of WAL is a key to a better solution. The x^1.5 is
>> probably a step in the right direction, but I don't feel particularly
>> confident about the 1.5 (which is rather arbitrary).
>
> If it works well empirically, does it really matter that it's
> arbitrary?  I mean, the entire planner is full of fairly arbitrary
> assumptions about which things to consider in the cost model and which
> to ignore.  The proof that we have made good decisions there is in the
> query plans it generates.  (The proof that we have made bad decisions
> in some cases in the query plans, too.)

Agreed.

> I think a bigger problem for this patch is that Heikki seems to have
> almost completely disappeared.

Yeah, there's that problem too :-).

The reason I didn't commit this back then was lack of performance 
testing. I'm fairly confident that this would be a significant 
improvement for some workloads, and shouldn't hurt much even in the 
worst case. But I did only a little testing on my laptop. I think Simon 
was in favor of just committing it immediately, and Fabien wanted to see 
more performance testing before committing.

I was hoping that Digoal would re-ran his original test case, and report 
back on whether it helps. Fabien had a performance test setup, for 
testing another patch, but he didn't want to run it to test this patch. 
Amit did some testing, but didn't see a difference. We can take that as 
a positive sign - no regression - or as a negative sign, but I think 
that basically means that his test was just not sensitive to the FPW issue.

So Tomas, if you're willing to do some testing on this, that would be 
brilliant!

- Heikki




Hi,

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:
> On 17/12/15 19:07, Robert Haas wrote:
>> On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> So we know that we should expect about
>>>
>>>    (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
>>>
>>>    (       regular WAL        ) + (              FPW WAL             )
>>>
>>> to be produced until the end of the current checkpoint. I don't
>>> have a clear idea how to transform this into the 'progress' yet,
>>> but I'm pretty sure tracking the two types of WAL is a key to a
>>> better solution. The x^1.5 is probably a step in the right
>>> direction, but I don't feel particularly confident about the 1.5
>>> (which is rather arbitrary).
>>
>> If it works well empirically, does it really matter that it's
>> arbitrary? I mean, the entire planner is full of fairly arbitrary
>> assumptions about which things to consider in the cost model and
>> which to ignore. The proof that we have made good decisions there
>> is in the query plans it generates. (The proof that we have made
>> bad decisions in some cases in the query plans, too.)
>
> Agreed.

What if it only seems to work well because it was tested on cases it was 
designed for? What about the workloads that behave differently?

Whenever we do changes to costing and query planning, we carefully 
consider counter-examples and cases where it might fail. I see nothing 
like that in this thread - all I see is a bunch of pgbench tests, which 
seems rather insufficient to me.

>
>> I think a bigger problem for this patch is that Heikki seems to have
>> almost completely disappeared.
>
> Yeah, there's that problem too :-).
>
> The reason I didn't commit this back then was lack of performance
> testing. I'm fairly confident that this would be a significant
> improvement for some workloads, and shouldn't hurt much even in the
> worst case. But I did only a little testing on my laptop. I think
> Simon was in favor of just committing it immediately, and Fabien
> wanted to see more performance testing before committing.
>
> I was hoping that Digoal would re-ran his original test case, and
> report back on whether it helps. Fabien had a performance test setup,
> for testing another patch, but he didn't want to run it to test this
> patch. Amit did some testing, but didn't see a difference. We can
> take that as a positive sign - no regression - or as a negative sign,
> but I think that basically means that his test was just not sensitive
> to the FPW  issue.
>
> So Tomas, if you're willing to do some testing on this, that would
> be brilliant!

I'm ready to spend some time on this, assuming we can agree on what 
tests to run. Can we come up with realistic workloads where we expect 
the patch might actually work poorly?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 21/12/15 13:53, Tomas Vondra wrote:
> On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:
>> On 17/12/15 19:07, Robert Haas wrote:
>>> If it works well empirically, does it really matter that it's
>>> arbitrary? I mean, the entire planner is full of fairly arbitrary
>>> assumptions about which things to consider in the cost model and
>>> which to ignore. The proof that we have made good decisions there
>>> is in the query plans it generates. (The proof that we have made
>>> bad decisions in some cases in the query plans, too.)
>>
>> Agreed.
>
> What if it only seems to work well because it was tested on cases it was
> designed for? What about the workloads that behave differently?
>
> Whenever we do changes to costing and query planning, we carefully
> consider counter-examples and cases where it might fail. I see nothing
> like that in this thread - all I see is a bunch of pgbench tests, which
> seems rather insufficient to me.

Agreed on that too.

> I'm ready to spend some time on this, assuming we can agree on what
> tests to run. Can we come up with realistic workloads where we expect
> the patch might actually work poorly?

I think the worst case scenario would be the case where there is no 
FPW-related WAL burst at all, and checkpoints are always triggered by 
max_wal_size rather than checkpoint_timeout. In that scenario, the 
compensation formula will cause the checkpoint to be too lazy in the 
beginning, and it will have to catch up more aggressively towards the 
end of the checkpoint cycle.

One such scenario might be to do only COPYs into a table with no 
indexes. Or hack pgbench to do concentrate all the updates on only a few 
very rows. There will be a FPW on those few pages initially, but the 
spike will be much shorter. Or turn full_page_writes=off, and hack the 
patch to do compensation even when fullpage_writes=off, and then just 
run pgbench.

- Heikki




Hello Heikki,

> The reason I didn't commit this back then was lack of performance testing. 
> I'm fairly confident that this would be a significant improvement for some 
> workloads, and shouldn't hurt much even in the worst case. But I did only a 
> little testing on my laptop. I think Simon was in favor of just committing it 
> immediately, and

> Fabien wanted to see more performance testing before committing.

I confirm. To summarize my opinion:

I think that the 1.5 value somewhere in the patch is much too high for the 
purpose because it shifts the checkpoint load quite a lot (50% more load 
at the end of the checkpoint) just for the purpose of avoiding a spike 
which lasts a few seconds (I think) at the beginning. A much smaller value 
should be used (1.0 <= factor < 1.1), as it would be much less disruptive 
and would probably avoid the issue just the same. I recommend not to 
commit with a 1.5 factor in any case.

Another issue I raised is that the load change occurs both with xlog and 
time triggered checkpoints, and I'm sure it should be applied in both 
case.

Another issue is that the patch makes sense when the WAL & relations are 
on the same disk, but might degrade performance otherwise.

Another point that it interacts potentially with a patch I submitted which 
has a large impact on performance (order of magnitude better in some cases 
by sorting & flushing blocks on checkpoints), so it would make sense to 
check that.

So more testing is definitely needed. A guc would be nice for this 
purpose, especially to look at different factors.

> I was hoping that Digoal would re-ran his original test case, and report 
> back on whether it helps. Fabien had a performance test setup, for 
> testing another patch, but he didn't want to run it to test this patch.

Indeed, I have, but I'm quite behind at the moment, I cannot promise 
anything. Moreover, I'm not sure I see this "spike" issue in my setting, 
AFAICR.

-- 
Fabien.



On Mon, Dec 21, 2015 at 7:51 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> I think that the 1.5 value somewhere in the patch is much too high for the
> purpose because it shifts the checkpoint load quite a lot (50% more load at
> the end of the checkpoint) just for the purpose of avoiding a spike which
> lasts a few seconds (I think) at the beginning. A much smaller value should
> be used (1.0 <= factor < 1.1), as it would be much less disruptive and would
> probably avoid the issue just the same. I recommend not to commit with a 1.5
> factor in any case.

Wait, what?  On what workload does the FPW spike last only a few
seconds?  That's certainly not the case in testing I've done.  It
would have to be the case that almost all the writes were concentrated
on a very few pages.

> Another issue I raised is that the load change occurs both with xlog and
> time triggered checkpoints, and I'm sure it should be applied in both case.

Is this sentence missing a "not"?

> Another issue is that the patch makes sense when the WAL & relations are on
> the same disk, but might degrade performance otherwise.

Yes, that would be a good case to test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Hello Robert,

>> I think that the 1.5 value somewhere in the patch is much too high for the
>> purpose because it shifts the checkpoint load quite a lot (50% more load at
>> the end of the checkpoint) just for the purpose of avoiding a spike which
>> lasts a few seconds (I think) at the beginning. A much smaller value should
>> be used (1.0 <= factor < 1.1), as it would be much less disruptive and would
>> probably avoid the issue just the same. I recommend not to commit with a 1.5
>> factor in any case.
>
> Wait, what?  On what workload does the FPW spike last only a few
> seconds? [...]

Ok. AFAICR, a relatively small part at the beginning of the checkpoint, 
but possibly more that a few seconds.

My actual point is that it should be tested with different and especially 
smaller values, because 1.5 changes the overall load distribution *a lot*. 
For testing purpose I suggested that a guc would help, but the patch 
author has never been back to intervene on the thread, discuss the 
arguments not provide another patch.

>> Another issue I raised is that the load change occurs both with xlog and
>> time triggered checkpoints, and I'm sure it should be applied in both case.
>
> Is this sentence missing a "not"?

Indeed. I think that it make sense for xlog triggered checkpoints, but 
less so with time triggered checkpoints. I may be wrong, but I think that 
this deserve careful analysis.

-- 
Fabien.



On Wed, Dec 23, 2015 at 9:22 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> Wait, what?  On what workload does the FPW spike last only a few
>> seconds? [...]
>
> Ok. AFAICR, a relatively small part at the beginning of the checkpoint, but
> possibly more that a few seconds.

On a pgbench test, and probably many other workloads, the impact of
FPWs declines exponentially (or maybe geometrically, but I think
exponentially) as we get further into the checkpoint.  The first write
is dead certain to need an FPW; after that, if access is more or less
random, the chance of needing an FPW for the next write increases in
proportion to the number of FPWs already written.  As the chances of
NOT needing an FPW grow higher, the tps rate starts to increase,
initially just a bit, but then faster and faster as the percentage of
the working set that has already had an FPW grows.  If the working set
is large, we're still doing FPWs pretty frequently when the next
checkpoint hits - if it's small, then it'll tail off sooner.

> My actual point is that it should be tested with different and especially
> smaller values, because 1.5 changes the overall load distribution *a lot*.
> For testing purpose I suggested that a guc would help, but the patch author
> has never been back to intervene on the thread, discuss the arguments not
> provide another patch.

Well, somebody else should be able to hack a GUC into the patch.

I think one thing that this conversation exposes is that the size of
the working set matters a lot.   For example, if the workload is
pgbench, you're going to see a relatively short FPW-related spike at
scale factor 100, but at scale factor 3000 it's going to be longer and
at some larger scale factor it will be longer still.  Therefore you're
probably right that 1.5 is unlikely to be optimal for everyone.

Another point (which Jan Wieck made me think of) is that the optimal
behavior here likely depends on whether xlog and data are on the same
disk controller.  If they aren't, the FPW spike and background writes
may not interact as much.

>>> Another issue I raised is that the load change occurs both with xlog and
>>> time triggered checkpoints, and I'm sure it should be applied in both
>>> case.
>>
>> Is this sentence missing a "not"?
> Indeed. I think that it make sense for xlog triggered checkpoints, but less
> so with time triggered checkpoints. I may be wrong, but I think that this
> deserve careful analysis.

Hmm, off-hand I don't see why that should make any difference.  No
matter what triggers the checkpoint, there is going to be a spike of
FPI activity at the beginning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Hello Robert,

> On a pgbench test, and probably many other workloads, the impact of
> FPWs declines exponentially (or maybe geometrically, but I think
> exponentially) as we get further into the checkpoint.

Indeed. If the probability of hitting a page is uniform, I think that the 
FPW probability is exp(-n/N) for the n-th page access.

> The first write is dead certain to need an FPW; after that, if access is 
> more or less random, the chance of needing an FPW for the next write 
> increases in proportion to the number of FPWs already written.  As the 
> chances of NOT needing an FPW grow higher, the tps rate starts to 
> increase, initially just a bit, but then faster and faster as the 
> percentage of the working set that has already had an FPW grows.  If the 
> working set is large, we're still doing FPWs pretty frequently when the 
> next checkpoint hits - if it's small, then it'll tail off sooner.

Yes.

>> My actual point is that it should be tested with different and especially
>> smaller values, because 1.5 changes the overall load distribution *a lot*.
>> For testing purpose I suggested that a guc would help, but the patch author
>> has never been back to intervene on the thread, discuss the arguments not
>> provide another patch.
>
> Well, somebody else should be able to hack a GUC into the patch.

Yep. But I'm so far behind everything that I was basically waiting for the 
author to do it:-)

> I think one thing that this conversation exposes is that the size of
> the working set matters a lot.   For example, if the workload is
> pgbench, you're going to see a relatively short FPW-related spike at
> scale factor 100, but at scale factor 3000 it's going to be longer and
> at some larger scale factor it will be longer still.  Therefore you're
> probably right that 1.5 is unlikely to be optimal for everyone.
>
> Another point (which Jan Wieck made me think of) is that the optimal
> behavior here likely depends on whether xlog and data are on the same
> disk controller.  If they aren't, the FPW spike and background writes
> may not interact as much.

Yep, I pointed out that as well. In which case the patch just disrupts the 
checkpoint load for no benefit... Which would make a guc mandatory.

>> [...]. I think that it make sense for xlog triggered checkpoints, but 
>> less so with time triggered checkpoints. I may be wrong, but I think 
>> that this deserve careful analysis.
>
> Hmm, off-hand I don't see why that should make any difference.  No
> matter what triggers the checkpoint, there is going to be a spike of
> FPI activity at the beginning.

Hmmm. Let us try with both hands:

AFAICR with xlog-triggered checkpoints, the checkpointer progress is 
measured with respect to the size of the WAL file, which does not grow 
linearly in time for the reason you pointed above (a lot of FPW at the 
beginning, less in the end). As the WAL file is growing quickly, the 
checkpointer thinks that it is late and that it has some catchup to do, so 
it will start to try writing quickly as well. There is a double whammy as 
both are writing more, and are probably not succeeding.

For time triggered checkpoints, the WAL file gets filled up *but* the 
checkpointer load is balanced against time. This is a "simple" whammy, 
where the checkpointer uses IO bandwith which is needed for the WAL, and 
it could wait a little bit because the WAL will need less later, but it is 
not trying to catch up by even writing more, so the load shifting needed 
in this case is not the same as the previous case.

As you point out there is a WAL spike in both case, but in one case there 
is also a checkpointer spike and in the other the checkpointer load is 
flat.

So I think that the correction should not be the same in both cases. 
Moreover no correction is needed if WAL & relations are on different 
disks. Also, as you pointed out, it also depends on the load (for a large 
base the FPW is spead more evenly, for smaller bases there is a spike), so 
the corrective formula should take that information into account, which 
means that some evaluation of the FPW distribution should be collected...

All this is non trivial. I may do some math to try to solve this, but I'm 
pretty sure that a blank 1.5 correction in all cases is not the solution.

-- 
Fabien.



Hi,

On 12/23/2015 03:38 PM, Robert Haas wrote:
>
> I think one thing that this conversation exposes is that the size of
> the working set matters a lot. For example, if the workload is
> pgbench, you're going to see a relatively short FPW-related spike at
> scale factor 100, but at scale factor 3000 it's going to be longer
> and at some larger scale factor it will be longer still. Therefore
> you're probably right that 1.5 is unlikely to be optimal for
> everyone.

Right.

Also, when you say "pgbench" you probably mean the default uniform 
distribution. But we now have gaussian and exponential distributions 
which might be handy to simulate other types of workloads.

>
> Another point (which Jan Wieck made me think of) is that the optimal
> behavior here likely depends on whether xlog and data are on the same
> disk controller. If they aren't, the FPW spike and background writes
> may not interact as much.

I'm not sure what exactly you mean by "optimal behavior" here. Surely if 
you want to minimize interference between WAL and regular I/O, you'll do 
that.

But I don't see what that has to do with the writes generated by the 
checkpoint? If we do much more writes at the beginning of the checkpoint 
(due to getting confused by FPW), and OS starts flushing that to disk 
because we exceed dirty_(background)_bytes, that surely interferes with 
reads (which is a major issue for queries).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Dec 23, 2015 at 10:37 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Hmmm. Let us try with both hands:
>
> AFAICR with xlog-triggered checkpoints, the checkpointer progress is
> measured with respect to the size of the WAL file, which does not grow
> linearly in time for the reason you pointed above (a lot of FPW at the
> beginning, less in the end). As the WAL file is growing quickly, the
> checkpointer thinks that it is late and that it has some catchup to do, so
> it will start to try writing quickly as well. There is a double whammy as
> both are writing more, and are probably not succeeding.
>
> For time triggered checkpoints, the WAL file gets filled up *but* the
> checkpointer load is balanced against time. This is a "simple" whammy, where
> the checkpointer uses IO bandwith which is needed for the WAL, and it could
> wait a little bit because the WAL will need less later, but it is not trying
> to catch up by even writing more, so the load shifting needed in this case
> is not the same as the previous case.

I see your point, but this isn't a function of what triggered the
checkpoint.  It's a function of how we measure whether the
already-triggered checkpoint is on schedule - we may be behind either
because of time, or because of xlog, or both.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> Another point (which Jan Wieck made me think of) is that the optimal
>> behavior here likely depends on whether xlog and data are on the same
>> disk controller. If they aren't, the FPW spike and background writes
>> may not interact as much.
>
> I'm not sure what exactly you mean by "optimal behavior" here. Surely if you
> want to minimize interference between WAL and regular I/O, you'll do that.
>
> But I don't see what that has to do with the writes generated by the
> checkpoint? If we do much more writes at the beginning of the checkpoint
> (due to getting confused by FPW), and OS starts flushing that to disk
> because we exceed dirty_(background)_bytes, that surely interferes with
> reads (which is a major issue for queries).

Well, it's true that the checkpointer dirty page writes could
interfere with reads, but if you've also got lots of FPW-bloated WAL
records being written to the same disk at the same time, I would think
that'd be worse.  No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




On 12/21/2015 01:11 PM, Heikki Linnakangas wrote:
> On 21/12/15 13:53, Tomas Vondra wrote:
>> On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:
>>> On 17/12/15 19:07, Robert Haas wrote:
>>>> If it works well empirically, does it really matter that it's
>>>> arbitrary? I mean, the entire planner is full of fairly arbitrary
>>>> assumptions about which things to consider in the cost model and
>>>> which to ignore. The proof that we have made good decisions there
>>>> is in the query plans it generates. (The proof that we have made
>>>> bad decisions in some cases in the query plans, too.)
>>>
>>> Agreed.
>>
>> What if it only seems to work well because it was tested on cases it was
>> designed for? What about the workloads that behave differently?
>>
>> Whenever we do changes to costing and query planning, we carefully
>> consider counter-examples and cases where it might fail. I see nothing
>> like that in this thread - all I see is a bunch of pgbench tests, which
>> seems rather insufficient to me.
>
> Agreed on that too.
>
>> I'm ready to spend some time on this, assuming we can agree on what
>> tests to run. Can we come up with realistic workloads where we expect
>> the patch might actually work poorly?
>
> I think the worst case scenario would be the case where there is no
> FPW-related WAL burst at all, and checkpoints are always triggered by
> max_wal_size rather than checkpoint_timeout. In that scenario, the
> compensation formula will cause the checkpoint to be too lazy in the
> beginning, and it will have to catch up more aggressively towards the
> end of the checkpoint cycle.
>
> One such scenario might be to do only COPYs into a table with no
> indexes. Or hack pgbench to do concentrate all the updates on only a few
> very rows. There will be a FPW on those few pages initially, but the
> spike will be much shorter. Or turn full_page_writes=off, and hack the
> patch to do compensation even when fullpage_writes=off, and then just
> run pgbench.

OK, the COPY scenario works interesting and also realistic because it 
probably applies to systems doing batch loads.

So that's one test to do, can we come up with some other?

We probably do want to do a bunch of pgbench tests, with various scales 
and also distributions - the gaussian/exponential distributions seem 
useful for simulating OLTP systems that usually have just s small active 
set (instead of touching all the data). This surely affects how much FPW 
we do and at what point - my expectetion is that the non-uniform 
distributions will have a long tail of FPW.

So I was thinking about these combinations:

* modes: uniform, gaussian, exponential
* scales: 1000 (15GB), 10000 (150GB)
* clients: 1, 2, 4, 8, 16 (to see impact on scalability, if any)

Each combination needs to run for at least an hour or two, possibly with 
multiple runs. I'll also try running this both on SSD-based sytem and a 
system with 10k drives, because those will probably behave differently.

Also, are we tracking the amount of FPW during the checkpoint, 
somewhere? That'd be useful, at least for this patch. Or do we need to 
just track the amount of WAL produced?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi,

On 12/23/2015 08:22 PM, Robert Haas wrote:
> On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> Another point (which Jan Wieck made me think of) is that the optimal
>>> behavior here likely depends on whether xlog and data are on the same
>>> disk controller. If they aren't, the FPW spike and background writes
>>> may not interact as much.
>>
>> I'm not sure what exactly you mean by "optimal behavior" here. Surely if you
>> want to minimize interference between WAL and regular I/O, you'll do that.
>>
>> But I don't see what that has to do with the writes generated by the
>> checkpoint? If we do much more writes at the beginning of the checkpoint
>> (due to getting confused by FPW), and OS starts flushing that to disk
>> because we exceed dirty_(background)_bytes, that surely interferes with
>> reads (which is a major issue for queries).
>
> Well, it's true that the checkpointer dirty page writes could
> interfere with reads, but if you've also got lots of FPW-bloated WAL
> records being written to the same disk at the same time, I would think
> that'd be worse.  No?

Yes, sure. My point was that in both cases the "optimal behavior" is not 
to get confused by the initially higher amount of WAL (due to FPW), and 
track the "real" un-skewed checkpoint progress.

Placing both data and WAL on the same device/controller makes the 
interference worse, especially when we have a lot of FPW at the 
beginning of the checkpoint.

I.e. there's only one "optimal" behavior for both cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



>> AFAICR with xlog-triggered checkpoints, the checkpointer progress is
>> measured with respect to the size of the WAL file, which does not grow
>> linearly in time for the reason you pointed above (a lot of FPW at the
>> beginning, less in the end). As the WAL file is growing quickly, the
>> checkpointer thinks that it is late and that it has some catchup to do, so
>> it will start to try writing quickly as well. There is a double whammy as
>> both are writing more, and are probably not succeeding.
>>
>> For time triggered checkpoints, the WAL file gets filled up *but* the
>> checkpointer load is balanced against time. This is a "simple" whammy, where
>> the checkpointer uses IO bandwith which is needed for the WAL, and it could
>> wait a little bit because the WAL will need less later, but it is not trying
>> to catch up by even writing more, so the load shifting needed in this case
>> is not the same as the previous case.
>
> I see your point, but this isn't a function of what triggered the
> checkpoint.  It's a function of how we measure whether the
> already-triggered checkpoint is on schedule - we may be behind either
> because of time, or because of xlog, or both.

Yes. Indeed the current implementation does some kind of both time & xlog.

My reasonning was that for time triggered checkpoints (probably average to 
low load) the time is likely to be used for the checkpoint schedule, while 
for xlog-triggered checkpoints (probably higher load) it would be more 
likely to be the xlog, which is skewed.

Anyway careful thinking is needed to balance WAL and checkpointer IOs, 
only when needed, not a rough formula applied blindly.

-- 
Fabien.



On Wed, Dec 23, 2015 at 04:37:00PM +0100, Fabien COELHO wrote:
> Hmmm. Let us try with both hands:
> 
> AFAICR with xlog-triggered checkpoints, the checkpointer progress is
> measured with respect to the size of the WAL file, which does not
> grow linearly in time for the reason you pointed above (a lot of FPW
> at the beginning, less in the end). As the WAL file is growing
> quickly, the checkpointer thinks that it is late and that it has
> some catchup to do, so it will start to try writing quickly as well.
> There is a double whammy as both are writing more, and are probably
> not succeeding.
> 
> For time triggered checkpoints, the WAL file gets filled up *but*
> the checkpointer load is balanced against time. This is a "simple"
> whammy, where the checkpointer uses IO bandwith which is needed for
> the WAL, and it could wait a little bit because the WAL will need
> less later, but it is not trying to catch up by even writing more,
> so the load shifting needed in this case is not the same as the
> previous case.
> 
> As you point out there is a WAL spike in both case, but in one case
> there is also a checkpointer spike and in the other the checkpointer
> load is flat.

Coming in late here, but I always thought the fact that the FPW happen
mostly at the start of the checkpoint, and the checkpoint writes/fsyncs
happen mostly in the first half of the checkpoint period was always
suboptimal, i.e. it would be nice of one of these was more active in the
second half of the checkpoint period.  I assume that is what is being
discussed here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



> Coming in late here, but I always thought the fact that the FPW happen
> mostly at the start of the checkpoint, and the checkpoint writes/fsyncs
> happen mostly in the first half of the checkpoint period was always
> suboptimal, i.e. it would be nice of one of these was more active in the
> second half of the checkpoint period.  I assume that is what is being
> discussed here.

Yes, this is the subject of the thread.

On the one end hand, whether is the first half or first quarter of first 
tenth really depends on the actual load, so how much to rebalance depends 
on that dynamic information. At the beginning there should be a short 
spike for index pages which are quickly reused, and a longer spike about 
data pages depending on the pattern of access and size of table.

On the other hand the rebalancing also depends on the measure chosen to 
know about the overall progress, either WAL writing or time, and their 
behavior is not the same, so this should be taken into account.

My conclusion is that there is no simple static fix to this issue, as 
proposed in the submitted patch. The problem needs thinking and maths.

-- 
Fabien.