Home > mailing lists

Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? - Mailing list pgsql-hackers

From	digoal zhou
Subject	Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Date	May 12, 2015 07:11:29
Msg-id	CAKHd5Ce-bnD=gEEdtXiT2_AY7shquTKd0yHXXk5F4zVEKRPX-w@mail.gmail.com Whole thread Raw
Responses	Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? (Heikki Linnakangas <hlinnaka@iki.fi>)
List	pgsql-hackers

Tree view

PostgreSQL (<=9.4) trend to smooth buffer write smooth in a checkpoint_completion_target (checkpoint_timeout or checkpoint_segments), but when we use synchronous_commit=off, there is a little problem for the checkpoint_segments target, because xlog write fast(for full page write which the first page write after checkpoint), so checkpointer cann't sleep and write buffer not smooth.

There is an test:

# stap -DMAXSKIPPED=100000 -v 11111 -e '

global s_var, e_var, stat_var;

/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */

probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {

s_var[pid(),1] = gettimeofday_us()

}

/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */

probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {

e_var[pid(),1] = gettimeofday_us()

if ( s_var[pid(),1] > 0 )

stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]

}

/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */

probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {

s_var[pid(),2] = gettimeofday_us()

}

/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */

probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {

e_var[pid(),2] = gettimeofday_us()

if ( s_var[pid(),2] > 0 )

stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]

}

probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {

printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n", $NBuffers, $num_to_write)

}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {

printf("checkpoint start\n")

}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {

printf("checkpoint done\n")

}

probe timer.s(1) {

foreach ([v1,v2] in stat_var +) {

if ( @count(stat_var[v1,v2]) >0 ) {

printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))

}

printf("----------------------------------end-----------------------------\n")

delete s_var

delete e_var

delete stat_var

Use the test table and data:

create table tbl(id primary key,info text,crt_time timestamp);

insert into tbl select generate_series(1,50000000),now(),now();

Use pgbench test it.

$ vi test.sql

\setrandom id 1 50000000

update tbl set crt_time=now() where id = :id ;

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000

When on schedule checkpoint occure , the tps:

progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504

progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331

progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554

progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139

progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035

progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592

progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018

progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818

progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146

progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711

progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837

progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615

progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647

progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598

progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624

progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524

The parameters for onschedule checkpoint:

checkpoint_segments = 512

checkpoint_timeout = 5min

checkpoint_completion_target = 0.9

stap's output :

there is 156467 dirty blocks, we can see the buffer write per second, write buffer is not smooth between time target.

but between xlog target.

156467/(4.5*60*0.9) = 579.5 write per second.

checkpoint start

buffer__sync__start num_buffers: 262144, dirty_buffers: 156467

r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count: 8596

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count: 8097

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count: 4748

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count: 6304

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count: 4703

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count: 6534

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count: 3318

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398

----------------------------------end-----------------------------

r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216

......

I think we can add an condition to the IsCheckpointOnSchedule,

if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)

{

recptr = GetInsertRecPtr();

elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

if (progress < elapsed_xlogs)

{

ckpt_cached_elapsed = elapsed_xlogs;

return false;

}

# vi src/backend/postmaster/checkpointer.c

#include "access/xact.h"

* IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint

* in time?

* Compares the current progress against the time/segments elapsed since last

* checkpoint, and returns true if the progress we've made this far is greater

* than the elapsed time/segments.

static bool

IsCheckpointOnSchedule(double progress)

{

XLogRecPtr recptr;

struct timeval now;

double elapsed_xlogs,

elapsed_time;

Assert(ckpt_active);

/* Scale progress according to checkpoint_completion_target. */

progress *= CheckPointCompletionTarget;

* Check against the cached value first. Only do the more expensive

* calculations once we reach the target previously calculated. Since

* neither time or WAL insert pointer moves backwards, a freshly

* calculated value can only be greater than or equal to the cached value.

if (progress < ckpt_cached_elapsed)

return false;

* Check progress against WAL segments written and checkpoint_segments.

* We compare the current WAL insert location against the location

* computed before calling CreateCheckPoint. The code in XLogInsert that

* actually triggers a checkpoint when checkpoint_segments is exceeded

* compares against RedoRecptr, so this is not completely accurate.

* However, it's good enough for our purposes, we're only calculating an

* estimate anyway.

if (!RecoveryInProgress())

{

if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)

{

recptr = GetInsertRecPtr();

elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

if (progress < elapsed_xlogs)

{

ckpt_cached_elapsed = elapsed_xlogs;

return false;

}

* Check progress against time elapsed and checkpoint_timeout.

gettimeofday(&now, NULL);

elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +

now.tv_usec / 1000000.0) / CheckPointTimeout;

if (progress < elapsed_time)

{

ckpt_cached_elapsed = elapsed_time;

return false;

}

/* It looks like we're on schedule. */

return true;

}

# gmake && gmake install

$ pg_ctl restart -m fast

Test again:

progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383

progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433

progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460

progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043

progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286

progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941

progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356

progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676

progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134

progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470

progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325

progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297

progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309

We can see checkpointer wiriter buffer smooth(spread time perid) this time.

checkpoint start

----------------------------------end-----------------------------

buffer__sync__start num_buffers: 262144, dirty_buffers: 156761

r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607

----------------------------------end-----------------------------

r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545

But there is also a little problem, When PostgreSQL write xlog reach checkpoint_segments earlier then checkpoint_timeout, the next checkpoint will start soon, so we must tuning the checkpoint_segments larger when the checkpoint occure busy.

Regards,

Digoal

--
公益是一辈子的事,I'm Digoal,Just Do It.

pgsql-hackers by date:

From: Alvaro Herrera
Date: 12 May 2015, 07:03:13
Subject: Re: Multi-xacts and our process problem

From: Craig Ringer
Date: 12 May 2015, 07:27:02
Subject: RFC: Non-user-resettable SET SESSION AUTHORISATION

Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule? - Mailing list pgsql-hackers

Previous

Next