Thread: about fsync in CLOG buffer write

about fsync in CLOG buffer write

From

"张广舟(明虚)"

Date:

02 September 2015, 11:45:09

Hi Hackers,

We host many PG instances on each of our machines, and we want to minimize the calls to fsync within each instance, in order to minimize possible impact on other instances. We found there is a fsync call when CLOG buffer is written out in SlruPhysicalWritePage(). It is often called when a backend needs to check transaction status with SimpleLruReadPage().

The code is as follows (in src/backend/access/transam/slru.c line 837 in PG 9.4):

* If not part of Flush, need to fsync now. We assume this happens

* infrequently enough that it's not a performance issue.

if (!fdata)

{

if (ctl->do_fsync && pg_fsync(fd))

{

slru_errcause = SLRU_FSYNC_FAILED;

slru_errno = errno;

CloseTransientFile(fd);

return false;

}

if (CloseTransientFile(fd))

{

slru_errcause = SLRU_CLOSE_FAILED;

slru_errno = errno;

return false;

}

ctl->do_fsync is true for CLOG. Question is, could we just disable fsync for CLOG buffer write out here? Is it safe to do so? I understand a checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and the fsync call here (for buffer write out) is not necessary.

Thanks for your time!

Guangzhou

Re: about fsync in CLOG buffer write

From

Andres Freund

Date:

02 September 2015, 12:32:59

On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
> We found there is a fsync call when CLOG buffer
> is written out in SlruPhysicalWritePage(). It is often called when a backend
> needs to check transaction status with SimpleLruReadPage().

That's when there's not enough buffers available some other, and your
case dirty, needs to be written out.

You could try increasing the max (32) in CLOGShmemBuffers() further.

>  ctl->do_fsync is true for CLOG.  Question is, could we just disable fsync
> for CLOG buffer write out here? Is it safe to do so? I understand a
> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
> the fsync call here (for buffer write out) is not necessary.

No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage()
is that the flush during checkpoint won't necessarily touch those files
at all (as they've been replaced with other buffers in memory).

This could be optimized though - it should be possible to delay the
fsync for slru files that have been evicted from memory till
checkpoint. Using something like ForwardFsyncRequest() except that it
obviously has to be usable for other files than normal relfilenodes.

Greetings,

Andres Freund

Re: about fsync in CLOG buffer write

From

"张广舟(明虚)"

Date:

07 September 2015, 12:56:07

Andres,

Thanks for the reply!

I will try the ForwardFsyncRequest-like approach.

在 15-9-2 下午8:32， "Andres Freund" <andres@anarazel.de> 写入:

>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a
>>backend
>> needs to check transaction status with SimpleLruReadPage().
>
>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.
>
>You could try increasing the max (32) in CLOGShmemBuffers() further.
>
>>  ctl->do_fsync is true for CLOG.  Question is, could we just disable
>>fsync
>> for CLOG buffer write out here? Is it safe to do so? I understand a
>> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once,
>>and
>> the fsync call here (for buffer write out) is not necessary.
>
>No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage()
>is that the flush during checkpoint won't necessarily touch those files
>at all (as they've been replaced with other buffers in memory).
>
>This could be optimized though - it should be possible to delay the
>fsync for slru files that have been evicted from memory till
>checkpoint. Using something like ForwardFsyncRequest() except that it
>obviously has to be usable for other files than normal relfilenodes.
>
>Greetings,
>
>Andres Freund

答复：[HACKERS] about fsync in CLOG buffer write

From

"周正中(德歌)"

Date:

07 September 2015, 23:20:16

>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().

>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.

>You could try increasing the max (32) in CLOGShmemBuffers() further.

I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。

there is an trace:

global f_start[999999]

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

pg_fsync called every 32K xacts by ExtendCLOG .
     0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
    31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
    53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
   102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
  1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
  1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

>> ctl->do_fsync is true for CLOG. Question is, could we just disable fsync
>> for CLOG buffer write out here? Is it safe to do so? I understand a
>> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
>> the fsync call here (for buffer write out) is not necessary.

>No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage()
>is that the flush during checkpoint won't necessarily touch those files
>at all (as they've been replaced with other buffers in memory).

>This could be optimized though - it should be possible to delay the
>fsync for slru files that have been evicted from memory till
>checkpoint. Using something like ForwardFsyncRequest() except that it
>obviously has to be usable for other files than normal relfilenodes.

>Greetings,

>Andres Freund

Best regards,

digoal

Re: [HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

Amit Kapila

Date:

08 September 2015, 07:37:25

On Sun, Sep 6, 2015 at 7:34 AM, 周正中(德歌) <dege.zzz@alibaba-inc.com> wrote:

>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().

>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.

>You could try increasing the max (32) in CLOGShmemBuffers() further.
I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。

Ideally that should not happen every time, CLOG page is written only

when there is no non-dirty CLOG buffer to accommodate a new page

during ExtendCLOG. I think it will happen only if there is no flush of

CLOG pages by Checkpoint or by any other process trying to commit

the transaction during last 32 pages worth of transactions (which is

4 (TRANS_PER_BYTE) * 8192 (BLKSIZE) * 32 (CLOG_BUFFERS)).

What kind of workload you are using, can you once try by commenting

the sync call and see if there is any gain (I understand that is not right,

but at least we will come to know how much we can save by moving such

a sync to checkpoint).

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

"周正中(德歌)"

Date:

08 September 2015, 13:36:53

HI,

this is the test case:

Use stap track pg_clog/* fsync call.
[root@digoal ~]# cat trc.stp
global f_start[999999]

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}

Use pgbench like this:
postgres@digoal-> cat 7.sql
select txid_current();

postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
About 32K tps.
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112

Trace the PID:
postgres@digoal-> ps -ewf|grep postgres
postgres  2921  1883 29 09:37 pts/1    00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres  2924  1841 66 09:37 ?        00:00:13 postgres: postgres postgres [local] SELECT

[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1
     0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
    31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
    53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
   102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
  1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
  1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'

fsync interval about 1 sec, every 32KB xacts, per CLOG PAGE.
postgres=# select 1441503928836-1441503927731;
?column?
----------
     1105
(1 row)

postgres=# select 1441503929819-1441503928836;
?column?
----------
      983
(1 row)

------------------------------------------------------------------
发件人：Amit Kapila <amit.kapila16@gmail.com>
发送时间：2015年9月8日(星期二) 15:37
收件人：周正中(德歌) <dege.zzz@alibaba-inc.com>
抄　送：Andres Freund <andres@anarazel.de>，张广舟(明虚) <guangzhou.zgz@alibaba-inc.com>，pgsql-hackers <pgsql-hackers@postgresql.org>，范孝剑(康贤) <funnyxj.fxj@alibaba-inc.com>，曾文旌(义从) <wenjing.zwj@alibaba-inc.com>，窦贤明(执白) <xianming.dxm@alibaba-inc.com>，萧少聪(铁庵) <shaocong.xsc@alibaba-inc.com>，陈新坚(惧留孙) <xinjian.chen@alibaba-inc.com>
主　题：Re: [HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

On Sun, Sep 6, 2015 at 7:34 AM, 周正中(德歌) <dege.zzz@alibaba-inc.com> wrote:

>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().

>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.

>You could try increasing the max (32) in CLOGShmemBuffers() further.
I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。

Ideally that should not happen every time, CLOG page is written only
when there is no non-dirty CLOG buffer to accommodate a new page
during ExtendCLOG. I think it will happen only if there is no flush of
CLOG pages by Checkpoint or by any other process trying to commit
the transaction during last 32 pages worth of transactions (which is
4 (TRANS_PER_BYTE) * 8192 (BLKSIZE) * 32 (CLOG_BUFFERS)).

What kind of workload you are using, can you once try by commenting
the sync call and see if there is any gain (I understand that is not right,
but at least we will come to know how much we can save by moving such
a sync to checkpoint).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: 答复：[HACKERS] 答复：[HACKERS] about fsync inCLOG buffer write

From

Andres Freund

Date:

08 September 2015, 18:28:11

Hi,

On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
> postgres@digoal-> cat 7.sql
> select txid_current();
>
> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
> About 32K tps.
> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127

So you're benchmarking how fast you can assign txids. Is that actually
something meaningful? If you have other writes interleaved you'll write
much more WAL, so there'll be checkpoints and such.

FWIW, if you measure something realistic and there's checkpoints,
there'll be fewer fsyncs if you increase the slru buffer size - as
there'll often be clean buffers due to the checkpoint having written
them out.

Greetings,

Andres Freund

Re: [HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

Robert Haas

Date:

09 September 2015, 14:27:16

On Tue, Sep 8, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
>> postgres@digoal-> cat 7.sql
>> select txid_current();
>>
>> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
>> About 32K tps.
>> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
>> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
>
> So you're benchmarking how fast you can assign txids. Is that actually
> something meaningful? If you have other writes interleaved you'll write
> much more WAL, so there'll be checkpoints and such.
>
> FWIW, if you measure something realistic and there's checkpoints,
> there'll be fewer fsyncs if you increase the slru buffer size - as
> there'll often be clean buffers due to the checkpoint having written
> them out.

But I think it's not very hard to come up with a workload where 32
clog buffers isn't enough, and you end up waiting for backends to
fsync().  Suppose you have a pgbench workload with N tuples.  We
update tuples at random, so sometimes we hit one that's just recently
been updated, and other times we hit one that hasn't been updated for
many transactions.  At 32k transactions per page, each transaction's
chance of updating a tuple whose existing xmin is on the most recent
clog page is 32k/N.  The chance of hitting a tuple whose existing xmin
is on the next page back is (1 - 32k/N) * 32k/N.  The chance of
hitting a page whose current xmin is at least X pages prior to the
most recent one is (1 - 32k/N)^X.  So, how many tuples do we need in
order for each update to have a 50% chance of hitting a tuple that is
at least 32 pages back?

(1 - 32k/N)^32 = .5
1 - 32k/N = 0.9785720620877
32k/N = 0.0214279379122999
N = 32k/0.0214279379122999
N = 1529218.54329206

...or in other words, scale factor 16.  At scale factors >= 1044, the
chance that the next update hits an xmin more than 32 clog buffers
back is > 99%.  So any test of this sort causes extremely rapid clog
page eviction - basically every transaction is going to request a
buffer that's probably not cached, and as soon as it's done with it,
some other transaction will evict it to bring in a different buffer
that's not cached.

How often such a workload actually has to replace a *dirty* clog
buffer obviously depends on how often you checkpoint, but if you're
getting ~28k TPS you can completely fill 32 clog buffers (1 million
transactions) in less than 40 seconds, and you're probably not
checkpointing nearly that often.

I'm not entirely sure how much fsync absorption for SLRU buffers will
help, but I think it would be worth trying.  Back when I was spending
more time on this area, I saw some evidence that those fsyncs did
cause at least some latency spikes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: [HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

Tom Lane

Date:

09 September 2015, 14:35:34

Robert Haas <robertmhaas@gmail.com> writes:
> ... How often such a workload actually has to replace a *dirty* clog
> buffer obviously depends on how often you checkpoint, but if you're
> getting ~28k TPS you can completely fill 32 clog buffers (1 million
> transactions) in less than 40 seconds, and you're probably not
> checkpointing nearly that often.

But by the same token, at that kind of transaction rate, no clog page is
actively getting dirtied for more than a couple of seconds.  So while it
might get swapped in and out of the SLRU arena pretty often after that,
this scenario seems unconvincing as a source of repeated fsyncs.

Like Andres, I'd want to see a more realistic problem case before
expending a lot of work here.

            regards, tom lane

Re: [HACKERS] Re: [HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

Robert Haas

Date:

09 September 2015, 14:47:01

On Wed, Sep 9, 2015 at 10:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> ... How often such a workload actually has to replace a *dirty* clog
>> buffer obviously depends on how often you checkpoint, but if you're
>> getting ~28k TPS you can completely fill 32 clog buffers (1 million
>> transactions) in less than 40 seconds, and you're probably not
>> checkpointing nearly that often.
>
> But by the same token, at that kind of transaction rate, no clog page is
> actively getting dirtied for more than a couple of seconds.  So while it
> might get swapped in and out of the SLRU arena pretty often after that,
> this scenario seems unconvincing as a source of repeated fsyncs.

Well, if you're filling ~1 clog page per second, you're doing ~1 fsync
per second too.  Or if you are not, then you are thrashing the
progressively smaller and smaller number of clean slots ever-harder
until no clean pages remain and you're forced to fsync something -
probably, a bunch of things all at once.

> Like Andres, I'd want to see a more realistic problem case before
> expending a lot of work here.

I think the question here isn't whether the problem case is realistic
- I am quite sure that a pgbench workload is - but rather how much of
a problem it actually causes.  I'm very sure that individual pgbench
backends experience multi-second stallls as a result of this.  What
I'm not sure about is how frequently it happens, and how much of an
effect it has on overall latency.  I think it would be worth someone's
time to try to write some good instrumentation code here and figure
that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: [HACKERS] Re: [HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

Andres Freund

Date:

09 September 2015, 15:14:10

On 2015-09-09 10:46:47 -0400, Robert Haas wrote:
> Well, if you're filling ~1 clog page per second, you're doing ~1 fsync
> per second too.  Or if you are not, then you are thrashing the
> progressively smaller and smaller number of clean slots ever-harder
> until no clean pages remain and you're forced to fsync something -
> probably, a bunch of things all at once.

And you're also triggering a lot of other fsyncs. At the very least
you're pretty constantly flushing the WAL, even with synchronous_commit
= off.

> > Like Andres, I'd want to see a more realistic problem case before
> > expending a lot of work here.
>
> I think the question here isn't whether the problem case is realistic
> - I am quite sure that a pgbench workload is - but rather how much of
> a problem it actually causes.  I'm very sure that individual pgbench
> backends experience multi-second stallls as a result of this.

The fsync causes that? That seems odd. I mean we release the slru
control lock for writing out pages, don't we?

To me it seems the fsyncs are a red herring here, and the problems are,
uh, much bigger. ISTM that cache replacement locking granularity et al
have a much bigger effect than a fsync every now and then.

Greetings,

Andres Freund

答复：[HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

From

"周正中(德歌)"

Date:

09 September 2015, 18:49:14

HI,

Sorry, it's my misunderstand.

I don't read seriously SlruSelectLRUPage before.

  /* See if page already has a buffer assigned */
  for (slotno = 0; slotno < shared->num_slots; slotno++)
  {
   if (shared->page_number[slotno] == pageno &&
    shared->page_status[slotno] != SLRU_PAGE_EMPTY)
    return slotno;
  }

  /*
   * If we find any EMPTY slot, just select that one. Else choose a
   * victim page to replace.  We normally take the least recently used
   * valid page, but we will never take the slot containing
   * latest_page_number, even if it appears least recently used.  We
   * will select a slot that is already I/O busy only if there is no
   * other choice: a read-busy slot will not be least recently used once
   * the read finishes, and waiting for an I/O on a write-busy slot is
   * inferior to just picking some other slot.  Testing shows the slot
   * we pick instead will often be clean, allowing us to begin a read at
   * once.
   *
   * Normally the page_lru_count values will all be different and so
   * there will be a well-defined LRU page.  But since we allow
   * concurrent execution of SlruRecentlyUsed() within
   * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
   * acquire the same lru_count values.  In that case we break ties by
   * choosing the furthest-back page.
   *
   * Notice that this next line forcibly advances cur_lru_count to a
   * value that is certainly beyond any value that will be in the
   * page_lru_count array after the loop finishes.  This ensures that
   * the next execution of SlruRecentlyUsed will mark the page newly
   * used, even if it's for a page that has the current counter value.
   * That gets us back on the path to having good data when there are
   * multiple pages with the same lru_count.
   */
  cur_count = (shared->cur_lru_count)++;
  for (slotno = 0; slotno < shared->num_slots; slotno++)
  {
   int   this_delta;
   int   this_page_number;

   if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
    return slotno;
   this_delta = cur_count - shared->page_lru_count[slotno];
   if (this_delta < 0)
   {
    /*
     * Clean up in case shared updates have caused cur_count
     * increments to get "lost".  We back off the page counts,
     * rather than trying to increase cur_count, to avoid any
     * question of infinite loops or failure in the presence of
     * wrapped-around counts.
     */
    shared->page_lru_count[slotno] = cur_count;
    this_delta = 0;
   }
   this_page_number = shared->page_number[slotno];
   if (this_page_number == shared->latest_page_number)
    continue;
   if (shared->page_status[slotno] == SLRU_PAGE_VALID)
   {
    if (this_delta > best_valid_delta ||
     (this_delta == best_valid_delta &&
      ctl->PagePrecedes(this_page_number,
            best_valid_page_number)))
    {
     bestvalidslot = slotno;
     best_valid_delta = this_delta;
     best_valid_page_number = this_page_number;
    }
   }
   else
   {
    if (this_delta > best_invalid_delta ||
     (this_delta == best_invalid_delta &&
      ctl->PagePrecedes(this_page_number,
            best_invalid_page_number)))
    {
     bestinvalidslot = slotno;
     best_invalid_delta = this_delta;
     best_invalid_page_number = this_page_number;
    }
   }
  }

  /*
   * If all pages (except possibly the latest one) are I/O busy, we'll
   * have to wait for an I/O to complete and then retry.  In that
   * unhappy case, we choose to wait for the I/O on the least recently
   * used slot, on the assumption that it was likely initiated first of
   * all the I/Os in progress and may therefore finish first.
   */
  if (best_valid_delta < 0)
  {
   SimpleLruWaitIO(ctl, bestinvalidslot);
   continue;
  }

  /*
   * If the selected page is clean, we're set.
   */
  if (!shared->page_dirty[bestvalidslot])
   return bestvalidslot;

------------------------------------------------------------------
发件人：Andres Freund <andres@anarazel.de>
发送时间：2015年9月9日(星期三) 02:28
收件人：周正中(德歌) <dege.zzz@alibaba-inc.com>
抄　送：Amit Kapila <amit.kapila16@gmail.com>，张广舟(明虚) <guangzhou.zgz@alibaba-inc.com>，pgsql-hackers <pgsql-hackers@postgresql.org>，范孝剑(康贤) <funnyxj.fxj@alibaba-inc.com>，曾文旌(义从) <wenjing.zwj@alibaba-inc.com>，窦贤明(执白) <xianming.dxm@alibaba-inc.com>，萧少聪(铁庵) <shaocong.xsc@alibaba-inc.com>，陈新坚(惧留孙) <xinjian.chen@alibaba-inc.com>
主　题：Re: [HACKERS] 答复：[HACKERS] 答复：[HACKERS] about fsync in CLOG buffer write

Hi,

On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
> postgres@digoal-> cat 7.sql
> select txid_current();
>
> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
> About 32K tps.
> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127

So you're benchmarking how fast you can assign txids. Is that actually
something meaningful? If you have other writes interleaved you'll write
much more WAL, so there'll be checkpoints and such.

FWIW, if you measure something realistic and there's checkpoints,
there'll be fewer fsyncs if you increase the slru buffer size - as
there'll often be clean buffers due to the checkpoint having written
them out.

Greetings,

Andres Freund

Re: about fsync in CLOG buffer write

From

Jeff Janes

Date:

13 September 2015, 00:18:39

On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
> We found there is a fsync call when CLOG buffer
> is written out in SlruPhysicalWritePage(). It is often called when a backend
> needs to check transaction status with SimpleLruReadPage().

That's when there's not enough buffers available some other, and your
case dirty, needs to be written out.

Why bother to find a place to store the page in shared memory at all? If we just want to read it, and it isn't already in shared memory, then why not just ask the kernel for the specific byte we need? The byte we want to read can't differ between shared memory and kernel, because it doesn't exist in shared memory.

Cheers,

Jeff

Re: about fsync in CLOG buffer write

From

Andres Freund

Date:

13 September 2015, 00:21:49

On September 12, 2015 5:18:28 PM PDT, Jeff Janes <jeff.janes@gmail.com> wrote:
>On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> > We found there is a fsync call when CLOG buffer
>> > is written out in SlruPhysicalWritePage(). It is often called when
>a
>> backend
>> > needs to check transaction status with SimpleLruReadPage().
>>
>> That's when there's not enough buffers available some other, and your
>> case dirty, needs to be written out.
>>
>
>Why bother to find a place to store the page in shared memory at all? 
>If
>we just want to read it, and it isn't already in shared memory, then
>why
>not just ask the kernel for the specific byte we need?  The byte we
>want to
>read can't differ between shared memory and kernel, because it doesn't
>exist in shared memory.

I doubt that'd help - the next access would be more expensive, and we'd need to have a more complex locking regime.
Thesepages aren't necessarily read only at that point.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: about fsync in CLOG buffer write

From

Jeff Janes

Date:

04 October 2015, 19:14:18

On Sat, Sep 12, 2015 at 5:21 PM, Andres Freund <andres@anarazel.de> wrote:

On September 12, 2015 5:18:28 PM PDT, Jeff Janes <jeff.janes@gmail.com> wrote:
>On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> > We found there is a fsync call when CLOG buffer
>> > is written out in SlruPhysicalWritePage(). It is often called when
>a
>> backend
>> > needs to check transaction status with SimpleLruReadPage().
>>
>> That's when there's not enough buffers available some other, and your
>> case dirty, needs to be written out.
>>
>
>Why bother to find a place to store the page in shared memory at all?
>If
>we just want to read it, and it isn't already in shared memory, then
>why
>not just ask the kernel for the specific byte we need? The byte we
>want to
>read can't differ between shared memory and kernel, because it doesn't
>exist in shared memory.

I doubt that'd help - the next access would be more expensive, and we'd need to have a more complex locking regime. These pages aren't necessarily read only at that point.

My (naive) expectation is that no additional locking is needed.

Once we decide to consult the clog, we already know the transaction is no longer in progress, so it can't be in-flight to change that clog entry we care about because it was required to have done that already.

Once we have verified (under existing locking) that the relevant page is already not in memory, we know it can't be dirty in memory. If someone pulls it into memory after we observe it to be not there, it doesn't matter to us as whatever transaction they are about to change can't be the one we care about.

Perhaps someone will want the same page later so that they can write to it and so will have to pull it in. But we have to play the odds, and the odds are that a page already dirty in memory is more likely to be needed to be written to in the near future, than another page which was not already dirty and is only needed with read intent.

If we are wrong, all that happens is someone later on has to do the same work that we would have had to do anyway, at no greater cost than we if did it now. If we are right, we avoid an fsync to make room for new page, and then later on avoid someone else having to shove out the page we brought in (or a different one) only to replace it with the same page we just wrote, fsynced, and shoved out.

Is there a chance that, if we read a byte from the kernel when someone is in the process of writing adjacent bytes (or writing the same byte, with changes only to bits in it which we don't care about), the kernel will deliver us something which is neither the old value nor the new value, but some monstrosity?

Cheers,

Jeff

Re: about fsync in CLOG buffer write

From

Andres Freund

Date:

04 October 2015, 19:25:46

On 2015-10-04 12:14:05 -0700, Jeff Janes wrote:
> My (naive) expectation is that no additional locking is needed.
> 
> Once we decide to consult the clog, we already know the transaction is no
> longer in progress, so it can't be in-flight to change that clog entry we
> care about because it was required to have done that already.

Other xids on the same page can still be in progress and those
concurrently might need to be written to.

> Once we have verified (under existing locking) that the relevant page is
> already not in memory, we know it can't be dirty in memory.  If someone
> pulls it into memory after we observe it to be not there, it doesn't matter
> to us as whatever transaction they are about to change can't be the one we
> care about.

The read of the page from disk from a concurrent process might have been
before our write, i.e. containing an unmodified page, but now future
writes will overwrite the entry we wrote directly. I think there's a
bunch of related issues.

Such things will currently prevented by the IO locks in slru.c.

> Is there a chance that, if we read a byte from the kernel when someone is
> in the process of writing adjacent bytes (or writing the same byte, with
> changes only to bits in it which we don't care about), the kernel will
> deliver us something which is neither the old value nor the new value, but
> some monstrosity?

Depends on the granularity of the write/read and the OS IIRC.

I don't think it's worth investing time and complexity to bypass SLRU in
certain cases. We should rather rewrite the thing completely.

Greetings,

Andres Freund

Re: about fsync in CLOG buffer write

From

Robert Haas

Date:

06 October 2015, 20:42:32

On Sun, Oct 4, 2015 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> I don't think it's worth investing time and complexity to bypass SLRU in
> certain cases. We should rather rewrite the thing completely.

+1.  That code is considerably past its sell-by date.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company