Thread: about fsync in CLOG buffer write
Hi Hackers,
We host many PG instances on each of our machines, and we want to minimize the calls to fsync within each instance, in order to minimize possible impact on other instances. We found there is a fsync call when CLOG buffer is written out in SlruPhysicalWritePage(). It is often called when a backend needs to check transaction status with SimpleLruReadPage().
The code is as follows (in src/backend/access/transam/slru.c line 837 in PG 9.4):
/*
* If not part of Flush, need to fsync now. We assume this happens
* infrequently enough that it's not a performance issue.
*/
if (!fdata)
{
if (ctl->do_fsync && pg_fsync(fd))
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}
if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
return false;
}
ctl->do_fsync is true for CLOG. Question is, could we just disable fsync for CLOG buffer write out here? Is it safe to do so? I understand a checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and the fsync call here (for buffer write out) is not necessary.
Thanks for your time!
Guangzhou
On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote: > We found there is a fsync call when CLOG buffer > is written out in SlruPhysicalWritePage(). It is often called when a backend > needs to check transaction status with SimpleLruReadPage(). That's when there's not enough buffers available some other, and your case dirty, needs to be written out. You could try increasing the max (32) in CLOGShmemBuffers() further. > ctl->do_fsync is true for CLOG. Question is, could we just disable fsync > for CLOG buffer write out here? Is it safe to do so? I understand a > checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and > the fsync call here (for buffer write out) is not necessary. No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage() is that the flush during checkpoint won't necessarily touch those files at all (as they've been replaced with other buffers in memory). This could be optimized though - it should be possible to delay the fsync for slru files that have been evicted from memory till checkpoint. Using something like ForwardFsyncRequest() except that it obviously has to be usable for other files than normal relfilenodes. Greetings, Andres Freund
Andres, Thanks for the reply! I will try the ForwardFsyncRequest-like approach. 在 15-9-2 下午8:32, "Andres Freund" <andres@anarazel.de> 写入: >On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote: >> We found there is a fsync call when CLOG buffer >> is written out in SlruPhysicalWritePage(). It is often called when a >>backend >> needs to check transaction status with SimpleLruReadPage(). > >That's when there's not enough buffers available some other, and your >case dirty, needs to be written out. > >You could try increasing the max (32) in CLOGShmemBuffers() further. > >> ctl->do_fsync is true for CLOG. Question is, could we just disable >>fsync >> for CLOG buffer write out here? Is it safe to do so? I understand a >> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, >>and >> the fsync call here (for buffer write out) is not necessary. > >No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage() >is that the flush during checkpoint won't necessarily touch those files >at all (as they've been replaced with other buffers in memory). > >This could be optimized though - it should be possible to delay the >fsync for slru files that have been evicted from memory till >checkpoint. Using something like ForwardFsyncRequest() except that it >obviously has to be usable for other files than normal relfilenodes. > >Greetings, > >Andres Freund
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().
>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.
>You could try increasing the max (32) in CLOGShmemBuffers() further.
I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。
there is an trace:
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
pg_fsync called every 32K xacts by ExtendCLOG .
0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
>> ctl->do_fsync is true for CLOG. Question is, could we just disable fsync
>> for CLOG buffer write out here? Is it safe to do so? I understand a
>> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
>> the fsync call here (for buffer write out) is not necessary.
>No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage()
>is that the flush during checkpoint won't necessarily touch those files
>at all (as they've been replaced with other buffers in memory).
>This could be optimized though - it should be possible to delay the
>fsync for slru files that have been evicted from memory till
>checkpoint. Using something like ForwardFsyncRequest() except that it
>obviously has to be usable for other files than normal relfilenodes.
>Greetings,
>Andres Freund
Best regards,
digoal
On Sun, Sep 6, 2015 at 7:34 AM, 周正中(德歌) <dege.zzz@alibaba-inc.com> wrote:
With Regards,
Amit Kapila.
>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().
>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.
>You could try increasing the max (32) in CLOGShmemBuffers() further.I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。
Ideally that should not happen every time, CLOG page is written only
when there is no non-dirty CLOG buffer to accommodate a new page
during ExtendCLOG. I think it will happen only if there is no flush of
CLOG pages by Checkpoint or by any other process trying to commit
the transaction during last 32 pages worth of transactions (which is
4 (TRANS_PER_BYTE) * 8192 (BLKSIZE) * 32 (CLOG_BUFFERS)).
What kind of workload you are using, can you once try by commenting
the sync call and see if there is any gain (I understand that is not right,
but at least we will come to know how much we can save by moving such
a sync to checkpoint).
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
HI,
this is the test case:
Use stap track pg_clog/* fsync call.
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
Use pgbench like this:
postgres@digoal-> cat 7.sql
select txid_current();
postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
About 32K tps.
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112
Trace the PID:
postgres@digoal-> ps -ewf|grep postgres
postgres 2921 1883 29 09:37 pts/1 00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres 2924 1841 66 09:37 ? 00:00:13 postgres: postgres postgres [local] SELECT
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1
0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
fsync interval about 1 sec, every 32KB xacts, per CLOG PAGE.
postgres=# select 1441503928836-1441503927731;
?column?
----------
1105
(1 row)
postgres=# select 1441503929819-1441503928836;
?column?
----------
983
(1 row)
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
Use pgbench like this:
postgres@digoal-> cat 7.sql
select txid_current();
postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
About 32K tps.
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112
Trace the PID:
postgres@digoal-> ps -ewf|grep postgres
postgres 2921 1883 29 09:37 pts/1 00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres 2924 1841 66 09:37 ? 00:00:13 postgres: postgres postgres [local] SELECT
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1
0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
fsync interval about 1 sec, every 32KB xacts, per CLOG PAGE.
postgres=# select 1441503928836-1441503927731;
?column?
----------
1105
(1 row)
postgres=# select 1441503929819-1441503928836;
?column?
----------
983
(1 row)
------------------------------------------------------------------发件人:Amit Kapila <amit.kapila16@gmail.com>发送时间:2015年9月8日(星期二) 15:37收件人:周正中(德歌) <dege.zzz@alibaba-inc.com>抄 送:Andres Freund <andres@anarazel.de>,张广舟(明虚) <guangzhou.zgz@alibaba-inc.com>,pgsql-hackers <pgsql-hackers@postgresql.org>,范孝剑(康贤) <funnyxj.fxj@alibaba-inc.com>,曾文旌(义从) <wenjing.zwj@alibaba-inc.com>,窦贤明(执白) <xianming.dxm@alibaba-inc.com>,萧少聪(铁庵) <shaocong.xsc@alibaba-inc.com>,陈新坚(惧留孙) <xinjian.chen@alibaba-inc.com>主 题:Re: [HACKERS] 答复:[HACKERS] about fsync in CLOG buffer writeOn Sun, Sep 6, 2015 at 7:34 AM, 周正中(德歌) <dege.zzz@alibaba-inc.com> wrote:>On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> We found there is a fsync call when CLOG buffer
>> is written out in SlruPhysicalWritePage(). It is often called when a backend
>> needs to check transaction status with SimpleLruReadPage().
>That's when there's not enough buffers available some other, and your
>case dirty, needs to be written out.
>You could try increasing the max (32) in CLOGShmemBuffers() further.I think increasing CLOGShmemBuffers have no help for our case, because PG will call pg_fsync once for every CLOG PAGE(BLOCKSZ) when ExtendCLOG。Ideally that should not happen every time, CLOG page is written onlywhen there is no non-dirty CLOG buffer to accommodate a new pageduring ExtendCLOG. I think it will happen only if there is no flush ofCLOG pages by Checkpoint or by any other process trying to committhe transaction during last 32 pages worth of transactions (which is4 (TRANS_PER_BYTE) * 8192 (BLKSIZE) * 32 (CLOG_BUFFERS)).What kind of workload you are using, can you once try by commentingthe sync call and see if there is any gain (I understand that is not right,but at least we will come to know how much we can save by moving sucha sync to checkpoint).
With Regards,
Amit Kapila.EnterpriseDB: http://www.enterprisedb.com
Hi, On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote: > postgres@digoal-> cat 7.sql > select txid_current(); > > postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000 > About 32K tps. > progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183 > progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127 So you're benchmarking how fast you can assign txids. Is that actually something meaningful? If you have other writes interleaved you'll write much more WAL, so there'll be checkpoints and such. FWIW, if you measure something realistic and there's checkpoints, there'll be fewer fsyncs if you increase the slru buffer size - as there'll often be clean buffers due to the checkpoint having written them out. Greetings, Andres Freund
On Tue, Sep 8, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote: >> postgres@digoal-> cat 7.sql >> select txid_current(); >> >> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000 >> About 32K tps. >> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183 >> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127 > > So you're benchmarking how fast you can assign txids. Is that actually > something meaningful? If you have other writes interleaved you'll write > much more WAL, so there'll be checkpoints and such. > > FWIW, if you measure something realistic and there's checkpoints, > there'll be fewer fsyncs if you increase the slru buffer size - as > there'll often be clean buffers due to the checkpoint having written > them out. But I think it's not very hard to come up with a workload where 32 clog buffers isn't enough, and you end up waiting for backends to fsync(). Suppose you have a pgbench workload with N tuples. We update tuples at random, so sometimes we hit one that's just recently been updated, and other times we hit one that hasn't been updated for many transactions. At 32k transactions per page, each transaction's chance of updating a tuple whose existing xmin is on the most recent clog page is 32k/N. The chance of hitting a tuple whose existing xmin is on the next page back is (1 - 32k/N) * 32k/N. The chance of hitting a page whose current xmin is at least X pages prior to the most recent one is (1 - 32k/N)^X. So, how many tuples do we need in order for each update to have a 50% chance of hitting a tuple that is at least 32 pages back? (1 - 32k/N)^32 = .5 1 - 32k/N = 0.9785720620877 32k/N = 0.0214279379122999 N = 32k/0.0214279379122999 N = 1529218.54329206 ...or in other words, scale factor 16. At scale factors >= 1044, the chance that the next update hits an xmin more than 32 clog buffers back is > 99%. So any test of this sort causes extremely rapid clog page eviction - basically every transaction is going to request a buffer that's probably not cached, and as soon as it's done with it, some other transaction will evict it to bring in a different buffer that's not cached. How often such a workload actually has to replace a *dirty* clog buffer obviously depends on how often you checkpoint, but if you're getting ~28k TPS you can completely fill 32 clog buffers (1 million transactions) in less than 40 seconds, and you're probably not checkpointing nearly that often. I'm not entirely sure how much fsync absorption for SLRU buffers will help, but I think it would be worth trying. Back when I was spending more time on this area, I saw some evidence that those fsyncs did cause at least some latency spikes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... How often such a workload actually has to replace a *dirty* clog > buffer obviously depends on how often you checkpoint, but if you're > getting ~28k TPS you can completely fill 32 clog buffers (1 million > transactions) in less than 40 seconds, and you're probably not > checkpointing nearly that often. But by the same token, at that kind of transaction rate, no clog page is actively getting dirtied for more than a couple of seconds. So while it might get swapped in and out of the SLRU arena pretty often after that, this scenario seems unconvincing as a source of repeated fsyncs. Like Andres, I'd want to see a more realistic problem case before expending a lot of work here. regards, tom lane
Re: [HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write
From
Robert Haas
Date:
On Wed, Sep 9, 2015 at 10:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> ... How often such a workload actually has to replace a *dirty* clog >> buffer obviously depends on how often you checkpoint, but if you're >> getting ~28k TPS you can completely fill 32 clog buffers (1 million >> transactions) in less than 40 seconds, and you're probably not >> checkpointing nearly that often. > > But by the same token, at that kind of transaction rate, no clog page is > actively getting dirtied for more than a couple of seconds. So while it > might get swapped in and out of the SLRU arena pretty often after that, > this scenario seems unconvincing as a source of repeated fsyncs. Well, if you're filling ~1 clog page per second, you're doing ~1 fsync per second too. Or if you are not, then you are thrashing the progressively smaller and smaller number of clean slots ever-harder until no clean pages remain and you're forced to fsync something - probably, a bunch of things all at once. > Like Andres, I'd want to see a more realistic problem case before > expending a lot of work here. I think the question here isn't whether the problem case is realistic - I am quite sure that a pgbench workload is - but rather how much of a problem it actually causes. I'm very sure that individual pgbench backends experience multi-second stallls as a result of this. What I'm not sure about is how frequently it happens, and how much of an effect it has on overall latency. I think it would be worth someone's time to try to write some good instrumentation code here and figure that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: [HACKERS] Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer write
From
Andres Freund
Date:
On 2015-09-09 10:46:47 -0400, Robert Haas wrote: > Well, if you're filling ~1 clog page per second, you're doing ~1 fsync > per second too. Or if you are not, then you are thrashing the > progressively smaller and smaller number of clean slots ever-harder > until no clean pages remain and you're forced to fsync something - > probably, a bunch of things all at once. And you're also triggering a lot of other fsyncs. At the very least you're pretty constantly flushing the WAL, even with synchronous_commit = off. > > Like Andres, I'd want to see a more realistic problem case before > > expending a lot of work here. > > I think the question here isn't whether the problem case is realistic > - I am quite sure that a pgbench workload is - but rather how much of > a problem it actually causes. I'm very sure that individual pgbench > backends experience multi-second stallls as a result of this. The fsync causes that? That seems odd. I mean we release the slru control lock for writing out pages, don't we? To me it seems the fsyncs are a red herring here, and the problems are, uh, much bigger. ISTM that cache replacement locking granularity et al have a much bigger effect than a fsync every now and then. Greetings, Andres Freund
HI,
Sorry, it's my misunderstand.
I don't read seriously SlruSelectLRUPage before.
/* See if page already has a buffer assigned */
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
return slotno;
}
/*
* If we find any EMPTY slot, just select that one. Else choose a
* victim page to replace. We normally take the least recently used
* valid page, but we will never take the slot containing
* latest_page_number, even if it appears least recently used. We
* will select a slot that is already I/O busy only if there is no
* other choice: a read-busy slot will not be least recently used once
* the read finishes, and waiting for an I/O on a write-busy slot is
* inferior to just picking some other slot. Testing shows the slot
* we pick instead will often be clean, allowing us to begin a read at
* once.
*
* Normally the page_lru_count values will all be different and so
* there will be a well-defined LRU page. But since we allow
* concurrent execution of SlruRecentlyUsed() within
* SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
* acquire the same lru_count values. In that case we break ties by
* choosing the furthest-back page.
*
* Notice that this next line forcibly advances cur_lru_count to a
* value that is certainly beyond any value that will be in the
* page_lru_count array after the loop finishes. This ensures that
* the next execution of SlruRecentlyUsed will mark the page newly
* used, even if it's for a page that has the current counter value.
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
int this_delta;
int this_page_number;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
return slotno;
this_delta = cur_count - shared->page_lru_count[slotno];
if (this_delta < 0)
{
/*
* Clean up in case shared updates have caused cur_count
* increments to get "lost". We back off the page counts,
* rather than trying to increase cur_count, to avoid any
* question of infinite loops or failure in the presence of
* wrapped-around counts.
*/
shared->page_lru_count[slotno] = cur_count;
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
if (this_page_number == shared->latest_page_number)
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
if (this_delta > best_valid_delta ||
(this_delta == best_valid_delta &&
ctl->PagePrecedes(this_page_number,
best_valid_page_number)))
{
bestvalidslot = slotno;
best_valid_delta = this_delta;
best_valid_page_number = this_page_number;
}
}
else
{
if (this_delta > best_invalid_delta ||
(this_delta == best_invalid_delta &&
ctl->PagePrecedes(this_page_number,
best_invalid_page_number)))
{
bestinvalidslot = slotno;
best_invalid_delta = this_delta;
best_invalid_page_number = this_page_number;
}
}
}
/*
* If all pages (except possibly the latest one) are I/O busy, we'll
* have to wait for an I/O to complete and then retry. In that
* unhappy case, we choose to wait for the I/O on the least recently
* used slot, on the assumption that it was likely initiated first of
* all the I/Os in progress and may therefore finish first.
*/
if (best_valid_delta < 0)
{
SimpleLruWaitIO(ctl, bestinvalidslot);
continue;
}
/*
* If the selected page is clean, we're set.
*/
if (!shared->page_dirty[bestvalidslot])
return bestvalidslot;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
return slotno;
}
/*
* If we find any EMPTY slot, just select that one. Else choose a
* victim page to replace. We normally take the least recently used
* valid page, but we will never take the slot containing
* latest_page_number, even if it appears least recently used. We
* will select a slot that is already I/O busy only if there is no
* other choice: a read-busy slot will not be least recently used once
* the read finishes, and waiting for an I/O on a write-busy slot is
* inferior to just picking some other slot. Testing shows the slot
* we pick instead will often be clean, allowing us to begin a read at
* once.
*
* Normally the page_lru_count values will all be different and so
* there will be a well-defined LRU page. But since we allow
* concurrent execution of SlruRecentlyUsed() within
* SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
* acquire the same lru_count values. In that case we break ties by
* choosing the furthest-back page.
*
* Notice that this next line forcibly advances cur_lru_count to a
* value that is certainly beyond any value that will be in the
* page_lru_count array after the loop finishes. This ensures that
* the next execution of SlruRecentlyUsed will mark the page newly
* used, even if it's for a page that has the current counter value.
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
int this_delta;
int this_page_number;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
return slotno;
this_delta = cur_count - shared->page_lru_count[slotno];
if (this_delta < 0)
{
/*
* Clean up in case shared updates have caused cur_count
* increments to get "lost". We back off the page counts,
* rather than trying to increase cur_count, to avoid any
* question of infinite loops or failure in the presence of
* wrapped-around counts.
*/
shared->page_lru_count[slotno] = cur_count;
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
if (this_page_number == shared->latest_page_number)
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID)
{
if (this_delta > best_valid_delta ||
(this_delta == best_valid_delta &&
ctl->PagePrecedes(this_page_number,
best_valid_page_number)))
{
bestvalidslot = slotno;
best_valid_delta = this_delta;
best_valid_page_number = this_page_number;
}
}
else
{
if (this_delta > best_invalid_delta ||
(this_delta == best_invalid_delta &&
ctl->PagePrecedes(this_page_number,
best_invalid_page_number)))
{
bestinvalidslot = slotno;
best_invalid_delta = this_delta;
best_invalid_page_number = this_page_number;
}
}
}
/*
* If all pages (except possibly the latest one) are I/O busy, we'll
* have to wait for an I/O to complete and then retry. In that
* unhappy case, we choose to wait for the I/O on the least recently
* used slot, on the assumption that it was likely initiated first of
* all the I/Os in progress and may therefore finish first.
*/
if (best_valid_delta < 0)
{
SimpleLruWaitIO(ctl, bestinvalidslot);
continue;
}
/*
* If the selected page is clean, we're set.
*/
if (!shared->page_dirty[bestvalidslot])
return bestvalidslot;
------------------------------------------------------------------发件人:Andres Freund <andres@anarazel.de>发送时间:2015年9月9日(星期三) 02:28收件人:周正中(德歌) <dege.zzz@alibaba-inc.com>抄 送:Amit Kapila <amit.kapila16@gmail.com>,张广舟(明虚) <guangzhou.zgz@alibaba-inc.com>,pgsql-hackers <pgsql-hackers@postgresql.org>,范孝剑(康贤) <funnyxj.fxj@alibaba-inc.com>,曾文旌(义从) <wenjing.zwj@alibaba-inc.com>,窦贤明(执白) <xianming.dxm@alibaba-inc.com>,萧少聪(铁庵) <shaocong.xsc@alibaba-inc.com>,陈新坚(惧留孙) <xinjian.chen@alibaba-inc.com>主 题:Re: [HACKERS] 答复:[HACKERS] 答复:[HACKERS] about fsync in CLOG buffer writeHi,
On 2015-09-08 15:58:26 +0800, 周正中(德歌) wrote:
> postgres@digoal-> cat 7.sql
> select txid_current();
>
> postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
> About 32K tps.
> progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
> progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
So you're benchmarking how fast you can assign txids. Is that actually
something meaningful? If you have other writes interleaved you'll write
much more WAL, so there'll be checkpoints and such.
FWIW, if you measure something realistic and there's checkpoints,
there'll be fewer fsyncs if you increase the slru buffer size - as
there'll often be clean buffers due to the checkpoint having written
them out.
Greetings,
Andres Freund
On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
> We found there is a fsync call when CLOG buffer
> is written out in SlruPhysicalWritePage(). It is often called when a backend
> needs to check transaction status with SimpleLruReadPage().
That's when there's not enough buffers available some other, and your
case dirty, needs to be written out.
Why bother to find a place to store the page in shared memory at all? If we just want to read it, and it isn't already in shared memory, then why not just ask the kernel for the specific byte we need? The byte we want to read can't differ between shared memory and kernel, because it doesn't exist in shared memory.
Cheers,
Jeff
On September 12, 2015 5:18:28 PM PDT, Jeff Janes <jeff.janes@gmail.com> wrote: >On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de> >wrote: > >> On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote: >> > We found there is a fsync call when CLOG buffer >> > is written out in SlruPhysicalWritePage(). It is often called when >a >> backend >> > needs to check transaction status with SimpleLruReadPage(). >> >> That's when there's not enough buffers available some other, and your >> case dirty, needs to be written out. >> > >Why bother to find a place to store the page in shared memory at all? >If >we just want to read it, and it isn't already in shared memory, then >why >not just ask the kernel for the specific byte we need? The byte we >want to >read can't differ between shared memory and kernel, because it doesn't >exist in shared memory. I doubt that'd help - the next access would be more expensive, and we'd need to have a more complex locking regime. Thesepages aren't necessarily read only at that point. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Sat, Sep 12, 2015 at 5:21 PM, Andres Freund <andres@anarazel.de> wrote:
On September 12, 2015 5:18:28 PM PDT, Jeff Janes <jeff.janes@gmail.com> wrote:
>On Wed, Sep 2, 2015 at 5:32 AM, Andres Freund <andres@anarazel.de>
>wrote:
>
>> On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
>> > We found there is a fsync call when CLOG buffer
>> > is written out in SlruPhysicalWritePage(). It is often called when
>a
>> backend
>> > needs to check transaction status with SimpleLruReadPage().
>>
>> That's when there's not enough buffers available some other, and your
>> case dirty, needs to be written out.
>>
>
>Why bother to find a place to store the page in shared memory at all?
>If
>we just want to read it, and it isn't already in shared memory, then
>why
>not just ask the kernel for the specific byte we need? The byte we
>want to
>read can't differ between shared memory and kernel, because it doesn't
>exist in shared memory.
I doubt that'd help - the next access would be more expensive, and we'd need to have a more complex locking regime. These pages aren't necessarily read only at that point.
My (naive) expectation is that no additional locking is needed.
Once we decide to consult the clog, we already know the transaction is no longer in progress, so it can't be in-flight to change that clog entry we care about because it was required to have done that already.
Once we have verified (under existing locking) that the relevant page is already not in memory, we know it can't be dirty in memory. If someone pulls it into memory after we observe it to be not there, it doesn't matter to us as whatever transaction they are about to change can't be the one we care about.
Perhaps someone will want the same page later so that they can write to it and so will have to pull it in. But we have to play the odds, and the odds are that a page already dirty in memory is more likely to be needed to be written to in the near future, than another page which was not already dirty and is only needed with read intent.
If we are wrong, all that happens is someone later on has to do the same work that we would have had to do anyway, at no greater cost than we if did it now. If we are right, we avoid an fsync to make room for new page, and then later on avoid someone else having to shove out the page we brought in (or a different one) only to replace it with the same page we just wrote, fsynced, and shoved out.
Is there a chance that, if we read a byte from the kernel when someone is in the process of writing adjacent bytes (or writing the same byte, with changes only to bits in it which we don't care about), the kernel will deliver us something which is neither the old value nor the new value, but some monstrosity?
Cheers,
Jeff
On 2015-10-04 12:14:05 -0700, Jeff Janes wrote: > My (naive) expectation is that no additional locking is needed. > > Once we decide to consult the clog, we already know the transaction is no > longer in progress, so it can't be in-flight to change that clog entry we > care about because it was required to have done that already. Other xids on the same page can still be in progress and those concurrently might need to be written to. > Once we have verified (under existing locking) that the relevant page is > already not in memory, we know it can't be dirty in memory. If someone > pulls it into memory after we observe it to be not there, it doesn't matter > to us as whatever transaction they are about to change can't be the one we > care about. The read of the page from disk from a concurrent process might have been before our write, i.e. containing an unmodified page, but now future writes will overwrite the entry we wrote directly. I think there's a bunch of related issues. Such things will currently prevented by the IO locks in slru.c. > Is there a chance that, if we read a byte from the kernel when someone is > in the process of writing adjacent bytes (or writing the same byte, with > changes only to bits in it which we don't care about), the kernel will > deliver us something which is neither the old value nor the new value, but > some monstrosity? Depends on the granularity of the write/read and the OS IIRC. I don't think it's worth investing time and complexity to bypass SLRU in certain cases. We should rather rewrite the thing completely. Greetings, Andres Freund
On Sun, Oct 4, 2015 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote: > I don't think it's worth investing time and complexity to bypass SLRU in > certain cases. We should rather rewrite the thing completely. +1. That code is considerably past its sell-by date. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company