Thread: About to add WAL write/fsync statistics to pg_stat_wal view
Hi, I propose to add wal write/fsync statistics to pg_stat_wal view. It's useful not only for developing/improving source code related to WAL but also for users to detect workload changes, HW failure, and so on. I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view. I separate the parameter from "track_io_timing" to "track_wal_io_timing" because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity. ``` postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal; -[ RECORD 1 ]--+---- wal_write | 650 # Total number of times WAL data was written to the disk wal_write_time | 43 # Total amount of time that has been spent in the portion of WAL data was written to disk # if track-wal-io-timing is enabled, otherwise zero wal_sync | 78 # Total number of times WAL data was synced to the disk wal_sync_time | 104 # Total amount of time that has been spent in the portion of WAL data was synced to disk # if track-wal-io-timing is enabled, otherwise zero ``` What do you think? Please let me know your comments. Regards -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
Hi, > On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > > Hi, > > I propose to add wal write/fsync statistics to pg_stat_wal view. > It's useful not only for developing/improving source code related to WAL > but also for users to detect workload changes, HW failure, and so on. > > I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view. > I separate the parameter from "track_io_timing" to "track_wal_io_timing" > because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity. > > ``` > postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal; > -[ RECORD 1 ]--+---- > wal_write | 650 # Total number of times WAL data was written to the disk > > wal_write_time | 43 # Total amount of time that has been spent in the portion of WAL data was written to disk > # if track-wal-io-timing is enabled, otherwise zero > > wal_sync | 78 # Total number of times WAL data was synced to the disk > > wal_sync_time | 104 # Total amount of time that has been spent in the portion of WAL data was synced to disk > # if track-wal-io-timing is enabled, otherwise zero > ``` > > What do you think? > Please let me know your comments. > > Regards > -- > Masahiro Ikeda > NTT DATA CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch> There is a no previous prototype warning for ‘fsyncMethodCalled’, and it now only used in xlog.c, should we declare with static? And this function wants a boolean as a return, should we use true/false other than 0/1? +/* + * Check if fsync mothod is called. + */ +bool +fsyncMethodCalled() +{ + if (!enableFsync) + return 0; + + switch (sync_method) + { + case SYNC_METHOD_FSYNC: + case SYNC_METHOD_FSYNC_WRITETHROUGH: + case SYNC_METHOD_FDATASYNC: + return 1; + default: + /* others don't have a specific fsync method */ + return 0; + } +} + -- Best regards ChengDu WenWu Information Technology Co.,Ltd. Japin Li
On 2020-12-08 16:45, Li Japin wrote: > Hi, > >> On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> >> wrote: >> >> Hi, >> >> I propose to add wal write/fsync statistics to pg_stat_wal view. >> It's useful not only for developing/improving source code related to >> WAL >> but also for users to detect workload changes, HW failure, and so on. >> >> I introduce "track_wal_io_timing" parameter and provide the following >> information on pg_stat_wal view. >> I separate the parameter from "track_io_timing" to >> "track_wal_io_timing" >> because IIUC, WAL I/O activity may have a greater impact on query >> performance than database I/O activity. >> >> ``` >> postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time >> FROM pg_stat_wal; >> -[ RECORD 1 ]--+---- >> wal_write | 650 # Total number of times WAL data was written to >> the disk >> >> wal_write_time | 43 # Total amount of time that has been spent in >> the portion of WAL data was written to disk >> # if track-wal-io-timing is enabled, otherwise >> zero >> >> wal_sync | 78 # Total number of times WAL data was synced to >> the disk >> >> wal_sync_time | 104 # Total amount of time that has been spent in >> the portion of WAL data was synced to disk >> # if track-wal-io-timing is enabled, otherwise >> zero >> ``` >> >> What do you think? >> Please let me know your comments. >> >> Regards >> -- >> Masahiro Ikeda >> NTT DATA >> CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch> > > There is a no previous prototype warning for ‘fsyncMethodCalled’, and > it now only used in xlog.c, > should we declare with static? And this function wants a boolean as a > return, should we use > true/false other than 0/1? > > +/* > + * Check if fsync mothod is called. > + */ > +bool > +fsyncMethodCalled() > +{ > + if (!enableFsync) > + return 0; > + > + switch (sync_method) > + { > + case SYNC_METHOD_FSYNC: > + case SYNC_METHOD_FSYNC_WRITETHROUGH: > + case SYNC_METHOD_FDATASYNC: > + return 1; > + default: > + /* others don't have a specific fsync method */ > + return 0; > + } > +} > + Thanks for your review. I agree with your comments. I fixed them. Regards -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
Hi, I rebased the patch to the master branch. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
RE: About to add WAL write/fsync statistics to pg_stat_wal view
From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san, This patch cannot be applied to the HEAD, but anyway I put a comment. ``` + /* + * Measure i/o timing to fsync WAL data. + * + * The wal receiver skip to collect it to avoid performance degradation of standy servers. + * If sync_method doesn't have its fsync method, to skip too. + */ + if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled()) + INSTR_TIME_SET_CURRENT(start); ``` I think m_wal_sync_time should be collected even if the process is WalRecevier. Because all wal_fsync should be recorded, and some performance issues have been aleady occurred if track_wal_io_timing is turned on. I think it's strange only to take care of the walrecevier case. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > > Hi, > > I rebased the patch to the master branch. Thank you for working on this. I've read the latest patch. Here are comments: --- + if (track_wal_io_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_write_time += INSTR_TIME_GET_MILLISEC(duration); + } * I think it should add the time in micro sec. After running pgbench with track_wal_io_timing = on for 30 sec, pg_stat_wal showed the following on my environment: postgres(1:61569)=# select * from pg_stat_wal; -[ RECORD 1 ]----+----------------------------- wal_records | 285947 wal_fpi | 53285 wal_bytes | 442008213 wal_buffers_full | 0 wal_write | 25516 wal_write_time | 0 wal_sync | 25437 wal_sync_time | 14490 stats_reset | 2021-01-22 10:56:13.29464+09 Since writes can complete less than a millisecond, wal_write_time didn't increase. I think sync_time could also have the same problem. --- + /* + * Measure i/o timing to fsync WAL data. + * + * The wal receiver skip to collect it to avoid performance degradation of standy servers. + * If sync_method doesn't have its fsync method, to skip too. + */ + if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled()) + INSTR_TIME_SET_CURRENT(start); * Why does only the wal receiver skip it even if track_wal_io_timinig is true? I think the performance degradation is also true for backend processes. If there is another reason for that, I think it's better to mention in both the doc and comment. * How about checking track_wal_io_timing first? * s/standy/standby/ --- + /* increment the i/o timing and the number of times to fsync WAL data */ + if (fsyncMethodCalled()) + { + if (!AmWalReceiverProcess() && track_wal_io_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration); + } + + WalStats.m_wal_sync++; + } * I'd avoid always calling fsyncMethodCalled() in this path. How about incrementing m_wal_sync after each sync operation? --- +/* + * Check if fsync mothod is called. + */ +static bool +fsyncMethodCalled() +{ + if (!enableFsync) + return false; + + switch (sync_method) + { + case SYNC_METHOD_FSYNC: + case SYNC_METHOD_FSYNC_WRITETHROUGH: + case SYNC_METHOD_FDATASYNC: + return true; + default: + /* others don't have a specific fsync method */ + return false; + } +} * I'm concerned that the function name could confuse the reader because it's called even before the fsync method is called. As I commented above, calling to fsyncMethodCalled() can be eliminated. That way, this function is called at only once. So do we really need this function? * As far as I read the code, issue_xlog_fsync() seems to do fsync even if enableFsync is false. Why does the function return false in that case? I might be missing something. * void is missing as argument? * s/mothod/method/ Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On 2021-01-22 11:54, kuroda.hayato@fujitsu.com wrote: > Dear Ikeda-san, > > This patch cannot be applied to the HEAD, but anyway I put a comment. > > ``` > + /* > + * Measure i/o timing to fsync WAL data. > + * > + * The wal receiver skip to collect it to avoid performance > degradation of standy servers. > + * If sync_method doesn't have its fsync method, to skip too. > + */ > + if (!AmWalReceiverProcess() && track_wal_io_timing && > fsyncMethodCalled()) > + INSTR_TIME_SET_CURRENT(start); > ``` > > I think m_wal_sync_time should be collected even if the process is > WalRecevier. > Because all wal_fsync should be recorded, and > some performance issues have been aleady occurred if > track_wal_io_timing is turned on. > I think it's strange only to take care of the walrecevier case. Kuroda-san, Thanks for your comments. Although I thought that the performance impact may be bigger in standby servers because WALReceiver didn't use wal buffers, it's no need to be considered. I agreed that if track_wal_io_timing is turned on, the primary server's performance degradation occurs too. I will make rebased and modified. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2021-01-22 14:50, Masahiko Sawada wrote: > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> >> Hi, >> >> I rebased the patch to the master branch. > > Thank you for working on this. I've read the latest patch. Here are > comments: > > --- > + if (track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > > * I think it should add the time in micro sec. > After running pgbench with track_wal_io_timing = on for 30 sec, > pg_stat_wal showed the following on my environment: > > postgres(1:61569)=# select * from pg_stat_wal; > -[ RECORD 1 ]----+----------------------------- > wal_records | 285947 > wal_fpi | 53285 > wal_bytes | 442008213 > wal_buffers_full | 0 > wal_write | 25516 > wal_write_time | 0 > wal_sync | 25437 > wal_sync_time | 14490 > stats_reset | 2021-01-22 10:56:13.29464+09 > > Since writes can complete less than a millisecond, wal_write_time > didn't increase. I think sync_time could also have the same problem. Thanks for your comments. I didn't notice that. I changed the unit from milliseconds to microseconds. > --- > + /* > + * Measure i/o timing to fsync WAL data. > + * > + * The wal receiver skip to collect it to avoid performance > degradation of standy servers. > + * If sync_method doesn't have its fsync method, to skip too. > + */ > + if (!AmWalReceiverProcess() && track_wal_io_timing && > fsyncMethodCalled()) > + INSTR_TIME_SET_CURRENT(start); > > * Why does only the wal receiver skip it even if track_wal_io_timinig > is true? I think the performance degradation is also true for backend > processes. If there is another reason for that, I think it's better to > mention in both the doc and comment. > * How about checking track_wal_io_timing first? > * s/standy/standby/ I fixed it. As kuroda-san mentioned too, the skip is no need to be considered. > --- > + /* increment the i/o timing and the number of times to fsync WAL > data */ > + if (fsyncMethodCalled()) > + { > + if (!AmWalReceiverProcess() && track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time += > INSTR_TIME_GET_MILLISEC(duration); > + } > + > + WalStats.m_wal_sync++; > + } > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > incrementing m_wal_sync after each sync operation? I think if syncing the disk does not occur, m_wal_sync should not be incremented. It depends enableFsync and sync_method. enableFsync is checked in each fsync method like pg_fsync_no_writethrough(), so if incrementing m_wal_sync after each sync operation, it should be implemented in each fsync method. It leads to many duplicated codes. So, why don't you change the function to a flag whether to sync data to the disk will be occurred or not in issue_xlog_fsync()? > --- > +/* > + * Check if fsync mothod is called. > + */ > +static bool > +fsyncMethodCalled() > +{ > + if (!enableFsync) > + return false; > + > + switch (sync_method) > + { > + case SYNC_METHOD_FSYNC: > + case SYNC_METHOD_FSYNC_WRITETHROUGH: > + case SYNC_METHOD_FDATASYNC: > + return true; > + default: > + /* others don't have a specific fsync method */ > + return false; > + } > +} > > * I'm concerned that the function name could confuse the reader > because it's called even before the fsync method is called. As I > commented above, calling to fsyncMethodCalled() can be eliminated. > That way, this function is called at only once. So do we really need > this function? Thanks to your comments, I removed them. > * As far as I read the code, issue_xlog_fsync() seems to do fsync even > if enableFsync is false. Why does the function return false in that > case? I might be missing something. IIUC, the reason is that I thought that each fsync functions like pg_fsync_no_writethrough() check enableFsync. If this code doesn't check, m_wal_sync_time may be incremented even though some sync methods like SYNC_METHOD_OPEN don't call to sync some data to the disk at the time. > * void is missing as argument? > > * s/mothod/method/ I removed them. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
Hi, Masahiro Thanks for you update the v4 patch. Here are some comments: (1) + char *msg = NULL; + bool sync_called; /* whether to sync data to the disk. */ + instr_time start; + instr_time duration; + + /* check whether to sync data to the disk is really occurred. */ + sync_called = false; Maybe we can initialize the "sync_called" variable when declare it. (2) + if (sync_called) + { + /* increment the i/o timing and the number of times to fsync WAL data */ + if (track_wal_io_timing) + { + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); + } + + WalStats.m_wal_sync++; + } There is an extra space before INSTR_TIME_GET_MICROSEC(duration). In the issue_xlog_fsync(), the comment says that if sync_method is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced. Does that mean it synced when write the WAL data? And for those cases, we cannot get accurate write/sync timing and number of write/sync times, right? case SYNC_METHOD_OPEN: case SYNC_METHOD_OPEN_DSYNC: /* write synced it already */ break; On Fri, 22 Jan 2021 at 21:05, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > On 2021-01-22 14:50, Masahiko Sawada wrote: >> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >> <ikedamsh@oss.nttdata.com> wrote: >>> >>> Hi, >>> >>> I rebased the patch to the master branch. >> >> Thank you for working on this. I've read the latest patch. Here are >> comments: >> >> --- >> + if (track_wal_io_timing) >> + { >> + INSTR_TIME_SET_CURRENT(duration); >> + INSTR_TIME_SUBTRACT(duration, start); >> + WalStats.m_wal_write_time += >> INSTR_TIME_GET_MILLISEC(duration); >> + } >> >> * I think it should add the time in micro sec. >> After running pgbench with track_wal_io_timing = on for 30 sec, >> pg_stat_wal showed the following on my environment: >> >> postgres(1:61569)=# select * from pg_stat_wal; >> -[ RECORD 1 ]----+----------------------------- >> wal_records | 285947 >> wal_fpi | 53285 >> wal_bytes | 442008213 >> wal_buffers_full | 0 >> wal_write | 25516 >> wal_write_time | 0 >> wal_sync | 25437 >> wal_sync_time | 14490 >> stats_reset | 2021-01-22 10:56:13.29464+09 >> >> Since writes can complete less than a millisecond, wal_write_time >> didn't increase. I think sync_time could also have the same problem. > > Thanks for your comments. I didn't notice that. > I changed the unit from milliseconds to microseconds. > >> --- >> + /* >> + * Measure i/o timing to fsync WAL data. >> + * >> + * The wal receiver skip to collect it to avoid performance >> degradation of standy servers. >> + * If sync_method doesn't have its fsync method, to skip too. >> + */ >> + if (!AmWalReceiverProcess() && track_wal_io_timing && >> fsyncMethodCalled()) >> + INSTR_TIME_SET_CURRENT(start); >> >> * Why does only the wal receiver skip it even if track_wal_io_timinig >> is true? I think the performance degradation is also true for backend >> processes. If there is another reason for that, I think it's better to >> mention in both the doc and comment. >> * How about checking track_wal_io_timing first? >> * s/standy/standby/ > > I fixed it. > As kuroda-san mentioned too, the skip is no need to be considered. > >> --- >> + /* increment the i/o timing and the number of times to fsync WAL >> data */ >> + if (fsyncMethodCalled()) >> + { >> + if (!AmWalReceiverProcess() && track_wal_io_timing) >> + { >> + INSTR_TIME_SET_CURRENT(duration); >> + INSTR_TIME_SUBTRACT(duration, start); >> + WalStats.m_wal_sync_time += >> INSTR_TIME_GET_MILLISEC(duration); >> + } >> + >> + WalStats.m_wal_sync++; >> + } >> >> * I'd avoid always calling fsyncMethodCalled() in this path. How about >> incrementing m_wal_sync after each sync operation? > > I think if syncing the disk does not occur, m_wal_sync should not be > incremented. > It depends enableFsync and sync_method. > > enableFsync is checked in each fsync method like > pg_fsync_no_writethrough(), > so if incrementing m_wal_sync after each sync operation, it should be > implemented > in each fsync method. It leads to many duplicated codes. > > So, why don't you change the function to a flag whether to > sync data to the disk will be occurred or not in issue_xlog_fsync()? > > >> --- >> +/* >> + * Check if fsync mothod is called. >> + */ >> +static bool >> +fsyncMethodCalled() >> +{ >> + if (!enableFsync) >> + return false; >> + >> + switch (sync_method) >> + { >> + case SYNC_METHOD_FSYNC: >> + case SYNC_METHOD_FSYNC_WRITETHROUGH: >> + case SYNC_METHOD_FDATASYNC: >> + return true; >> + default: >> + /* others don't have a specific fsync method */ >> + return false; >> + } >> +} >> >> * I'm concerned that the function name could confuse the reader >> because it's called even before the fsync method is called. As I >> commented above, calling to fsyncMethodCalled() can be eliminated. >> That way, this function is called at only once. So do we really need >> this function? > > Thanks to your comments, I removed them. > > >> * As far as I read the code, issue_xlog_fsync() seems to do fsync even >> if enableFsync is false. Why does the function return false in that >> case? I might be missing something. > > IIUC, the reason is that I thought that each fsync functions like > pg_fsync_no_writethrough() check enableFsync. > > If this code doesn't check, m_wal_sync_time may be incremented > even though some sync methods like SYNC_METHOD_OPEN don't call to sync > some data to the disk at the time. > >> * void is missing as argument? >> >> * s/mothod/method/ > > I removed them. > > > Regards, -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Hi, Japin Thanks for your comments. On 2021-01-23 01:46, japin wrote: > Hi, Masahiro > > Thanks for you update the v4 patch. Here are some comments: > > (1) > + char *msg = NULL; > + bool sync_called; /* whether to sync > data to the disk. */ > + instr_time start; > + instr_time duration; > + > + /* check whether to sync data to the disk is really occurred. > */ > + sync_called = false; > > Maybe we can initialize the "sync_called" variable when declare it. Yes, I fixed it. > (2) > + if (sync_called) > + { > + /* increment the i/o timing and the number of times to > fsync WAL data */ > + if (track_wal_io_timing) > + { > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time = > INSTR_TIME_GET_MICROSEC(duration); > + } > + > + WalStats.m_wal_sync++; > + } > > There is an extra space before INSTR_TIME_GET_MICROSEC(duration). Yes, I removed it. > In the issue_xlog_fsync(), the comment says that if sync_method is > SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced. > Does that mean it synced when write the WAL data? And for those cases, > we > cannot get accurate write/sync timing and number of write/sync times, > right? > > case SYNC_METHOD_OPEN: > case SYNC_METHOD_OPEN_DSYNC: > /* write synced it already */ > break; Yes, I add the following comments in the document. @@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </para> <para> Total number of times WAL data was synced to disk + (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or + <literal>open_sync</literal>, this value is zero because WAL data is synced + when to write it). </para></entry> </row> @@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i <para> Total amount of time that has been spent in the portion of WAL data was synced to disk, in milliseconds - (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero) + (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero. + if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or + <literal>open_sync</literal>, this value is zero too because WAL data is synced + when to write it). </para></entry> </row> I attached a modified patch. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
RE: About to add WAL write/fsync statistics to pg_stat_wal view
From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san, Thank you for updating the patch. This can be applied to master, and can be used on my RHEL7. wal_write_time and wal_sync_time increase normally :-). ``` postgres=# select * from pg_stat_wal; -[ RECORD 1 ]----+------------------------------ wal_records | 121781 wal_fpi | 2287 wal_bytes | 36055146 wal_buffers_full | 799 wal_write | 12770 wal_write_time | 4.469 wal_sync | 11962 wal_sync_time | 132.352 stats_reset | 2021-01-25 00:51:40.674412+00 ``` I put a further comment: ``` @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i <structfield>wal_buffers_full</structfield> <type>bigint</type> </para> <para> - Number of times WAL data was written to disk because WAL buffers became full + Total number of times WAL data was written to disk because WAL buffers became full + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_write</structfield> <type>bigint</type> + </para> + <para> + Total number of times WAL data was written to disk + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_write_time</structfield> <type>double precision</type> + </para> + <para> + Total amount of time that has been spent in the portion of + WAL data was written to disk, in milliseconds + (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero). + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_sync</structfield> <type>bigint</type> + </para> + <para> + Total number of times WAL data was synced to disk + (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or + <literal>open_sync</literal>, this value is zero because WAL data is synced + when to write it). + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_sync_time</structfield> <type>double precision</type> + </para> + <para> + Total amount of time that has been spent in the portion of + WAL data was synced to disk, in milliseconds + (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero. + if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or + <literal>open_sync</literal>, this value is zero too because WAL data is synced + when to write it). </para></entry> </row> ``` Maybe "Total amount of time" should be used, not "Total number of time." Other views use "amount." I have no comments anymore. Hayato Kuroda FUJITSU LIMITED
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > > On 2021-01-22 14:50, Masahiko Sawada wrote: > > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > > <ikedamsh@oss.nttdata.com> wrote: > >> > >> Hi, > >> > >> I rebased the patch to the master branch. > > > > Thank you for working on this. I've read the latest patch. Here are > > comments: > > > > --- > > + if (track_wal_io_timing) > > + { > > + INSTR_TIME_SET_CURRENT(duration); > > + INSTR_TIME_SUBTRACT(duration, start); > > + WalStats.m_wal_write_time += > > INSTR_TIME_GET_MILLISEC(duration); > > + } > > > > * I think it should add the time in micro sec. > > After running pgbench with track_wal_io_timing = on for 30 sec, > > pg_stat_wal showed the following on my environment: > > > > postgres(1:61569)=# select * from pg_stat_wal; > > -[ RECORD 1 ]----+----------------------------- > > wal_records | 285947 > > wal_fpi | 53285 > > wal_bytes | 442008213 > > wal_buffers_full | 0 > > wal_write | 25516 > > wal_write_time | 0 > > wal_sync | 25437 > > wal_sync_time | 14490 > > stats_reset | 2021-01-22 10:56:13.29464+09 > > > > Since writes can complete less than a millisecond, wal_write_time > > didn't increase. I think sync_time could also have the same problem. > > Thanks for your comments. I didn't notice that. > I changed the unit from milliseconds to microseconds. > > > --- > > + /* > > + * Measure i/o timing to fsync WAL data. > > + * > > + * The wal receiver skip to collect it to avoid performance > > degradation of standy servers. > > + * If sync_method doesn't have its fsync method, to skip too. > > + */ > > + if (!AmWalReceiverProcess() && track_wal_io_timing && > > fsyncMethodCalled()) > > + INSTR_TIME_SET_CURRENT(start); > > > > * Why does only the wal receiver skip it even if track_wal_io_timinig > > is true? I think the performance degradation is also true for backend > > processes. If there is another reason for that, I think it's better to > > mention in both the doc and comment. > > * How about checking track_wal_io_timing first? > > * s/standy/standby/ > > I fixed it. > As kuroda-san mentioned too, the skip is no need to be considered. I think you also removed the code to have the wal receiver report the stats. So with the latest patch, the wal receiver tracks those statistics but doesn't report. And maybe XLogWalRcvWrite() also needs to track I/O? > > > --- > > + /* increment the i/o timing and the number of times to fsync WAL > > data */ > > + if (fsyncMethodCalled()) > > + { > > + if (!AmWalReceiverProcess() && track_wal_io_timing) > > + { > > + INSTR_TIME_SET_CURRENT(duration); > > + INSTR_TIME_SUBTRACT(duration, start); > > + WalStats.m_wal_sync_time += > > INSTR_TIME_GET_MILLISEC(duration); > > + } > > + > > + WalStats.m_wal_sync++; > > + } > > > > * I'd avoid always calling fsyncMethodCalled() in this path. How about > > incrementing m_wal_sync after each sync operation? > > I think if syncing the disk does not occur, m_wal_sync should not be > incremented. > It depends enableFsync and sync_method. > > enableFsync is checked in each fsync method like > pg_fsync_no_writethrough(), > so if incrementing m_wal_sync after each sync operation, it should be > implemented > in each fsync method. It leads to many duplicated codes. Right. I missed that each fsync function checks enableFsync. > So, why don't you change the function to a flag whether to > sync data to the disk will be occurred or not in issue_xlog_fsync()? Looks better. Since we don't necessarily need to increment m_wal_sync after doing fsync we can write the code without an additional variable as follows: if (enableFsync) { switch (sync_method) { case SYNC_METHOD_FSYNC: #ifdef HAVE_FSYNC_WRITETHROUGH case SYNC_METHOD_FSYNC_WRITETHROUGH: #endif #ifdef HAVE_FDATASYNC case SYNC_METHOD_FDATASYNC: #endif WalStats.m_wal_sync++; if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); break; default: break; } } (do fsync and error handling here) /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } I think we can change the first switch-case to an if statement. > > > > * As far as I read the code, issue_xlog_fsync() seems to do fsync even > > if enableFsync is false. Why does the function return false in that > > case? I might be missing something. > > IIUC, the reason is that I thought that each fsync functions like > pg_fsync_no_writethrough() check enableFsync. > > If this code doesn't check, m_wal_sync_time may be incremented > even though some sync methods like SYNC_METHOD_OPEN don't call to sync > some data to the disk at the time. Right. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> >> On 2021-01-22 14:50, Masahiko Sawada wrote: >> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >> > <ikedamsh@oss.nttdata.com> wrote: >> >> >> >> Hi, >> >> >> >> I rebased the patch to the master branch. >> > >> > Thank you for working on this. I've read the latest patch. Here are >> > comments: >> > >> > --- >> > + if (track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_write_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > >> > * I think it should add the time in micro sec. >> > After running pgbench with track_wal_io_timing = on for 30 sec, >> > pg_stat_wal showed the following on my environment: >> > >> > postgres(1:61569)=# select * from pg_stat_wal; >> > -[ RECORD 1 ]----+----------------------------- >> > wal_records | 285947 >> > wal_fpi | 53285 >> > wal_bytes | 442008213 >> > wal_buffers_full | 0 >> > wal_write | 25516 >> > wal_write_time | 0 >> > wal_sync | 25437 >> > wal_sync_time | 14490 >> > stats_reset | 2021-01-22 10:56:13.29464+09 >> > >> > Since writes can complete less than a millisecond, wal_write_time >> > didn't increase. I think sync_time could also have the same problem. >> >> Thanks for your comments. I didn't notice that. >> I changed the unit from milliseconds to microseconds. >> >> > --- >> > + /* >> > + * Measure i/o timing to fsync WAL data. >> > + * >> > + * The wal receiver skip to collect it to avoid performance >> > degradation of standy servers. >> > + * If sync_method doesn't have its fsync method, to skip too. >> > + */ >> > + if (!AmWalReceiverProcess() && track_wal_io_timing && >> > fsyncMethodCalled()) >> > + INSTR_TIME_SET_CURRENT(start); >> > >> > * Why does only the wal receiver skip it even if track_wal_io_timinig >> > is true? I think the performance degradation is also true for backend >> > processes. If there is another reason for that, I think it's better to >> > mention in both the doc and comment. >> > * How about checking track_wal_io_timing first? >> > * s/standy/standby/ >> >> I fixed it. >> As kuroda-san mentioned too, the skip is no need to be considered. > > I think you also removed the code to have the wal receiver report the > stats. So with the latest patch, the wal receiver tracks those > statistics but doesn't report. > > And maybe XLogWalRcvWrite() also needs to track I/O? > >> >> > --- >> > + /* increment the i/o timing and the number of times to fsync WAL >> > data */ >> > + if (fsyncMethodCalled()) >> > + { >> > + if (!AmWalReceiverProcess() && track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_sync_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > + >> > + WalStats.m_wal_sync++; >> > + } >> > >> > * I'd avoid always calling fsyncMethodCalled() in this path. How about >> > incrementing m_wal_sync after each sync operation? >> >> I think if syncing the disk does not occur, m_wal_sync should not be >> incremented. >> It depends enableFsync and sync_method. >> >> enableFsync is checked in each fsync method like >> pg_fsync_no_writethrough(), >> so if incrementing m_wal_sync after each sync operation, it should be >> implemented >> in each fsync method. It leads to many duplicated codes. > > Right. I missed that each fsync function checks enableFsync. > >> So, why don't you change the function to a flag whether to >> sync data to the disk will be occurred or not in issue_xlog_fsync()? > > Looks better. Since we don't necessarily need to increment m_wal_sync > after doing fsync we can write the code without an additional variable > as follows: > > if (enableFsync) > { > switch (sync_method) > { > case SYNC_METHOD_FSYNC: > #ifdef HAVE_FSYNC_WRITETHROUGH > case SYNC_METHOD_FSYNC_WRITETHROUGH: > #endif > #ifdef HAVE_FDATASYNC > case SYNC_METHOD_FDATASYNC: > #endif > WalStats.m_wal_sync++; > if (track_wal_io_timing) > INSTR_TIME_SET_CURRENT(start); > break; > default: > break; > } > } > > (do fsync and error handling here) > > /* increment the i/o timing and the number of times to fsync WAL data */ > if (track_wal_io_timing) > { > INSTR_TIME_SET_CURRENT(duration); > INSTR_TIME_SUBTRACT(duration, start); > WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > } > > I think we can change the first switch-case to an if statement. > +1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On 2021-01-25 10:34, kuroda.hayato@fujitsu.com wrote: > Dear Ikeda-san, > > Thank you for updating the patch. This can be applied to master, and > can be used on my RHEL7. > wal_write_time and wal_sync_time increase normally :-). > > ``` > postgres=# select * from pg_stat_wal; > -[ RECORD 1 ]----+------------------------------ > wal_records | 121781 > wal_fpi | 2287 > wal_bytes | 36055146 > wal_buffers_full | 799 > wal_write | 12770 > wal_write_time | 4.469 > wal_sync | 11962 > wal_sync_time | 132.352 > stats_reset | 2021-01-25 00:51:40.674412+00 > ``` Thanks for checking. > I put a further comment: > > ``` > @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM > pg_stat_activity WHERE wait_event i > <structfield>wal_buffers_full</structfield> <type>bigint</type> > </para> > <para> > - Number of times WAL data was written to disk because WAL > buffers became full > + Total number of times WAL data was written to disk because WAL > buffers became full > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>wal_write</structfield> <type>bigint</type> > + </para> > + <para> > + Total number of times WAL data was written to disk > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>wal_write_time</structfield> <type>double > precision</type> > + </para> > + <para> > + Total amount of time that has been spent in the portion of > + WAL data was written to disk, in milliseconds > + (if <xref linkend="guc-track-wal-io-timing"/> is enabled, > otherwise zero). > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>wal_sync</structfield> <type>bigint</type> > + </para> > + <para> > + Total number of times WAL data was synced to disk > + (if <xref linkend="guc-wal-sync-method"/> is > <literal>open_datasync</literal> or > + <literal>open_sync</literal>, this value is zero because WAL > data is synced > + when to write it). > + </para></entry> > + </row> > + > + <row> > + <entry role="catalog_table_entry"><para > role="column_definition"> > + <structfield>wal_sync_time</structfield> <type>double > precision</type> > + </para> > + <para> > + Total amount of time that has been spent in the portion of > + WAL data was synced to disk, in milliseconds > + (if <xref linkend="guc-track-wal-io-timing"/> is enabled, > otherwise zero. > + if <xref linkend="guc-wal-sync-method"/> is > <literal>open_datasync</literal> or > + <literal>open_sync</literal>, this value is zero too because > WAL data is synced > + when to write it). > </para></entry> > </row> > ``` > > Maybe "Total amount of time" should be used, not "Total number of > time." > Other views use "amount." Thanks. I checked columns' descriptions of other views. There are "Number of xxx", "Total number of xxx", "Total amount of time that xxx" and "Total time spent xxx". Since the "time" is used for showing spending time, not count, I'll change it to "Total number of WAL data written/synced to disk". Thought? Regards, -- Masahiro Ikeda NTT DATA CORPORATION
RE: About to add WAL write/fsync statistics to pg_stat_wal view
From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san, > I checked columns' descriptions of other views. > There are "Number of xxx", "Total number of xxx", "Total amount of time > that xxx" and "Total time spent xxx". Right. > Since the "time" is used for showing spending time, not count, > I'll change it to "Total number of WAL data written/synced to disk". > Thought? I misread your patch, sorry. I prefer your suggestion. Please fix like that way with others. Best Regards, Hayato Kuroda FUJITSU LIMITED
On 2021-01-25 10:36, Masahiko Sawada wrote: > On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> >> On 2021-01-22 14:50, Masahiko Sawada wrote: >> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >> > <ikedamsh@oss.nttdata.com> wrote: >> >> >> >> Hi, >> >> >> >> I rebased the patch to the master branch. >> > >> > Thank you for working on this. I've read the latest patch. Here are >> > comments: >> > >> > --- >> > + if (track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_write_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > >> > * I think it should add the time in micro sec. >> > After running pgbench with track_wal_io_timing = on for 30 sec, >> > pg_stat_wal showed the following on my environment: >> > >> > postgres(1:61569)=# select * from pg_stat_wal; >> > -[ RECORD 1 ]----+----------------------------- >> > wal_records | 285947 >> > wal_fpi | 53285 >> > wal_bytes | 442008213 >> > wal_buffers_full | 0 >> > wal_write | 25516 >> > wal_write_time | 0 >> > wal_sync | 25437 >> > wal_sync_time | 14490 >> > stats_reset | 2021-01-22 10:56:13.29464+09 >> > >> > Since writes can complete less than a millisecond, wal_write_time >> > didn't increase. I think sync_time could also have the same problem. >> >> Thanks for your comments. I didn't notice that. >> I changed the unit from milliseconds to microseconds. >> >> > --- >> > + /* >> > + * Measure i/o timing to fsync WAL data. >> > + * >> > + * The wal receiver skip to collect it to avoid performance >> > degradation of standy servers. >> > + * If sync_method doesn't have its fsync method, to skip too. >> > + */ >> > + if (!AmWalReceiverProcess() && track_wal_io_timing && >> > fsyncMethodCalled()) >> > + INSTR_TIME_SET_CURRENT(start); >> > >> > * Why does only the wal receiver skip it even if track_wal_io_timinig >> > is true? I think the performance degradation is also true for backend >> > processes. If there is another reason for that, I think it's better to >> > mention in both the doc and comment. >> > * How about checking track_wal_io_timing first? >> > * s/standy/standby/ >> >> I fixed it. >> As kuroda-san mentioned too, the skip is no need to be considered. > > I think you also removed the code to have the wal receiver report the > stats. So with the latest patch, the wal receiver tracks those > statistics but doesn't report. > And maybe XLogWalRcvWrite() also needs to track I/O? Thanks, I forgot to add them. I'll fix it. >> >> > --- >> > + /* increment the i/o timing and the number of times to fsync WAL >> > data */ >> > + if (fsyncMethodCalled()) >> > + { >> > + if (!AmWalReceiverProcess() && track_wal_io_timing) >> > + { >> > + INSTR_TIME_SET_CURRENT(duration); >> > + INSTR_TIME_SUBTRACT(duration, start); >> > + WalStats.m_wal_sync_time += >> > INSTR_TIME_GET_MILLISEC(duration); >> > + } >> > + >> > + WalStats.m_wal_sync++; >> > + } >> > >> > * I'd avoid always calling fsyncMethodCalled() in this path. How about >> > incrementing m_wal_sync after each sync operation? >> >> I think if syncing the disk does not occur, m_wal_sync should not be >> incremented. >> It depends enableFsync and sync_method. >> >> enableFsync is checked in each fsync method like >> pg_fsync_no_writethrough(), >> so if incrementing m_wal_sync after each sync operation, it should be >> implemented >> in each fsync method. It leads to many duplicated codes. > > Right. I missed that each fsync function checks enableFsync. > >> So, why don't you change the function to a flag whether to >> sync data to the disk will be occurred or not in issue_xlog_fsync()? > > Looks better. Since we don't necessarily need to increment m_wal_sync > after doing fsync we can write the code without an additional variable > as follows: > > if (enableFsync) > { > switch (sync_method) > { > case SYNC_METHOD_FSYNC: > #ifdef HAVE_FSYNC_WRITETHROUGH > case SYNC_METHOD_FSYNC_WRITETHROUGH: > #endif > #ifdef HAVE_FDATASYNC > case SYNC_METHOD_FDATASYNC: > #endif > WalStats.m_wal_sync++; > if (track_wal_io_timing) > INSTR_TIME_SET_CURRENT(start); > break; > default: > break; > } > } > > (do fsync and error handling here) > > /* increment the i/o timing and the number of times to fsync WAL > data */ > if (track_wal_io_timing) > { > INSTR_TIME_SET_CURRENT(duration); > INSTR_TIME_SUBTRACT(duration, start); > WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > } IIUC, I think we can't handle the following case. When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and "track_wal_io_timing" is enabled, "start" doesn't be initialized. My understanding is something wrong, isn't it? > I think we can change the first switch-case to an if statement. Yes, I'll change it. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2021-01-25 11:47, japin wrote: > On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda >> <ikedamsh@oss.nttdata.com> wrote: >>> >>> On 2021-01-22 14:50, Masahiko Sawada wrote: >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >>> > <ikedamsh@oss.nttdata.com> wrote: >>> >> >>> >> Hi, >>> >> >>> >> I rebased the patch to the master branch. >>> > >>> > Thank you for working on this. I've read the latest patch. Here are >>> > comments: >>> > >>> > --- >>> > + if (track_wal_io_timing) >>> > + { >>> > + INSTR_TIME_SET_CURRENT(duration); >>> > + INSTR_TIME_SUBTRACT(duration, start); >>> > + WalStats.m_wal_write_time += >>> > INSTR_TIME_GET_MILLISEC(duration); >>> > + } >>> > >>> > * I think it should add the time in micro sec. >>> > After running pgbench with track_wal_io_timing = on for 30 sec, >>> > pg_stat_wal showed the following on my environment: >>> > >>> > postgres(1:61569)=# select * from pg_stat_wal; >>> > -[ RECORD 1 ]----+----------------------------- >>> > wal_records | 285947 >>> > wal_fpi | 53285 >>> > wal_bytes | 442008213 >>> > wal_buffers_full | 0 >>> > wal_write | 25516 >>> > wal_write_time | 0 >>> > wal_sync | 25437 >>> > wal_sync_time | 14490 >>> > stats_reset | 2021-01-22 10:56:13.29464+09 >>> > >>> > Since writes can complete less than a millisecond, wal_write_time >>> > didn't increase. I think sync_time could also have the same problem. >>> >>> Thanks for your comments. I didn't notice that. >>> I changed the unit from milliseconds to microseconds. >>> >>> > --- >>> > + /* >>> > + * Measure i/o timing to fsync WAL data. >>> > + * >>> > + * The wal receiver skip to collect it to avoid performance >>> > degradation of standy servers. >>> > + * If sync_method doesn't have its fsync method, to skip too. >>> > + */ >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing && >>> > fsyncMethodCalled()) >>> > + INSTR_TIME_SET_CURRENT(start); >>> > >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig >>> > is true? I think the performance degradation is also true for backend >>> > processes. If there is another reason for that, I think it's better to >>> > mention in both the doc and comment. >>> > * How about checking track_wal_io_timing first? >>> > * s/standy/standby/ >>> >>> I fixed it. >>> As kuroda-san mentioned too, the skip is no need to be considered. >> >> I think you also removed the code to have the wal receiver report the >> stats. So with the latest patch, the wal receiver tracks those >> statistics but doesn't report. >> >> And maybe XLogWalRcvWrite() also needs to track I/O? >> >>> >>> > --- >>> > + /* increment the i/o timing and the number of times to fsync WAL >>> > data */ >>> > + if (fsyncMethodCalled()) >>> > + { >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing) >>> > + { >>> > + INSTR_TIME_SET_CURRENT(duration); >>> > + INSTR_TIME_SUBTRACT(duration, start); >>> > + WalStats.m_wal_sync_time += >>> > INSTR_TIME_GET_MILLISEC(duration); >>> > + } >>> > + >>> > + WalStats.m_wal_sync++; >>> > + } >>> > >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about >>> > incrementing m_wal_sync after each sync operation? >>> >>> I think if syncing the disk does not occur, m_wal_sync should not be >>> incremented. >>> It depends enableFsync and sync_method. >>> >>> enableFsync is checked in each fsync method like >>> pg_fsync_no_writethrough(), >>> so if incrementing m_wal_sync after each sync operation, it should be >>> implemented >>> in each fsync method. It leads to many duplicated codes. >> >> Right. I missed that each fsync function checks enableFsync. >> >>> So, why don't you change the function to a flag whether to >>> sync data to the disk will be occurred or not in issue_xlog_fsync()? >> >> Looks better. Since we don't necessarily need to increment m_wal_sync >> after doing fsync we can write the code without an additional variable >> as follows: >> >> if (enableFsync) >> { >> switch (sync_method) >> { >> case SYNC_METHOD_FSYNC: >> #ifdef HAVE_FSYNC_WRITETHROUGH >> case SYNC_METHOD_FSYNC_WRITETHROUGH: >> #endif >> #ifdef HAVE_FDATASYNC >> case SYNC_METHOD_FDATASYNC: >> #endif >> WalStats.m_wal_sync++; >> if (track_wal_io_timing) >> INSTR_TIME_SET_CURRENT(start); >> break; >> default: >> break; >> } >> } >> >> (do fsync and error handling here) >> >> /* increment the i/o timing and the number of times to fsync WAL >> data */ >> if (track_wal_io_timing) >> { >> INSTR_TIME_SET_CURRENT(duration); >> INSTR_TIME_SUBTRACT(duration, start); >> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); >> } >> >> I think we can change the first switch-case to an if statement. >> > > +1. We can also narrow the scope of "duration" into "if > (track_wal_io_timing)" branch. Thanks, I'll change it. -- Masahiro Ikeda NTT DATA CORPORATION
On 2021-01-25 13:15, Masahiro Ikeda wrote: > On 2021-01-25 10:36, Masahiko Sawada wrote: >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda >> <ikedamsh@oss.nttdata.com> wrote: >>> >>> On 2021-01-22 14:50, Masahiko Sawada wrote: >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda >>> > <ikedamsh@oss.nttdata.com> wrote: >>> >> >>> >> Hi, >>> >> >>> >> I rebased the patch to the master branch. >>> > >>> > Thank you for working on this. I've read the latest patch. Here are >>> > comments: >>> > >>> > --- >>> > + if (track_wal_io_timing) >>> > + { >>> > + INSTR_TIME_SET_CURRENT(duration); >>> > + INSTR_TIME_SUBTRACT(duration, start); >>> > + WalStats.m_wal_write_time += >>> > INSTR_TIME_GET_MILLISEC(duration); >>> > + } >>> > >>> > * I think it should add the time in micro sec. >>> > After running pgbench with track_wal_io_timing = on for 30 sec, >>> > pg_stat_wal showed the following on my environment: >>> > >>> > postgres(1:61569)=# select * from pg_stat_wal; >>> > -[ RECORD 1 ]----+----------------------------- >>> > wal_records | 285947 >>> > wal_fpi | 53285 >>> > wal_bytes | 442008213 >>> > wal_buffers_full | 0 >>> > wal_write | 25516 >>> > wal_write_time | 0 >>> > wal_sync | 25437 >>> > wal_sync_time | 14490 >>> > stats_reset | 2021-01-22 10:56:13.29464+09 >>> > >>> > Since writes can complete less than a millisecond, wal_write_time >>> > didn't increase. I think sync_time could also have the same problem. >>> >>> Thanks for your comments. I didn't notice that. >>> I changed the unit from milliseconds to microseconds. >>> >>> > --- >>> > + /* >>> > + * Measure i/o timing to fsync WAL data. >>> > + * >>> > + * The wal receiver skip to collect it to avoid performance >>> > degradation of standy servers. >>> > + * If sync_method doesn't have its fsync method, to skip too. >>> > + */ >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing && >>> > fsyncMethodCalled()) >>> > + INSTR_TIME_SET_CURRENT(start); >>> > >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig >>> > is true? I think the performance degradation is also true for backend >>> > processes. If there is another reason for that, I think it's better to >>> > mention in both the doc and comment. >>> > * How about checking track_wal_io_timing first? >>> > * s/standy/standby/ >>> >>> I fixed it. >>> As kuroda-san mentioned too, the skip is no need to be considered. >> >> I think you also removed the code to have the wal receiver report the >> stats. So with the latest patch, the wal receiver tracks those >> statistics but doesn't report. >> And maybe XLogWalRcvWrite() also needs to track I/O? > > Thanks, I forgot to add them. > I'll fix it. > > >>> >>> > --- >>> > + /* increment the i/o timing and the number of times to fsync WAL >>> > data */ >>> > + if (fsyncMethodCalled()) >>> > + { >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing) >>> > + { >>> > + INSTR_TIME_SET_CURRENT(duration); >>> > + INSTR_TIME_SUBTRACT(duration, start); >>> > + WalStats.m_wal_sync_time += >>> > INSTR_TIME_GET_MILLISEC(duration); >>> > + } >>> > + >>> > + WalStats.m_wal_sync++; >>> > + } >>> > >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about >>> > incrementing m_wal_sync after each sync operation? >>> >>> I think if syncing the disk does not occur, m_wal_sync should not be >>> incremented. >>> It depends enableFsync and sync_method. >>> >>> enableFsync is checked in each fsync method like >>> pg_fsync_no_writethrough(), >>> so if incrementing m_wal_sync after each sync operation, it should be >>> implemented >>> in each fsync method. It leads to many duplicated codes. >> >> Right. I missed that each fsync function checks enableFsync. >> >>> So, why don't you change the function to a flag whether to >>> sync data to the disk will be occurred or not in issue_xlog_fsync()? >> >> Looks better. Since we don't necessarily need to increment m_wal_sync >> after doing fsync we can write the code without an additional variable >> as follows: >> >> if (enableFsync) >> { >> switch (sync_method) >> { >> case SYNC_METHOD_FSYNC: >> #ifdef HAVE_FSYNC_WRITETHROUGH >> case SYNC_METHOD_FSYNC_WRITETHROUGH: >> #endif >> #ifdef HAVE_FDATASYNC >> case SYNC_METHOD_FDATASYNC: >> #endif >> WalStats.m_wal_sync++; >> if (track_wal_io_timing) >> INSTR_TIME_SET_CURRENT(start); >> break; >> default: >> break; >> } >> } >> >> (do fsync and error handling here) >> >> /* increment the i/o timing and the number of times to fsync WAL >> data */ >> if (track_wal_io_timing) >> { >> INSTR_TIME_SET_CURRENT(duration); >> INSTR_TIME_SUBTRACT(duration, start); >> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); >> } > > IIUC, I think we can't handle the following case. > > When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and > "track_wal_io_timing" is enabled, "start" doesn't be initialized. > > My understanding is something wrong, isn't it? I thought the following is better. ``` /* Measure i/o timing to sync WAL data.*/ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); (do fsync and error handling here) /* check whether to sync WAL data to the disk right now. */ if (enableFsync) { if ((sync_method == SYNC_METHOD_FSYNC) || (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) || (sync_method == SYNC_METHOD_FDATASYNC)) { /* increment the i/o timing and the number of times to fsync WAL data */ if (track_wal_io_timing) { instr_time duration; INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } WalStats.m_wal_sync++; } } ``` Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless of the "sync_method" and "enableFsync", we don't make additional variables. But it's ok because "track_wal_io_timing" leads already performance degradation. What do you think? Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > > On 2021-01-25 13:15, Masahiro Ikeda wrote: > > On 2021-01-25 10:36, Masahiko Sawada wrote: > >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda > >> <ikedamsh@oss.nttdata.com> wrote: > >>> > >>> On 2021-01-22 14:50, Masahiko Sawada wrote: > >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda > >>> > <ikedamsh@oss.nttdata.com> wrote: > >>> >> > >>> >> Hi, > >>> >> > >>> >> I rebased the patch to the master branch. > >>> > > >>> > Thank you for working on this. I've read the latest patch. Here are > >>> > comments: > >>> > > >>> > --- > >>> > + if (track_wal_io_timing) > >>> > + { > >>> > + INSTR_TIME_SET_CURRENT(duration); > >>> > + INSTR_TIME_SUBTRACT(duration, start); > >>> > + WalStats.m_wal_write_time += > >>> > INSTR_TIME_GET_MILLISEC(duration); > >>> > + } > >>> > > >>> > * I think it should add the time in micro sec. > >>> > After running pgbench with track_wal_io_timing = on for 30 sec, > >>> > pg_stat_wal showed the following on my environment: > >>> > > >>> > postgres(1:61569)=# select * from pg_stat_wal; > >>> > -[ RECORD 1 ]----+----------------------------- > >>> > wal_records | 285947 > >>> > wal_fpi | 53285 > >>> > wal_bytes | 442008213 > >>> > wal_buffers_full | 0 > >>> > wal_write | 25516 > >>> > wal_write_time | 0 > >>> > wal_sync | 25437 > >>> > wal_sync_time | 14490 > >>> > stats_reset | 2021-01-22 10:56:13.29464+09 > >>> > > >>> > Since writes can complete less than a millisecond, wal_write_time > >>> > didn't increase. I think sync_time could also have the same problem. > >>> > >>> Thanks for your comments. I didn't notice that. > >>> I changed the unit from milliseconds to microseconds. > >>> > >>> > --- > >>> > + /* > >>> > + * Measure i/o timing to fsync WAL data. > >>> > + * > >>> > + * The wal receiver skip to collect it to avoid performance > >>> > degradation of standy servers. > >>> > + * If sync_method doesn't have its fsync method, to skip too. > >>> > + */ > >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing && > >>> > fsyncMethodCalled()) > >>> > + INSTR_TIME_SET_CURRENT(start); > >>> > > >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig > >>> > is true? I think the performance degradation is also true for backend > >>> > processes. If there is another reason for that, I think it's better to > >>> > mention in both the doc and comment. > >>> > * How about checking track_wal_io_timing first? > >>> > * s/standy/standby/ > >>> > >>> I fixed it. > >>> As kuroda-san mentioned too, the skip is no need to be considered. > >> > >> I think you also removed the code to have the wal receiver report the > >> stats. So with the latest patch, the wal receiver tracks those > >> statistics but doesn't report. > >> And maybe XLogWalRcvWrite() also needs to track I/O? > > > > Thanks, I forgot to add them. > > I'll fix it. > > > > > >>> > >>> > --- > >>> > + /* increment the i/o timing and the number of times to fsync WAL > >>> > data */ > >>> > + if (fsyncMethodCalled()) > >>> > + { > >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing) > >>> > + { > >>> > + INSTR_TIME_SET_CURRENT(duration); > >>> > + INSTR_TIME_SUBTRACT(duration, start); > >>> > + WalStats.m_wal_sync_time += > >>> > INSTR_TIME_GET_MILLISEC(duration); > >>> > + } > >>> > + > >>> > + WalStats.m_wal_sync++; > >>> > + } > >>> > > >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about > >>> > incrementing m_wal_sync after each sync operation? > >>> > >>> I think if syncing the disk does not occur, m_wal_sync should not be > >>> incremented. > >>> It depends enableFsync and sync_method. > >>> > >>> enableFsync is checked in each fsync method like > >>> pg_fsync_no_writethrough(), > >>> so if incrementing m_wal_sync after each sync operation, it should be > >>> implemented > >>> in each fsync method. It leads to many duplicated codes. > >> > >> Right. I missed that each fsync function checks enableFsync. > >> > >>> So, why don't you change the function to a flag whether to > >>> sync data to the disk will be occurred or not in issue_xlog_fsync()? > >> > >> Looks better. Since we don't necessarily need to increment m_wal_sync > >> after doing fsync we can write the code without an additional variable > >> as follows: > >> > >> if (enableFsync) > >> { > >> switch (sync_method) > >> { > >> case SYNC_METHOD_FSYNC: > >> #ifdef HAVE_FSYNC_WRITETHROUGH > >> case SYNC_METHOD_FSYNC_WRITETHROUGH: > >> #endif > >> #ifdef HAVE_FDATASYNC > >> case SYNC_METHOD_FDATASYNC: > >> #endif > >> WalStats.m_wal_sync++; > >> if (track_wal_io_timing) > >> INSTR_TIME_SET_CURRENT(start); > >> break; > >> default: > >> break; > >> } > >> } > >> > >> (do fsync and error handling here) > >> > >> /* increment the i/o timing and the number of times to fsync WAL > >> data */ > >> if (track_wal_io_timing) > >> { > >> INSTR_TIME_SET_CURRENT(duration); > >> INSTR_TIME_SUBTRACT(duration, start); > >> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > >> } > > > > IIUC, I think we can't handle the following case. > > > > When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and > > "track_wal_io_timing" is enabled, "start" doesn't be initialized. > > > > My understanding is something wrong, isn't it? You're right. We might want to initialize 'start' with 0 in those two cases and check if INSTR_TIME_IS_ZERO() later when accumulating the I/O time. > > I thought the following is better. > > > ``` > /* Measure i/o timing to sync WAL data.*/ > if (track_wal_io_timing) > INSTR_TIME_SET_CURRENT(start); > > (do fsync and error handling here) > > /* check whether to sync WAL data to the disk right now. */ > if (enableFsync) > { > if ((sync_method == SYNC_METHOD_FSYNC) || > (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) || > (sync_method == SYNC_METHOD_FDATASYNC)) > { > /* increment the i/o timing and the number of times to fsync WAL data > */ > if (track_wal_io_timing) > { > instr_time duration; > > INSTR_TIME_SET_CURRENT(duration); > INSTR_TIME_SUBTRACT(duration, start); > WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > } > WalStats.m_wal_sync++; > } > } > ``` > > Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless > of the "sync_method" and "enableFsync", we don't make additional > variables. > But it's ok because "track_wal_io_timing" leads already performance > degradation. > > What do you think? That also fine with me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, thanks for the reviews. I updated the attached patch. The summary of the changes is following. 1. fix document I followed another view's comments. 2. refactor issue_xlog_fsync() I removed "sync_called" variables, narrowed the "duration" scope and change the switch statement to if statement. 3. make wal-receiver report WAL statistics I add the code to collect the statistics for a written operation in XLogWalRcvWrite() and to report stats in WalReceiverMain(). Since WalReceiverMain() can loop fast, to avoid loading stats collector, I add "force" argument to the pgstat_send_wal function. If "force" is false, it can skip reporting until at least 500 msec since it last reported. WalReceiverMain() almost calls pgstat_send_wal() with "force" as false. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote: > > Hi, thanks for the reviews. > > I updated the attached patch. Thank you for updating the patch! > The summary of the changes is following. > > 1. fix document > > I followed another view's comments. > > > 2. refactor issue_xlog_fsync() > > I removed "sync_called" variables, narrowed the "duration" scope and > change the switch statement to if statement. Looking at the code again, I think if we check if an fsync was really called when calculating the I/O time, it's better to check that before starting the measurement. bool issue_fsync = false; if (enableFsync && (sync_method == SYNC_METHOD_FSYNC || sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH || sync_method == SYNC_METHOD_FDATASYNC)) { if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); issue_fsync = true; } (snip) if (issue_fsync) { if (track_wal_io_timing) { instr_time duration; INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); } WalStats.m_wal_sync++; } So I prefer either the above which is a modified version of the original approach or my idea that doesn’t introduce a new local variable I proposed before. But I'm not going to insist on that. > > > 3. make wal-receiver report WAL statistics > > I add the code to collect the statistics for a written operation > in XLogWalRcvWrite() and to report stats in WalReceiverMain(). > > Since WalReceiverMain() can loop fast, to avoid loading stats collector, > I add "force" argument to the pgstat_send_wal function. If "force" is > false, it can skip reporting until at least 500 msec since it last > reported. WalReceiverMain() almost calls pgstat_send_wal() with "force" > as false. void -pgstat_send_wal(void) +pgstat_send_wal(bool force) { /* We assume this initializes to zeroes */ static const PgStat_MsgWal all_zeroes; + static TimestampTz last_report = 0; + TimestampTz now; WalUsage walusage; + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL + * msec since we last sent one or specified "force". + */ + now = GetCurrentTimestamp(); + if (!force && + !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL)) + return; + + last_report = now; Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this purpose since it is used as a minimum time for stats file updates. If we want an interval, I think we should define another one Also, with the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time even if track_wal_io_timing is off, which is not good. On the other hand, I agree that your concern that the wal receiver should not send the stats for whenever receiving wal records. So an idea could be to send the wal stats when finishing the current WAL segment file and when timeout in the main loop. That way we can guarantee that the wal stats on a replica is updated at least every time finishing a WAL segment file when actively receiving WAL records and every NAPTIME_PER_CYCLE in other cases. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On 2021-01-26 00:03, Masahiko Sawada wrote: > On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: >> >> Hi, thanks for the reviews. >> >> I updated the attached patch. > > Thank you for updating the patch! > >> The summary of the changes is following. >> >> 1. fix document >> >> I followed another view's comments. >> >> >> 2. refactor issue_xlog_fsync() >> >> I removed "sync_called" variables, narrowed the "duration" scope and >> change the switch statement to if statement. > > Looking at the code again, I think if we check if an fsync was really > called when calculating the I/O time, it's better to check that before > starting the measurement. > > bool issue_fsync = false; > > if (enableFsync && > (sync_method == SYNC_METHOD_FSYNC || > sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH || > sync_method == SYNC_METHOD_FDATASYNC)) > { > if (track_wal_io_timing) > INSTR_TIME_SET_CURRENT(start); > issue_fsync = true; > } > (snip) > if (issue_fsync) > { > if (track_wal_io_timing) > { > instr_time duration; > > INSTR_TIME_SET_CURRENT(duration); > INSTR_TIME_SUBTRACT(duration, start); > WalStats.m_wal_sync_time = > INSTR_TIME_GET_MICROSEC(duration); > } > WalStats.m_wal_sync++; > } > > So I prefer either the above which is a modified version of the > original approach or my idea that doesn’t introduce a new local > variable I proposed before. But I'm not going to insist on that. Thanks for the comments. I change the code to the above. >> >> >> 3. make wal-receiver report WAL statistics >> >> I add the code to collect the statistics for a written operation >> in XLogWalRcvWrite() and to report stats in WalReceiverMain(). >> >> Since WalReceiverMain() can loop fast, to avoid loading stats >> collector, >> I add "force" argument to the pgstat_send_wal function. If "force" is >> false, it can skip reporting until at least 500 msec since it last >> reported. WalReceiverMain() almost calls pgstat_send_wal() with >> "force" >> as false. > > void > -pgstat_send_wal(void) > +pgstat_send_wal(bool force) > { > /* We assume this initializes to zeroes */ > static const PgStat_MsgWal all_zeroes; > + static TimestampTz last_report = 0; > > + TimestampTz now; > WalUsage walusage; > > + /* > + * Don't send a message unless it's been at least > PGSTAT_STAT_INTERVAL > + * msec since we last sent one or specified "force". > + */ > + now = GetCurrentTimestamp(); > + if (!force && > + !TimestampDifferenceExceeds(last_report, now, > PGSTAT_STAT_INTERVAL)) > + return; > + > + last_report = now; > > Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this > purpose since it is used as a minimum time for stats file updates. If > we want an interval, I think we should define another one Also, with > the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time > even if track_wal_io_timing is off, which is not good. On the other > hand, I agree that your concern that the wal receiver should not send > the stats for whenever receiving wal records. So an idea could be to > send the wal stats when finishing the current WAL segment file and > when timeout in the main loop. That way we can guarantee that the wal > stats on a replica is updated at least every time finishing a WAL > segment file when actively receiving WAL records and every > NAPTIME_PER_CYCLE in other cases. I agree with your comments. I think it should report when reaching the end of WAL too. I add the code to report the stats when finishing the current WAL segment file when timeout in the main loop and when reaching the end of WAL. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi, thanks for the reviews.
>
> I updated the attached patch.
Thank you for updating the patch!
Your original email with "total number of times" is more correct, removing the "of times" and just writing "total number of WAL" is not good wording.
Specifically, this change is strictly worse than the original.
- Number of times WAL data was written to disk because WAL buffers became full
+ Total number of WAL data written to disk because WAL buffers became full
+ Total number of WAL data written to disk because WAL buffers became full
Both have the flaw that they leave implied exactly what it means to "write WAL to disk". It is also unclear whether a counter, bytes, or both, would be more useful here. I've incorporated this into my documentation suggestions below:
(wal_buffers_full)
-- Revert - the original was better, though maybe add more detail similar to the below. I didn't research exactly how this works.
(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite
-- Seems like this should have a bytes version too
(wal_write_time)
The amount of time spent writing WAL buffers to disk, excluding sync time unless the wal_sync_method is either open_datasync or open_sync.
Units are in milliseconds with microsecond resolution. This is zero when track_wal_io_timing is disabled.
(wal_sync)
The number of times WAL files were synced to disk while wal_sync_method was set to one of the "sync at commit" options (i.e., fdatasync, fsync, or fsync_writethrough).
-- it is not going to be zero just because those settings are presently disabled as they could have been enabled at some point since the last time these statistics were reset.
(wal_sync_time)
The amount of time spent syncing WAL files to disk, in milliseconds with microsecond resolution. This requires setting wal_sync_method to one of the "sync at commit" options (i.e., fdatasync, fsync, or fsync_writethrough).
Also,
I would suggest extracting the changes to postmaster/pgstat.c and replication/walreceiver.c to a separate patch as you've fundamentally changed how it behaves with regards to that function and how it interacts with the WAL receiver. That seems an entirely separate topic warranting its own patch and discussion.
David J.
On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.
The following is not an improvement:
- /* Send WAL statistics to the stats collector. */
+ /* Send WAL statistics to stats collector */The word "the" there makes it proper English. Your copy-pasting should have kept the existing good wording in the other locations rather than replace the existing location with the newly added incorrect wording.
This doesn't make sense:
* current WAL segment file to avoid loading stats collector.
Maybe "overloading" or "overwhelming"?
I see you removed the pgstat_send_wal(force) change. The rest of my comments on the v6 patch still stand I believe.
David J.
Hi, David. Thanks for your comments. On 2021-01-26 08:48, David G. Johnston wrote: > On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > >> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda >> <ikedamsh@oss.nttdata.com> wrote: >>> >>> Hi, thanks for the reviews. >>> >>> I updated the attached patch. >> >> Thank you for updating the patch! > > Your original email with "total number of times" is more correct, > removing the "of times" and just writing "total number of WAL" is not > good wording. > > Specifically, this change is strictly worse than the original. > > - Number of times WAL data was written to disk because WAL > buffers became full > + Total number of WAL data written to disk because WAL buffers > became full > > Both have the flaw that they leave implied exactly what it means to > "write WAL to disk". It is also unclear whether a counter, bytes, or > both, would be more useful here. I've incorporated this into my > documentation suggestions below: > (wal_buffers_full) > > -- Revert - the original was better, though maybe add more detail > similar to the below. I didn't research exactly how this works. OK, I understood. I reverted since this is a counter statistics. > (wal_write) > The number of times WAL buffers were written out to disk via XLogWrite > Thanks. I thought it's better to omit "The" and "XLogWrite" because other views' description omits "The" and there is no description of "XlogWrite" in the documents. What do you think? > -- Seems like this should have a bytes version too Do you mean that we need to separate statistics for wal write? > (wal_write_time) > The amount of time spent writing WAL buffers to disk, excluding sync > time unless the wal_sync_method is either open_datasync or open_sync. > Units are in milliseconds with microsecond resolution. This is zero > when track_wal_io_timing is disabled. Thanks, I'll fix it. > (wal_sync) > The number of times WAL files were synced to disk while > wal_sync_method was set to one of the "sync at commit" options (i.e., > fdatasync, fsync, or fsync_writethrough). Thanks, I'll fix it. > -- it is not going to be zero just because those settings are > presently disabled as they could have been enabled at some point since > the last time these statistics were reset. Right, your description is correct. The "track_wal_io_timing" has the same limitation, doesn't it? > (wal_sync_time) > The amount of time spent syncing WAL files to disk, in milliseconds > with microsecond resolution. This requires setting wal_sync_method to > one of the "sync at commit" options (i.e., fdatasync, fsync, or > fsync_writethrough). Thanks, I'll fix it. I will add the comments related to "track_wal_io_timing". > Also, > > I would suggest extracting the changes to postmaster/pgstat.c and > replication/walreceiver.c to a separate patch as you've fundamentally > changed how it behaves with regards to that function and how it > interacts with the WAL receiver. That seems an entirely separate > topic warranting its own patch and discussion. OK, I will separate two patches. On 2021-01-26 08:52, David G. Johnston wrote: > On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: > >> I agree with your comments. I think it should report when >> reaching the end of WAL too. I add the code to report the stats >> when finishing the current WAL segment file when timeout in the >> main loop and when reaching the end of WAL. > > The following is not an improvement: > > - /* Send WAL statistics to the stats collector. */+ /* Send WAL > statistics to stats collector */ > > The word "the" there makes it proper English. Your copy-pasting > should have kept the existing good wording in the other locations > rather than replace the existing location with the newly added > incorrect wording. Thanks, I'll fix it. > This doesn't make sense: > > * current WAL segment file to avoid loading stats collector. > > Maybe "overloading" or "overwhelming"? > > I see you removed the pgstat_send_wal(force) change. The rest of my > comments on the v6 patch still stand I believe. Yes, "overloading" is right. Thanks. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> (wal_write)
> The number of times WAL buffers were written out to disk via XLogWrite
>
Thanks.
I thought it's better to omit "The" and "XLogWrite" because other views'
description
omits "The" and there is no description of "XlogWrite" in the documents.
What do you think?
The documentation for WAL does get into the public API level of detail and doing so here makes what this measures crystal clear. The potential absence of sufficient detail elsewhere should be corrected instead of making this description more vague. Specifically, probably XLogWrite should be added to the WAL overview as part of this update and probably even have the descriptive section of the documentation note that the number of times that said function is executed is exposed as a counter in the wal statistics table - thus closing the loop.
David J.
On 2021-01-27 00:14, David G. Johnston wrote: > On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: > >>> (wal_write) >>> The number of times WAL buffers were written out to disk via >> XLogWrite >>> >> >> Thanks. >> >> I thought it's better to omit "The" and "XLogWrite" because other >> views' >> description >> omits "The" and there is no description of "XlogWrite" in the >> documents. >> What do you think? > > The documentation for WAL does get into the public API level of detail > and doing so here makes what this measures crystal clear. The > potential absence of sufficient detail elsewhere should be corrected > instead of making this description more vague. Specifically, probably > XLogWrite should be added to the WAL overview as part of this update > and probably even have the descriptive section of the documentation > note that the number of times that said function is executed is > exposed as a counter in the wal statistics table - thus closing the > loop. Thanks for your comments. I added the descriptions in documents and separated the patch into attached two patches. First is to add wal i/o activity statistics. Second is to make the wal receiver report the wal statistics. Please let me know if you have any comments. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
I pgindented the patches. Regards -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On 2021/02/05 8:45, Masahiro Ikeda wrote: > I pgindented the patches. Thanks for updating the patches! + <function>XLogWrite</function>, which nomally called by an + <function>issue_xlog_fsync</function>, which nomally called by an Typo: "nomally" should be "normally"? + <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>) + <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>), Isn't it better to add a space character just after "request"? + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration); If several cycles happen in the do-while loop, m_wal_write_time should be updated with the sum of "duration" in those cycles instead of "duration" in the last cycle? If yes, "+=" should be used instead of "=" when updating m_wal_write_time? + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); Also "=" should be "+=" in the above? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021/02/08 13:01, Fujii Masao wrote: > > > On 2021/02/05 8:45, Masahiro Ikeda wrote: >> I pgindented the patches. > > Thanks for updating the patches! > > + <function>XLogWrite</function>, which nomally called by an > + <function>issue_xlog_fsync</function>, which nomally called by an > > Typo: "nomally" should be "normally"? > > + <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>) > + <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>), > > Isn't it better to add a space character just after "request"? > > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration); > > If several cycles happen in the do-while loop, m_wal_write_time should be > updated with the sum of "duration" in those cycles instead of "duration" > in the last cycle? If yes, "+=" should be used instead of "=" when updating > m_wal_write_time? > > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > > Also "=" should be "+=" in the above? + /* Send WAL statistics */ + pgstat_send_wal(); This may cause overhead in WAL-writing by walwriter because it's called every cycles even when walwriter needs to write more WAL next cycle (don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal() should be called only when WaitLatch() returns with WL_TIMEOUT? - <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>) + <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>), + or WAL data written out to disk by WAL receiver. So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and wal_sync_time are updated even while the other values are not. Isn't this confusing to users? If so, what about reporting those walreceiver stats in pg_stat_wal_receiver? if (endofwal) + { + /* Send WAL statistics to the stats collector */ + pgstat_send_wal(); break; You added pgstat_send_wal() so that it's called in some cases where walreceiver exits. But ISTM that there are other walreceiver-exit cases. For example, in the case where SIGTERM is received. Instead, pgstat_send_wal() should be called in WalRcvDie() for those all cases? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
I pgindented the patches.
... <function>XLogWrite</function>, which is invoked during an <function>XLogFlush</function> request (see ...). This is also incremented by the WAL receiver during replication.
("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste)
is ether -> is either
"This parameter is off by default as it will repeatedly query the operating system..."
", because" -> "as"
wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication."
"The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in....) This is undesirable because ..."
I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics. This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not. It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL). Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead. In either case, this half-and-half implementation seems undesirable.
David J.
On 2021-02-08 13:01, Fujii Masao wrote: > On 2021/02/05 8:45, Masahiro Ikeda wrote: >> I pgindented the patches. > > Thanks for updating the patches! Thanks for checking the patches. > + <function>XLogWrite</function>, which nomally called by an > + <function>issue_xlog_fsync</function>, which nomally called by > an > > Typo: "nomally" should be "normally"? Yes, I'll fix it. > + <function>XLogFlush</function> request(see <xref > linkend="wal-configuration"/>) > + <function>XLogFlush</function> request(see <xref > linkend="wal-configuration"/>), > > Isn't it better to add a space character just after "request"? Thanks, I'll fix it. > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration); > > If several cycles happen in the do-while loop, m_wal_write_time should > be > updated with the sum of "duration" in those cycles instead of > "duration" > in the last cycle? If yes, "+=" should be used instead of "=" when > updating > m_wal_write_time? > + INSTR_TIME_SET_CURRENT(duration); > + INSTR_TIME_SUBTRACT(duration, start); > + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration); > > Also "=" should be "+=" in the above? Yes, they are my mistake when changing the unit from milliseconds to microseconds. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2021-02-08 14:26, Fujii Masao wrote: > On 2021/02/08 13:01, Fujii Masao wrote: >> >> >> On 2021/02/05 8:45, Masahiro Ikeda wrote: >>> I pgindented the patches. >> >> Thanks for updating the patches! >> >> + <function>XLogWrite</function>, which nomally called by an >> + <function>issue_xlog_fsync</function>, which nomally called by >> an >> >> Typo: "nomally" should be "normally"? >> >> + <function>XLogFlush</function> request(see <xref >> linkend="wal-configuration"/>) >> + <function>XLogFlush</function> request(see <xref >> linkend="wal-configuration"/>), >> >> Isn't it better to add a space character just after "request"? >> >> + INSTR_TIME_SET_CURRENT(duration); >> + INSTR_TIME_SUBTRACT(duration, start); >> + WalStats.m_wal_write_time = >> INSTR_TIME_GET_MICROSEC(duration); >> >> If several cycles happen in the do-while loop, m_wal_write_time should >> be >> updated with the sum of "duration" in those cycles instead of >> "duration" >> in the last cycle? If yes, "+=" should be used instead of "=" when >> updating >> m_wal_write_time? >> >> + INSTR_TIME_SET_CURRENT(duration); >> + INSTR_TIME_SUBTRACT(duration, start); >> + WalStats.m_wal_sync_time = >> INSTR_TIME_GET_MICROSEC(duration); >> >> Also "=" should be "+=" in the above? > > + /* Send WAL statistics */ > + pgstat_send_wal(); > > This may cause overhead in WAL-writing by walwriter because it's called > every cycles even when walwriter needs to write more WAL next cycle > (don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal() > should be called only when WaitLatch() returns with WL_TIMEOUT? Thanks, I didn't notice that. I'll fix it. > - <function>XLogFlush</function> request(see <xref > linkend="wal-configuration"/>) > + <function>XLogFlush</function> request(see <xref > linkend="wal-configuration"/>), > + or WAL data written out to disk by WAL receiver. > > So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and > wal_sync_time are updated even while the other values are not. Isn't > this > confusing to users? If so, what about reporting those walreceiver stats > in > pg_stat_wal_receiver? OK, I'll add new infrastructure code to interect with wal receiver and stats collector and show the stats in pg_stat_wal_receiver. > if (endofwal) > + { > + /* Send WAL statistics to the stats collector */ > + pgstat_send_wal(); > break; > > You added pgstat_send_wal() so that it's called in some cases where > walreceiver exits. But ISTM that there are other walreceiver-exit > cases. > For example, in the case where SIGTERM is received. Instead, > pgstat_send_wal() should be called in WalRcvDie() for those all cases? Thanks, I forgot the case. I'll fix it. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2021-02-10 00:51, David G. Johnston wrote: > On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda > <ikedamsh@oss.nttdata.com> wrote: > >> I pgindented the patches. > > ... <function>XLogWrite</function>, which is invoked during an > <function>XLogFlush</function> request (see ...). This is also > incremented by the WAL receiver during replication. > > ("which normally called" should be "which is normally called" or > "which normally is called" if you want to keep true to the original) > You missed the adding the space before an opening parenthesis here and > elsewhere (probably copy-paste) > > is ether -> is either > "This parameter is off by default as it will repeatedly query the > operating system..." > ", because" -> "as" Thanks, I fixed them. > wal_write_time and the sync items also need the note: "This is also > incremented by the WAL receiver during replication." I skipped changing it since I separated the stats for the WAL receiver in pg_stat_wal_receiver. > "The number of times it happened..." -> " (the tally of this event is > reported in wal_buffers_full in....) This is undesirable because ..." Thanks, I fixed it. > I notice that the patch for WAL receiver doesn't require explicitly > computing the sync statistics but does require computing the write > statistics. This is because of the presence of issue_xlog_fsync but > absence of an equivalent pg_xlog_pwrite. Additionally, I observe that > the XLogWrite code path calls pgstat_report_wait_*() while the WAL > receiver path does not. It seems technically straight-forward to > refactor here to avoid the almost-duplicated logic in the two places, > though I suspect there may be a trade-off for not adding another > function call to the stack given the importance of WAL processing > (though that seems marginalized compared to the cost of actually > writing the WAL). Or, as Fujii noted, go the other way and don't have > any shared code between the two but instead implement the WAL receiver > one to use pg_stat_wal_receiver instead. In either case, this > half-and-half implementation seems undesirable. OK, as Fujii-san mentioned, I separated the WAL receiver stats. (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch) I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and the stats for WAL receiver is counted in pg_stat_wal_receiver. What do you think? Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On 2021/03/15 10:39, Masahiro Ikeda wrote: > Thanks, I understood get_sync_bit() checks the sync flags and > the write unit of generated wal data and replicated wal data is different. > (It's interesting optimization whether to use kernel cache or not.) > > OK. Although I agree to separate the stats for the walrecever, > I want to hear opinions from other people too. I didn't change the patch. > > Please feel to your comments. What about applying the patch for common WAL write function like XLogWriteFile(), separately from the patch for walreceiver's stats? Seems the former reaches the consensus, so we can commit it firstly. Also even only the former change is useful because which allows walreceiver to report WALWrite wait event. > OK. I agree. > > I wonder to change the error check ways depending on who calls this function? > Now, only the walreceiver checks (1)errno==0 and doesn't check (2)errno==ENITR. > Other processes are the opposite. > > IIUC, it's appropriate that every process checks (1)(2). > Please let me know my understanding is wrong. I'm thinking the same. Regarding (2), commit 79ce29c734 introduced that code. According to the following commit log, it seems harmless to retry on EINTR even walreceiver. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive. >> BTW, currently XLogWrite() increments IO timing even when pg_pwrite() >> reports an error. But this is useless. Probably IO timing should be >> incremented after the return code of pg_pwrite() is checked, instead? > > Yes, I agree. I fixed it. > (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks for the patch! nleft = nbytes; do { - errno = 0; + written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset, + ThisTimeLineID, openLogSegNo, wal_segment_size); Can we merge this do-while loop in XLogWrite() into the loop in XLogWriteFile()? If we do that, ISTM that the following codes are not necessary in XLogWrite(). nleft -= written; from += written; + * 'segsize' is a segment size of WAL segment file. Since segsize is always wal_segment_size, segsize argument seems not necessary in XLogWriteFile(). +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset, + TimeLineID timelineid, XLogSegNo segno, int segsize) Why did you use "const void *" instead of "char *" for *buf? Regarding 0005 patch, I will review it later. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021-03-19 16:30, Fujii Masao wrote: > On 2021/03/15 10:39, Masahiro Ikeda wrote: >> Thanks, I understood get_sync_bit() checks the sync flags and >> the write unit of generated wal data and replicated wal data is >> different. >> (It's interesting optimization whether to use kernel cache or not.) >> >> OK. Although I agree to separate the stats for the walrecever, >> I want to hear opinions from other people too. I didn't change the >> patch. >> >> Please feel to your comments. > > What about applying the patch for common WAL write function like > XLogWriteFile(), separately from the patch for walreceiver's stats? > Seems the former reaches the consensus, so we can commit it firstly. > Also even only the former change is useful because which allows > walreceiver to report WALWrite wait event. Agreed. I separated the patches. If only the former is committed, my trivial concern is that there may be a disadvantage, but no advantage for the standby server. It may lead to performance degradation to the wal receiver by calling INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the latter patch is committed. I think it's ok because this not happening in the case to disable the "track_wal_io_timing" in the standby server. Although some users may start the standby server using the backup which "track_wal_io_timing" is enabled in the primary server, they will say it's ok since the users already accept the performance degradation in the primary server. >> OK. I agree. >> >> I wonder to change the error check ways depending on who calls this >> function? >> Now, only the walreceiver checks (1)errno==0 and doesn't check >> (2)errno==ENITR. >> Other processes are the opposite. >> >> IIUC, it's appropriate that every process checks (1)(2). >> Please let me know my understanding is wrong. > > I'm thinking the same. Regarding (2), commit 79ce29c734 introduced > that code. According to the following commit log, it seems harmless > to retry on EINTR even walreceiver. > > Also retry on EINTR. All signals used in the backend are flagged > SA_RESTART > nowadays, so it shouldn't happen, but better to be defensive. Thanks, I understood. >>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite() >>> reports an error. But this is useless. Probably IO timing should be >>> incremented after the return code of pg_pwrite() is checked, instead? >> >> Yes, I agree. I fixed it. >> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) > > Thanks for the patch! > > nleft = nbytes; > do > { > - errno = 0; > + written = XLogWriteFile(openLogFile, from, nleft, (off_t) > startoffset, > + ThisTimeLineID, openLogSegNo, wal_segment_size); > > Can we merge this do-while loop in XLogWrite() into the loop > in XLogWriteFile()? > If we do that, ISTM that the following codes are not necessary in > XLogWrite(). > > nleft -= written; > from += written; OK, I fixed it. > + * 'segsize' is a segment size of WAL segment file. > > Since segsize is always wal_segment_size, segsize argument seems > not necessary in XLogWriteFile(). Right. I fixed it. > +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset, > + TimeLineID timelineid, XLogSegNo segno, int segsize) > > Why did you use "const void *" instead of "char *" for *buf? I followed the argument of pg_pwrite(). But, I think "char *" is better, so fixed it. > Regarding 0005 patch, I will review it later. Thanks. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On 2021/03/22 9:50, ikedamsh wrote: > Agreed. I separated the patches. > > If only the former is committed, my trivial concern is that there may be > a disadvantage, but no advantage for the standby server. It may lead to > performance degradation to the wal receiver by calling > INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the > latter patch is committed. Your concern is valid, so let's polish and commit also the 0003 patch to v14. I'm still thinking that it's better to separate wal_xxx columns into walreceiver's and the others. But if we count even walreceiver activity on the existing columns, regarding 0003 patch, we need to update the document? For example, "Number of times WAL buffers were written out to disk via XLogWrite request." should be "Number of times WAL buffers were written out to disk via XLogWrite request and by WAL receiver process."? Maybe we need to append some descriptions about this into "WAL configuration" section? > I followed the argument of pg_pwrite(). > But, I think "char *" is better, so fixed it. Thanks for updating the patch! +extern int XLogWriteFile(int fd, char *buf, + size_t nbyte, off_t offset, + TimeLineID timelineid, XLogSegNo segno, + bool write_all); write_all seems not to be necessary. You added this flag for walreceiver, I guess. But even without the argument, walreceiver seems to work expectedly. So, what about the attached patch? I applied some cosmetic changes to the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021/03/22 16:50, Fujii Masao wrote: > > > On 2021/03/22 9:50, ikedamsh wrote: >> Agreed. I separated the patches. >> >> If only the former is committed, my trivial concern is that there may be >> a disadvantage, but no advantage for the standby server. It may lead to >> performance degradation to the wal receiver by calling >> INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the >> latter patch is committed. > > Your concern is valid, so let's polish and commit also the 0003 patch to v14. > I'm still thinking that it's better to separate wal_xxx columns into > walreceiver's and the others. But if we count even walreceiver activity on > the existing columns, regarding 0003 patch, we need to update the document? > For example, "Number of times WAL buffers were written out to disk via > XLogWrite request." should be "Number of times WAL buffers were written > out to disk via XLogWrite request and by WAL receiver process."? Maybe > we need to append some descriptions about this into "WAL configuration" > section? Agreed. Users can know whether the stats is for walreceiver or not. The pg_stat_wal view in standby server shows for the walreceiver, and in primary server it shows for the others. So, I updated the document. (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) >> I followed the argument of pg_pwrite(). >> But, I think "char *" is better, so fixed it. > > Thanks for updating the patch! > > +extern int XLogWriteFile(int fd, char *buf, > + size_t nbyte, off_t offset, > + TimeLineID timelineid, XLogSegNo segno, > + bool write_all); > > write_all seems not to be necessary. You added this flag for walreceiver, > I guess. But even without the argument, walreceiver seems to work expectedly. > So, what about the attached patch? I applied some cosmetic changes to the patch. Thanks a lot. Yes, "write_all" is unnecessary. Your patch is looks good to me. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Attachment
On 2021/03/22 20:25, ikedamsh wrote: > Agreed. Users can know whether the stats is for walreceiver or not. The > pg_stat_wal view in standby server shows for the walreceiver, and in primary > server it shows for the others. So, I updated the document. > (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks for updating the docs! There was the discussion about when the stats collector is invoked, at [1]. Currently during archive recovery or standby, the stats collector is invoked when the startup process reaches the consistent state, sends PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting read-only connections. But walreceiver can be invoked at earlier stage. This can cause walreceiver to generate and send the statistics about WAL writing even though the stats collector has not been running yet. This might be problematic? If so, maybe we need to ensure that the stats collector is invoked before walreceiver? During recovery, the stats collector is not invoked if hot standby mode is disabled. But walreceiver can be running in this case. So probably we should change walreceiver so that it's invoked even when hot standby is disabled? Otherwise we cannnot collect the statistics about WAL writing by walreceiver in that case. [1] https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2021/03/23 16:10, Fujii Masao wrote: > > > On 2021/03/22 20:25, ikedamsh wrote: >> Agreed. Users can know whether the stats is for walreceiver or not. The >> pg_stat_wal view in standby server shows for the walreceiver, and in primary >> server it shows for the others. So, I updated the document. >> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) > > Thanks for updating the docs! > > There was the discussion about when the stats collector is invoked, at [1]. > Currently during archive recovery or standby, the stats collector is > invoked when the startup process reaches the consistent state, sends > PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting > read-only connections. But walreceiver can be invoked at earlier stage. > This can cause walreceiver to generate and send the statistics about WAL > writing even though the stats collector has not been running yet. This might > be problematic? If so, maybe we need to ensure that the stats collector is > invoked before walreceiver? > > During recovery, the stats collector is not invoked if hot standby mode is > disabled. But walreceiver can be running in this case. So probably we should > change walreceiver so that it's invoked even when hot standby is disabled? > Otherwise we cannnot collect the statistics about WAL writing by walreceiver > in that case. > > [1] > https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com Thanks for comments! I didn't notice that. As I mentioned[1], if my understanding is right, this issue seem to be not for only the wal receiver. Since the shared memory thread already handles these issues, does this patch, which to collect the stats for the wal receiver and make a common function for writing wal files, have to be committed after the patches for share memory stats are committed? Or to handle them in this thread because we don't know when the shared memory stats patches will be committed. I think the former is better because to collect stats in shared memory is very useful feature for users and it make a big change in design. So, I think it's beneficial to make an effort to move the shared memory stats thread forward (by reviewing or testing) instead of handling the issues in this thread. [1] https://www.postgresql.org/message-id/9f4e19ad-518d-b91a-e500-25a666471c42%40oss.nttdata.com Regards, -- Masahiro Ikeda NTT DATA CORPORATION
On 2021/03/25 11:50, Masahiro Ikeda wrote: > > > On 2021/03/23 16:10, Fujii Masao wrote: >> >> >> On 2021/03/22 20:25, ikedamsh wrote: >>> Agreed. Users can know whether the stats is for walreceiver or not. The >>> pg_stat_wal view in standby server shows for the walreceiver, and in primary >>> server it shows for the others. So, I updated the document. >>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) >> >> Thanks for updating the docs! >> >> There was the discussion about when the stats collector is invoked, at [1]. >> Currently during archive recovery or standby, the stats collector is >> invoked when the startup process reaches the consistent state, sends >> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting >> read-only connections. But walreceiver can be invoked at earlier stage. >> This can cause walreceiver to generate and send the statistics about WAL >> writing even though the stats collector has not been running yet. This might >> be problematic? If so, maybe we need to ensure that the stats collector is >> invoked before walreceiver? >> >> During recovery, the stats collector is not invoked if hot standby mode is >> disabled. But walreceiver can be running in this case. So probably we should >> change walreceiver so that it's invoked even when hot standby is disabled? >> Otherwise we cannnot collect the statistics about WAL writing by walreceiver >> in that case. >> >> [1] >> https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com > > Thanks for comments! I didn't notice that. > As I mentioned[1], if my understanding is right, this issue seem to be not for > only the wal receiver. > > Since the shared memory thread already handles these issues, does this patch, > which to collect the stats for the wal receiver and make a common function for > writing wal files, have to be committed after the patches for share memory > stats are committed? Or to handle them in this thread because we don't know > when the shared memory stats patches will be committed. > > I think the former is better because to collect stats in shared memory is very > useful feature for users and it make a big change in design. So, I think it's > beneficial to make an effort to move the shared memory stats thread forward > (by reviewing or testing) instead of handling the issues in this thread. Sounds reasonable. Agreed. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION