Thread: About to add WAL write/fsync statistics to pg_stat_wal view

About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following 
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query 
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time 
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write      | 650  # Total number of times WAL data was written to 
the disk

wal_write_time | 43   # Total amount of time that has been spent in the 
portion of WAL data was written to disk
                       # if track-wal-io-timing is enabled, otherwise 
zero

wal_sync       | 78   # Total number of times WAL data was synced to the 
disk

wal_sync_time  | 104  # Total amount of time that has been spent in the 
portion of WAL data was synced to disk
                       # if track-wal-io-timing is enabled, otherwise 
zero
```

What do you think?
Please let me know your comments.

Regards
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Li Japin
Date:
Hi,

> On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> 
> Hi,
> 
> I propose to add wal write/fsync statistics to pg_stat_wal view.
> It's useful not only for developing/improving source code related to WAL
> but also for users to detect workload changes, HW failure, and so on.
> 
> I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view.
> I separate the parameter from "track_io_timing" to "track_wal_io_timing"
> because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity.
> 
> ```
> postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal;
> -[ RECORD 1 ]--+----
> wal_write      | 650  # Total number of times WAL data was written to the disk
> 
> wal_write_time | 43   # Total amount of time that has been spent in the portion of WAL data was written to disk
>                      # if track-wal-io-timing is enabled, otherwise zero
> 
> wal_sync       | 78   # Total number of times WAL data was synced to the disk
> 
> wal_sync_time  | 104  # Total amount of time that has been spent in the portion of WAL data was synced to disk
>                      # if track-wal-io-timing is enabled, otherwise zero
> ```
> 
> What do you think?
> Please let me know your comments.
> 
> Regards
> -- 
> Masahiro Ikeda
> NTT DATA CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+       if (!enableFsync)
+               return 0;
+
+       switch (sync_method)
+       {
+               case SYNC_METHOD_FSYNC:
+               case SYNC_METHOD_FSYNC_WRITETHROUGH:
+               case SYNC_METHOD_FDATASYNC:
+                       return 1;
+               default:
+                       /* others don't have a specific fsync method */
+                       return 0;
+       }
+}
+

--
Best regards
ChengDu WenWu Information Technology Co.,Ltd.
Japin Li


Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2020-12-08 16:45, Li Japin wrote:
> Hi,
> 
>> On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> 
>> wrote:
>> 
>> Hi,
>> 
>> I propose to add wal write/fsync statistics to pg_stat_wal view.
>> It's useful not only for developing/improving source code related to 
>> WAL
>> but also for users to detect workload changes, HW failure, and so on.
>> 
>> I introduce "track_wal_io_timing" parameter and provide the following 
>> information on pg_stat_wal view.
>> I separate the parameter from "track_io_timing" to 
>> "track_wal_io_timing"
>> because IIUC, WAL I/O activity may have a greater impact on query 
>> performance than database I/O activity.
>> 
>> ```
>> postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time 
>> FROM pg_stat_wal;
>> -[ RECORD 1 ]--+----
>> wal_write      | 650  # Total number of times WAL data was written to 
>> the disk
>> 
>> wal_write_time | 43   # Total amount of time that has been spent in 
>> the portion of WAL data was written to disk
>>                      # if track-wal-io-timing is enabled, otherwise 
>> zero
>> 
>> wal_sync       | 78   # Total number of times WAL data was synced to 
>> the disk
>> 
>> wal_sync_time  | 104  # Total amount of time that has been spent in 
>> the portion of WAL data was synced to disk
>>                      # if track-wal-io-timing is enabled, otherwise 
>> zero
>> ```
>> 
>> What do you think?
>> Please let me know your comments.
>> 
>> Regards
>> --
>> Masahiro Ikeda
>> NTT DATA 
>> CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>
> 
> There is a no previous prototype warning for ‘fsyncMethodCalled’, and
> it now only used in xlog.c,
> should we declare with static? And this function wants a boolean as a
> return, should we use
> true/false other than 0/1?
> 
> +/*
> + * Check if fsync mothod is called.
> + */
> +bool
> +fsyncMethodCalled()
> +{
> +       if (!enableFsync)
> +               return 0;
> +
> +       switch (sync_method)
> +       {
> +               case SYNC_METHOD_FSYNC:
> +               case SYNC_METHOD_FSYNC_WRITETHROUGH:
> +               case SYNC_METHOD_FDATASYNC:
> +                       return 1;
> +               default:
> +                       /* others don't have a specific fsync method */
> +                       return 0;
> +       }
> +}
> +

Thanks for your review.
I agree with your comments. I fixed them.

Regards
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
Hi,

I rebased the patch to the master branch.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

RE: About to add WAL write/fsync statistics to pg_stat_wal view

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+    /*
+     * Measure i/o timing to fsync WAL data.
+     *
+     * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+     * If sync_method doesn't have its fsync method, to skip too.
+     */
+    if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+        INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiko Sawada
Date:
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi,
>
> I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.

After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records      | 285947
wal_fpi          | 53285
wal_bytes        | 442008213
wal_buffers_full | 0
wal_write        | 25516
wal_write_time   | 0
wal_sync         | 25437
wal_sync_time    | 14490
stats_reset      | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.

* How about checking track_wal_io_timing first?

* s/standy/standby/

---
+   /* increment the i/o timing and the number of times to fsync WAL data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

* void is missing as argument?

* s/mothod/method/

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-22 11:54, kuroda.hayato@fujitsu.com wrote:
> Dear Ikeda-san,
> 
> This patch cannot be applied to the HEAD, but anyway I put a comment.
> 
> ```
> +    /*
> +     * Measure i/o timing to fsync WAL data.
> +     *
> +     * The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> +     * If sync_method doesn't have its fsync method, to skip too.
> +     */
> +    if (!AmWalReceiverProcess() && track_wal_io_timing && 
> fsyncMethodCalled())
> +        INSTR_TIME_SET_CURRENT(start);
> ```
> 
> I think m_wal_sync_time should be collected even if the process is 
> WalRecevier.
> Because all wal_fsync should be recorded, and
> some performance issues have been aleady occurred if
> track_wal_io_timing is turned on.
> I think it's strange only to take care of the walrecevier case.

Kuroda-san, Thanks for your comments.

Although I thought that the performance impact may be bigger in standby 
servers
because WALReceiver didn't use wal buffers, it's no need to be 
considered.
I agreed that if track_wal_io_timing is turned on, the primary server's
performance degradation occurs too.

I will make rebased and modified.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-22 14:50, Masahiko Sawada wrote:
> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda 
> <ikedamsh@oss.nttdata.com> wrote:
>> 
>> Hi,
>> 
>> I rebased the patch to the master branch.
> 
> Thank you for working on this. I've read the latest patch. Here are 
> comments:
> 
> ---
> +               if (track_wal_io_timing)
> +               {
> +                   INSTR_TIME_SET_CURRENT(duration);
> +                   INSTR_TIME_SUBTRACT(duration, start);
> +                   WalStats.m_wal_write_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> +               }
> 
> * I think it should add the time in micro sec.
> After running pgbench with track_wal_io_timing = on for 30 sec,
> pg_stat_wal showed the following on my environment:
> 
> postgres(1:61569)=# select * from pg_stat_wal;
> -[ RECORD 1 ]----+-----------------------------
> wal_records      | 285947
> wal_fpi          | 53285
> wal_bytes        | 442008213
> wal_buffers_full | 0
> wal_write        | 25516
> wal_write_time   | 0
> wal_sync         | 25437
> wal_sync_time    | 14490
> stats_reset      | 2021-01-22 10:56:13.29464+09
> 
> Since writes can complete less than a millisecond, wal_write_time
> didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

> ---
> +   /*
> +    * Measure i/o timing to fsync WAL data.
> +    *
> +    * The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> +    * If sync_method doesn't have its fsync method, to skip too.
> +    */
> +   if (!AmWalReceiverProcess() && track_wal_io_timing && 
> fsyncMethodCalled())
> +       INSTR_TIME_SET_CURRENT(start);
> 
> * Why does only the wal receiver skip it even if track_wal_io_timinig
> is true? I think the performance degradation is also true for backend
> processes. If there is another reason for that, I think it's better to
> mention in both the doc and comment.
> * How about checking track_wal_io_timing first?
> * s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

> ---
> +   /* increment the i/o timing and the number of times to fsync WAL 
> data */
> +   if (fsyncMethodCalled())
> +   {
> +       if (!AmWalReceiverProcess() && track_wal_io_timing)
> +       {
> +           INSTR_TIME_SET_CURRENT(duration);
> +           INSTR_TIME_SUBTRACT(duration, start);
> +           WalStats.m_wal_sync_time += 
> INSTR_TIME_GET_MILLISEC(duration);
> +       }
> +
> +       WalStats.m_wal_sync++;
> +   }
> 
> * I'd avoid always calling fsyncMethodCalled() in this path. How about
> incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be 
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like 
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be 
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?


> ---
> +/*
> + * Check if fsync mothod is called.
> + */
> +static bool
> +fsyncMethodCalled()
> +{
> +   if (!enableFsync)
> +       return false;
> +
> +   switch (sync_method)
> +   {
> +       case SYNC_METHOD_FSYNC:
> +       case SYNC_METHOD_FSYNC_WRITETHROUGH:
> +       case SYNC_METHOD_FDATASYNC:
> +           return true;
> +       default:
> +           /* others don't have a specific fsync method */
> +           return false;
> +   }
> +}
> 
> * I'm concerned that the function name could confuse the reader
> because it's called even before the fsync method is called. As I
> commented above, calling to fsyncMethodCalled() can be eliminated.
> That way, this function is called at only once. So do we really need
> this function?

Thanks to your comments, I removed them.


> * As far as I read the code, issue_xlog_fsync() seems to do fsync even
> if enableFsync is false. Why does the function return false in that
> case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like 
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync 
some data to the disk at the time.

> * void is missing as argument?
> 
> * s/mothod/method/

I removed them.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
japin
Date:
Hi, Masahiro

Thanks for you update the v4 patch.  Here are some comments:

(1)
+       char            *msg = NULL;
+       bool            sync_called;            /* whether to sync data to the disk. */
+       instr_time      start;
+       instr_time      duration;
+
+       /* check whether to sync data to the disk is really occurred. */
+       sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.

(2)
+       if (sync_called)
+       {
+               /* increment the i/o timing and the number of times to fsync WAL data */
+               if (track_wal_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(duration);
+                       INSTR_TIME_SUBTRACT(duration, start);
+                       WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
+               }
+
+               WalStats.m_wal_sync++;
+       }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data?  And for those cases, we
cannot get accurate write/sync timing and number of write/sync times, right?

        case SYNC_METHOD_OPEN:
        case SYNC_METHOD_OPEN_DSYNC:
            /* write synced it already */
            break;

On Fri, 22 Jan 2021 at 21:05, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda 
>> <ikedamsh@oss.nttdata.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I rebased the patch to the master branch.
>> 
>> Thank you for working on this. I've read the latest patch. Here are 
>> comments:
>> 
>> ---
>> +               if (track_wal_io_timing)
>> +               {
>> +                   INSTR_TIME_SET_CURRENT(duration);
>> +                   INSTR_TIME_SUBTRACT(duration, start);
>> +                   WalStats.m_wal_write_time +=
>> INSTR_TIME_GET_MILLISEC(duration);
>> +               }
>> 
>> * I think it should add the time in micro sec.
>> After running pgbench with track_wal_io_timing = on for 30 sec,
>> pg_stat_wal showed the following on my environment:
>> 
>> postgres(1:61569)=# select * from pg_stat_wal;
>> -[ RECORD 1 ]----+-----------------------------
>> wal_records      | 285947
>> wal_fpi          | 53285
>> wal_bytes        | 442008213
>> wal_buffers_full | 0
>> wal_write        | 25516
>> wal_write_time   | 0
>> wal_sync         | 25437
>> wal_sync_time    | 14490
>> stats_reset      | 2021-01-22 10:56:13.29464+09
>> 
>> Since writes can complete less than a millisecond, wal_write_time
>> didn't increase. I think sync_time could also have the same problem.
>
> Thanks for your comments. I didn't notice that.
> I changed the unit from milliseconds to microseconds.
>
>> ---
>> +   /*
>> +    * Measure i/o timing to fsync WAL data.
>> +    *
>> +    * The wal receiver skip to collect it to avoid performance
>> degradation of standy servers.
>> +    * If sync_method doesn't have its fsync method, to skip too.
>> +    */
>> +   if (!AmWalReceiverProcess() && track_wal_io_timing && 
>> fsyncMethodCalled())
>> +       INSTR_TIME_SET_CURRENT(start);
>> 
>> * Why does only the wal receiver skip it even if track_wal_io_timinig
>> is true? I think the performance degradation is also true for backend
>> processes. If there is another reason for that, I think it's better to
>> mention in both the doc and comment.
>> * How about checking track_wal_io_timing first?
>> * s/standy/standby/
>
> I fixed it.
> As kuroda-san mentioned too, the skip is no need to be considered.
>
>> ---
>> +   /* increment the i/o timing and the number of times to fsync WAL 
>> data */
>> +   if (fsyncMethodCalled())
>> +   {
>> +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>> +       {
>> +           INSTR_TIME_SET_CURRENT(duration);
>> +           INSTR_TIME_SUBTRACT(duration, start);
>> +           WalStats.m_wal_sync_time += 
>> INSTR_TIME_GET_MILLISEC(duration);
>> +       }
>> +
>> +       WalStats.m_wal_sync++;
>> +   }
>> 
>> * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> incrementing m_wal_sync after each sync operation?
>
> I think if syncing the disk does not occur, m_wal_sync should not be 
> incremented.
> It depends enableFsync and sync_method.
>
> enableFsync is checked in each fsync method like 
> pg_fsync_no_writethrough(),
> so if incrementing m_wal_sync after each sync operation, it should be 
> implemented
> in each fsync method. It leads to many duplicated codes.
>
> So, why don't you change the function to a flag whether to
> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
>
>> ---
>> +/*
>> + * Check if fsync mothod is called.
>> + */
>> +static bool
>> +fsyncMethodCalled()
>> +{
>> +   if (!enableFsync)
>> +       return false;
>> +
>> +   switch (sync_method)
>> +   {
>> +       case SYNC_METHOD_FSYNC:
>> +       case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> +       case SYNC_METHOD_FDATASYNC:
>> +           return true;
>> +       default:
>> +           /* others don't have a specific fsync method */
>> +           return false;
>> +   }
>> +}
>> 
>> * I'm concerned that the function name could confuse the reader
>> because it's called even before the fsync method is called. As I
>> commented above, calling to fsyncMethodCalled() can be eliminated.
>> That way, this function is called at only once. So do we really need
>> this function?
>
> Thanks to your comments, I removed them.
>
>
>> * As far as I read the code, issue_xlog_fsync() seems to do fsync even
>> if enableFsync is false. Why does the function return false in that
>> case? I might be missing something.
>
> IIUC, the reason is that I thought that each fsync functions like 
> pg_fsync_no_writethrough() check enableFsync.
>
> If this code doesn't check, m_wal_sync_time may be incremented
> even though some sync methods like SYNC_METHOD_OPEN don't call to sync 
> some data to the disk at the time.
>
>> * void is missing as argument?
>> 
>> * s/mothod/method/
>
> I removed them.
>
>
> Regards,


-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
Hi, Japin

Thanks for your comments.

On 2021-01-23 01:46, japin wrote:
> Hi, Masahiro
> 
> Thanks for you update the v4 patch.  Here are some comments:
> 
> (1)
> +       char            *msg = NULL;
> +       bool            sync_called;            /* whether to sync
> data to the disk. */
> +       instr_time      start;
> +       instr_time      duration;
> +
> +       /* check whether to sync data to the disk is really occurred. 
> */
> +       sync_called = false;
> 
> Maybe we can initialize the "sync_called" variable when declare it.

Yes, I fixed it.

> (2)
> +       if (sync_called)
> +       {
> +               /* increment the i/o timing and the number of times to
> fsync WAL data */
> +               if (track_wal_io_timing)
> +               {
> +                       INSTR_TIME_SET_CURRENT(duration);
> +                       INSTR_TIME_SUBTRACT(duration, start);
> +                       WalStats.m_wal_sync_time =
> INSTR_TIME_GET_MICROSEC(duration);
> +               }
> +
> +               WalStats.m_wal_sync++;
> +       }
> 
> There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

Yes, I removed it.

> In the issue_xlog_fsync(), the comment says that if sync_method is
> SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
> Does that mean it synced when write the WAL data?  And for those cases, 
> we
> cannot get accurate write/sync timing and number of write/sync times, 
> right?
> 
>         case SYNC_METHOD_OPEN:
>         case SYNC_METHOD_OPEN_DSYNC:
>             /* write synced it already */
>             break;

Yes, I add the following comments in the document.

@@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>

@@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        <para>
         Total amount of time that has been spent in the portion of
         WAL data was synced to disk, in milliseconds
-       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero)
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>


I attached a modified patch.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

RE: About to add WAL write/fsync statistics to pg_stat_wal view

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]----+------------------------------
wal_records      | 121781
wal_fpi          | 2287
wal_bytes        | 36055146
wal_buffers_full | 799
wal_write        | 12770
wal_write_time   | 4.469
wal_sync         | 11962
wal_sync_time    | 132.352
stats_reset      | 2021-01-25 00:51:40.674412+00
```

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL data is synced
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced
+       when to write it).
       </para></entry>
      </row>
 ```

Maybe "Total amount of time" should be used, not "Total number of time."
Other views use "amount."

I have no comments anymore.

Hayato Kuroda
FUJITSU LIMITED



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiko Sawada
Date:
On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
>
> On 2021-01-22 14:50, Masahiko Sawada wrote:
> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> > <ikedamsh@oss.nttdata.com> wrote:
> >>
> >> Hi,
> >>
> >> I rebased the patch to the master branch.
> >
> > Thank you for working on this. I've read the latest patch. Here are
> > comments:
> >
> > ---
> > +               if (track_wal_io_timing)
> > +               {
> > +                   INSTR_TIME_SET_CURRENT(duration);
> > +                   INSTR_TIME_SUBTRACT(duration, start);
> > +                   WalStats.m_wal_write_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > +               }
> >
> > * I think it should add the time in micro sec.
> > After running pgbench with track_wal_io_timing = on for 30 sec,
> > pg_stat_wal showed the following on my environment:
> >
> > postgres(1:61569)=# select * from pg_stat_wal;
> > -[ RECORD 1 ]----+-----------------------------
> > wal_records      | 285947
> > wal_fpi          | 53285
> > wal_bytes        | 442008213
> > wal_buffers_full | 0
> > wal_write        | 25516
> > wal_write_time   | 0
> > wal_sync         | 25437
> > wal_sync_time    | 14490
> > stats_reset      | 2021-01-22 10:56:13.29464+09
> >
> > Since writes can complete less than a millisecond, wal_write_time
> > didn't increase. I think sync_time could also have the same problem.
>
> Thanks for your comments. I didn't notice that.
> I changed the unit from milliseconds to microseconds.
>
> > ---
> > +   /*
> > +    * Measure i/o timing to fsync WAL data.
> > +    *
> > +    * The wal receiver skip to collect it to avoid performance
> > degradation of standy servers.
> > +    * If sync_method doesn't have its fsync method, to skip too.
> > +    */
> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> > fsyncMethodCalled())
> > +       INSTR_TIME_SET_CURRENT(start);
> >
> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> > is true? I think the performance degradation is also true for backend
> > processes. If there is another reason for that, I think it's better to
> > mention in both the doc and comment.
> > * How about checking track_wal_io_timing first?
> > * s/standy/standby/
>
> I fixed it.
> As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

>
> > ---
> > +   /* increment the i/o timing and the number of times to fsync WAL
> > data */
> > +   if (fsyncMethodCalled())
> > +   {
> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
> > +       {
> > +           INSTR_TIME_SET_CURRENT(duration);
> > +           INSTR_TIME_SUBTRACT(duration, start);
> > +           WalStats.m_wal_sync_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > +       }
> > +
> > +       WalStats.m_wal_sync++;
> > +   }
> >
> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> > incrementing m_wal_sync after each sync operation?
>
> I think if syncing the disk does not occur, m_wal_sync should not be
> incremented.
> It depends enableFsync and sync_method.
>
> enableFsync is checked in each fsync method like
> pg_fsync_no_writethrough(),
> so if incrementing m_wal_sync after each sync operation, it should be
> implemented
> in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

> So, why don't you change the function to a flag whether to
> sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

    if (enableFsync)
    {
        switch (sync_method)
        {
            case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
            case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
            case SYNC_METHOD_FDATASYNC:
#endif
                WalStats.m_wal_sync++;
                if (track_wal_io_timing)
                    INSTR_TIME_SET_CURRENT(start);
                break;
            default:
                break;
        }
    }

  (do fsync and error handling here)

   /* increment the i/o timing and the number of times to fsync WAL data */
   if (track_wal_io_timing)
   {
       INSTR_TIME_SET_CURRENT(duration);
       INSTR_TIME_SUBTRACT(duration, start);
       WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
   }

I think we can change the first switch-case to an if statement.

>
>
> > * As far as I read the code, issue_xlog_fsync() seems to do fsync even
> > if enableFsync is false. Why does the function return false in that
> > case? I might be missing something.
>
> IIUC, the reason is that I thought that each fsync functions like
> pg_fsync_no_writethrough() check enableFsync.
>
> If this code doesn't check, m_wal_sync_time may be incremented
> even though some sync methods like SYNC_METHOD_OPEN don't call to sync
> some data to the disk at the time.

Right.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
japin
Date:
On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
>>
>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> > <ikedamsh@oss.nttdata.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I rebased the patch to the master branch.
>> >
>> > Thank you for working on this. I've read the latest patch. Here are
>> > comments:
>> >
>> > ---
>> > +               if (track_wal_io_timing)
>> > +               {
>> > +                   INSTR_TIME_SET_CURRENT(duration);
>> > +                   INSTR_TIME_SUBTRACT(duration, start);
>> > +                   WalStats.m_wal_write_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +               }
>> >
>> > * I think it should add the time in micro sec.
>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>> > pg_stat_wal showed the following on my environment:
>> >
>> > postgres(1:61569)=# select * from pg_stat_wal;
>> > -[ RECORD 1 ]----+-----------------------------
>> > wal_records      | 285947
>> > wal_fpi          | 53285
>> > wal_bytes        | 442008213
>> > wal_buffers_full | 0
>> > wal_write        | 25516
>> > wal_write_time   | 0
>> > wal_sync         | 25437
>> > wal_sync_time    | 14490
>> > stats_reset      | 2021-01-22 10:56:13.29464+09
>> >
>> > Since writes can complete less than a millisecond, wal_write_time
>> > didn't increase. I think sync_time could also have the same problem.
>>
>> Thanks for your comments. I didn't notice that.
>> I changed the unit from milliseconds to microseconds.
>>
>> > ---
>> > +   /*
>> > +    * Measure i/o timing to fsync WAL data.
>> > +    *
>> > +    * The wal receiver skip to collect it to avoid performance
>> > degradation of standy servers.
>> > +    * If sync_method doesn't have its fsync method, to skip too.
>> > +    */
>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> > fsyncMethodCalled())
>> > +       INSTR_TIME_SET_CURRENT(start);
>> >
>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>> > is true? I think the performance degradation is also true for backend
>> > processes. If there is another reason for that, I think it's better to
>> > mention in both the doc and comment.
>> > * How about checking track_wal_io_timing first?
>> > * s/standy/standby/
>>
>> I fixed it.
>> As kuroda-san mentioned too, the skip is no need to be considered.
>
> I think you also removed the code to have the wal receiver report the
> stats. So with the latest patch, the wal receiver tracks those
> statistics but doesn't report.
>
> And maybe XLogWalRcvWrite() also needs to track I/O?
>
>>
>> > ---
>> > +   /* increment the i/o timing and the number of times to fsync WAL
>> > data */
>> > +   if (fsyncMethodCalled())
>> > +   {
>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>> > +       {
>> > +           INSTR_TIME_SET_CURRENT(duration);
>> > +           INSTR_TIME_SUBTRACT(duration, start);
>> > +           WalStats.m_wal_sync_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +       }
>> > +
>> > +       WalStats.m_wal_sync++;
>> > +   }
>> >
>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> > incrementing m_wal_sync after each sync operation?
>>
>> I think if syncing the disk does not occur, m_wal_sync should not be
>> incremented.
>> It depends enableFsync and sync_method.
>>
>> enableFsync is checked in each fsync method like
>> pg_fsync_no_writethrough(),
>> so if incrementing m_wal_sync after each sync operation, it should be
>> implemented
>> in each fsync method. It leads to many duplicated codes.
>
> Right. I missed that each fsync function checks enableFsync.
>
>> So, why don't you change the function to a flag whether to
>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
> Looks better. Since we don't necessarily need to increment m_wal_sync
> after doing fsync we can write the code without an additional variable
> as follows:
>
>     if (enableFsync)
>     {
>         switch (sync_method)
>         {
>             case SYNC_METHOD_FSYNC:
> #ifdef HAVE_FSYNC_WRITETHROUGH
>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
> #endif
> #ifdef HAVE_FDATASYNC
>             case SYNC_METHOD_FDATASYNC:
> #endif
>                 WalStats.m_wal_sync++;
>                 if (track_wal_io_timing)
>                     INSTR_TIME_SET_CURRENT(start);
>                 break;
>             default:
>                 break;
>         }
>     }
>
>   (do fsync and error handling here)
>
>    /* increment the i/o timing and the number of times to fsync WAL data */
>    if (track_wal_io_timing)
>    {
>        INSTR_TIME_SET_CURRENT(duration);
>        INSTR_TIME_SUBTRACT(duration, start);
>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>    }
>
> I think we can change the first switch-case to an if statement.
>

+1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.



RE: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-25 10:34, kuroda.hayato@fujitsu.com wrote:
> Dear Ikeda-san,
> 
> Thank you for updating the patch. This can be applied to master, and
> can be used on my RHEL7.
> wal_write_time and wal_sync_time increase normally :-).
> 
> ```
> postgres=# select * from pg_stat_wal;
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 121781
> wal_fpi          | 2287
> wal_bytes        | 36055146
> wal_buffers_full | 799
> wal_write        | 12770
> wal_write_time   | 4.469
> wal_sync         | 11962
> wal_sync_time    | 132.352
> stats_reset      | 2021-01-25 00:51:40.674412+00
> ```

Thanks for checking.

> I put a further comment:
> 
> ```
> @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM
> pg_stat_activity WHERE wait_event i
>         <structfield>wal_buffers_full</structfield> <type>bigint</type>
>        </para>
>        <para>
> -       Number of times WAL data was written to disk because WAL
> buffers became full
> +       Total number of times WAL data was written to disk because WAL
> buffers became full
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para 
> role="column_definition">
> +       <structfield>wal_write</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Total number of times WAL data was written to disk
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para 
> role="column_definition">
> +       <structfield>wal_write_time</structfield> <type>double 
> precision</type>
> +      </para>
> +      <para>
> +       Total amount of time that has been spent in the portion of
> +       WAL data was written to disk, in milliseconds
> +       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
> otherwise zero).
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para 
> role="column_definition">
> +       <structfield>wal_sync</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Total number of times WAL data was synced to disk
> +       (if <xref linkend="guc-wal-sync-method"/> is
> <literal>open_datasync</literal> or
> +       <literal>open_sync</literal>, this value is zero because WAL
> data is synced
> +       when to write it).
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para 
> role="column_definition">
> +       <structfield>wal_sync_time</structfield> <type>double 
> precision</type>
> +      </para>
> +      <para>
> +       Total amount of time that has been spent in the portion of
> +       WAL data was synced to disk, in milliseconds
> +       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
> otherwise zero.
> +       if <xref linkend="guc-wal-sync-method"/> is
> <literal>open_datasync</literal> or
> +       <literal>open_sync</literal>, this value is zero too because
> WAL data is synced
> +       when to write it).
>        </para></entry>
>       </row>
>  ```
> 
> Maybe "Total amount of time" should be used, not "Total number of 
> time."
> Other views use "amount."

Thanks.

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time 
that xxx" and "Total time spent xxx".

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



RE: About to add WAL write/fsync statistics to pg_stat_wal view

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Ikeda-san,

> I checked columns' descriptions of other views.
> There are "Number of xxx", "Total number of xxx", "Total amount of time
> that xxx" and "Total time spent xxx".

Right.

> Since the "time" is used for showing spending time, not count,
> I'll change it to "Total number of WAL data written/synced to disk".
> Thought?

I misread your patch, sorry. I prefer your suggestion.
Please fix like that way with others.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-25 10:36, Masahiko Sawada wrote:
> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
>> 
>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> > <ikedamsh@oss.nttdata.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I rebased the patch to the master branch.
>> >
>> > Thank you for working on this. I've read the latest patch. Here are
>> > comments:
>> >
>> > ---
>> > +               if (track_wal_io_timing)
>> > +               {
>> > +                   INSTR_TIME_SET_CURRENT(duration);
>> > +                   INSTR_TIME_SUBTRACT(duration, start);
>> > +                   WalStats.m_wal_write_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +               }
>> >
>> > * I think it should add the time in micro sec.
>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>> > pg_stat_wal showed the following on my environment:
>> >
>> > postgres(1:61569)=# select * from pg_stat_wal;
>> > -[ RECORD 1 ]----+-----------------------------
>> > wal_records      | 285947
>> > wal_fpi          | 53285
>> > wal_bytes        | 442008213
>> > wal_buffers_full | 0
>> > wal_write        | 25516
>> > wal_write_time   | 0
>> > wal_sync         | 25437
>> > wal_sync_time    | 14490
>> > stats_reset      | 2021-01-22 10:56:13.29464+09
>> >
>> > Since writes can complete less than a millisecond, wal_write_time
>> > didn't increase. I think sync_time could also have the same problem.
>> 
>> Thanks for your comments. I didn't notice that.
>> I changed the unit from milliseconds to microseconds.
>> 
>> > ---
>> > +   /*
>> > +    * Measure i/o timing to fsync WAL data.
>> > +    *
>> > +    * The wal receiver skip to collect it to avoid performance
>> > degradation of standy servers.
>> > +    * If sync_method doesn't have its fsync method, to skip too.
>> > +    */
>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> > fsyncMethodCalled())
>> > +       INSTR_TIME_SET_CURRENT(start);
>> >
>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>> > is true? I think the performance degradation is also true for backend
>> > processes. If there is another reason for that, I think it's better to
>> > mention in both the doc and comment.
>> > * How about checking track_wal_io_timing first?
>> > * s/standy/standby/
>> 
>> I fixed it.
>> As kuroda-san mentioned too, the skip is no need to be considered.
> 
> I think you also removed the code to have the wal receiver report the
> stats. So with the latest patch, the wal receiver tracks those
> statistics but doesn't report.
> And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.


>> 
>> > ---
>> > +   /* increment the i/o timing and the number of times to fsync WAL
>> > data */
>> > +   if (fsyncMethodCalled())
>> > +   {
>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>> > +       {
>> > +           INSTR_TIME_SET_CURRENT(duration);
>> > +           INSTR_TIME_SUBTRACT(duration, start);
>> > +           WalStats.m_wal_sync_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > +       }
>> > +
>> > +       WalStats.m_wal_sync++;
>> > +   }
>> >
>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> > incrementing m_wal_sync after each sync operation?
>> 
>> I think if syncing the disk does not occur, m_wal_sync should not be
>> incremented.
>> It depends enableFsync and sync_method.
>> 
>> enableFsync is checked in each fsync method like
>> pg_fsync_no_writethrough(),
>> so if incrementing m_wal_sync after each sync operation, it should be
>> implemented
>> in each fsync method. It leads to many duplicated codes.
> 
> Right. I missed that each fsync function checks enableFsync.
> 
>> So, why don't you change the function to a flag whether to
>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
> 
> Looks better. Since we don't necessarily need to increment m_wal_sync
> after doing fsync we can write the code without an additional variable
> as follows:
> 
>     if (enableFsync)
>     {
>         switch (sync_method)
>         {
>             case SYNC_METHOD_FSYNC:
> #ifdef HAVE_FSYNC_WRITETHROUGH
>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
> #endif
> #ifdef HAVE_FDATASYNC
>             case SYNC_METHOD_FDATASYNC:
> #endif
>                 WalStats.m_wal_sync++;
>                 if (track_wal_io_timing)
>                     INSTR_TIME_SET_CURRENT(start);
>                 break;
>             default:
>                 break;
>         }
>     }
> 
>   (do fsync and error handling here)
> 
>    /* increment the i/o timing and the number of times to fsync WAL 
> data */
>    if (track_wal_io_timing)
>    {
>        INSTR_TIME_SET_CURRENT(duration);
>        INSTR_TIME_SUBTRACT(duration, start);
>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>    }

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?


> I think we can change the first switch-case to an if statement.

Yes, I'll change it.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-25 11:47, japin wrote:
> On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> 
> wrote:
>> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>> <ikedamsh@oss.nttdata.com> wrote:
>>> 
>>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>>> > <ikedamsh@oss.nttdata.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I rebased the patch to the master branch.
>>> >
>>> > Thank you for working on this. I've read the latest patch. Here are
>>> > comments:
>>> >
>>> > ---
>>> > +               if (track_wal_io_timing)
>>> > +               {
>>> > +                   INSTR_TIME_SET_CURRENT(duration);
>>> > +                   INSTR_TIME_SUBTRACT(duration, start);
>>> > +                   WalStats.m_wal_write_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +               }
>>> >
>>> > * I think it should add the time in micro sec.
>>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>>> > pg_stat_wal showed the following on my environment:
>>> >
>>> > postgres(1:61569)=# select * from pg_stat_wal;
>>> > -[ RECORD 1 ]----+-----------------------------
>>> > wal_records      | 285947
>>> > wal_fpi          | 53285
>>> > wal_bytes        | 442008213
>>> > wal_buffers_full | 0
>>> > wal_write        | 25516
>>> > wal_write_time   | 0
>>> > wal_sync         | 25437
>>> > wal_sync_time    | 14490
>>> > stats_reset      | 2021-01-22 10:56:13.29464+09
>>> >
>>> > Since writes can complete less than a millisecond, wal_write_time
>>> > didn't increase. I think sync_time could also have the same problem.
>>> 
>>> Thanks for your comments. I didn't notice that.
>>> I changed the unit from milliseconds to microseconds.
>>> 
>>> > ---
>>> > +   /*
>>> > +    * Measure i/o timing to fsync WAL data.
>>> > +    *
>>> > +    * The wal receiver skip to collect it to avoid performance
>>> > degradation of standy servers.
>>> > +    * If sync_method doesn't have its fsync method, to skip too.
>>> > +    */
>>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>>> > fsyncMethodCalled())
>>> > +       INSTR_TIME_SET_CURRENT(start);
>>> >
>>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>>> > is true? I think the performance degradation is also true for backend
>>> > processes. If there is another reason for that, I think it's better to
>>> > mention in both the doc and comment.
>>> > * How about checking track_wal_io_timing first?
>>> > * s/standy/standby/
>>> 
>>> I fixed it.
>>> As kuroda-san mentioned too, the skip is no need to be considered.
>> 
>> I think you also removed the code to have the wal receiver report the
>> stats. So with the latest patch, the wal receiver tracks those
>> statistics but doesn't report.
>> 
>> And maybe XLogWalRcvWrite() also needs to track I/O?
>> 
>>> 
>>> > ---
>>> > +   /* increment the i/o timing and the number of times to fsync WAL
>>> > data */
>>> > +   if (fsyncMethodCalled())
>>> > +   {
>>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>>> > +       {
>>> > +           INSTR_TIME_SET_CURRENT(duration);
>>> > +           INSTR_TIME_SUBTRACT(duration, start);
>>> > +           WalStats.m_wal_sync_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +       }
>>> > +
>>> > +       WalStats.m_wal_sync++;
>>> > +   }
>>> >
>>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>>> > incrementing m_wal_sync after each sync operation?
>>> 
>>> I think if syncing the disk does not occur, m_wal_sync should not be
>>> incremented.
>>> It depends enableFsync and sync_method.
>>> 
>>> enableFsync is checked in each fsync method like
>>> pg_fsync_no_writethrough(),
>>> so if incrementing m_wal_sync after each sync operation, it should be
>>> implemented
>>> in each fsync method. It leads to many duplicated codes.
>> 
>> Right. I missed that each fsync function checks enableFsync.
>> 
>>> So, why don't you change the function to a flag whether to
>>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>> 
>> Looks better. Since we don't necessarily need to increment m_wal_sync
>> after doing fsync we can write the code without an additional variable
>> as follows:
>> 
>>     if (enableFsync)
>>     {
>>         switch (sync_method)
>>         {
>>             case SYNC_METHOD_FSYNC:
>> #ifdef HAVE_FSYNC_WRITETHROUGH
>>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> #endif
>> #ifdef HAVE_FDATASYNC
>>             case SYNC_METHOD_FDATASYNC:
>> #endif
>>                 WalStats.m_wal_sync++;
>>                 if (track_wal_io_timing)
>>                     INSTR_TIME_SET_CURRENT(start);
>>                 break;
>>             default:
>>                 break;
>>         }
>>     }
>> 
>>   (do fsync and error handling here)
>> 
>>    /* increment the i/o timing and the number of times to fsync WAL 
>> data */
>>    if (track_wal_io_timing)
>>    {
>>        INSTR_TIME_SET_CURRENT(duration);
>>        INSTR_TIME_SUBTRACT(duration, start);
>>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>>    }
>> 
>> I think we can change the first switch-case to an if statement.
>> 
> 
> +1. We can also narrow the scope of "duration" into "if
> (track_wal_io_timing)" branch.

Thanks, I'll change it.

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-25 13:15, Masahiro Ikeda wrote:
> On 2021-01-25 10:36, Masahiko Sawada wrote:
>> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>> <ikedamsh@oss.nttdata.com> wrote:
>>> 
>>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>>> > <ikedamsh@oss.nttdata.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I rebased the patch to the master branch.
>>> >
>>> > Thank you for working on this. I've read the latest patch. Here are
>>> > comments:
>>> >
>>> > ---
>>> > +               if (track_wal_io_timing)
>>> > +               {
>>> > +                   INSTR_TIME_SET_CURRENT(duration);
>>> > +                   INSTR_TIME_SUBTRACT(duration, start);
>>> > +                   WalStats.m_wal_write_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +               }
>>> >
>>> > * I think it should add the time in micro sec.
>>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>>> > pg_stat_wal showed the following on my environment:
>>> >
>>> > postgres(1:61569)=# select * from pg_stat_wal;
>>> > -[ RECORD 1 ]----+-----------------------------
>>> > wal_records      | 285947
>>> > wal_fpi          | 53285
>>> > wal_bytes        | 442008213
>>> > wal_buffers_full | 0
>>> > wal_write        | 25516
>>> > wal_write_time   | 0
>>> > wal_sync         | 25437
>>> > wal_sync_time    | 14490
>>> > stats_reset      | 2021-01-22 10:56:13.29464+09
>>> >
>>> > Since writes can complete less than a millisecond, wal_write_time
>>> > didn't increase. I think sync_time could also have the same problem.
>>> 
>>> Thanks for your comments. I didn't notice that.
>>> I changed the unit from milliseconds to microseconds.
>>> 
>>> > ---
>>> > +   /*
>>> > +    * Measure i/o timing to fsync WAL data.
>>> > +    *
>>> > +    * The wal receiver skip to collect it to avoid performance
>>> > degradation of standy servers.
>>> > +    * If sync_method doesn't have its fsync method, to skip too.
>>> > +    */
>>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>>> > fsyncMethodCalled())
>>> > +       INSTR_TIME_SET_CURRENT(start);
>>> >
>>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>>> > is true? I think the performance degradation is also true for backend
>>> > processes. If there is another reason for that, I think it's better to
>>> > mention in both the doc and comment.
>>> > * How about checking track_wal_io_timing first?
>>> > * s/standy/standby/
>>> 
>>> I fixed it.
>>> As kuroda-san mentioned too, the skip is no need to be considered.
>> 
>> I think you also removed the code to have the wal receiver report the
>> stats. So with the latest patch, the wal receiver tracks those
>> statistics but doesn't report.
>> And maybe XLogWalRcvWrite() also needs to track I/O?
> 
> Thanks, I forgot to add them.
> I'll fix it.
> 
> 
>>> 
>>> > ---
>>> > +   /* increment the i/o timing and the number of times to fsync WAL
>>> > data */
>>> > +   if (fsyncMethodCalled())
>>> > +   {
>>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>>> > +       {
>>> > +           INSTR_TIME_SET_CURRENT(duration);
>>> > +           INSTR_TIME_SUBTRACT(duration, start);
>>> > +           WalStats.m_wal_sync_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +       }
>>> > +
>>> > +       WalStats.m_wal_sync++;
>>> > +   }
>>> >
>>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>>> > incrementing m_wal_sync after each sync operation?
>>> 
>>> I think if syncing the disk does not occur, m_wal_sync should not be
>>> incremented.
>>> It depends enableFsync and sync_method.
>>> 
>>> enableFsync is checked in each fsync method like
>>> pg_fsync_no_writethrough(),
>>> so if incrementing m_wal_sync after each sync operation, it should be
>>> implemented
>>> in each fsync method. It leads to many duplicated codes.
>> 
>> Right. I missed that each fsync function checks enableFsync.
>> 
>>> So, why don't you change the function to a flag whether to
>>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>> 
>> Looks better. Since we don't necessarily need to increment m_wal_sync
>> after doing fsync we can write the code without an additional variable
>> as follows:
>> 
>>     if (enableFsync)
>>     {
>>         switch (sync_method)
>>         {
>>             case SYNC_METHOD_FSYNC:
>> #ifdef HAVE_FSYNC_WRITETHROUGH
>>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> #endif
>> #ifdef HAVE_FDATASYNC
>>             case SYNC_METHOD_FDATASYNC:
>> #endif
>>                 WalStats.m_wal_sync++;
>>                 if (track_wal_io_timing)
>>                     INSTR_TIME_SET_CURRENT(start);
>>                 break;
>>             default:
>>                 break;
>>         }
>>     }
>> 
>>   (do fsync and error handling here)
>> 
>>    /* increment the i/o timing and the number of times to fsync WAL 
>> data */
>>    if (track_wal_io_timing)
>>    {
>>        INSTR_TIME_SET_CURRENT(duration);
>>        INSTR_TIME_SUBTRACT(duration, start);
>>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>>    }
> 
> IIUC, I think we can't handle the following case.
> 
> When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
> "track_wal_io_timing" is enabled, "start" doesn't be initialized.
> 
> My understanding is something wrong, isn't it?

I thought the following is better.


```
    /* Measure i/o timing to sync WAL data.*/
    if (track_wal_io_timing)
        INSTR_TIME_SET_CURRENT(start);

   (do fsync and error handling here)

    /* check whether to sync WAL data to the disk right now. */
    if (enableFsync)
    {
        if ((sync_method == SYNC_METHOD_FSYNC) ||
            (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
            (sync_method == SYNC_METHOD_FDATASYNC))
        {
            /* increment the i/o timing and the number of times to fsync WAL data 
*/
            if (track_wal_io_timing)
            {
                instr_time    duration;

                INSTR_TIME_SET_CURRENT(duration);
                INSTR_TIME_SUBTRACT(duration, start);
                WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
            }
            WalStats.m_wal_sync++;
        }
    }
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional 
variables.
But it's ok because "track_wal_io_timing" leads already performance 
degradation.

What do you think?


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiko Sawada
Date:
On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> On 2021-01-25 13:15, Masahiro Ikeda wrote:
> > On 2021-01-25 10:36, Masahiko Sawada wrote:
> >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> >> <ikedamsh@oss.nttdata.com> wrote:
> >>>
> >>> On 2021-01-22 14:50, Masahiko Sawada wrote:
> >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> >>> > <ikedamsh@oss.nttdata.com> wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I rebased the patch to the master branch.
> >>> >
> >>> > Thank you for working on this. I've read the latest patch. Here are
> >>> > comments:
> >>> >
> >>> > ---
> >>> > +               if (track_wal_io_timing)
> >>> > +               {
> >>> > +                   INSTR_TIME_SET_CURRENT(duration);
> >>> > +                   INSTR_TIME_SUBTRACT(duration, start);
> >>> > +                   WalStats.m_wal_write_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > +               }
> >>> >
> >>> > * I think it should add the time in micro sec.
> >>> > After running pgbench with track_wal_io_timing = on for 30 sec,
> >>> > pg_stat_wal showed the following on my environment:
> >>> >
> >>> > postgres(1:61569)=# select * from pg_stat_wal;
> >>> > -[ RECORD 1 ]----+-----------------------------
> >>> > wal_records      | 285947
> >>> > wal_fpi          | 53285
> >>> > wal_bytes        | 442008213
> >>> > wal_buffers_full | 0
> >>> > wal_write        | 25516
> >>> > wal_write_time   | 0
> >>> > wal_sync         | 25437
> >>> > wal_sync_time    | 14490
> >>> > stats_reset      | 2021-01-22 10:56:13.29464+09
> >>> >
> >>> > Since writes can complete less than a millisecond, wal_write_time
> >>> > didn't increase. I think sync_time could also have the same problem.
> >>>
> >>> Thanks for your comments. I didn't notice that.
> >>> I changed the unit from milliseconds to microseconds.
> >>>
> >>> > ---
> >>> > +   /*
> >>> > +    * Measure i/o timing to fsync WAL data.
> >>> > +    *
> >>> > +    * The wal receiver skip to collect it to avoid performance
> >>> > degradation of standy servers.
> >>> > +    * If sync_method doesn't have its fsync method, to skip too.
> >>> > +    */
> >>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
> >>> > fsyncMethodCalled())
> >>> > +       INSTR_TIME_SET_CURRENT(start);
> >>> >
> >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> >>> > is true? I think the performance degradation is also true for backend
> >>> > processes. If there is another reason for that, I think it's better to
> >>> > mention in both the doc and comment.
> >>> > * How about checking track_wal_io_timing first?
> >>> > * s/standy/standby/
> >>>
> >>> I fixed it.
> >>> As kuroda-san mentioned too, the skip is no need to be considered.
> >>
> >> I think you also removed the code to have the wal receiver report the
> >> stats. So with the latest patch, the wal receiver tracks those
> >> statistics but doesn't report.
> >> And maybe XLogWalRcvWrite() also needs to track I/O?
> >
> > Thanks, I forgot to add them.
> > I'll fix it.
> >
> >
> >>>
> >>> > ---
> >>> > +   /* increment the i/o timing and the number of times to fsync WAL
> >>> > data */
> >>> > +   if (fsyncMethodCalled())
> >>> > +   {
> >>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
> >>> > +       {
> >>> > +           INSTR_TIME_SET_CURRENT(duration);
> >>> > +           INSTR_TIME_SUBTRACT(duration, start);
> >>> > +           WalStats.m_wal_sync_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > +       }
> >>> > +
> >>> > +       WalStats.m_wal_sync++;
> >>> > +   }
> >>> >
> >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> >>> > incrementing m_wal_sync after each sync operation?
> >>>
> >>> I think if syncing the disk does not occur, m_wal_sync should not be
> >>> incremented.
> >>> It depends enableFsync and sync_method.
> >>>
> >>> enableFsync is checked in each fsync method like
> >>> pg_fsync_no_writethrough(),
> >>> so if incrementing m_wal_sync after each sync operation, it should be
> >>> implemented
> >>> in each fsync method. It leads to many duplicated codes.
> >>
> >> Right. I missed that each fsync function checks enableFsync.
> >>
> >>> So, why don't you change the function to a flag whether to
> >>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
> >>
> >> Looks better. Since we don't necessarily need to increment m_wal_sync
> >> after doing fsync we can write the code without an additional variable
> >> as follows:
> >>
> >>     if (enableFsync)
> >>     {
> >>         switch (sync_method)
> >>         {
> >>             case SYNC_METHOD_FSYNC:
> >> #ifdef HAVE_FSYNC_WRITETHROUGH
> >>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
> >> #endif
> >> #ifdef HAVE_FDATASYNC
> >>             case SYNC_METHOD_FDATASYNC:
> >> #endif
> >>                 WalStats.m_wal_sync++;
> >>                 if (track_wal_io_timing)
> >>                     INSTR_TIME_SET_CURRENT(start);
> >>                 break;
> >>             default:
> >>                 break;
> >>         }
> >>     }
> >>
> >>   (do fsync and error handling here)
> >>
> >>    /* increment the i/o timing and the number of times to fsync WAL
> >> data */
> >>    if (track_wal_io_timing)
> >>    {
> >>        INSTR_TIME_SET_CURRENT(duration);
> >>        INSTR_TIME_SUBTRACT(duration, start);
> >>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
> >>    }
> >
> > IIUC, I think we can't handle the following case.
> >
> > When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
> > "track_wal_io_timing" is enabled, "start" doesn't be initialized.
> >
> > My understanding is something wrong, isn't it?

You're right. We might want to initialize 'start' with 0 in those two
cases and check if INSTR_TIME_IS_ZERO() later when accumulating the
I/O time.

>
> I thought the following is better.
>
>
> ```
>         /* Measure i/o timing to sync WAL data.*/
>         if (track_wal_io_timing)
>                 INSTR_TIME_SET_CURRENT(start);
>
>    (do fsync and error handling here)
>
>         /* check whether to sync WAL data to the disk right now. */
>         if (enableFsync)
>         {
>                 if ((sync_method == SYNC_METHOD_FSYNC) ||
>                         (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
>                         (sync_method == SYNC_METHOD_FDATASYNC))
>                 {
>                         /* increment the i/o timing and the number of times to fsync WAL data
> */
>                         if (track_wal_io_timing)
>                         {
>                                 instr_time      duration;
>
>                                 INSTR_TIME_SET_CURRENT(duration);
>                                 INSTR_TIME_SUBTRACT(duration, start);
>                                 WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
>                         }
>                         WalStats.m_wal_sync++;
>                 }
>         }
> ```
>
> Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
> of the "sync_method" and "enableFsync", we don't make additional
> variables.
> But it's ok because "track_wal_io_timing" leads already performance
> degradation.
>
> What do you think?

That also fine with me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
Hi, thanks for the reviews.

I updated the attached patch.
The summary of the changes is following.

1. fix document

I followed another view's comments.


2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.


3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last 
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force" 
as false.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiko Sawada
Date:
On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi, thanks for the reviews.
>
> I updated the attached patch.

Thank you for updating the patch!

> The summary of the changes is following.
>
> 1. fix document
>
> I followed another view's comments.
>
>
> 2. refactor issue_xlog_fsync()
>
> I removed "sync_called" variables, narrowed the "duration" scope and
> change the switch statement to if statement.

Looking at the code again, I think if we check if an fsync was really
called when calculating the I/O time, it's better to check that before
starting the measurement.

    bool issue_fsync = false;

    if (enableFsync &&
        (sync_method == SYNC_METHOD_FSYNC ||
         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
         sync_method == SYNC_METHOD_FDATASYNC))
    {
        if (track_wal_io_timing)
            INSTR_TIME_SET_CURRENT(start);
        issue_fsync = true;
    }
    (snip)
    if (issue_fsync)
    {
        if (track_wal_io_timing)
        {
            instr_time  duration;

            INSTR_TIME_SET_CURRENT(duration);
            INSTR_TIME_SUBTRACT(duration, start);
            WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
        }
        WalStats.m_wal_sync++;
    }

So I prefer either the above which is a modified version of the
original approach or my idea that doesn’t introduce a new local
variable I proposed before. But I'm not going to insist on that.

>
>
> 3. make wal-receiver report WAL statistics
>
> I add the code to collect the statistics for a written operation
> in XLogWalRcvWrite() and to report stats in WalReceiverMain().
>
> Since WalReceiverMain() can loop fast, to avoid loading stats collector,
> I add "force" argument to the pgstat_send_wal function. If "force" is
> false, it can skip reporting until at least 500 msec since it last
> reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
> as false.

 void
-pgstat_send_wal(void)
+pgstat_send_wal(bool force)
 {
    /* We assume this initializes to zeroes */
    static const PgStat_MsgWal all_zeroes;
+   static TimestampTz last_report = 0;

+   TimestampTz now;
    WalUsage    walusage;

+   /*
+    * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+    * msec since we last sent one or specified "force".
+    */
+   now = GetCurrentTimestamp();
+   if (!force &&
+       !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+       return;
+
+   last_report = now;

Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
purpose since it is used as a minimum time for stats file updates. If
we want an interval, I think we should define another one Also, with
the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
even if track_wal_io_timing is off, which is not good. On the other
hand, I agree that your concern that the wal receiver should not send
the stats for whenever receiving wal records. So an idea could be to
send the wal stats when finishing the current WAL segment file and
when timeout in the main loop. That way we can guarantee that the wal
stats on a replica is updated at least every time finishing a WAL
segment file when actively receiving WAL records and every
NAPTIME_PER_CYCLE in other cases.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-26 00:03, Masahiko Sawada wrote:
> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda 
> <ikedamsh@oss.nttdata.com> wrote:
>> 
>> Hi, thanks for the reviews.
>> 
>> I updated the attached patch.
> 
> Thank you for updating the patch!
> 
>> The summary of the changes is following.
>> 
>> 1. fix document
>> 
>> I followed another view's comments.
>> 
>> 
>> 2. refactor issue_xlog_fsync()
>> 
>> I removed "sync_called" variables, narrowed the "duration" scope and
>> change the switch statement to if statement.
> 
> Looking at the code again, I think if we check if an fsync was really
> called when calculating the I/O time, it's better to check that before
> starting the measurement.
> 
>     bool issue_fsync = false;
> 
>     if (enableFsync &&
>         (sync_method == SYNC_METHOD_FSYNC ||
>          sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>          sync_method == SYNC_METHOD_FDATASYNC))
>     {
>         if (track_wal_io_timing)
>             INSTR_TIME_SET_CURRENT(start);
>         issue_fsync = true;
>     }
>     (snip)
>     if (issue_fsync)
>     {
>         if (track_wal_io_timing)
>         {
>             instr_time  duration;
> 
>             INSTR_TIME_SET_CURRENT(duration);
>             INSTR_TIME_SUBTRACT(duration, start);
>             WalStats.m_wal_sync_time = 
> INSTR_TIME_GET_MICROSEC(duration);
>         }
>         WalStats.m_wal_sync++;
>     }
> 
> So I prefer either the above which is a modified version of the
> original approach or my idea that doesn’t introduce a new local
> variable I proposed before. But I'm not going to insist on that.

Thanks for the comments.
I change the code to the above.

>> 
>> 
>> 3. make wal-receiver report WAL statistics
>> 
>> I add the code to collect the statistics for a written operation
>> in XLogWalRcvWrite() and to report stats in WalReceiverMain().
>> 
>> Since WalReceiverMain() can loop fast, to avoid loading stats 
>> collector,
>> I add "force" argument to the pgstat_send_wal function. If "force" is
>> false, it can skip reporting until at least 500 msec since it last
>> reported. WalReceiverMain() almost calls pgstat_send_wal() with 
>> "force"
>> as false.
> 
>  void
> -pgstat_send_wal(void)
> +pgstat_send_wal(bool force)
>  {
>     /* We assume this initializes to zeroes */
>     static const PgStat_MsgWal all_zeroes;
> +   static TimestampTz last_report = 0;
> 
> +   TimestampTz now;
>     WalUsage    walusage;
> 
> +   /*
> +    * Don't send a message unless it's been at least 
> PGSTAT_STAT_INTERVAL
> +    * msec since we last sent one or specified "force".
> +    */
> +   now = GetCurrentTimestamp();
> +   if (!force &&
> +       !TimestampDifferenceExceeds(last_report, now, 
> PGSTAT_STAT_INTERVAL))
> +       return;
> +
> +   last_report = now;
> 
> Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
> purpose since it is used as a minimum time for stats file updates. If
> we want an interval, I think we should define another one Also, with
> the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
> even if track_wal_io_timing is off, which is not good. On the other
> hand, I agree that your concern that the wal receiver should not send
> the stats for whenever receiving wal records. So an idea could be to
> send the wal stats when finishing the current WAL segment file and
> when timeout in the main loop. That way we can guarantee that the wal
> stats on a replica is updated at least every time finishing a WAL
> segment file when actively receiving WAL records and every
> NAPTIME_PER_CYCLE in other cases.

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
"David G. Johnston"
Date:
On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi, thanks for the reviews.
>
> I updated the attached patch.

Thank you for updating the patch!
 
Your original email with "total number of times" is more correct, removing the "of times" and just writing "total number of WAL" is not good wording.

Specifically, this change is strictly worse than the original.

-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of WAL data written to disk because WAL buffers became full

Both have the flaw that they leave implied exactly what it means to "write WAL to disk".  It is also unclear whether a counter, bytes, or both, would be more useful here. I've incorporated this into my documentation suggestions below:

(wal_buffers_full)
-- Revert - the original was better, though maybe add more detail similar to the below.  I didn't research exactly how this works.

(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite

-- Seems like this should have a bytes version too

(wal_write_time)
The amount of time spent writing WAL buffers to disk, excluding sync time unless the wal_sync_method is either open_datasync or open_sync.
Units are in milliseconds with microsecond resolution.  This is zero when track_wal_io_timing is disabled.

(wal_sync)
The number of times WAL files were synced to disk while wal_sync_method was set to one of the "sync at commit" options (i.e., fdatasync, fsync, or fsync_writethrough).

-- it is not going to be zero just because those settings are presently disabled as they could have been enabled at some point since the last time these statistics were reset.

(wal_sync_time)
The amount of time spent syncing WAL files to disk, in milliseconds with microsecond resolution.  This requires setting wal_sync_method to one of the "sync at commit" options (i.e., fdatasync, fsync, or fsync_writethrough).


Also,

I would suggest extracting the changes to postmaster/pgstat.c and replication/walreceiver.c to a separate patch as you've fundamentally changed how it behaves with regards to that function and how it interacts with the WAL receiver.  That seems an entirely separate topic warranting its own patch and discussion.

David J.

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
"David G. Johnston"
Date:
On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.


The following is not an improvement:

- /* Send WAL statistics to the stats collector. */
+ /* Send WAL statistics to stats collector */

The word "the" there makes it proper English.  Your copy-pasting should have kept the existing good wording in the other locations rather than replace the existing location with the newly added incorrect wording.

This doesn't make sense:

* current WAL segment file to avoid loading stats collector.

Maybe "overloading" or "overwhelming"?

I see you removed the pgstat_send_wal(force) change.  The rest of my comments on the v6 patch still stand I believe.

David J.


Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
Hi, David.

Thanks for your comments.

On 2021-01-26 08:48, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> 
>> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda
>> <ikedamsh@oss.nttdata.com> wrote:
>>> 
>>> Hi, thanks for the reviews.
>>> 
>>> I updated the attached patch.
>> 
>> Thank you for updating the patch!
> 
> Your original email with "total number of times" is more correct,
> removing the "of times" and just writing "total number of WAL" is not
> good wording.
> 
> Specifically, this change is strictly worse than the original.
> 
> -       Number of times WAL data was written to disk because WAL
> buffers became full
> +       Total number of WAL data written to disk because WAL buffers
> became full
> 
> Both have the flaw that they leave implied exactly what it means to
> "write WAL to disk".  It is also unclear whether a counter, bytes, or
> both, would be more useful here. I've incorporated this into my
> documentation suggestions below:
> (wal_buffers_full)
> 
> -- Revert - the original was better, though maybe add more detail
> similar to the below.  I didn't research exactly how this works.

OK, I understood.
I reverted since this is a counter statistics.


> (wal_write)
> The number of times WAL buffers were written out to disk via XLogWrite
> 

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other views' 
description
omits "The" and there is no description of "XlogWrite" in the documents. 
What do you think?

> -- Seems like this should have a bytes version too

Do you mean that we need to separate statistics for wal write?


> (wal_write_time)
> The amount of time spent writing WAL buffers to disk, excluding sync
> time unless the wal_sync_method is either open_datasync or open_sync.
> Units are in milliseconds with microsecond resolution.  This is zero
> when track_wal_io_timing is disabled.

Thanks, I'll fix it.


> (wal_sync)
> The number of times WAL files were synced to disk while
> wal_sync_method was set to one of the "sync at commit" options (i.e.,
> fdatasync, fsync, or fsync_writethrough).

Thanks, I'll fix it.


> -- it is not going to be zero just because those settings are
> presently disabled as they could have been enabled at some point since
> the last time these statistics were reset.

Right, your description is correct.
The "track_wal_io_timing" has the same limitation, doesn't it?


> (wal_sync_time)
> The amount of time spent syncing WAL files to disk, in milliseconds
> with microsecond resolution.  This requires setting wal_sync_method to
> one of the "sync at commit" options (i.e., fdatasync, fsync, or
> fsync_writethrough).

Thanks, I'll fix it.
I will add the comments related to "track_wal_io_timing".


> Also,
> 
> I would suggest extracting the changes to postmaster/pgstat.c and
> replication/walreceiver.c to a separate patch as you've fundamentally
> changed how it behaves with regards to that function and how it
> interacts with the WAL receiver.  That seems an entirely separate
> topic warranting its own patch and discussion.

OK, I will separate two patches.


On 2021-01-26 08:52, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
> 
>> I agree with your comments. I think it should report when
>> reaching the end of WAL too. I add the code to report the stats
>> when finishing the current WAL segment file when timeout in the
>> main loop and when reaching the end of WAL.
> 
> The following is not an improvement:
> 
> - /* Send WAL statistics to the stats collector. */+ /* Send WAL
> statistics to stats collector */
> 
> The word "the" there makes it proper English.  Your copy-pasting
> should have kept the existing good wording in the other locations
> rather than replace the existing location with the newly added
> incorrect wording.

Thanks, I'll fix it.


> This doesn't make sense:
> 
> * current WAL segment file to avoid loading stats collector.
> 
> Maybe "overloading" or "overwhelming"?
> 
> I see you removed the pgstat_send_wal(force) change.  The rest of my
> comments on the v6 patch still stand I believe.

Yes, "overloading" is right. Thanks.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
"David G. Johnston"
Date:
On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

> (wal_write)
> The number of times WAL buffers were written out to disk via XLogWrite
>

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other views'
description
omits "The" and there is no description of "XlogWrite" in the documents.
What do you think?


The documentation for WAL does get into the public API level of detail and doing so here makes what this measures crystal clear.  The potential absence of sufficient detail elsewhere should be corrected instead of making this description more vague.  Specifically, probably XLogWrite should be added to the WAL overview as part of this update and probably even have the descriptive section of the documentation note that the number of times that said function is executed is exposed as a counter in the wal statistics table - thus closing the loop.

David J.

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-01-27 00:14, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
> 
>>> (wal_write)
>>> The number of times WAL buffers were written out to disk via
>> XLogWrite
>>> 
>> 
>> Thanks.
>> 
>> I thought it's better to omit "The" and "XLogWrite" because other
>> views'
>> description
>> omits "The" and there is no description of "XlogWrite" in the
>> documents.
>> What do you think?
> 
> The documentation for WAL does get into the public API level of detail
> and doing so here makes what this measures crystal clear.  The
> potential absence of sufficient detail elsewhere should be corrected
> instead of making this description more vague.  Specifically, probably
> XLogWrite should be added to the WAL overview as part of this update
> and probably even have the descriptive section of the documentation
> note that the number of times that said function is executed is
> exposed as a counter in the wal statistics table - thus closing the
> loop.

Thanks for your comments.

I added the descriptions in documents and separated the patch
into attached two patches. First is to add wal i/o activity statistics.
Second is to make the wal receiver report the wal statistics.

Please let me know if you have any comments.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
I pgindented the patches.

Regards
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/02/05 8:45, Masahiro Ikeda wrote:
> I pgindented the patches.

Thanks for updating the patches!

+       <function>XLogWrite</function>, which nomally called by an
+       <function>issue_xlog_fsync</function>, which nomally called by an

Typo: "nomally" should be "normally"?

+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),

Isn't it better to add a space character just after "request"?

+                    INSTR_TIME_SET_CURRENT(duration);
+                    INSTR_TIME_SUBTRACT(duration, start);
+                    WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);

If several cycles happen in the do-while loop, m_wal_write_time should be
updated with the sum of "duration" in those cycles instead of "duration"
in the last cycle? If yes, "+=" should be used instead of "=" when updating
m_wal_write_time?

+            INSTR_TIME_SET_CURRENT(duration);
+            INSTR_TIME_SUBTRACT(duration, start);
+            WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/02/08 13:01, Fujii Masao wrote:
> 
> 
> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>> I pgindented the patches.
> 
> Thanks for updating the patches!
> 
> +       <function>XLogWrite</function>, which nomally called by an
> +       <function>issue_xlog_fsync</function>, which nomally called by an
> 
> Typo: "nomally" should be "normally"?
> 
> +       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
> +       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
> 
> Isn't it better to add a space character just after "request"?
> 
> +                    INSTR_TIME_SET_CURRENT(duration);
> +                    INSTR_TIME_SUBTRACT(duration, start);
> +                    WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
> 
> If several cycles happen in the do-while loop, m_wal_write_time should be
> updated with the sum of "duration" in those cycles instead of "duration"
> in the last cycle? If yes, "+=" should be used instead of "=" when updating
> m_wal_write_time?
> 
> +            INSTR_TIME_SET_CURRENT(duration);
> +            INSTR_TIME_SUBTRACT(duration, start);
> +            WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> 
> Also "=" should be "+=" in the above?

+        /* Send WAL statistics */
+        pgstat_send_wal();

This may cause overhead in WAL-writing by walwriter because it's called
every cycles even when walwriter needs to write more WAL next cycle
(don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
should be called only when WaitLatch() returns with WL_TIMEOUT?

-       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+       or WAL data written out to disk by WAL receiver.

So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
wal_sync_time are updated even while the other values are not. Isn't this
confusing to users? If so, what about reporting those walreceiver stats in
pg_stat_wal_receiver?

                  if (endofwal)
+                {
+                    /* Send WAL statistics to the stats collector */
+                    pgstat_send_wal();
                      break;

You added pgstat_send_wal() so that it's called in some cases where
walreceiver exits. But ISTM that there are other walreceiver-exit cases.
For example, in the case where SIGTERM is received. Instead,
pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
"David G. Johnston"
Date:
On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
I pgindented the patches.


... <function>XLogWrite</function>, which is invoked during an <function>XLogFlush</function> request (see ...).  This is also incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original)

You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste)

is ether -> is either

"This parameter is off by default as it will repeatedly query the operating system..."
", because" -> "as"

wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication."

"The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in....) This is undesirable because ..."

I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics.  This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not.  It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL).  Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead.  In either case, this half-and-half implementation seems undesirable.

David J.

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-02-08 13:01, Fujii Masao wrote:
> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>> I pgindented the patches.
> 
> Thanks for updating the patches!

Thanks for checking the patches.

> +       <function>XLogWrite</function>, which nomally called by an
> +       <function>issue_xlog_fsync</function>, which nomally called by 
> an
> 
> Typo: "nomally" should be "normally"?

Yes, I'll fix it.

> +       <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>)
> +       <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>),
> 
> Isn't it better to add a space character just after "request"?

Thanks, I'll fix it.

> +                    INSTR_TIME_SET_CURRENT(duration);
> +                    INSTR_TIME_SUBTRACT(duration, start);
> +                    WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
> 
> If several cycles happen in the do-while loop, m_wal_write_time should 
> be
> updated with the sum of "duration" in those cycles instead of 
> "duration"
> in the last cycle? If yes, "+=" should be used instead of "=" when 
> updating
> m_wal_write_time?
> +            INSTR_TIME_SET_CURRENT(duration);
> +            INSTR_TIME_SUBTRACT(duration, start);
> +            WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> 
> Also "=" should be "+=" in the above?

Yes, they are my mistake when changing the unit from milliseconds to 
microseconds.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-02-08 14:26, Fujii Masao wrote:
> On 2021/02/08 13:01, Fujii Masao wrote:
>> 
>> 
>> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>>> I pgindented the patches.
>> 
>> Thanks for updating the patches!
>> 
>> +       <function>XLogWrite</function>, which nomally called by an
>> +       <function>issue_xlog_fsync</function>, which nomally called by 
>> an
>> 
>> Typo: "nomally" should be "normally"?
>> 
>> +       <function>XLogFlush</function> request(see <xref 
>> linkend="wal-configuration"/>)
>> +       <function>XLogFlush</function> request(see <xref 
>> linkend="wal-configuration"/>),
>> 
>> Isn't it better to add a space character just after "request"?
>> 
>> +                    INSTR_TIME_SET_CURRENT(duration);
>> +                    INSTR_TIME_SUBTRACT(duration, start);
>> +                    WalStats.m_wal_write_time = 
>> INSTR_TIME_GET_MICROSEC(duration);
>> 
>> If several cycles happen in the do-while loop, m_wal_write_time should 
>> be
>> updated with the sum of "duration" in those cycles instead of 
>> "duration"
>> in the last cycle? If yes, "+=" should be used instead of "=" when 
>> updating
>> m_wal_write_time?
>> 
>> +            INSTR_TIME_SET_CURRENT(duration);
>> +            INSTR_TIME_SUBTRACT(duration, start);
>> +            WalStats.m_wal_sync_time = 
>> INSTR_TIME_GET_MICROSEC(duration);
>> 
>> Also "=" should be "+=" in the above?
> 
> +        /* Send WAL statistics */
> +        pgstat_send_wal();
> 
> This may cause overhead in WAL-writing by walwriter because it's called
> every cycles even when walwriter needs to write more WAL next cycle
> (don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
> should be called only when WaitLatch() returns with WL_TIMEOUT?

Thanks, I didn't notice that.
I'll fix it.

> -       <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>)
> +       <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>),
> +       or WAL data written out to disk by WAL receiver.
> 
> So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
> wal_sync_time are updated even while the other values are not. Isn't 
> this
> confusing to users? If so, what about reporting those walreceiver stats 
> in
> pg_stat_wal_receiver?

OK, I'll add new infrastructure code to interect with wal receiver
and stats collector and show the stats in pg_stat_wal_receiver.

>                  if (endofwal)
> +                {
> +                    /* Send WAL statistics to the stats collector */
> +                    pgstat_send_wal();
>                      break;
> 
> You added pgstat_send_wal() so that it's called in some cases where
> walreceiver exits. But ISTM that there are other walreceiver-exit 
> cases.
> For example, in the case where SIGTERM is received. Instead,
> pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Thanks, I forgot the case.
I'll fix it.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:
On 2021-02-10 00:51, David G. Johnston wrote:
> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
> 
>> I pgindented the patches.
> 
> ... <function>XLogWrite</function>, which is invoked during an
> <function>XLogFlush</function> request (see ...).  This is also
> incremented by the WAL receiver during replication.
> 
> ("which normally called" should be "which is normally called" or
> "which normally is called" if you want to keep true to the original)
> You missed the adding the space before an opening parenthesis here and
> elsewhere (probably copy-paste)
> 
> is ether -> is either
> "This parameter is off by default as it will repeatedly query the
> operating system..."
> ", because" -> "as"

Thanks, I fixed them.

> wal_write_time and the sync items also need the note: "This is also
> incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

> "The number of times it happened..." -> " (the tally of this event is
> reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

> I notice that the patch for WAL receiver doesn't require explicitly
> computing the sync statistics but does require computing the write
> statistics.  This is because of the presence of issue_xlog_fsync but
> absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
> receiver path does not.  It seems technically straight-forward to
> refactor here to avoid the almost-duplicated logic in the two places,
> though I suspect there may be a trade-off for not adding another
> function call to the stack given the importance of WAL processing
> (though that seems marginalized compared to the cost of actually
> writing the WAL).  Or, as Fujii noted, go the other way and don't have
> any shared code between the two but instead implement the WAL receiver
> one to use pg_stat_wal_receiver instead.  In either case, this
> half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

I added the infrastructure code to communicate the WAL receiver stats 
messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/03/15 10:39, Masahiro Ikeda wrote:
> Thanks, I understood get_sync_bit() checks the sync flags and
> the write unit of generated wal data and replicated wal data is different.
> (It's interesting optimization whether to use kernel cache or not.)
> 
> OK. Although I agree to separate the stats for the walrecever,
> I want to hear opinions from other people too. I didn't change the patch.
> 
> Please feel to your comments.

What about applying the patch for common WAL write function like
XLogWriteFile(), separately from the patch for walreceiver's stats?
Seems the former reaches the consensus, so we can commit it firstly.
Also even only the former change is useful because which allows
walreceiver to report WALWrite wait event.

> OK. I agree.
> 
> I wonder to change the error check ways depending on who calls this function?
> Now, only the walreceiver checks (1)errno==0 and doesn't check (2)errno==ENITR.
> Other processes are the opposite.
> 
> IIUC, it's appropriate that every process checks (1)(2).
> Please let me know my understanding is wrong.

I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
that code. According to the following commit log, it seems harmless
to retry on EINTR even walreceiver.

     Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
     nowadays, so it shouldn't happen, but better to be defensive.

>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
>> reports an error. But this is useless. Probably IO timing should be
>> incremented after the return code of pg_pwrite() is checked, instead?
> 
> Yes, I agree. I fixed it.
> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for the patch!

              nleft = nbytes;
              do
              {
-                errno = 0;
+                written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+                                        ThisTimeLineID, openLogSegNo, wal_segment_size);

Can we merge this do-while loop in XLogWrite() into the loop
in XLogWriteFile()?

If we do that, ISTM that the following codes are not necessary in XLogWrite().

                  nleft -= written;
                  from += written;

+ * 'segsize' is a segment size of WAL segment file.

Since segsize is always wal_segment_size, segsize argument seems
not necessary in XLogWriteFile().

+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
+              TimeLineID timelineid, XLogSegNo segno, int segsize)

Why did you use "const void *" instead of "char *" for *buf?

Regarding 0005 patch, I will review it later.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
ikedamsh
Date:
On 2021-03-19 16:30, Fujii Masao wrote:
> On 2021/03/15 10:39, Masahiro Ikeda wrote:
>> Thanks, I understood get_sync_bit() checks the sync flags and
>> the write unit of generated wal data and replicated wal data is 
>> different.
>> (It's interesting optimization whether to use kernel cache or not.)
>> 
>> OK. Although I agree to separate the stats for the walrecever,
>> I want to hear opinions from other people too. I didn't change the 
>> patch.
>> 
>> Please feel to your comments.
> 
> What about applying the patch for common WAL write function like
> XLogWriteFile(), separately from the patch for walreceiver's stats?
> Seems the former reaches the consensus, so we can commit it firstly.
> Also even only the former change is useful because which allows
> walreceiver to report WALWrite wait event.

Agreed. I separated the patches.

If only the former is committed, my trivial concern is that there may be
a disadvantage, but no advantage for the standby server. It may lead to
performance degradation to the wal receiver by calling
INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
latter patch is committed.

I think it's ok because this not happening in the case to disable the
"track_wal_io_timing" in the standby server. Although some users may start the
standby server using the backup which "track_wal_io_timing" is enabled in the
primary server, they will say it's ok since the users already accept the
performance degradation in the primary server.

>> OK. I agree.
>> 
>> I wonder to change the error check ways depending on who calls this 
>> function?
>> Now, only the walreceiver checks (1)errno==0 and doesn't check 
>> (2)errno==ENITR.
>> Other processes are the opposite.
>> 
>> IIUC, it's appropriate that every process checks (1)(2).
>> Please let me know my understanding is wrong.
> 
> I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
> that code. According to the following commit log, it seems harmless
> to retry on EINTR even walreceiver.
> 
>     Also retry on EINTR. All signals used in the backend are flagged 
> SA_RESTART
>     nowadays, so it shouldn't happen, but better to be defensive.

Thanks, I understood.


>>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
>>> reports an error. But this is useless. Probably IO timing should be
>>> incremented after the return code of pg_pwrite() is checked, instead?
>> 
>> Yes, I agree. I fixed it.
>> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
> 
> Thanks for the patch!
> 
>              nleft = nbytes;
>              do
>              {
> -                errno = 0;
> +                written = XLogWriteFile(openLogFile, from, nleft, (off_t) 
> startoffset,
> +                                        ThisTimeLineID, openLogSegNo, wal_segment_size);
> 
> Can we merge this do-while loop in XLogWrite() into the loop
> in XLogWriteFile()?
> If we do that, ISTM that the following codes are not necessary in 
> XLogWrite().
> 
>                  nleft -= written;
>                  from += written;

OK, I fixed it.


> + * 'segsize' is a segment size of WAL segment file.
> 
> Since segsize is always wal_segment_size, segsize argument seems
> not necessary in XLogWriteFile().

Right. I fixed it.


> +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
> +              TimeLineID timelineid, XLogSegNo segno, int segsize)
> 
> Why did you use "const void *" instead of "char *" for *buf?

I followed the argument of pg_pwrite().
But, I think "char *" is better, so fixed it.


> Regarding 0005 patch, I will review it later.

Thanks.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION




Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/03/22 9:50, ikedamsh wrote:
> Agreed. I separated the patches.
> 
> If only the former is committed, my trivial concern is that there may be
> a disadvantage, but no advantage for the standby server. It may lead to
> performance degradation to the wal receiver by calling
> INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
> latter patch is committed.

Your concern is valid, so let's polish and commit also the 0003 patch to v14.
I'm still thinking that it's better to separate wal_xxx columns into
walreceiver's and the others. But if we count even walreceiver activity on
the existing columns, regarding 0003 patch, we need to update the document?
For example, "Number of times WAL buffers were written out to disk via
XLogWrite request." should be "Number of times WAL buffers were written
out to disk via XLogWrite request and by WAL receiver process."? Maybe
we need to append some descriptions about this into "WAL configuration"
section?


> I followed the argument of pg_pwrite().
> But, I think "char *" is better, so fixed it.

Thanks for updating the patch!

+extern int    XLogWriteFile(int fd, char *buf,
+                          size_t nbyte, off_t offset,
+                          TimeLineID timelineid, XLogSegNo segno,
+                          bool write_all);

write_all seems not to be necessary. You added this flag for walreceiver,
I guess. But even without the argument, walreceiver seems to work expectedly.
So, what about the attached patch? I applied some cosmetic changes to the patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
ikedamsh
Date:

On 2021/03/22 16:50, Fujii Masao wrote:
> 
> 
> On 2021/03/22 9:50, ikedamsh wrote:
>> Agreed. I separated the patches.
>>
>> If only the former is committed, my trivial concern is that there may be
>> a disadvantage, but no advantage for the standby server. It may lead to
>> performance degradation to the wal receiver by calling
>> INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
>> latter patch is committed.
> 
> Your concern is valid, so let's polish and commit also the 0003 patch to v14.
> I'm still thinking that it's better to separate wal_xxx columns into
> walreceiver's and the others. But if we count even walreceiver activity on
> the existing columns, regarding 0003 patch, we need to update the document?
> For example, "Number of times WAL buffers were written out to disk via
> XLogWrite request." should be "Number of times WAL buffers were written
> out to disk via XLogWrite request and by WAL receiver process."? Maybe
> we need to append some descriptions about this into "WAL configuration"
> section?

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

>> I followed the argument of pg_pwrite().
>> But, I think "char *" is better, so fixed it.
> 
> Thanks for updating the patch!
> 
> +extern int    XLogWriteFile(int fd, char *buf,
> +                          size_t nbyte, off_t offset,
> +                          TimeLineID timelineid, XLogSegNo segno,
> +                          bool write_all);
> 
> write_all seems not to be necessary. You added this flag for walreceiver,
> I guess. But even without the argument, walreceiver seems to work expectedly.
> So, what about the attached patch? I applied some cosmetic changes to the patch.

Thanks a lot. Yes, "write_all" is unnecessary.
Your patch is looks good to me.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Attachment

Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/03/22 20:25, ikedamsh wrote:
> Agreed. Users can know whether the stats is for walreceiver or not. The
> pg_stat_wal view in standby server shows for the walreceiver, and in primary
> server it shows for the others. So, I updated the document.
> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the docs!

There was the discussion about when the stats collector is invoked, at [1].
Currently during archive recovery or standby, the stats collector is
invoked when the startup process reaches the consistent state, sends
PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
read-only connections. But walreceiver can be invoked at earlier stage.
This can cause walreceiver to generate and send the statistics about WAL
writing even though the stats collector has not been running yet. This might
be problematic? If so, maybe we need to ensure that the stats collector is
invoked before walreceiver?

During recovery, the stats collector is not invoked if hot standby mode is
disabled. But walreceiver can be running in this case. So probably we should
change walreceiver so that it's invoked even when hot standby is disabled?
Otherwise we cannnot collect the statistics about WAL writing by walreceiver
in that case.

[1]
https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Masahiro Ikeda
Date:

On 2021/03/23 16:10, Fujii Masao wrote:
> 
> 
> On 2021/03/22 20:25, ikedamsh wrote:
>> Agreed. Users can know whether the stats is for walreceiver or not. The
>> pg_stat_wal view in standby server shows for the walreceiver, and in primary
>> server it shows for the others. So, I updated the document.
>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
> 
> Thanks for updating the docs!
> 
> There was the discussion about when the stats collector is invoked, at [1].
> Currently during archive recovery or standby, the stats collector is
> invoked when the startup process reaches the consistent state, sends
> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
> read-only connections. But walreceiver can be invoked at earlier stage.
> This can cause walreceiver to generate and send the statistics about WAL
> writing even though the stats collector has not been running yet. This might
> be problematic? If so, maybe we need to ensure that the stats collector is
> invoked before walreceiver?
> 
> During recovery, the stats collector is not invoked if hot standby mode is
> disabled. But walreceiver can be running in this case. So probably we should
> change walreceiver so that it's invoked even when hot standby is disabled?
> Otherwise we cannnot collect the statistics about WAL writing by walreceiver
> in that case.
> 
> [1]
> https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com

Thanks for comments! I didn't notice that.
As I mentioned[1], if my understanding is right, this issue seem to be not for
only the wal receiver.

Since the shared memory thread already handles these issues, does this patch,
which to collect the stats for the wal receiver and make a common function for
writing wal files, have to be committed after the patches for share memory
stats are committed? Or to handle them in this thread because we don't know
when the shared memory stats patches will be committed.

I think the former is better because to collect stats in shared memory is very
useful feature for users and it make a big change in design. So, I think it's
beneficial to make an effort to move the shared memory stats thread forward
(by reviewing or testing) instead of handling the issues in this thread.

[1]
https://www.postgresql.org/message-id/9f4e19ad-518d-b91a-e500-25a666471c42%40oss.nttdata.com

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: About to add WAL write/fsync statistics to pg_stat_wal view

From
Fujii Masao
Date:

On 2021/03/25 11:50, Masahiro Ikeda wrote:
> 
> 
> On 2021/03/23 16:10, Fujii Masao wrote:
>>
>>
>> On 2021/03/22 20:25, ikedamsh wrote:
>>> Agreed. Users can know whether the stats is for walreceiver or not. The
>>> pg_stat_wal view in standby server shows for the walreceiver, and in primary
>>> server it shows for the others. So, I updated the document.
>>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>
>> Thanks for updating the docs!
>>
>> There was the discussion about when the stats collector is invoked, at [1].
>> Currently during archive recovery or standby, the stats collector is
>> invoked when the startup process reaches the consistent state, sends
>> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
>> read-only connections. But walreceiver can be invoked at earlier stage.
>> This can cause walreceiver to generate and send the statistics about WAL
>> writing even though the stats collector has not been running yet. This might
>> be problematic? If so, maybe we need to ensure that the stats collector is
>> invoked before walreceiver?
>>
>> During recovery, the stats collector is not invoked if hot standby mode is
>> disabled. But walreceiver can be running in this case. So probably we should
>> change walreceiver so that it's invoked even when hot standby is disabled?
>> Otherwise we cannnot collect the statistics about WAL writing by walreceiver
>> in that case.
>>
>> [1]
>> https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921bdb5@oss.nttdata.com
> 
> Thanks for comments! I didn't notice that.
> As I mentioned[1], if my understanding is right, this issue seem to be not for
> only the wal receiver.
> 
> Since the shared memory thread already handles these issues, does this patch,
> which to collect the stats for the wal receiver and make a common function for
> writing wal files, have to be committed after the patches for share memory
> stats are committed? Or to handle them in this thread because we don't know
> when the shared memory stats patches will be committed.
> 
> I think the former is better because to collect stats in shared memory is very
> useful feature for users and it make a big change in design. So, I think it's
> beneficial to make an effort to move the shared memory stats thread forward
> (by reviewing or testing) instead of handling the issues in this thread.

Sounds reasonable. Agreed.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION