Re: About to add WAL write/fsync statistics to pg_stat_wal view - Mailing list pgsql-hackers

From Masahiro Ikeda
Subject Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date
Msg-id d18188f83ab56a32b48e29baf920d245@oss.nttdata.com
Whole thread Raw
In response to Re: About to add WAL write/fsync statistics to pg_stat_wal view  (Masahiro Ikeda <ikedamsh@oss.nttdata.com>)
Responses Re: About to add WAL write/fsync statistics to pg_stat_wal view  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On 2021-01-25 13:15, Masahiro Ikeda wrote:
> On 2021-01-25 10:36, Masahiko Sawada wrote:
>> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>> <ikedamsh@oss.nttdata.com> wrote:
>>> 
>>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>>> > <ikedamsh@oss.nttdata.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I rebased the patch to the master branch.
>>> >
>>> > Thank you for working on this. I've read the latest patch. Here are
>>> > comments:
>>> >
>>> > ---
>>> > +               if (track_wal_io_timing)
>>> > +               {
>>> > +                   INSTR_TIME_SET_CURRENT(duration);
>>> > +                   INSTR_TIME_SUBTRACT(duration, start);
>>> > +                   WalStats.m_wal_write_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +               }
>>> >
>>> > * I think it should add the time in micro sec.
>>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>>> > pg_stat_wal showed the following on my environment:
>>> >
>>> > postgres(1:61569)=# select * from pg_stat_wal;
>>> > -[ RECORD 1 ]----+-----------------------------
>>> > wal_records      | 285947
>>> > wal_fpi          | 53285
>>> > wal_bytes        | 442008213
>>> > wal_buffers_full | 0
>>> > wal_write        | 25516
>>> > wal_write_time   | 0
>>> > wal_sync         | 25437
>>> > wal_sync_time    | 14490
>>> > stats_reset      | 2021-01-22 10:56:13.29464+09
>>> >
>>> > Since writes can complete less than a millisecond, wal_write_time
>>> > didn't increase. I think sync_time could also have the same problem.
>>> 
>>> Thanks for your comments. I didn't notice that.
>>> I changed the unit from milliseconds to microseconds.
>>> 
>>> > ---
>>> > +   /*
>>> > +    * Measure i/o timing to fsync WAL data.
>>> > +    *
>>> > +    * The wal receiver skip to collect it to avoid performance
>>> > degradation of standy servers.
>>> > +    * If sync_method doesn't have its fsync method, to skip too.
>>> > +    */
>>> > +   if (!AmWalReceiverProcess() && track_wal_io_timing &&
>>> > fsyncMethodCalled())
>>> > +       INSTR_TIME_SET_CURRENT(start);
>>> >
>>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>>> > is true? I think the performance degradation is also true for backend
>>> > processes. If there is another reason for that, I think it's better to
>>> > mention in both the doc and comment.
>>> > * How about checking track_wal_io_timing first?
>>> > * s/standy/standby/
>>> 
>>> I fixed it.
>>> As kuroda-san mentioned too, the skip is no need to be considered.
>> 
>> I think you also removed the code to have the wal receiver report the
>> stats. So with the latest patch, the wal receiver tracks those
>> statistics but doesn't report.
>> And maybe XLogWalRcvWrite() also needs to track I/O?
> 
> Thanks, I forgot to add them.
> I'll fix it.
> 
> 
>>> 
>>> > ---
>>> > +   /* increment the i/o timing and the number of times to fsync WAL
>>> > data */
>>> > +   if (fsyncMethodCalled())
>>> > +   {
>>> > +       if (!AmWalReceiverProcess() && track_wal_io_timing)
>>> > +       {
>>> > +           INSTR_TIME_SET_CURRENT(duration);
>>> > +           INSTR_TIME_SUBTRACT(duration, start);
>>> > +           WalStats.m_wal_sync_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > +       }
>>> > +
>>> > +       WalStats.m_wal_sync++;
>>> > +   }
>>> >
>>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>>> > incrementing m_wal_sync after each sync operation?
>>> 
>>> I think if syncing the disk does not occur, m_wal_sync should not be
>>> incremented.
>>> It depends enableFsync and sync_method.
>>> 
>>> enableFsync is checked in each fsync method like
>>> pg_fsync_no_writethrough(),
>>> so if incrementing m_wal_sync after each sync operation, it should be
>>> implemented
>>> in each fsync method. It leads to many duplicated codes.
>> 
>> Right. I missed that each fsync function checks enableFsync.
>> 
>>> So, why don't you change the function to a flag whether to
>>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>> 
>> Looks better. Since we don't necessarily need to increment m_wal_sync
>> after doing fsync we can write the code without an additional variable
>> as follows:
>> 
>>     if (enableFsync)
>>     {
>>         switch (sync_method)
>>         {
>>             case SYNC_METHOD_FSYNC:
>> #ifdef HAVE_FSYNC_WRITETHROUGH
>>             case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> #endif
>> #ifdef HAVE_FDATASYNC
>>             case SYNC_METHOD_FDATASYNC:
>> #endif
>>                 WalStats.m_wal_sync++;
>>                 if (track_wal_io_timing)
>>                     INSTR_TIME_SET_CURRENT(start);
>>                 break;
>>             default:
>>                 break;
>>         }
>>     }
>> 
>>   (do fsync and error handling here)
>> 
>>    /* increment the i/o timing and the number of times to fsync WAL 
>> data */
>>    if (track_wal_io_timing)
>>    {
>>        INSTR_TIME_SET_CURRENT(duration);
>>        INSTR_TIME_SUBTRACT(duration, start);
>>        WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
>>    }
> 
> IIUC, I think we can't handle the following case.
> 
> When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
> "track_wal_io_timing" is enabled, "start" doesn't be initialized.
> 
> My understanding is something wrong, isn't it?

I thought the following is better.


```
    /* Measure i/o timing to sync WAL data.*/
    if (track_wal_io_timing)
        INSTR_TIME_SET_CURRENT(start);

   (do fsync and error handling here)

    /* check whether to sync WAL data to the disk right now. */
    if (enableFsync)
    {
        if ((sync_method == SYNC_METHOD_FSYNC) ||
            (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
            (sync_method == SYNC_METHOD_FDATASYNC))
        {
            /* increment the i/o timing and the number of times to fsync WAL data 
*/
            if (track_wal_io_timing)
            {
                instr_time    duration;

                INSTR_TIME_SET_CURRENT(duration);
                INSTR_TIME_SUBTRACT(duration, start);
                WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
            }
            WalStats.m_wal_sync++;
        }
    }
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional 
variables.
But it's ok because "track_wal_io_timing" leads already performance 
degradation.

What do you think?


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Is Recovery actually paused?
Next
From: Dilip Kumar
Date:
Subject: Re: Is Recovery actually paused?