Re: pg_stat_io not tracking smgrwriteback() is confusing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pg_stat_io not tracking smgrwriteback() is confusing
Date
Msg-id 20230424211432.u2z6v6mceemznd6t@awork3.anarazel.de
Whole thread Raw
In response to Re: pg_stat_io not tracking smgrwriteback() is confusing  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: pg_stat_io not tracking smgrwriteback() is confusing
List pgsql-hackers
Hi,

On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > I noticed that the numbers in pg_stat_io dont't quite add up to what I
> > expected in write heavy workloads. Particularly for checkpointer, the numbers
> > for "write" in log_checkpoints output are larger than what is visible in
> > pg_stat_io.
> > 
> > That partially is because log_checkpoints' "write" covers way too many things,
> > but there's an issue with pg_stat_io as well:
> > 
> > Checkpoints, and some other sources of writes, will often end up doing a lot
> > of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> > pre-existing forms of IO statistics.
> > 
> > It seems pretty clear that we should track writeback as well. I wonder if it's
> > worth doing so for 16? It'd give a more complete picture that way. The
> > counter-argument I see is that we didn't track the time for it in existing
> > stats either, and that nobody complained - but I suspect that's mostly because
> > nobody knew to look.
> 
> Not complaining about making pg_stat_io more accurate, but what exactly
> would we be tracking for smgrwriteback()? I assume you are talking about
> IO timing. AFAICT, on Linux, it does sync_file_range() with
> SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
> tracking the system call overhead time?

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

As an example, I inited' a pgbench w/ scale 1000, on a decent but not great
NVMe SSD. Created dirty data with:

  c=96;/srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -M prepared -c$c -j$c -T30 -P1
and then measured the checkpoint:
  perf trace -s -p $pid_of_checkpointer psql -x -c "SELECT pg_stat_reset_shared('io')" -c "checkpoint"

 postgres (367444), 1891280 events, 100.0%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   sync_file_range   359176      0  4560.670     0.002     0.013   238.955     10.36%
   pwrite64          582964      0  2874.634     0.003     0.005     0.156      0.06%
   fsync                242      0   251.631     0.001     1.040    42.166     18.81%
   openat               317     65     2.171     0.002     0.007     0.068      5.69%
   rename                69      0     1.973     0.012     0.029     0.084      5.81%
   fdatasync              1      0     1.543     1.543     1.543     1.543      0.00%
   unlink               150    137     1.278     0.002     0.009     0.062     10.69%
   close                250      0     0.694     0.001     0.003     0.007      3.14%
   newfstatat           140     68     0.667     0.001     0.005     0.022      7.26%
   write                  5      0     0.067     0.005     0.013     0.025     24.55%
   lseek                 14      0     0.050     0.001     0.004     0.018     33.87%
   getdents64             8      0     0.047     0.002     0.006     0.022     39.51%
   kill                   3      0     0.029     0.008     0.010     0.011     10.18%
   epoll_wait             2      1     0.006     0.000     0.003     0.006    100.00%
   read                   1      0     0.004     0.004     0.004     0.004      0.00%

Log output:

2023-04-24 14:11:59.234 PDT [367444][checkpointer][:0][] LOG:  checkpoint starting: immediate force wait
2023-04-24 14:12:09.236 PDT [367444][checkpointer][:0][] LOG:  checkpoint complete: wrote 595974 buffers (28.4%); 0 WAL
file(s)added, 0 removed, 68 recycled; write=9.740 s, sync=0.057 s, total=10.002 s; sync files=27, longest=0.043 s,
average=0.003s; distance=4467386 kB, estimate=4467386 kB; lsn=6/E5D33F98, redo lsn=6/E5D33F28
 


# SELECT writes, write_time, fsyncs, fsync_time FROM pg_stat_io WHERE backend_type = 'checkpointer';
┌────────┬────────────────────┬────────┬────────────┐
│ writes │     write_time     │ fsyncs │ fsync_time │
├────────┼────────────────────┼────────┼────────────┤
│ 595914 │ 4002.1730000000002 │     24 │     46.359 │
└────────┴────────────────────┴────────┴────────────┘


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Missing update of all_hasnulls in BRIN opclasses
Next
From: Tomas Vondra
Date:
Subject: Re: Missing update of all_hasnulls in BRIN opclasses