pg_stat_io_histogram - Mailing list pgsql-hackers

From Jakub Wartak
Subject pg_stat_io_histogram
Date
Msg-id CAKZiRmwvE4uJLKTgPXeBA4m+d4tTghayoefcaM9=z3_S7i72GA@mail.gmail.com
Whole thread Raw
Responses Re: pg_stat_io_histogram
List pgsql-hackers
I'm proposing that we add pg_stat_io_histogram that would track/show I/O
latencies profile, so we could quickly identify I/O outliers. From time to
time users complain that 'PostgreSQL is slow or stuck' (usually COMMIT is
slow), when it is quite apparent that it is down to somewhere in the I/O
stack. It is quite  easy to prove once one has proper measurement tools in
place and is able to correlate, but it takes IMHO way too much time and
energy to cross-correlate all of that information (iostat -x 1s,
wait events 1s, and so on), especially if one would like to provide rapid
response.

Right now the patch does not include per-backend/PID tracking, hopefully if
there will be interest in this, I'll add it, but I would like to first hear
if that's a good idea. The current implementation uses fast bucket calculation
to avoid overheads and tries to cover most useful range of devices via buckets
(128us..256ms, so that covers both NVMe/SSD/HDD and abnormally high latency
too as from time to time I'm try to help with I/O stuck for *seconds*,
usually a sign
of some I/O multipath issues, device resetting, or hypervisor woes).

postgres=# select
    substring(backend_type,1,8) as backend,object,context,io_type,
    bucket_latency_us as lat_us,
    round(bucket_latency_us/1000.0, 3) as lat_ms,
    bucket_count as count
from pg_stat_get_io_histogram()
where
    bucket_count > 0
order by 1,2,3,4,5;
 backend  |  object  |  context  |  io_type  | lat_us | lat_ms | count
----------+----------+-----------+-----------+--------+--------+-------
 autovacu | relation | normal    | read      |    128 |  0.128 |    54
 autovacu | relation | normal    | read      |    256 |  0.256 |     7
 autovacu | relation | normal    | read      |    512 |  0.512 |     1
 autovacu | relation | vacuum    | read      |    128 |  0.128 |     8
 autovacu | relation | vacuum    | read      |    256 |  0.256 |     5
 backgrou | relation | bulkread  | read      |    128 |  0.128 |   658
 backgrou | relation | normal    | read      |    128 |  0.128 |     5
 checkpoi | relation | normal    | fsync     |   2048 |  2.048 |    37
 checkpoi | relation | normal    | fsync     |   4096 |  4.096 |     7
 checkpoi | relation | normal    | fsync     |  16384 | 16.384 |     4
 checkpoi | relation | normal    | fsync     |  32768 | 32.768 |     1
 checkpoi | relation | normal    | fsync     |  65536 | 65.536 |     1
 checkpoi | relation | normal    | write     |    128 |  0.128 |  2059
 checkpoi | relation | normal    | write     |    256 |  0.256 |     2
 checkpoi | relation | normal    | write     |    512 |  0.512 |     1
 checkpoi | relation | normal    | writeback |    128 |  0.128 |    64
 checkpoi | relation | normal    | writeback |    256 |  0.256 |     1
 client b | relation | bulkread  | read      |    128 |  0.128 |   675
 client b | relation | bulkread  | read      |    256 |  0.256 |     1
 client b | relation | bulkwrite | extend    |    128 |  0.128 |   260
 client b | relation | bulkwrite | extend    |    512 |  0.512 |     1
 client b | relation | bulkwrite | write     |    128 |  0.128 | 14404
 client b | relation | normal    | extend    |    128 |  0.128 |     6
 client b | relation | normal    | read      |    128 |  0.128 |   273
 client b | relation | normal    | read      |    256 |  0.256 |     6
 client b | relation | vacuum    | read      |    128 |  0.128 |   907
 client b | relation | vacuum    | read      |    256 |  0.256 |     3
 client b | relation | vacuum    | read      |    512 |  0.512 |     2

Of course most of the I/O calls today are hitting page cache, so one would
expect they'll be < 128us most of the time, but above you can see here degraded
fsync/fdatasync as well (BTW that was achieved via device mapper
delayed device). My hope that above would help tremendously when dealing
with flaky storage, or I/O path issues, or even hypervisors being paused.

Alternative idea I was having would be simply to add logging of slow I/O
outliers, but meh.. then one would to answer all those questions:
what should be the threshold (=>guc?), risk of spamming the log and so on
(and I wouldn't be fond of proposing yet another log_* GUC ;))

Any hints, co-authors, or help are more than welcome!

-J.

Attachment

pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: Newly created replication slot may be invalidated by checkpoint
Next
From: Zsolt Parragi
Date:
Subject: Re: Custom oauth validator options