Re: Patch: add timing of buffer I/O requests - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Patch: add timing of buffer I/O requests
Date
Msg-id 4F44FF1C.1030409@2ndQuadrant.com
Whole thread Raw
In response to Patch: add timing of buffer I/O requests  (Ants Aasma <ants.aasma@eesti.ee>)
Responses Re: Patch: add timing of buffer I/O requests
List pgsql-hackers
Attached are updated versions of this feature without the pg_test_timing
tool part, since I broke that out into another discussion thread.  I've
split the part that updates pg_stat_statistics out from the main feature
too, separate patch attached to here (but I'm not reviewing that yet).
Lots of bitrot since this was submitted, and yes I noticed that I've
almost recreated earlier versions of this patch--by splitting off the
parts that were developed later.

Earlier discussion of this got side tracked on a few things, partly my
fault. It's worth taking a look at what this provides before judging it
too much.  It can demo well.

The stated purpose is helping figure out what relations are gobbling up
the most access time, presumably to optimize them and/or the storage
they are on.  "What do I put onto SSD" is surely a popular request
nowadays.  To check suitability for that, I decided to run the standard
pgbench test and see what it would show lots of time being consumed by.
  Any answer other than "pgbench_accounts and to a lesser extent its
index" is a failing grade.  I started with a clean database and OS cache
so I'd get real read timings:

$ psql -d pgbench -x -c "select
relname,heap_blks_read,heap_blks_hit,heap_blks_time,
idx_blks_read ,idx_blks_hit,idx_blks_time
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname        | pgbench_accounts
heap_blks_read | 7879
heap_blks_hit  | 43837
heap_blks_time | 151770
idx_blks_read  | 7503
idx_blks_hit   | 60484
idx_blks_time  | 70968

relname        | pgbench_tellers
heap_blks_read | 19
heap_blks_hit  | 15856
heap_blks_time | 105
idx_blks_read  | 11
idx_blks_hit   | 15745
idx_blks_time  | 62

relname        | pgbench_branches
heap_blks_read | 11
heap_blks_hit  | 32333
heap_blks_time | 77
idx_blks_read  | 2
idx_blks_hit   | 0
idx_blks_time  | 9

Now, the first critical question to ask is "what additional information
is this providing above the existing counters?"  After all, it's
possible to tell pgbench_accounts is the hotspot just from comparing
heap_blks_read, right?  To really be useful, this would need to make it
obvious that reads from pgbench_accounts are slower than the other two,
because it's bigger and requires more seeking around to populate.  That
should show up if we compute time per read numbers:

$ psql -d pgbench -x -c "select relname,
1.0 * heap_blks_time / heap_blks_read as time_per_read,
1.0 * idx_blks_time / idx_blks_read as time_per_idx_read
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname           | pgbench_accounts
time_per_read     | 19.2625967762406397
time_per_idx_read | 9.4586165533786485

relname           | pgbench_tellers
time_per_read     | 5.5263157894736842
time_per_idx_read | 5.6363636363636364

relname           | pgbench_branches
time_per_read     | 7.0000000000000000
time_per_idx_read | 4.5000000000000000

This run looks useful at providing the data wished for--that read times
are slower per capita from the accounts table.  The first time I tried
this I got a bizarre high number for pgbench_branches.heap_blks_time ;
I'm not sure how reliable this is yet.  One problem that might be easy
to fix is that the write timing info doesn't show in any of these system
views, only in EXPLAIN and statement level ones.

I still think a full wait timing interface is the right long-term
direction here.  It's hard to reject this idea when it seems to be
working right now though, while more comprehensive wait storage is still
at least a release off.   Opinions welcome, I'm still juggling this
around now that I have it working again.

Some implementation notes.  This currently fails regression test
create_function_3, haven't looked into why yet.  I've confirmed that on
a system where timing is cheap, this is too.  On something touching real
data, not just a synthetic thing moving memory around, Aants couldn't
measure it on a server similar to mine; I can't either.  Yes, this is
going to gobble up more room for statistics.

The track_iotiming GUC seems to work as expected.  Off by default, can
turn it on in a session and see that session's work get timed, and it
toggles on a config reload.  Everything needed to only turn it on
selectively; main penalty you pay all the time is the stats bloat.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Incorrect behaviour when using a GiST index on points
Next
From: Tom Lane
Date:
Subject: Re: leakproof