Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id DC2FB4TJBCYR.2U40BNPFOO51E@bowt.ie
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu Aug 14, 2025 at 3:41 PM EDT, Andres Freund wrote:
> Hm, that is somewhat curious.
>
> I wonder if there's some wait time that's not being captured by "I/O
> Timings". A first thing to do would be to just run strace --summary-only while
> running the query, and see if there are syscall wait times that seem too long.

For the slow, sequential backwards scan query:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.271216           4     66808           io_uring_enter
  0.00    0.000004           4         1           sendto
  0.00    0.000001           0         2         1 recvfrom
  0.00    0.000000           0         5           lseek
  0.00    0.000000           0         1           epoll_wait
  0.00    0.000000           0         4           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.271221           4     66821         1 total

For the fast, random backwards scan query:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99    0.351518           4     77819           io_uring_enter
  0.00    0.000007           2         3         1 epoll_wait
  0.00    0.000006           6         1           sendto
  0.00    0.000003           1         3         2 recvfrom
  0.00    0.000002           2         1           read
  0.00    0.000002           2         1         1 rt_sigreturn
  0.00    0.000002           2         1           getpid
  0.00    0.000002           1         2           kill
  0.00    0.000000           0         3           lseek
------ ----------- ----------- --------- --------- ----------------
100.00    0.351542           4     77834         4 total

> What effective_io_concurrency and io_max_concurrency setting are you using? If
> there are no free IO handles that's currently not nicely reported (because
> it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

effective_io_concurrency is 100.  io_max_concurrency is 64.  Nothing out of
the ordinary there.

> iostat -xmy /path/to/block/device
>
> I'd like to see the difference in average IO size (rareq-sz), queue depth
> (aqu-sz) and completion time (r_await) between the fast and slow cases.

I'll show one second interval output.

Slow, sequential backwards scan query
-------------------------------------

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       24613.00    192.29     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    4.92  53.20 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.22    0.00    0.44    0.85    0.00   98.50

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       25320.00    197.81     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.18  51.20 

Fast, random backwards scan query
---------------------------------

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       27140.59    212.04     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.50  23.37 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.84    0.00    0.00   98.66

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       50401.00    393.76     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   10.06  41.60 

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] bms_prev_member() can read beyond the end of the array of allocated words
Next
From: Masahiko Sawada
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum