Thread: Add time spent in posix_fadvise() in I/O read time ?

Add time spent in posix_fadvise() in I/O read time ?

From
Frédéric Yhuel
Date:
Hi,

I'm currently looking at how we can integrate effective_io_concurrency 
into the planner cost model. This is actually a big overstatement, as I 
have only done a bunch of tests so far. One thing I've noticed is that 
posix_fadvise(,,POSIX_FADV_WILLNEED) isn't always non-blocking on Linux. 
As Ted Ts'o explains in this old thread[1], it blocks when the request 
queue fills up.

This leads to very surprising EXPLAIN output. In the following test, run 
on a machine with an ATA SSD (I can't observe the same on a machine with 
a nvme SSD), if we increase effective_io_concurrency from 256 to 512, 
the I/O shared read time decreases from 1148ms to 350ms, while the 
execution time remains almost the same:

CREATE UNLOGGED TABLE eic_uniform(a INT, filler TEXT);

INSERT INTO eic_uniform SELECT 10000 * random(), md5(i::text) FROM 
generate_series(1,20000000) AS T(i);

VACUUM ANALYZE eic_uniform;

CREATE INDEX ON eic_uniform(a);

set work_mem = '16MB';
set max_parallel_workers_per_gather = 0;
set track_io_timing = on;

set effective_io_concurrency = 256;

-- remove heap from postgres cache and page cache

EXPLAIN (ANALYZE, SETTINGS, BUFFERS, COSTS OFF) SELECT * FROM 
eic_uniform WHERE a < 140 AND filler > 'fffffff';
                                               QUERY PLAN 

------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on eic_uniform (actual time=2107.439..2107.440 rows=0 
loops=1)
    Recheck Cond: (a < 140)
    Filter: (filler > 'fffffff'::text)
    Rows Removed by Filter: 279648
    Heap Blocks: exact=135778
    Buffers: shared read=136024
    I/O Timings: shared read=1147.903
    ->  Bitmap Index Scan on eic_uniform_a_idx (actual 
time=41.827..41.828 rows=279648 loops=1)
          Index Cond: (a < 140)
          Buffers: shared read=246
          I/O Timings: shared read=0.568
  Settings: effective_io_concurrency = '256', 
max_parallel_workers_per_gather = '0', work_mem = '16MB'
  Planning Time: 0.197 ms
  Execution Time: 2108.553 ms


  set effective_io_concurrency = 512;

-- remove heap from postgres cache and page cache

EXPLAIN (ANALYZE, SETTINGS, BUFFERS, COSTS OFF) SELECT * FROM 
eic_uniform WHERE a < 140 AND filler > 'fffffff';
                                               QUERY PLAN 

------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on eic_uniform (actual time=2118.706..2118.707 rows=0 
loops=1)
    Recheck Cond: (a < 140)
    Filter: (filler > 'fffffff'::text)
    Rows Removed by Filter: 279648
    Heap Blocks: exact=135778
    Buffers: shared read=136024
    I/O Timings: shared read=350.014
    ->  Bitmap Index Scan on eic_uniform_a_idx (actual 
time=44.619..44.620 rows=279648 loops=1)
          Index Cond: (a < 140)
          Buffers: shared read=246
          I/O Timings: shared read=0.581
  Settings: effective_io_concurrency = '512', 
max_parallel_workers_per_gather = '0', work_mem = '16MB'
  Planning Time: 0.201 ms
  Execution Time: 2119.885 ms

Now, if we apply the attached patch (it applies to REL_17_STABLE and is 
not meant to be reviewed), it becomes:

    I/O Timings: shared wait=322.476 read=1147.903

  and:

    I/O Timings: shared wait=1030.122 read=350.014

where the "wait time" measures the time spent in posix_fadvise().

I don't know if we should do something like this... this breakdown is 
useful for my study, but probably not for most users... although it 
might be interesting to have statistics in pg_stat_io that would allow 
us to detect a bad setting of effective_io_concurrency?

But if we don't, then I think that we should at least include the time 
spent in posix_fadvise() in the I/O read time, don't you?

BTW, with the same dataset, e_i_c = 0 performs twice better than 
e_i_c = 1, which means that kernel readahead does a much better job than 
postgresql prefetching with default settings in this very specific case, 
but I think that this has already been discussed somewhere.


[1] 
https://linux.kernel.narkive.com/WcGajTwh/posix-fadvise-posix-fadv-willneed-waits-before-returning#post2


Attachment

Re: Add time spent in posix_fadvise() in I/O read time ?

From
Frédéric Yhuel
Date:

On 3/14/25 09:43, Frédéric Yhuel wrote:
> One thing I've noticed is that posix_fadvise(,,POSIX_FADV_WILLNEED) 
> isn't always non-blocking on Linux. As Ted Ts'o explains in this old 
> thread[1], it blocks when the request queue fills up.

When posix_fadvise() blocks, it doesn't seem to be completely off-cpu 
(but mostly, at least on my machine), and I assume that this is the 
reason for the sentence "A value higher than needed to keep the disks 
busy will only result in extra CPU overhead" in the documentation for 
effective_io_concurrency? My guess is that Linux uses a spinlock 
somewhere, and that this explains the cpu overhead.

Also, on Linux, it seems we can control the size of the request queue 
with the /sys/block/XXX/queue/nr_requests setting. On my machine, it is 
set to 64 by default. When I set it up to 128, postgres spends less time 
on posix_fadvise(), and correspondingly more on pread().