Thread: [PATCH] Avoid useless prefetches in case of recent FPI WAL records
Hi,
One customer complained that physical standby has significant replay lag (few hours) on modern powerful bare metal servers. According to output of "top", startup process and few kernel workers utilized CPU, mostly in kernel space.
The incident has been resolved by disabling recovery_prefetch after quick analysis of perf and strace outputs.
Problem description
---------------------
Actual XLog prefetching code doesn't consider that shared buffer may be loaded from WAL FPI rather than from disk and makes fadvise calls in such cases.
It results into amount of useless fadvise calls and slowness of WAL apply process.
The full_page_writes are enabled by default, so after checkpoint any first modification is logged into WAL with full page. The startup process will read full page images from WAL stream and load them into shared buffer by XLogReadBufferForRedoExtended.
It seems there is no point to prefetch shared buffer from disk if FPI are logged recently.
Test case
----------
To reproduce case of useless prefetches, please create master instance and physical standby with default shared buffer (128MiB), then insert test data with size a bit more than shared buffers (32818 pages):
drop table if exists t1;
create table t1 (id int, k int, val text ) with (fillfactor = 10, autovacuum_enabled=off);
insert into t1 (id,k,val)
select i, 0, repeat('x', 800) from generate_series(1,32*1024 + 50) as i;
checkpoint;
Reset stats on standby server:
select pg_stat_reset_shared('io');
select pg_stat_reset_shared('recovery_prefetch');
Update table on master and check stats on standby:
checkpoint;
update t1 set k = k + 1;
select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
select * from pg_stat_recovery_prefetch ;
Here is output:
postgres=# select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-29 16:38:45.664648+03
relation | bulkwrite | 0 | 2025-03-29 16:38:45.664648+03
relation | init | 0 | 2025-03-29 16:38:45.664648+03
relation | normal | 9 | 2025-03-29 16:38:45.664648+03
relation | vacuum | 0 | 2025-03-29 16:38:45.664648+03
wal | init | | 2025-03-29 16:38:45.664648+03
wal | normal | 8100 | 2025-03-29 16:38:45.664648+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-29 16:38:45.665322+03 | 32383 | 435 | 0 | 0 | 32818 | 32818 | 0 | 0 | 0
(1 row)
The startup process called fadvise 32383 times, skipped 32818 FPW blocks, but read relations only 9 times.
Idea how to fix it
-------------------
As of now XLogPrefetcher skips FPW blocks without any further tracking.
It seems that book-keeping of this block to avoid further prefetching is more correct.
Please find attached patch for this idea. It adds block into recent_block & recent_rlocator.
After patch the stats are following:
postgres=# select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-30 01:27:45.981169+03
relation | bulkwrite | 0 | 2025-03-30 01:27:45.981169+03
relation | init | 0 | 2025-03-30 01:27:45.981169+03
relation | normal | 9 | 2025-03-30 01:27:45.981169+03
relation | vacuum | 0 | 2025-03-30 01:27:45.981169+03
wal | init | | 2025-03-30 01:27:45.981169+03
wal | normal | 7895 | 2025-03-30 01:27:45.981169+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-30 01:27:46.971602+03 | 0 | 0 | 0 | 0 | 32818 | 65636 | 0 | 0 | 0
(1 row)
i.e. there is no more useless fadvise() calls and amount of skipped repeated blocks are increased by amount of modified blocks.
Please feel free to ask any questions!
Thank you!
One customer complained that physical standby has significant replay lag (few hours) on modern powerful bare metal servers. According to output of "top", startup process and few kernel workers utilized CPU, mostly in kernel space.
The incident has been resolved by disabling recovery_prefetch after quick analysis of perf and strace outputs.
Problem description
---------------------
Actual XLog prefetching code doesn't consider that shared buffer may be loaded from WAL FPI rather than from disk and makes fadvise calls in such cases.
It results into amount of useless fadvise calls and slowness of WAL apply process.
The full_page_writes are enabled by default, so after checkpoint any first modification is logged into WAL with full page. The startup process will read full page images from WAL stream and load them into shared buffer by XLogReadBufferForRedoExtended.
It seems there is no point to prefetch shared buffer from disk if FPI are logged recently.
Test case
----------
To reproduce case of useless prefetches, please create master instance and physical standby with default shared buffer (128MiB), then insert test data with size a bit more than shared buffers (32818 pages):
drop table if exists t1;
create table t1 (id int, k int, val text ) with (fillfactor = 10, autovacuum_enabled=off);
insert into t1 (id,k,val)
select i, 0, repeat('x', 800) from generate_series(1,32*1024 + 50) as i;
checkpoint;
Reset stats on standby server:
select pg_stat_reset_shared('io');
select pg_stat_reset_shared('recovery_prefetch');
Update table on master and check stats on standby:
checkpoint;
update t1 set k = k + 1;
select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
select * from pg_stat_recovery_prefetch ;
Here is output:
postgres=# select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-29 16:38:45.664648+03
relation | bulkwrite | 0 | 2025-03-29 16:38:45.664648+03
relation | init | 0 | 2025-03-29 16:38:45.664648+03
relation | normal | 9 | 2025-03-29 16:38:45.664648+03
relation | vacuum | 0 | 2025-03-29 16:38:45.664648+03
wal | init | | 2025-03-29 16:38:45.664648+03
wal | normal | 8100 | 2025-03-29 16:38:45.664648+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-29 16:38:45.665322+03 | 32383 | 435 | 0 | 0 | 32818 | 32818 | 0 | 0 | 0
(1 row)
The startup process called fadvise 32383 times, skipped 32818 FPW blocks, but read relations only 9 times.
Idea how to fix it
-------------------
As of now XLogPrefetcher skips FPW blocks without any further tracking.
It seems that book-keeping of this block to avoid further prefetching is more correct.
Please find attached patch for this idea. It adds block into recent_block & recent_rlocator.
After patch the stats are following:
postgres=# select object, context,reads, stats_reset from pg_stat_io where backend_type = 'startup';
object | context | reads | stats_reset
----------+-----------+-------+-------------------------------
relation | bulkread | 0 | 2025-03-30 01:27:45.981169+03
relation | bulkwrite | 0 | 2025-03-30 01:27:45.981169+03
relation | init | 0 | 2025-03-30 01:27:45.981169+03
relation | normal | 9 | 2025-03-30 01:27:45.981169+03
relation | vacuum | 0 | 2025-03-30 01:27:45.981169+03
wal | init | | 2025-03-30 01:27:45.981169+03
wal | normal | 7895 | 2025-03-30 01:27:45.981169+03
(7 rows)
postgres=# select * from pg_stat_recovery_prefetch ;
stats_reset | prefetch | hit | skip_init | skip_new | skip_fpw | skip_rep | wal_distance | block_distance | io_depth
-------------------------------+----------+-----+-----------+----------+----------+----------+--------------+----------------+----------
2025-03-30 01:27:46.971602+03 | 0 | 0 | 0 | 0 | 32818 | 65636 | 0 | 0 | 0
(1 row)
i.e. there is no more useless fadvise() calls and amount of skipped repeated blocks are increased by amount of modified blocks.
Please feel free to ask any questions!
Thank you!
-- Michael Zhilin Postgres Professional https://www.postgrespro.ru