transparent_hugepage=never in our prod servers, %iowait is low 0.x-1.x% , read/write iops <2k, and read/write wait 0.x ms. we did not find other abnormal logs from OS logs either. Yes, we are discussing with our application team to reduce concurrency. more questions about DataFileRead and extend wait_event. when starting read data from physical datafiles , it report DataFileRead wait_event, until the data pages successfully loaded into buffer cache, then end wait_event report from pg_stat_activity,right? if some contention in buffer cache, that will make the session waiting there while in report DataFileRead, right? another thing is Operating system page cache, may sessions starting read data in parallel may bring contention on OS pagecache, bgwriter and checkpointer is flushing data, or OS background worker may flush dirty pages from OS page cache too, bgwriter or checkpointer maybe contention on OS pagecache or PG buffer cache? . no idea how to get the reason from online production server.
Thanks,
James
On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
> io read/write latency increased either.
Run "sar -P all 1" and see if "%iowait" is high.
Check if you have transparent hugepages enabled:
cat /sys/kernel/mm/transparent_hugepage/enabled
If they are enabled, disable them and see if it makes a difference.
I am only guessing here.
> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
> other tables in the tigger.
One thing that almost certainly would improve your situation is to run fewer
concurrent statements, for example by using a reasonably sized connection pool.
Yours,
Laurenz Albe