Thread: many sessions waiting DataFileRead and extend

many sessions waiting DataFileRead and extend

From

James Pang

Date:

25 June, 06:15:08

pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and "extend", it last about 2 seconds(based on pg_stat_activity query) , during the waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and io read/write latency increased either.

many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend" and "DataFileRead", there are triggers in this table "After delete" to insert/delete other tables in the tigger.

time wait_event count(sessions)

2025-06-24 18:00:11.687245+00 | DataFileFlush | 1
2025-06-24 18:00:11.687245+00 | DataFileWrite | 1
2025-06-24 18:00:11.687245+00 | PgSleep | 1
2025-06-24 18:00:11.687245+00 | DataFileExtend | 7
2025-06-24 18:00:11.687245+00 | ProcArrayGroupUpdate | 1
2025-06-24 18:00:11.687245+00 | WALWrite | 5
2025-06-24 18:00:11.687245+00 | extend | 55
2025-06-24 18:00:11.687245+00 | DataFileRead | 45
2025-06-24 18:00:11.687245+00 | DataFilePrefetch | 2

Thanks,

James

Re: many sessions waiting DataFileRead and extend

From

Laurenz Albe

Date:

25 June, 08:59:41

On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
> io read/write latency increased either.

Run "sar -P all 1" and see if "%iowait" is high.

Check if you have transparent hugepages enabled:

  cat /sys/kernel/mm/transparent_hugepage/enabled

If they are enabled, disable them and see if it makes a difference.

I am only guessing here.

> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
> other tables in the tigger. 

One thing that almost certainly would improve your situation is to run fewer
concurrent statements, for example by using a reasonably sized connection pool.

Yours,
Laurenz Albe

Re: many sessions waiting DataFileRead and extend

From

James Pang

Date:

25 June, 12:10:55

transparent_hugepage=never in our prod servers, %iowait is low 0.x-1.x% , read/write iops <2k, and read/write wait 0.x ms. we did not find other abnormal logs from OS logs either. Yes, we are discussing with our application team to reduce concurrency. more questions about DataFileRead and extend wait_event. when starting read data from physical datafiles , it report DataFileRead wait_event, until the data pages successfully loaded into buffer cache, then end wait_event report from pg_stat_activity,right? if some contention in buffer cache, that will make the session waiting there while in report DataFileRead, right? another thing is Operating system page cache, may sessions starting read data in parallel may bring contention on OS pagecache, bgwriter and checkpointer is flushing data, or OS background worker may flush dirty pages from OS page cache too, bgwriter or checkpointer maybe contention on OS pagecache or PG buffer cache? . no idea how to get the reason from online production server.

Thanks,

James

Laurenz Albe <laurenz.albe@cybertec.at> 於 2025年6月25日週三下午1:59寫道：

On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
> io read/write latency increased either.

Run "sar -P all 1" and see if "%iowait" is high.

Check if you have transparent hugepages enabled:

cat /sys/kernel/mm/transparent_hugepage/enabled

If they are enabled, disable them and see if it makes a difference.

I am only guessing here.

> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
> other tables in the tigger.

One thing that almost certainly would improve your situation is to run fewer
concurrent statements, for example by using a reasonably sized connection pool.

Yours,
Laurenz Albe

Re: many sessions waiting DataFileRead and extend

From

James Pang

Date:

26 June, 06:48:45

Thanks, I make a summary of the issue, no connection storm(fork) either, just suddenly many session waiting on "extend" and "DataFileRead", it last 2 seconds, this server has 64 vcpu and running there long time without issue, only last weekend, we patch from 14.8 to 14.14. We checked with Infra and VMWARE, and OS monitoring, except very high %sys cpu, not found other abnormal events. so, we are thinking if Postgresql buffer cache or OS page cache level contention leading to that(if that can contribute to IO wait, DataFileRead, extend).

attached please find details, just masking dbname, username,and ip addresses.

Laurenz Albe <laurenz.albe@cybertec.at> 於 2025年6月26日週四上午11:03寫道：

On Thu, 2025-06-26 at 10:32 +0800, James Pang wrote:
> thans for you suggestions, we have iowait from sar command too, copy here, checking with infra team not found abnormal IO activities either.
> 02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 0.00 0.00 0.00 37.09

Crazy high system time.

If it is not transparent hugepages, the other suspect is fork().
Turn on "log_connections" for a minute or two and see if there are lots
of connections established. That can easily use your CPU.

Other than that, I am out of guesses.

> Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月25日週三下午10:27寫道：
> > > On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > >
> > > Run "sar -P all 1" and see if "%iowait" is high.
> >
> > I would (strongly) advise against the use of iowait as an indicator.

I have heard that before, and I am sure you are right.

I grant that if it is low, it may just mean that the CPU is under load.
But if %iowait is high, my experience is that that indicates an I/O problem.

Yours,
Laurenz Albe

Attachment

many_sessions_waiting_extend_Datafileread_but_no_ioload_spike

Re: many sessions waiting DataFileRead and extend

From

James Pang

Date:

26 June, 10:22:49

in addition to "DataFileRead", actually we have more session waiting on "extend", and we enabled log_lock, for example

2025-06-24 18:00:11.368 xxxx:[1865315]:[4-1]:mbsLOG: process 1865315 still waiting for ExclusiveLock on extension of relation 14658239 of database 16384 after 1000.161 ms
2025-06-24 18:00:11.368 xxxx:mbs@pgaaweb:[1865315]:[5-1]:mbsDETAIL: Process holding the lock: 1748100. Wait queue: 232658, 606235, 1865315, 1939296, 1368901, 2472806, 2072266, 963893, 1304751, 1084094, 875968, 451652, 672314, 1240983, 2085501, 851401, 2043926, 2522611, 889816, 191663, 1090944, 536057, 1810868, 1240510, 1195564, 1067298.

see both "DataFileRead" and "extend" , each time last 2 seconds and automatically recoverred. for this "extend" wait_event, in addtion IO, OS page_cache and PG buffer cache, or bgwriter,checkpointer could be impact that too?

Thanks,

James

James Pang <jamespang886@gmail.com> 於 2025年6月26日週四下午2:47寫道：

we faced this issue 3 times this week, each time last only 2 seconds, so not easy to run perf in peak business time to capture that, anyway, I will try. before that, I want to understand if "os page cache" or "pg buffer cache" can contribute to the wait_event time "extend" and "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact that too ? we enable bgwriter , and we see checkpointer get scheduled by "wal" during the time, so I just increased max_wal_size to make checkpoint scheduled in longer time.

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四下午2:40寫道：
Okay. So it's a situation that is reproducable.
And like was mentioned, the system time (percentage) is very high.
Is this a physical machine, or a virtual machine?

The next thing to do, is use perf to record about 20 seconds or so during a period of time when you see this behavior (perf record -g, taking the backtrace with it).
This records (samples) the backtraces of on cpu tasks, from which you then can derive what they are doing, for which you should see lots of tasks in kernel space, and what that is, using perf report.

Frits Hoogland

On 26 Jun 2025, at 04:32, James Pang <jamespang886@gmail.com> wrote:

thans for you suggestions, we have iowait from sar command too, copy here, checking with infra team not found abnormal IO activities either.
02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 0.00 0.00 0.00 37.09
02:00:03 PM 0 17.59 0.00 46.73 1.01 0.50 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 1 9.50 0.00 61.50 0.50 0.50 1.00 0.00 0.00 0.00 27.00
02:00:03 PM 2 20.71 0.00 44.44 1.01 0.51 0.51 0.00 0.00 0.00 32.83
02:00:03 PM 3 14.00 0.00 51.50 2.00 1.00 1.00 0.00 0.00 0.00 30.50
02:00:03 PM 4 6.57 0.00 52.53 0.51 0.51 3.54 0.00 0.00 0.00 36.36
02:00:03 PM 5 10.20 0.00 49.49 1.02 1.53 0.00 0.00 0.00 0.00 37.76
02:00:03 PM 6 27.64 0.00 41.21 0.50 0.50 0.50 0.00 0.00 0.00 29.65
02:00:03 PM 7 9.05 0.00 50.75 0.50 1.01 0.50 0.00 0.00 0.00 38.19
02:00:03 PM 8 12.18 0.00 49.75 0.51 0.51 0.51 0.00 0.00 0.00 36.55
02:00:03 PM 9 13.00 0.00 9.50 0.50 1.50 15.50 0.00 0.00 0.00 60.00
02:00:03 PM 10 15.58 0.00 46.23 0.00 0.50 0.50 0.00 0.00 0.00 37.19
02:00:03 PM 11 20.71 0.00 10.10 0.00 1.01 14.14 0.00 0.00 0.00 54.04
02:00:03 PM 12 21.00 0.00 37.00 0.50 1.00 1.00 0.00 0.00 0.00 39.50
02:00:03 PM 13 13.57 0.00 45.73 1.01 1.01 1.01 0.00 0.00 0.00 37.69
02:00:03 PM 14 18.18 0.00 39.39 1.01 0.51 0.51 0.00 0.00 0.00 40.40
02:00:03 PM 15 14.00 0.00 49.50 0.50 0.50 3.50 0.00 0.00 0.00 32.00
02:00:03 PM 16 19.39 0.00 39.80 1.02 1.53 0.51 0.00 0.00 0.00 37.76
02:00:03 PM 17 16.75 0.00 45.18 1.52 1.02 2.54 0.00 0.00 0.00 32.99
02:00:03 PM 18 12.63 0.00 50.00 0.00 1.01 0.00 0.00 0.00 0.00 36.36
02:00:03 PM 19 5.56 0.00 82.32 0.00 0.00 0.00 0.00 0.00 0.00 12.12
02:00:03 PM 20 15.08 0.00 48.24 0.50 0.50 3.52 0.00 0.00 0.00 32.16
02:00:03 PM 21 17.68 0.00 9.09 0.51 1.52 13.64 0.00 0.00 0.00 57.58
02:00:03 PM 22 13.13 0.00 43.94 0.51 0.51 0.51 0.00 0.00 0.00 41.41
02:00:03 PM 23 14.07 0.00 42.71 0.50 0.50 0.50 0.00 0.00 0.00 41.71
02:00:03 PM 24 13.13 0.00 41.92 1.01 0.51 0.51 0.00 0.00 0.00 42.93
02:00:03 PM 25 16.58 0.00 47.74 0.50 1.01 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 26 16.58 0.00 46.73 0.50 1.01 0.50 0.00 0.00 0.00 34.67
02:00:03 PM 27 45.50 0.00 54.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:00:03 PM 28 6.06 0.00 32.32 0.00 0.51 13.13 0.00 0.00 0.00 47.98
02:00:03 PM 29 13.93 0.00 44.78 1.00 1.00 0.50 0.00 0.00 0.00 38.81
02:00:03 PM 30 11.56 0.00 57.79 0.00 0.50 1.01 0.00 0.00 0.00 29.15
02:00:03 PM 31 33.85 0.00 9.23 0.51 1.54 0.51 0.00 0.00 0.00 54.36
02:00:03 PM 32 30.15 0.00 41.71 0.50 0.50 1.51 0.00 0.00 0.00 25.63

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月25日週三下午10:27寫道：

> On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
>> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
>> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
>> io read/write latency increased either.
>
> Run "sar -P all 1" and see if "%iowait" is high.
I would (strongly) advise against the use of iowait as an indicator. It is a kernel approximation of time spent in IO from which cannot be use used in any sensible way other than possibly you're doing IO.
First of all, iowait is not a kernel state, and therefore it's taken from idle. This means that if there is no, or too little, idle time, iowait that should be there is gone.
Second, the calculation to transfer idle time to iowait is done for synchronous IO calls only. Which currently is not a problem for postgres because it uses exactly that, but in the future it might.
Very roughly put, what the kernel does is keep a counter of tasks currently in certain system IO calls, and then try to express that using iowait. The time in IO wait can't be used calculate any IO facts.

In that sense, it puts it in the same area as the load figure: indicative, but mostly useless because it doesn't give you any facts about what it is expressing.
>
> Check if you have transparent hugepages enabled:
>
> cat /sys/kernel/mm/transparent_hugepage/enabled
>
> If they are enabled, disable them and see if it makes a difference.
>
> I am only guessing here.
Absolutely. Anything that is using signficant amounts of memory and is not created to take advantage of transparent hugepages will probably experience more downsides from THP than it helps.
>
>> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
>> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
>> other tables in the tigger.
>
> One thing that almost certainly would improve your situation is to run fewer
> concurrent statements, for example by using a reasonably sized connection pool.
This is true if the limits of the IO device, or anything towards to IO device or devices are hit.
And in general, high "%sys", alias lots of time spent in kernel mode alias system time indicates lots of time spent in system calls, which is what the read and write calls in postgres are.
Therefore these figures suggest blocking for IO, for which Laurenz' advise to lower the amount of concurrent sessions doing IO in general makes sense.
A more nuanced analysis: if IO requests get queued, these will wait in 'D' state in linux, which by definition is off cpu, and thus do not spent cpu (system/kernel) time.

What sounds suspicious is that you indicate you indicate there is you see no signficant change in the amount of IO in iostat.

In order to understand this, you will have to first carefully find the actual IO physical IO devices that you are using for postgres IO.
In current linux this can be tricky, depending on how the hardware or virtual machine looks like, and how the disks are arranged in linux.
What you need to determine is which actual disk devices are used, and what their limits are.
Limits for any disk are IOPS (operations per second) and MBPS (megabytes per second -> bandwdith).

There is an additional thing to realize, which makes this really tricky: postgres for common IO uses buffered IO.
Buffered IO means any read or write will use the linux buffercache, and read or writes can be served from the buffercache if possible.

So in your case, if you managed to make the database perform identical read or write requests, this could result in a difference of amounts of read and write IOs served from the cache, which can make an enormous amounts of difference for how fast these requests are served. If somehow you managed to make the operating system choose to use the physical IO path, you will see significant amounts time spent on that, which will have IO related wait events.

Not a simple answer, but this is how it works.

So I would suggest checking the difference between the situation of when it's doing the same which is considered well performing versus badly performing.

>
> Yours,
> Laurenz Albe
>
>

Re: many sessions waiting DataFileRead and extend

From

Frits Hoogland

Date:

26 June, 11:07:18

Postgres lives as a process in linux, and keeps its own cache, and tries to use that as much as possible for data. This is postgres shared buffers, commonly called the buffer cache.

For WAL, sessions write to the wal buffer (separate from the postgres buffer cache), and need to write to disk upon commit with default settings. Please mind wal writes commonly are performed from the wal writer, but can be done from backends too.

(slightly simplified) When postgres processing needs to read data for processing, it will have to do that in the postgres buffer cache and use it from the cache if it can find the requested buffer, and otherwise it will have to read it from the operating system. When changes are made to data or metadata (such as hint bits set), the change is made to the buffer (and before that a description of this is placed in the wal buffer). As long as there are buffers available, it will perform the data changes in the buffer in the buffer cache. However, the buffer cache is a finite amount of buffers, and eventually changed buffers must make it to disk (which happens upon checkpoint, and for other reasons). The important part here is that checkpointer, bgwriter and backends all perform reads and writes. That is what postgres sees, it only knows and sees it's performing a read or write request.

These reads and writes are performed to the operating system, and the operating system essentially applies the same technique for performance as postgres does: if an IO is done buffered (which is default), it creates as a page in the linux page cache. For reads, if the read request happens to find the pages it requests in the linux page cache, it can serve it to the postgres process without actually needing to perform an actual read request to the operating system. For buffered writes, these are always done to the linux page cache, and written back by kernel writer threads asynchronously. (this is the essence, there is much more detail). Obviously, the amount linux can cache is dependent on the amount of cache and whether the pages for the request are in the linux page cache. For writes, it's dependent on commonly the setting vm.dirty_ratio, which is a percentage from available memory (commonly misunderstood as taken from total memory).

Now, the point I am trying to make: from the postgres point of view, it's impossible to understand whether a read or write request was served from OS cache by the operating system, or needed physical IO. However, the ability of the operating system to serve from cache is dependent upon the availability of free and file memory essentially, and therefore upon general usage of the operating system. Any action on the operating system that needs a significant amount of memory will impact the availability of available memory and therefore lower the amount of memory available to caching. This all in all means that if you do something significant on the operating system, it is perfectly possible to perceived that from postgres as sudden drastic change in performance of IO latency, whilst from the postgres perspective you didn't do anything different.

Now circling back to you concrete question: if such a thing happens, that linux all of a sudden needs to do physical IOs and let's your request wait on that, instead doing logical IO, where your session has to wait on too, but significantly less time, you might all of a sudden see all kind of IO related waits, which you are encountering too in the case of the fast logical IOs, but taking soo little time that you don't see it. This is where wait event occurence and time accumulation would significantly help, currently we can only sample the wait event state, and thus much of that is never seen, and thus not known.

Frits Hoogland

On 26 Jun 2025, at 08:47, James Pang <jamespang886@gmail.com> wrote:

we faced this issue 3 times this week, each time last only 2 seconds, so not easy to run perf in peak business time to capture that, anyway, I will try. before that, I want to understand if "os page cache" or "pg buffer cache" can contribute to the wait_event time "extend" and "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact that too ? we enable bgwriter , and we see checkpointer get scheduled by "wal" during the time, so I just increased max_wal_size to make checkpoint scheduled in longer time.

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四下午2:40寫道：
Okay. So it's a situation that is reproducable.
And like was mentioned, the system time (percentage) is very high.
Is this a physical machine, or a virtual machine?

The next thing to do, is use perf to record about 20 seconds or so during a period of time when you see this behavior (perf record -g, taking the backtrace with it).
This records (samples) the backtraces of on cpu tasks, from which you then can derive what they are doing, for which you should see lots of tasks in kernel space, and what that is, using perf report.

Frits Hoogland

On 26 Jun 2025, at 04:32, James Pang <jamespang886@gmail.com> wrote:

thans for you suggestions, we have iowait from sar command too, copy here, checking with infra team not found abnormal IO activities either.
02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 0.00 0.00 0.00 37.09
02:00:03 PM 0 17.59 0.00 46.73 1.01 0.50 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 1 9.50 0.00 61.50 0.50 0.50 1.00 0.00 0.00 0.00 27.00
02:00:03 PM 2 20.71 0.00 44.44 1.01 0.51 0.51 0.00 0.00 0.00 32.83
02:00:03 PM 3 14.00 0.00 51.50 2.00 1.00 1.00 0.00 0.00 0.00 30.50
02:00:03 PM 4 6.57 0.00 52.53 0.51 0.51 3.54 0.00 0.00 0.00 36.36
02:00:03 PM 5 10.20 0.00 49.49 1.02 1.53 0.00 0.00 0.00 0.00 37.76
02:00:03 PM 6 27.64 0.00 41.21 0.50 0.50 0.50 0.00 0.00 0.00 29.65
02:00:03 PM 7 9.05 0.00 50.75 0.50 1.01 0.50 0.00 0.00 0.00 38.19
02:00:03 PM 8 12.18 0.00 49.75 0.51 0.51 0.51 0.00 0.00 0.00 36.55
02:00:03 PM 9 13.00 0.00 9.50 0.50 1.50 15.50 0.00 0.00 0.00 60.00
02:00:03 PM 10 15.58 0.00 46.23 0.00 0.50 0.50 0.00 0.00 0.00 37.19
02:00:03 PM 11 20.71 0.00 10.10 0.00 1.01 14.14 0.00 0.00 0.00 54.04
02:00:03 PM 12 21.00 0.00 37.00 0.50 1.00 1.00 0.00 0.00 0.00 39.50
02:00:03 PM 13 13.57 0.00 45.73 1.01 1.01 1.01 0.00 0.00 0.00 37.69
02:00:03 PM 14 18.18 0.00 39.39 1.01 0.51 0.51 0.00 0.00 0.00 40.40
02:00:03 PM 15 14.00 0.00 49.50 0.50 0.50 3.50 0.00 0.00 0.00 32.00
02:00:03 PM 16 19.39 0.00 39.80 1.02 1.53 0.51 0.00 0.00 0.00 37.76
02:00:03 PM 17 16.75 0.00 45.18 1.52 1.02 2.54 0.00 0.00 0.00 32.99
02:00:03 PM 18 12.63 0.00 50.00 0.00 1.01 0.00 0.00 0.00 0.00 36.36
02:00:03 PM 19 5.56 0.00 82.32 0.00 0.00 0.00 0.00 0.00 0.00 12.12
02:00:03 PM 20 15.08 0.00 48.24 0.50 0.50 3.52 0.00 0.00 0.00 32.16
02:00:03 PM 21 17.68 0.00 9.09 0.51 1.52 13.64 0.00 0.00 0.00 57.58
02:00:03 PM 22 13.13 0.00 43.94 0.51 0.51 0.51 0.00 0.00 0.00 41.41
02:00:03 PM 23 14.07 0.00 42.71 0.50 0.50 0.50 0.00 0.00 0.00 41.71
02:00:03 PM 24 13.13 0.00 41.92 1.01 0.51 0.51 0.00 0.00 0.00 42.93
02:00:03 PM 25 16.58 0.00 47.74 0.50 1.01 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 26 16.58 0.00 46.73 0.50 1.01 0.50 0.00 0.00 0.00 34.67
02:00:03 PM 27 45.50 0.00 54.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:00:03 PM 28 6.06 0.00 32.32 0.00 0.51 13.13 0.00 0.00 0.00 47.98
02:00:03 PM 29 13.93 0.00 44.78 1.00 1.00 0.50 0.00 0.00 0.00 38.81
02:00:03 PM 30 11.56 0.00 57.79 0.00 0.50 1.01 0.00 0.00 0.00 29.15
02:00:03 PM 31 33.85 0.00 9.23 0.51 1.54 0.51 0.00 0.00 0.00 54.36
02:00:03 PM 32 30.15 0.00 41.71 0.50 0.50 1.51 0.00 0.00 0.00 25.63

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月25日週三下午10:27寫道：

> On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
>> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
>> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
>> io read/write latency increased either.
>
> Run "sar -P all 1" and see if "%iowait" is high.
I would (strongly) advise against the use of iowait as an indicator. It is a kernel approximation of time spent in IO from which cannot be use used in any sensible way other than possibly you're doing IO.
First of all, iowait is not a kernel state, and therefore it's taken from idle. This means that if there is no, or too little, idle time, iowait that should be there is gone.
Second, the calculation to transfer idle time to iowait is done for synchronous IO calls only. Which currently is not a problem for postgres because it uses exactly that, but in the future it might.
Very roughly put, what the kernel does is keep a counter of tasks currently in certain system IO calls, and then try to express that using iowait. The time in IO wait can't be used calculate any IO facts.

In that sense, it puts it in the same area as the load figure: indicative, but mostly useless because it doesn't give you any facts about what it is expressing.
>
> Check if you have transparent hugepages enabled:
>
> cat /sys/kernel/mm/transparent_hugepage/enabled
>
> If they are enabled, disable them and see if it makes a difference.
>
> I am only guessing here.
Absolutely. Anything that is using signficant amounts of memory and is not created to take advantage of transparent hugepages will probably experience more downsides from THP than it helps.
>
>> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
>> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
>> other tables in the tigger.
>
> One thing that almost certainly would improve your situation is to run fewer
> concurrent statements, for example by using a reasonably sized connection pool.
This is true if the limits of the IO device, or anything towards to IO device or devices are hit.
And in general, high "%sys", alias lots of time spent in kernel mode alias system time indicates lots of time spent in system calls, which is what the read and write calls in postgres are.
Therefore these figures suggest blocking for IO, for which Laurenz' advise to lower the amount of concurrent sessions doing IO in general makes sense.
A more nuanced analysis: if IO requests get queued, these will wait in 'D' state in linux, which by definition is off cpu, and thus do not spent cpu (system/kernel) time.

What sounds suspicious is that you indicate you indicate there is you see no signficant change in the amount of IO in iostat.

In order to understand this, you will have to first carefully find the actual IO physical IO devices that you are using for postgres IO.
In current linux this can be tricky, depending on how the hardware or virtual machine looks like, and how the disks are arranged in linux.
What you need to determine is which actual disk devices are used, and what their limits are.
Limits for any disk are IOPS (operations per second) and MBPS (megabytes per second -> bandwdith).

There is an additional thing to realize, which makes this really tricky: postgres for common IO uses buffered IO.
Buffered IO means any read or write will use the linux buffercache, and read or writes can be served from the buffercache if possible.

So in your case, if you managed to make the database perform identical read or write requests, this could result in a difference of amounts of read and write IOs served from the cache, which can make an enormous amounts of difference for how fast these requests are served. If somehow you managed to make the operating system choose to use the physical IO path, you will see significant amounts time spent on that, which will have IO related wait events.

Not a simple answer, but this is how it works.

So I would suggest checking the difference between the situation of when it's doing the same which is considered well performing versus badly performing.

>
> Yours,
> Laurenz Albe
>
>

Re: many sessions waiting DataFileRead and extend

From

James Pang

Date:

26 June, 11:58:27

thanks for you explaination, from Postgresql perspective, is it possible to see bgwriter,or checkpointer blocking backend process reading/wrting ? or vice versa ?

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四下午4:07寫道：

Postgres lives as a process in linux, and keeps its own cache, and tries to use that as much as possible for data. This is postgres shared buffers, commonly called the buffer cache.

For WAL, sessions write to the wal buffer (separate from the postgres buffer cache), and need to write to disk upon commit with default settings. Please mind wal writes commonly are performed from the wal writer, but can be done from backends too.

(slightly simplified) When postgres processing needs to read data for processing, it will have to do that in the postgres buffer cache and use it from the cache if it can find the requested buffer, and otherwise it will have to read it from the operating system. When changes are made to data or metadata (such as hint bits set), the change is made to the buffer (and before that a description of this is placed in the wal buffer). As long as there are buffers available, it will perform the data changes in the buffer in the buffer cache. However, the buffer cache is a finite amount of buffers, and eventually changed buffers must make it to disk (which happens upon checkpoint, and for other reasons). The important part here is that checkpointer, bgwriter and backends all perform reads and writes. That is what postgres sees, it only knows and sees it's performing a read or write request.

These reads and writes are performed to the operating system, and the operating system essentially applies the same technique for performance as postgres does: if an IO is done buffered (which is default), it creates as a page in the linux page cache. For reads, if the read request happens to find the pages it requests in the linux page cache, it can serve it to the postgres process without actually needing to perform an actual read request to the operating system. For buffered writes, these are always done to the linux page cache, and written back by kernel writer threads asynchronously. (this is the essence, there is much more detail). Obviously, the amount linux can cache is dependent on the amount of cache and whether the pages for the request are in the linux page cache. For writes, it's dependent on commonly the setting vm.dirty_ratio, which is a percentage from available memory (commonly misunderstood as taken from total memory).

Now, the point I am trying to make: from the postgres point of view, it's impossible to understand whether a read or write request was served from OS cache by the operating system, or needed physical IO. However, the ability of the operating system to serve from cache is dependent upon the availability of free and file memory essentially, and therefore upon general usage of the operating system. Any action on the operating system that needs a significant amount of memory will impact the availability of available memory and therefore lower the amount of memory available to caching. This all in all means that if you do something significant on the operating system, it is perfectly possible to perceived that from postgres as sudden drastic change in performance of IO latency, whilst from the postgres perspective you didn't do anything different.

Now circling back to you concrete question: if such a thing happens, that linux all of a sudden needs to do physical IOs and let's your request wait on that, instead doing logical IO, where your session has to wait on too, but significantly less time, you might all of a sudden see all kind of IO related waits, which you are encountering too in the case of the fast logical IOs, but taking soo little time that you don't see it. This is where wait event occurence and time accumulation would significantly help, currently we can only sample the wait event state, and thus much of that is never seen, and thus not known.

Frits Hoogland

On 26 Jun 2025, at 08:47, James Pang <jamespang886@gmail.com> wrote:

we faced this issue 3 times this week, each time last only 2 seconds, so not easy to run perf in peak business time to capture that, anyway, I will try. before that, I want to understand if "os page cache" or "pg buffer cache" can contribute to the wait_event time "extend" and "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact that too ? we enable bgwriter , and we see checkpointer get scheduled by "wal" during the time, so I just increased max_wal_size to make checkpoint scheduled in longer time.

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四下午2:40寫道：
Okay. So it's a situation that is reproducable.
And like was mentioned, the system time (percentage) is very high.
Is this a physical machine, or a virtual machine?

The next thing to do, is use perf to record about 20 seconds or so during a period of time when you see this behavior (perf record -g, taking the backtrace with it).
This records (samples) the backtraces of on cpu tasks, from which you then can derive what they are doing, for which you should see lots of tasks in kernel space, and what that is, using perf report.

Frits Hoogland

On 26 Jun 2025, at 04:32, James Pang <jamespang886@gmail.com> wrote:

thans for you suggestions, we have iowait from sar command too, copy here, checking with infra team not found abnormal IO activities either.
02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 0.00 0.00 0.00 37.09
02:00:03 PM 0 17.59 0.00 46.73 1.01 0.50 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 1 9.50 0.00 61.50 0.50 0.50 1.00 0.00 0.00 0.00 27.00
02:00:03 PM 2 20.71 0.00 44.44 1.01 0.51 0.51 0.00 0.00 0.00 32.83
02:00:03 PM 3 14.00 0.00 51.50 2.00 1.00 1.00 0.00 0.00 0.00 30.50
02:00:03 PM 4 6.57 0.00 52.53 0.51 0.51 3.54 0.00 0.00 0.00 36.36
02:00:03 PM 5 10.20 0.00 49.49 1.02 1.53 0.00 0.00 0.00 0.00 37.76
02:00:03 PM 6 27.64 0.00 41.21 0.50 0.50 0.50 0.00 0.00 0.00 29.65
02:00:03 PM 7 9.05 0.00 50.75 0.50 1.01 0.50 0.00 0.00 0.00 38.19
02:00:03 PM 8 12.18 0.00 49.75 0.51 0.51 0.51 0.00 0.00 0.00 36.55
02:00:03 PM 9 13.00 0.00 9.50 0.50 1.50 15.50 0.00 0.00 0.00 60.00
02:00:03 PM 10 15.58 0.00 46.23 0.00 0.50 0.50 0.00 0.00 0.00 37.19
02:00:03 PM 11 20.71 0.00 10.10 0.00 1.01 14.14 0.00 0.00 0.00 54.04
02:00:03 PM 12 21.00 0.00 37.00 0.50 1.00 1.00 0.00 0.00 0.00 39.50
02:00:03 PM 13 13.57 0.00 45.73 1.01 1.01 1.01 0.00 0.00 0.00 37.69
02:00:03 PM 14 18.18 0.00 39.39 1.01 0.51 0.51 0.00 0.00 0.00 40.40
02:00:03 PM 15 14.00 0.00 49.50 0.50 0.50 3.50 0.00 0.00 0.00 32.00
02:00:03 PM 16 19.39 0.00 39.80 1.02 1.53 0.51 0.00 0.00 0.00 37.76
02:00:03 PM 17 16.75 0.00 45.18 1.52 1.02 2.54 0.00 0.00 0.00 32.99
02:00:03 PM 18 12.63 0.00 50.00 0.00 1.01 0.00 0.00 0.00 0.00 36.36
02:00:03 PM 19 5.56 0.00 82.32 0.00 0.00 0.00 0.00 0.00 0.00 12.12
02:00:03 PM 20 15.08 0.00 48.24 0.50 0.50 3.52 0.00 0.00 0.00 32.16
02:00:03 PM 21 17.68 0.00 9.09 0.51 1.52 13.64 0.00 0.00 0.00 57.58
02:00:03 PM 22 13.13 0.00 43.94 0.51 0.51 0.51 0.00 0.00 0.00 41.41
02:00:03 PM 23 14.07 0.00 42.71 0.50 0.50 0.50 0.00 0.00 0.00 41.71
02:00:03 PM 24 13.13 0.00 41.92 1.01 0.51 0.51 0.00 0.00 0.00 42.93
02:00:03 PM 25 16.58 0.00 47.74 0.50 1.01 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 26 16.58 0.00 46.73 0.50 1.01 0.50 0.00 0.00 0.00 34.67
02:00:03 PM 27 45.50 0.00 54.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:00:03 PM 28 6.06 0.00 32.32 0.00 0.51 13.13 0.00 0.00 0.00 47.98
02:00:03 PM 29 13.93 0.00 44.78 1.00 1.00 0.50 0.00 0.00 0.00 38.81
02:00:03 PM 30 11.56 0.00 57.79 0.00 0.50 1.01 0.00 0.00 0.00 29.15
02:00:03 PM 31 33.85 0.00 9.23 0.51 1.54 0.51 0.00 0.00 0.00 54.36
02:00:03 PM 32 30.15 0.00 41.71 0.50 0.50 1.51 0.00 0.00 0.00 25.63

Thanks,

James

Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月25日週三下午10:27寫道：

> On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
>> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
>> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
>> io read/write latency increased either.
>
> Run "sar -P all 1" and see if "%iowait" is high.
I would (strongly) advise against the use of iowait as an indicator. It is a kernel approximation of time spent in IO from which cannot be use used in any sensible way other than possibly you're doing IO.
First of all, iowait is not a kernel state, and therefore it's taken from idle. This means that if there is no, or too little, idle time, iowait that should be there is gone.
Second, the calculation to transfer idle time to iowait is done for synchronous IO calls only. Which currently is not a problem for postgres because it uses exactly that, but in the future it might.
Very roughly put, what the kernel does is keep a counter of tasks currently in certain system IO calls, and then try to express that using iowait. The time in IO wait can't be used calculate any IO facts.

In that sense, it puts it in the same area as the load figure: indicative, but mostly useless because it doesn't give you any facts about what it is expressing.
>
> Check if you have transparent hugepages enabled:
>
> cat /sys/kernel/mm/transparent_hugepage/enabled
>
> If they are enabled, disable them and see if it makes a difference.
>
> I am only guessing here.
Absolutely. Anything that is using signficant amounts of memory and is not created to take advantage of transparent hugepages will probably experience more downsides from THP than it helps.
>
>> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
>> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
>> other tables in the tigger.
>
> One thing that almost certainly would improve your situation is to run fewer
> concurrent statements, for example by using a reasonably sized connection pool.
This is true if the limits of the IO device, or anything towards to IO device or devices are hit.
And in general, high "%sys", alias lots of time spent in kernel mode alias system time indicates lots of time spent in system calls, which is what the read and write calls in postgres are.
Therefore these figures suggest blocking for IO, for which Laurenz' advise to lower the amount of concurrent sessions doing IO in general makes sense.
A more nuanced analysis: if IO requests get queued, these will wait in 'D' state in linux, which by definition is off cpu, and thus do not spent cpu (system/kernel) time.

What sounds suspicious is that you indicate you indicate there is you see no signficant change in the amount of IO in iostat.

In order to understand this, you will have to first carefully find the actual IO physical IO devices that you are using for postgres IO.
In current linux this can be tricky, depending on how the hardware or virtual machine looks like, and how the disks are arranged in linux.
What you need to determine is which actual disk devices are used, and what their limits are.
Limits for any disk are IOPS (operations per second) and MBPS (megabytes per second -> bandwdith).

There is an additional thing to realize, which makes this really tricky: postgres for common IO uses buffered IO.
Buffered IO means any read or write will use the linux buffercache, and read or writes can be served from the buffercache if possible.

So in your case, if you managed to make the database perform identical read or write requests, this could result in a difference of amounts of read and write IOs served from the cache, which can make an enormous amounts of difference for how fast these requests are served. If somehow you managed to make the operating system choose to use the physical IO path, you will see significant amounts time spent on that, which will have IO related wait events.

Not a simple answer, but this is how it works.

So I would suggest checking the difference between the situation of when it's doing the same which is considered well performing versus badly performing.

>
> Yours,
> Laurenz Albe
>
>

Re: many sessions waiting DataFileRead and extend

From

Robert Haas

Date:

26 June, 15:19:33

On Thu, Jun 26, 2025 at 2:48 AM James Pang <jamespang886@gmail.com> wrote:
> we faced this issue 3 times this week, each time last only 2 seconds, so not easy to run perf in peak business
timeto capture that, anyway, I will try. before that, I want to understand if "os page cache" or "pg buffer cache" can
contributeto the wait_event time "extend" and "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact
thattoo ? we enable bgwriter , and we see checkpointer get scheduled by "wal" during the time, so I just increased
max_wal_sizeto make checkpoint scheduled in longer time.

To be honest, given the additional information that this only lasted 2
seconds, this doesn't sound terribly abnormal to me. My guess would be
that a bunch of processes all tried to insert into the same relation
at the time, so PostgreSQL had to make it bigger, and the OS couldn't
extend the file quite quickly enough.

I don't want to make it sound like there's absolutely no problem here:
PostgreSQL's relation extension logic isn't as performant as we'd like
it to be, and there have been several attempts to fix it over the
years. I don't know if the last attempt was later than PG14, so you
might have more success on a newer version, but I'm also not really
sure if the limitations of the implementation are your main problem
here.

If I were troubleshooting this, I'd try to capture a snapshot of what
queries were running at the moment the problem happens. If, for
example, all the sessions suddenly decide to insert into the same
table at the same time, it might be good to think about whether
there's a way to avoid that.

--
Robert Haas
EDB: http://www.enterprisedb.com