Re: many sessions waiting DataFileRead and extend - Mailing list pgsql-performance
From | James Pang |
---|---|
Subject | Re: many sessions waiting DataFileRead and extend |
Date | |
Msg-id | CAHgTRfdt9Yan7XNaXjx1_ES045R4_wBspuvwPXtTEh2wdGfLaQ@mail.gmail.com Whole thread Raw |
In response to | Re: many sessions waiting DataFileRead and extend (Frits Hoogland <frits.hoogland@gmail.com>) |
List | pgsql-performance |
thanks for you explaination, from Postgresql perspective, is it possible to see bgwriter,or checkpointer blocking backend process reading/wrting ? or vice versa ?
Thanks,
James
Frits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四 下午4:07寫道:
Postgres lives as a process in linux, and keeps its own cache, and tries to use that as much as possible for data. This is postgres shared buffers, commonly called the buffer cache.For WAL, sessions write to the wal buffer (separate from the postgres buffer cache), and need to write to disk upon commit with default settings. Please mind wal writes commonly are performed from the wal writer, but can be done from backends too.(slightly simplified) When postgres processing needs to read data for processing, it will have to do that in the postgres buffer cache and use it from the cache if it can find the requested buffer, and otherwise it will have to read it from the operating system. When changes are made to data or metadata (such as hint bits set), the change is made to the buffer (and before that a description of this is placed in the wal buffer). As long as there are buffers available, it will perform the data changes in the buffer in the buffer cache. However, the buffer cache is a finite amount of buffers, and eventually changed buffers must make it to disk (which happens upon checkpoint, and for other reasons). The important part here is that checkpointer, bgwriter and backends all perform reads and writes. That is what postgres sees, it only knows and sees it's performing a read or write request.These reads and writes are performed to the operating system, and the operating system essentially applies the same technique for performance as postgres does: if an IO is done buffered (which is default), it creates as a page in the linux page cache. For reads, if the read request happens to find the pages it requests in the linux page cache, it can serve it to the postgres process without actually needing to perform an actual read request to the operating system. For buffered writes, these are always done to the linux page cache, and written back by kernel writer threads asynchronously. (this is the essence, there is much more detail). Obviously, the amount linux can cache is dependent on the amount of cache and whether the pages for the request are in the linux page cache. For writes, it's dependent on commonly the setting vm.dirty_ratio, which is a percentage from available memory (commonly misunderstood as taken from total memory).Now, the point I am trying to make: from the postgres point of view, it's impossible to understand whether a read or write request was served from OS cache by the operating system, or needed physical IO. However, the ability of the operating system to serve from cache is dependent upon the availability of free and file memory essentially, and therefore upon general usage of the operating system. Any action on the operating system that needs a significant amount of memory will impact the availability of available memory and therefore lower the amount of memory available to caching. This all in all means that if you do something significant on the operating system, it is perfectly possible to perceived that from postgres as sudden drastic change in performance of IO latency, whilst from the postgres perspective you didn't do anything different.Now circling back to you concrete question: if such a thing happens, that linux all of a sudden needs to do physical IOs and let's your request wait on that, instead doing logical IO, where your session has to wait on too, but significantly less time, you might all of a sudden see all kind of IO related waits, which you are encountering too in the case of the fast logical IOs, but taking soo little time that you don't see it. This is where wait event occurence and time accumulation would significantly help, currently we can only sample the wait event state, and thus much of that is never seen, and thus not known.Frits HooglandOn 26 Jun 2025, at 08:47, James Pang <jamespang886@gmail.com> wrote:we faced this issue 3 times this week, each time last only 2 seconds, so not easy to run perf in peak business time to capture that, anyway, I will try. before that, I want to understand if "os page cache" or "pg buffer cache" can contribute to the wait_event time "extend" and "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact that too ? we enable bgwriter , and we see checkpointer get scheduled by "wal" during the time, so I just increased max_wal_size to make checkpoint scheduled in longer time.Thanks,JamesFrits Hoogland <frits.hoogland@gmail.com> 於 2025年6月26日週四 下午2:40寫道:Okay. So it's a situation that is reproducable.And like was mentioned, the system time (percentage) is very high.Is this a physical machine, or a virtual machine?The next thing to do, is use perf to record about 20 seconds or so during a period of time when you see this behavior (perf record -g, taking the backtrace with it).This records (samples) the backtraces of on cpu tasks, from which you then can derive what they are doing, for which you should see lots of tasks in kernel space, and what that is, using perf report.Frits HooglandOn 26 Jun 2025, at 04:32, James Pang <jamespang886@gmail.com> wrote:thans for you suggestions, we have iowait from sar command too, copy here, checking with infra team not found abnormal IO activities either.02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 0.00 0.00 0.00 37.09
02:00:03 PM 0 17.59 0.00 46.73 1.01 0.50 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 1 9.50 0.00 61.50 0.50 0.50 1.00 0.00 0.00 0.00 27.00
02:00:03 PM 2 20.71 0.00 44.44 1.01 0.51 0.51 0.00 0.00 0.00 32.83
02:00:03 PM 3 14.00 0.00 51.50 2.00 1.00 1.00 0.00 0.00 0.00 30.50
02:00:03 PM 4 6.57 0.00 52.53 0.51 0.51 3.54 0.00 0.00 0.00 36.36
02:00:03 PM 5 10.20 0.00 49.49 1.02 1.53 0.00 0.00 0.00 0.00 37.76
02:00:03 PM 6 27.64 0.00 41.21 0.50 0.50 0.50 0.00 0.00 0.00 29.65
02:00:03 PM 7 9.05 0.00 50.75 0.50 1.01 0.50 0.00 0.00 0.00 38.19
02:00:03 PM 8 12.18 0.00 49.75 0.51 0.51 0.51 0.00 0.00 0.00 36.55
02:00:03 PM 9 13.00 0.00 9.50 0.50 1.50 15.50 0.00 0.00 0.00 60.00
02:00:03 PM 10 15.58 0.00 46.23 0.00 0.50 0.50 0.00 0.00 0.00 37.19
02:00:03 PM 11 20.71 0.00 10.10 0.00 1.01 14.14 0.00 0.00 0.00 54.04
02:00:03 PM 12 21.00 0.00 37.00 0.50 1.00 1.00 0.00 0.00 0.00 39.50
02:00:03 PM 13 13.57 0.00 45.73 1.01 1.01 1.01 0.00 0.00 0.00 37.69
02:00:03 PM 14 18.18 0.00 39.39 1.01 0.51 0.51 0.00 0.00 0.00 40.40
02:00:03 PM 15 14.00 0.00 49.50 0.50 0.50 3.50 0.00 0.00 0.00 32.00
02:00:03 PM 16 19.39 0.00 39.80 1.02 1.53 0.51 0.00 0.00 0.00 37.76
02:00:03 PM 17 16.75 0.00 45.18 1.52 1.02 2.54 0.00 0.00 0.00 32.99
02:00:03 PM 18 12.63 0.00 50.00 0.00 1.01 0.00 0.00 0.00 0.00 36.36
02:00:03 PM 19 5.56 0.00 82.32 0.00 0.00 0.00 0.00 0.00 0.00 12.12
02:00:03 PM 20 15.08 0.00 48.24 0.50 0.50 3.52 0.00 0.00 0.00 32.16
02:00:03 PM 21 17.68 0.00 9.09 0.51 1.52 13.64 0.00 0.00 0.00 57.58
02:00:03 PM 22 13.13 0.00 43.94 0.51 0.51 0.51 0.00 0.00 0.00 41.41
02:00:03 PM 23 14.07 0.00 42.71 0.50 0.50 0.50 0.00 0.00 0.00 41.71
02:00:03 PM 24 13.13 0.00 41.92 1.01 0.51 0.51 0.00 0.00 0.00 42.93
02:00:03 PM 25 16.58 0.00 47.74 0.50 1.01 0.50 0.00 0.00 0.00 33.67
02:00:03 PM 26 16.58 0.00 46.73 0.50 1.01 0.50 0.00 0.00 0.00 34.67
02:00:03 PM 27 45.50 0.00 54.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:00:03 PM 28 6.06 0.00 32.32 0.00 0.51 13.13 0.00 0.00 0.00 47.98
02:00:03 PM 29 13.93 0.00 44.78 1.00 1.00 0.50 0.00 0.00 0.00 38.81
02:00:03 PM 30 11.56 0.00 57.79 0.00 0.50 1.01 0.00 0.00 0.00 29.15
02:00:03 PM 31 33.85 0.00 9.23 0.51 1.54 0.51 0.00 0.00 0.00 54.36
02:00:03 PM 32 30.15 0.00 41.71 0.50 0.50 1.51 0.00 0.00 0.00 25.63Thanks,JamesFrits Hoogland <frits.hoogland@gmail.com> 於 2025年6月25日週三 下午10:27寫道:
> On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on "DataFileRead" and
>> "extend", it last about 2 seconds(based on pg_stat_activity query) , during the
>> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops and
>> io read/write latency increased either.
>
> Run "sar -P all 1" and see if "%iowait" is high.
I would (strongly) advise against the use of iowait as an indicator. It is a kernel approximation of time spent in IO from which cannot be use used in any sensible way other than possibly you're doing IO.
First of all, iowait is not a kernel state, and therefore it's taken from idle. This means that if there is no, or too little, idle time, iowait that should be there is gone.
Second, the calculation to transfer idle time to iowait is done for synchronous IO calls only. Which currently is not a problem for postgres because it uses exactly that, but in the future it might.
Very roughly put, what the kernel does is keep a counter of tasks currently in certain system IO calls, and then try to express that using iowait. The time in IO wait can't be used calculate any IO facts.
In that sense, it puts it in the same area as the load figure: indicative, but mostly useless because it doesn't give you any facts about what it is expressing.
>
> Check if you have transparent hugepages enabled:
>
> cat /sys/kernel/mm/transparent_hugepage/enabled
>
> If they are enabled, disable them and see if it makes a difference.
>
> I am only guessing here.
Absolutely. Anything that is using signficant amounts of memory and is not created to take advantage of transparent hugepages will probably experience more downsides from THP than it helps.
>
>> many sessions were running same "DELETE FROM xxxx" in parallel waiting on "extend"
>> and "DataFileRead", there are triggers in this table "After delete" to insert/delete
>> other tables in the tigger.
>
> One thing that almost certainly would improve your situation is to run fewer
> concurrent statements, for example by using a reasonably sized connection pool.
This is true if the limits of the IO device, or anything towards to IO device or devices are hit.
And in general, high "%sys", alias lots of time spent in kernel mode alias system time indicates lots of time spent in system calls, which is what the read and write calls in postgres are.
Therefore these figures suggest blocking for IO, for which Laurenz' advise to lower the amount of concurrent sessions doing IO in general makes sense.
A more nuanced analysis: if IO requests get queued, these will wait in 'D' state in linux, which by definition is off cpu, and thus do not spent cpu (system/kernel) time.
What sounds suspicious is that you indicate you indicate there is you see no signficant change in the amount of IO in iostat.
In order to understand this, you will have to first carefully find the actual IO physical IO devices that you are using for postgres IO.
In current linux this can be tricky, depending on how the hardware or virtual machine looks like, and how the disks are arranged in linux.
What you need to determine is which actual disk devices are used, and what their limits are.
Limits for any disk are IOPS (operations per second) and MBPS (megabytes per second -> bandwdith).
There is an additional thing to realize, which makes this really tricky: postgres for common IO uses buffered IO.
Buffered IO means any read or write will use the linux buffercache, and read or writes can be served from the buffercache if possible.
So in your case, if you managed to make the database perform identical read or write requests, this could result in a difference of amounts of read and write IOs served from the cache, which can make an enormous amounts of difference for how fast these requests are served. If somehow you managed to make the operating system choose to use the physical IO path, you will see significant amounts time spent on that, which will have IO related wait events.
Not a simple answer, but this is how it works.
So I would suggest checking the difference between the situation of when it's doing the same which is considered well performing versus badly performing.
>
> Yours,
> Laurenz Albe
>
>
pgsql-performance by date: