Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug - Mailing list pgsql-general
From | hubert depesz lubaczewski |
---|---|
Subject | Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |
Date | |
Msg-id | aKiGtrwmEMBMXBFr@depesz.com Whole thread Raw |
In response to | Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug (Adrian Klaver <adrian.klaver@aklaver.com>) |
Responses |
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |
List | pgsql-general |
On Fri, Aug 22, 2025 at 07:33:03AM -0700, Adrian Klaver wrote: > On 8/22/25 05:37, hubert depesz lubaczewski wrote: > > On Thu, Aug 21, 2025 at 11:17:27AM -0700, Adrian Klaver wrote: > > > Have you looked at?: > > > https://www.pgbouncer.org/changelog.html#pgbouncer-124x > > > To see if anything stands out. > > > Then there is: > > > https://www.pgbouncer.org/config.html#max_prepared_statements > > > The below may also be worth looking at: > > > https://github.com/pgbouncer/pgbouncer/pull/1144 > > > I can't help thinking that there is a caching issue at stake, though that is > > > just a guess. > > > > Will check, but we don't use prepared statements, so this doesn't seem > > to be part-of-the-problem. > > In your OP there is: > > "Every now and then (usually every 3-5 minutes, but not through the whole > day), we see situations where every query suddently takes ~ 1 second. > > > I'm talkign about "prepare" for preparing statements. binds. also > "discard all". > " > > Is that referring to something else? No. Same problem. The thing is that it comes and goes. *BUT* I got repeatable case today. Is is breaking on its own everyy ~ 5 minutes. So I made test.sql with: \timing on SELECT 'SELECT now(), pg_sleep(.05);' FROM generate_series(1,100000) i \gexec and ran it like this: psql -p 5432 -d the_db -f test.sql -qAtX | perl -pe 's/\|\s*$/ :: /' This bypassed pgbouncer, connected via unix socket, the whole thing got simpler. After 4 minutes of 50ms runtimes I got: 2025-08-22 14:44:14.013541+00 :: Time: 50.090 ms 2025-08-22 14:44:14.063636+00 :: Time: 50.088 ms 2025-08-22 14:44:14.11373+00 :: Time: 50.085 ms 2025-08-22 14:44:14.163822+00 :: Time: 50.099 ms 2025-08-22 14:44:14.213927+00 :: Time: 1393.932 ms (00:01.394) 2025-08-22 14:44:15.607872+00 :: Time: 50.171 ms 2025-08-22 14:44:15.658097+00 :: Time: 50.209 ms 2025-08-22 14:44:15.708353+00 :: Time: 50.212 ms 2025-08-22 14:44:15.758589+00 :: Time: 50.170 ms This is pg_sleep. There are no locks, nothing else. I think this disproves idea that the problem is about pgbouncer, and it kinda looks like a thing in pg? Interestingly: ran strace on the pg backend that was doing this loop, like this: strace -tt -T -ff -p 1266521 -s 256 -o a.log And it showed something really strange: … 14:57:00.864684 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL)= 68 <0.000007> 14:57:00.864739 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050063> 14:57:00.914823 sendto(9, "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 14:57:00.86+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:00.914849recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(),pg_sleep(.05);\0", 8192, 0, NULL, NULL) = 68 <0.000007> 14:57:00.914891 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050064> 14:57:00.964978 sendto(9, "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 14:57:00.91+00\0\0\0\0071266521\0\0\0\0C\0\0\0\ 14:57:00.965006 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL)= 68 <0.000007> 14:57:00.965052 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050063> 14:57:01.015133 futex(0xffe7a6636538, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0 <1.354927> 14:57:02.370180 write(2, "\0\0\36\1YS\23\0T2025-08-22 14:57:02.370 UTC,\"postgres\",\"the.db\",1266521,\"[local]\",68a884fb.135359,5,\"SELECT\",2025-08-2214:55:55 UTC,85/0,0,LOG,00000,\"duration:1405.106 ms statement: SELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05); 14:57:02.370521 sendto(8, "\2\0\0\0@\0\0\0\216A\0\0\0\0\0\0\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\255\221\33\0\0\0\0\0\262\341\6\0\0\0\0\0\0\0\0\0\0\0\0\0", 64,0, NULL, 0) = 64 <0.000041> 14:57:02.370639 sendto(9, "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 14:57:00.97+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:02.371063recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(),pg_sleep(.05);\0", 8192, 0, NULL, NULL) = 68 <0.000019> 14:57:02.371215 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050090> 14:57:02.421381 sendto(9, "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 14:57:02.37+00\0\0\0\0071266521\0\0\0\0C\0\0\0\ 14:57:02.421480 recvfrom(9, 0xaaaae7b3fc08, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000008> 14:57:02.421507 epoll_pwait(4, [{EPOLLIN, {u32=470567512, u64=187651886697048}}],1, -1, NULL, 8) = 1 <0.000037> 14:57:02.421570 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL)= 68 <0.000012> 14:57:02.421673 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050099> 14:57:02.471866 sendto(9, "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 14:57:02.42+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:02.471961recvfrom(9, 0xaaaae7b3fc08, 8192, 0, NULL, NULL) = -1 EAGAIN(Resource temporarily unavailable) <0.000008> … As in, normal execution is: recv from, epoll_pwait that takes 0.05s, and sendto. all clear, nothing fancy. *BUT* in case of this one problematic thing there is futex() call that took 1.354s ?! Checked, and in straced ~ 800 queries thisis the only futex call() ?! Any idea on what could it be? Best regards, depesz
pgsql-general by date: