Hi,
I have noticed one of my buildfarm machines - widowbird - did not report
any results since February 17. And it seems to be stuck somewhere in
amcheck:
$ ps ax | grep postgres
1180067 ? Ss 0:02
/mnt/data/buildfarm/buildroot/HEAD/inst/bin/postgres -D data-C
1180069 ? Ss 0:00 postgres: checkpointer
1180070 ? Ss 0:00 postgres: background writer
1180072 ? Ss 0:00 postgres: walwriter
1180073 ? Ss 0:01 postgres: autovacuum launcher
1180074 ? Ss 0:00 postgres: logical replication launcher
1180107 ? Ss 0:05 postgres: buildfarm
contrib_regression_amcheck [local] INSERT
1180111 ? Ss 0:00 postgres: autovacuum worker
1180134 ? Ss 0:00 postgres: autovacuum worker
1180135 ? Ss 0:00 postgres: autovacuum worker
1374029 pts/0 S+ 0:00 grep --color=auto postgres
So there's PID 1180107, executing an insert, but not progressing. The
backtrace looks like this (first couple lines, full backtrace attached):
#0 0x0000007fa64b8ddc in __GI_epoll_pwait (epfd=5, events=0x55ad6285a8,
maxevents=1, timeout=timeout@entry=-1, set=set@entry=0x0) at
../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1 0x0000007fa64b8fe8 in epoll_wait (epfd=<optimized out>,
events=<optimized out>, maxevents=<optimized out>, timeout=timeout@entry=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:32
#2 0x000000558f043588 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ff8ed4e18, cur_timeout=-1, set=0x55ad628540) at
latch.c:1571
#3 WaitEventSetWait (set=0x55ad628540, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ff8ed4e18,
nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=134217781) at latch.c:1519
#4 0x000000558f043778 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781)
at latch.c:538
#5 0x000000558f052274 in ConditionVariableTimedSleep (cv=0x7f9ac9deb0,
timeout=timeout@entry=-1,
wait_event_info=wait_event_info@entry=134217781) at condition_variable.c:163
#6 0x000000558f05286c in ConditionVariableTimedSleep
(wait_event_info=134217781, timeout=-1, cv=<optimized out>) at
condition_variable.c:135
#7 0x000000558ed2fc90 in AdvanceXLInsertBuffer
(upto=upto@entry=608174080, tli=tli@entry=1,
opportunistic=opportunistic@entry=false) at xlog.c:2224
So, it's stuck in AdvanceXLInsertBuffer ... interesting. Another
interesting fact is it's testing 75dfde13639, which is just a couple
commits after 6a2275b895:
commit 6a2275b8953a4462d44daf001bdd60b3d48f0946
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon Feb 17 04:19:01 2025 +0200
Get rid of WALBufMappingLock
Allow multiple backends to initialize WAL buffers concurrently.
This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in
parallel without taking a single LWLock in exclusive mode.
...
which reworked AdvanceXLInsertBuffer() quite a bit, it seems. OTOH the
last (successful) run on widorbird was on eaf502747b, which already
includes 6a2275b895, so maybe it's unrelated.
Is there something else I could collect from the stuck instance, before
I restart it?
regards
--
Tomas Vondra