Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers
From | Alexander Korotkov |
---|---|
Subject | Re: Get rid of WALBufMappingLock |
Date | |
Msg-id | CAPpHfdsXCRg-arg3CFAYR4PXWTc7G1qY=2V7Yap3j83_5qe+Hg@mail.gmail.com Whole thread Raw |
In response to | Re: Get rid of WALBufMappingLock (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: Get rid of WALBufMappingLock
Re: Get rid of WALBufMappingLock |
List | pgsql-hackers |
On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote: > > > This timeout failure on hachi looks suspicious as well: > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03 > > > > > > Might be relevant that they are both aarch64? > > > > Just logged into the host. The logs of the timed out run are still > > around, and the last information I can see is from lastcommand.log, > > which seems to have frozen in time when the timeout has begun its > > vacuuming work: > > ok 73 + index_including_gist 353 ms > > # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse > > > > gokiburi is on the same host, and it is currently frozen in time when > > trying to fetch a WAL buffer. One of the stack traces: > > #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0, > > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at > > latch.c:1571 > > #3 WaitEventSetWait (set=0xd34ce0, timeout=-1, > > occurred_events=occurred_events@entry=0xffffffffadd8, > > nevents=nevents@entry=1, wait_event_info=<optimized out>, > > wait_event_info@entry=134217781) at latch.c:1519 > > #4 0x000000000084e964 in WaitLatch (latch=<optimized out>, > > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1, > > wait_event_info=wait_event_info@entry=134217781) at latch.c:538 > > #5 0x000000000085d2f8 in ConditionVariableTimedSleep > > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at > > condition_variable.c:163 > > #6 0x000000000085d1ec in ConditionVariableSleep > > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98 > > #7 0x000000000055f4f4 in AdvanceXLInsertBuffer > > (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at > > xlog.c:2224 > > #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880, > > tli=tli@entry=1) at xlog.c:1710 > > #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80, > > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>, > > EndPos=<optimized out>, tli=1) at xlog.c:1245 > > #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>, > > fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized > > out>, num_fpi@entry=0, topxid_included=false) at xlog.c:928 > > #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020', > > info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523 > > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950, > > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>, > > parent=<optimized out>, > > > > So, yes, something looks really wrong with this patch. Sounds > > plausible to me that some other buildfarm animals could be stuck > > without their owners knowing about it. It's proving to be a good idea > > to force a timeout value in the configuration file of these animals.. > > Tom, Michael, thank you for the information. > This patch will be better tested before next attempt. It seems that I managed to reproduce the issue on my Raspberry PI 4. After running our test suite in a loop for 2 days I found one timeout. I have hypothesis on why it might happen. We don't have protection against two backends in parallel get ReservedPtr mapped to a single XLog buffer. I've talked to Yura off-list about that. He pointer out that XLogWrite() should issue a PANIC in that case, which we didn't observe. However, I'm not sure this analysis is complete. One way or another, we need protection against this situation any way. The updated patch is attached. Now, after acquiring ReservedPtr it waits till OldPageRqstPtr gets initialized. Additionally I've to implement more accurate calculation of OldPageRqstPtr. I run tests with new patch on my Raspberry in a loop. Let's see how it goes. ------ Regards, Alexander Korotkov Supabase
Attachment
pgsql-hackers by date: