Home > mailing lists

Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: Get rid of WALBufMappingLock
Date	February 25 18:19:29
Msg-id	CAPpHfdsXCRg-arg3CFAYR4PXWTc7G1qY=2V7Yap3j83_5qe+Hg@mail.gmail.com Whole thread Raw
In response to	Re: Get rid of WALBufMappingLock (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: Get rid of WALBufMappingLock Re: Get rid of WALBufMappingLock
List	pgsql-hackers

Tree view

On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > > This timeout failure on hachi looks suspicious as well:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> > >
> > > Might be relevant that they are both aarch64?
> >
> > Just logged into the host.  The logs of the timed out run are still
> > around, and the last information I can see is from lastcommand.log,
> > which seems to have frozen in time when the timeout has begun its
> > vacuuming work:
> > ok 73        + index_including_gist 353 ms
> > # parallel group (16 tests):  create_cast errors create_aggregate drop_if_exists infinite_recurse
> >
> > gokiburi is on the same host, and it is currently frozen in time when
> > trying to fetch a WAL buffer.  One of the stack traces:
> > #2  0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> > latch.c:1571
> > #3  WaitEventSetWait (set=0xd34ce0, timeout=-1,
> > occurred_events=occurred_events@entry=0xffffffffadd8,
> > nevents=nevents@entry=1, wait_event_info=<optimized out>,
> > wait_event_info@entry=134217781) at latch.c:1519
> > #4  0x000000000084e964 in WaitLatch (latch=<optimized out>,
> > wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=-1,
> > wait_event_info=wait_event_info@entry=134217781)     at latch.c:538
> > #5  0x000000000085d2f8 in ConditionVariableTimedSleep
> > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> > condition_variable.c:163
> > #6  0x000000000085d1ec in ConditionVariableSleep
> > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> > #7  0x000000000055f4f4 in AdvanceXLInsertBuffer
> > (upto=upto@entry=112064880, tli=tli@entry=1, opportunistic=false) at
> > xlog.c:2224
> > #8  0x0000000000568398 in GetXLogBuffer (ptr=ptr@entry=112064880,
> > tli=tli@entry=1) at xlog.c:1710
> > #9  0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> > EndPos=<optimized out>, tli=1)     at xlog.c:1245
> > #10 XLogInsertRecord (rdata=rdata@entry=0xcc49b0 <hdr_rdt>,
> > fpw_lsn=fpw_lsn@entry=112025520, flags=0 '\000', num_fpi=<optimized
> > out>, num_fpi@entry=0,      topxid_included=false) at xlog.c:928
> > #11 0x000000000056b870 in XLogInsert (rmid=rmid@entry=16 '\020',
> > info=<optimized out>, info@entry=16 '\020') at xloginsert.c:523
> > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> > parent=<optimized out>,
> >
> > So, yes, something looks really wrong with this patch.  Sounds
> > plausible to me that some other buildfarm animals could be stuck
> > without their owners knowing about it.  It's proving to be a good idea
> > to force a timeout value in the configuration file of these animals..
>
> Tom, Michael, thank you for the information.
> This patch will be better tested before next attempt.

It seems that I managed to reproduce the issue on my Raspberry PI 4.
After running our test suite in a loop for 2 days I found one timeout.

I have hypothesis on why it might happen.  We don't have protection
against two backends in parallel get ReservedPtr mapped to a single
XLog buffer.  I've talked to Yura off-list about that.  He pointer out
that XLogWrite() should issue a PANIC in that case, which we didn't
observe.  However, I'm not sure this analysis is complete.

One way or another, we need protection against this situation any way.
The updated patch is attached.  Now, after acquiring ReservedPtr it
waits till OldPageRqstPtr gets initialized.  Additionally I've to
implement more accurate calculation of OldPageRqstPtr.  I run tests
with new patch on my Raspberry in a loop.  Let's see how it goes.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v11-0001-Get-rid-of-WALBufMappingLock.patch

pgsql-hackers by date:

From: Sami Imseih
Date: 25 February, 18:12:03
Subject: Re: Redact user password on pg_stat_statements

From: Melanie Plageman
Date: 25 February, 18:21:01
Subject: Re: Trigger more frequent autovacuums of heavy insert tables

Re: Get rid of WALBufMappingLock - Mailing list pgsql-hackers

Attachment

Previous

Next