Re: TruncateMultiXact() bugs - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: TruncateMultiXact() bugs
Date
Msg-id e3be34b6-396f-402f-be0a-ba58d4538bd9@iki.fi
Whole thread Raw
In response to TruncateMultiXact() bugs  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 14/06/2024 14:37, Heikki Linnakangas wrote:
> I was performing tests around multixid wraparound, when I ran into this
> assertion:
> 
>> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File:
"../src/backend/utils/mmgr/mcxt.c",Line: 1353, PID: 920981
 
>> postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
>> postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
>> postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
>> postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
>> postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
>> postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
>> postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
>> postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
>> postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
>> postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
>> postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
>> postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
>> postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
>> postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
>> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
>> postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
>> 2024-06-14 13:11:02.025 EEST [920971] LOG:  server process (PID 920981) was terminated by signal 6: Aborted
>> 2024-06-14 13:11:02.025 EEST [920971] DETAIL:  Failed process was running: autovacuum: VACUUM
pg_toast.pg_toast_13407(to prevent wraparound)
 
> 
> The attached python script reproduces this pretty reliably. It's a
> reduced version of a larger test script I was working on, it probably
> could be simplified further for this particular issue.
> 
> Looking at the code, it's pretty clear how it happens:
> 
> 1. TruncateMultiXact does START_CRIT_SECTION();
> 
> 2. In the critical section, it calls PerformMembersTruncation() ->
> SlruDeleteSegment() -> SlruInternalDeleteSegment() ->
> RegisterSyncRequest() -> ForwardSyncRequest()
> 
> 3. If the fsync request queue is full, it calls
> CompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are not
> allowed in a critical section.
> 
> A straightforward fix is to add a check to
> CompactCheckpointerRequestQueue() to bail out without compacting, if
> it's called in a critical section. That would cover any other cases like
> this, where RegisterSyncRequest() is called in a critical section. I
> haven't tried searching if any more cases like this exist.
> 
> But wait there is more!
> 
> After applying that fix in CompactCheckpointerRequestQueue(), the test
> script often gets stuck. There's a deadlock between the checkpointer,
> and the autovacuum backend trimming the SLRUs:
> 
> 1. TruncateMultiXact does this:
> 
>           MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> 
> 2. It then makes that call to PerformMembersTruncation() and
> RegisterSyncRequest(). If it cannot queue the request, it sleeps a
> little and retries. But the checkpointer is stuck waiting for the
> autovacuum backend, because of delayChkptFlags, and will never clear the
> queue.
> 
> To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops in
> CreateCheckPoint().
> 
> 
> Attached patch fixes both of those issues.

Committed and backpatched down to v14. This particular scenario cannot 
happen in older versions because the RegisterFsync() on SLRU truncation 
was added in v14. In principle, I think older versions might have 
similar issues, but given that when assertions are disabled this is only 
a problem if you happen to run out of memory in the critical section, it 
doesn't seem worth backpatching further unless someone reports a 
concrete case.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Tom Browder
Date:
Subject: Re: Recommended books for admin
Next
From: David Rowley
Date:
Subject: Re: Should we document how column DEFAULT expressions work?