TruncateMultiXact() bugs - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject TruncateMultiXact() bugs
Date
Msg-id ccc66933-31c1-4f6a-bf4b-45fef0d4f22e@iki.fi
Whole thread Raw
Responses Re: TruncateMultiXact() bugs
List pgsql-hackers
I was performing tests around multixid wraparound, when I ran into this 
assertion:

> TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File:
"../src/backend/utils/mmgr/mcxt.c",Line: 1353, PID: 920981
 
> postgres: autovacuum worker template0(ExceptionalCondition+0x6e)[0x560a501e866e]
> postgres: autovacuum worker template0(+0x5dce3d)[0x560a50217e3d]
> postgres: autovacuum worker template0(ForwardSyncRequest+0x8e)[0x560a4ffec95e]
> postgres: autovacuum worker template0(RegisterSyncRequest+0x2b)[0x560a50091eeb]
> postgres: autovacuum worker template0(+0x187b0a)[0x560a4fdc2b0a]
> postgres: autovacuum worker template0(SlruDeleteSegment+0x101)[0x560a4fdc2ab1]
> postgres: autovacuum worker template0(TruncateMultiXact+0x2fb)[0x560a4fdbde1b]
> postgres: autovacuum worker template0(vac_update_datfrozenxid+0x4b3)[0x560a4febd2f3]
> postgres: autovacuum worker template0(+0x3adf66)[0x560a4ffe8f66]
> postgres: autovacuum worker template0(AutoVacWorkerMain+0x3ed)[0x560a4ffe7c2d]
> postgres: autovacuum worker template0(+0x3b1ead)[0x560a4ffecead]
> postgres: autovacuum worker template0(+0x3b620e)[0x560a4fff120e]
> postgres: autovacuum worker template0(+0x3b3fbb)[0x560a4ffeefbb]
> postgres: autovacuum worker template0(+0x2f724e)[0x560a4ff3224e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f62cc642c8a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f62cc642d45]
> postgres: autovacuum worker template0(_start+0x21)[0x560a4fd16f31]
> 2024-06-14 13:11:02.025 EEST [920971] LOG:  server process (PID 920981) was terminated by signal 6: Aborted
> 2024-06-14 13:11:02.025 EEST [920971] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13407
(toprevent wraparound)
 

The attached python script reproduces this pretty reliably. It's a 
reduced version of a larger test script I was working on, it probably 
could be simplified further for this particular issue.

Looking at the code, it's pretty clear how it happens:

1. TruncateMultiXact does START_CRIT_SECTION();

2. In the critical section, it calls PerformMembersTruncation() -> 
SlruDeleteSegment() -> SlruInternalDeleteSegment() -> 
RegisterSyncRequest() -> ForwardSyncRequest()

3. If the fsync request queue is full, it calls 
CompactCheckpointerRequestQueue(), which calls palloc0. Pallocs are not 
allowed in a critical section.

A straightforward fix is to add a check to 
CompactCheckpointerRequestQueue() to bail out without compacting, if 
it's called in a critical section. That would cover any other cases like 
this, where RegisterSyncRequest() is called in a critical section. I 
haven't tried searching if any more cases like this exist.

But wait there is more!

After applying that fix in CompactCheckpointerRequestQueue(), the test 
script often gets stuck. There's a deadlock between the checkpointer, 
and the autovacuum backend trimming the SLRUs:

1. TruncateMultiXact does this:

         MyProc->delayChkptFlags |= DELAY_CHKPT_START;

2. It then makes that call to PerformMembersTruncation() and 
RegisterSyncRequest(). If it cannot queue the request, it sleeps a 
little and retries. But the checkpointer is stuck waiting for the 
autovacuum backend, because of delayChkptFlags, and will never clear the 
queue.

To fix, I propose to add AbsorbSyncRequests() calls to the wait-loops in 
CreateCheckPoint().


Attached patch fixes both of those issues.

I can't help thinking that TruncateMultiXact() should perhaps not have 
such a long critical section. TruncateCLOG() doesn't do that. But it was 
added for good reasons in commit 4f627f897367, and this fix seems 
appropriate for the stable branches anyway, even if we come up with 
something better for master.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Conflict Detection and Resolution
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict Detection and Resolution