Re: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC - Mailing list pgsql-hackers

From Andy Fan
Subject Re: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC
Date
Msg-id 874im44mi9.fsf@163.com
Whole thread Raw
In response to Re: raise ERROR between EndPrepare and PostPrepare_Locks causes ROLLBACK 2pc PAINC  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:

Hi,

Thanks for showing interests for this topic!

> On Wed, Mar 25, 2026 at 08:39:07AM +0800, Andy Fan wrote:
>> I found a similar but not exactly same case at 2014 [1] which 
>> might be helpful to recall a boarder understanding on this area. 
>> 
>> [1] https://www.postgresql.org/message-id/534AF601.1030007%40vmware.com
>
> Incorrect shared state when an ERROR happens at an arbitrary location
> is usually bad, yes.
>
> For this one, your suggestion of delaying the end of the critical
> section started at StartPrepare() and ending in EndPrepare() is not an
> acceptable solution as far as I can see, unfortunately: it would mean
> doing a SyncRepWaitForLSN() while in a critical section, and I doubt
> we'd want to do that.

I do have a more general question: (Q1). what is critical section
designed to? (Q2). What is the badness if we put more code into it
except the ERROR->PANIC logic?

Then (Q3) will SyncRepWaitForLSN be possible to raise ERROR? if no for
now and possible in future, then my Q2 raise. 

> Anyway, I doubt that this one is worth caring for. The current locking
> 2PC scheme means, as far as I remember, that it is not really possible
> to interact with an external command in a specific session between
> the EndPrepare() and the PostPrepare_Locks() 
> calls.

Then Q3 comes. The deeper answer might be Q2 or Q1.

> To put it in other words, let's imagine that we use a breakpoint
> between these two calls (or a wait injection point if you automate
> that).  Is it possible for a second backend to mess with the state of
> the first backend waiting until its locks are transfered to the dummy
> PGPROC entry?  That's what the 2014 thread is about: there was a race
> condition reachable between two sessions.

This is true, so the issue 2014 thread is more critical than the
current one and which has been fixed. 

> If the answer to this question is yes, I'd agree that this is
> something that deserves a closer lookup.

Generally yes.. But I can't stop thinking the Q3 -> Q2 -> Q1 when I want
to accept this asnwer. 

> And before you ask: attempting to interact with a 2PC
> state from a second session with a first session waiting between these
> two points would not work: the 2PC entry is locked, cleaned up after
> EndPrepare() and PostPrepare_Locks() at PostPrepare_Twophase().
> Trying to request an access to this entry fails, as the first backend
> is marked as locking it.  A second backend attempting to lock it would
> fail, complaining that the 2PC entry with a GXID is "busy".

I can understand what you are saying now, but what does it make
difference on the above case?

> SyncRepWaitForLSN() would be a problematic pattern between the
> EndPrepare() and the PostPrepare_Locks(), but we never ERROR there on 
> purpose: even if we cancel while waiting for a transaction commit we'd
> just get a WARNING, meaning that we'd be able to transfer our locks
> anyway.

again  Q2 -> Q3. 

> Or perhaps you have a realistic scenario where it is possible to mess
> up with the shared state, outside a elog(ERROR) forced between these
> two points?

No really. I just inject some exception on some predefined places. I even
don't know why people defined this places before.  As for me, I prefer
to know MORE design points for the CRITIAL SECTION besides what I side
before. 

-- 
Best Regards
Andy Fan




pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: relfilenode statistics
Next
From: Bharath Rupireddy
Date:
Subject: Re: Reduce log level of some logical decoding messages to DEBUG1