Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers
Date
Msg-id CAD21AoCfYEeO75NWxoRHi25+_j6LqgQMs2spfmmkacewijBo2w@mail.gmail.com
Whole thread Raw
In response to Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Responses Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers
List pgsql-hackers
On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 26/12/17 11:13, Masahiko Sawada wrote:
>> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>
>>>>
>>>> It's not a problem on crash restart because StartupReorderBuffer already
>>>> does the required delete.
>>>>
>>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
>>>> to have any guard to ensure that the segment files don't already exist
>>>> when it goes to serialize a snapshot. Adding one there would probably be
>>>> expensive; we don't know the last lsn of the txn yet, so to be really
>>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-*
>>>> entries.
>>>>
>>>> So to fix, I suggest that we should do a
>>>> slot-specific StartupReorderBuffer-style deletion when we start a new
>>>> decoding session on a slot, per attached patch.
>>>>
>>>> It might be nice to also add a hook on proc exit, so we don't have stale
>>>> buffers lying around until next decoding session, but I didn't want to
>>>> add new global state to reorderbuffer.c so I've left that for now.
>>>
>>>
>>> Hmm, can't we simply call the new cleanup function in
>>> ReplicationSlotRelease()? That's called during process exit and we know
>>> there about slot so no extra global variables are needed.
>>>
>>
>> I guess that ReplicationSlotRelease() currently might not get called
>> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
>> called by some functions such as WalSndErrorCleanup(), but at least in
>> the case where wal sender exits due to failed to write data to socket,
>> ReplicationSlotRelease() didn't get called as far as I tested.
>>
>
> Are you sure about that testing? Did you test it with replication slot
> active? ReplicationSlotRelease() is called from ProcKill() if the
> process is using a slot and should be called for any kind of exit except
> for outright crash (the kind that makes postgres kill all backends). If
> it wasn't we'd never unlock the replication slot used by the exiting
> walsender.
>

Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Lelisa Diriba
Date:
Subject: Ethiopian calendar year(DATE TYPE) are different from the Gregoriancalendar year
Next
From: Pavel Stehule
Date:
Subject: Re: Ethiopian calendar year(DATE TYPE) are different from theGregorian calendar year