Re: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: long-standing data loss bug in initial sync of logical replication
Date
Msg-id c1e5ccd0-9681-4959-8c8a-ad4853064e98@enterprisedb.com
Whole thread Raw
In response to Re: long-standing data loss bug in initial sync of logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: long-standing data loss bug in initial sync of logical replication
List pgsql-hackers
On 6/25/24 07:04, Amit Kapila wrote:
> On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 6/24/24 12:54, Amit Kapila wrote:
>>> ...
>>>>
>>>>>> I'm not sure there are any cases where using SRE instead of AE would cause
>>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very
>>>>>> surprised if just using SRE would not lead to corrupted cache contents in some
>>>>>> situations. The cases where a lower lock level is ok are ones where we just
>>>>>> don't care that the cache is coherent in that moment.
>>>>
>>>>> Are you saying it might break cases that are not corrupted now? How
>>>>> could obtaining a stronger lock have such effect?
>>>>
>>>> No, I mean that I don't know if using SRE instead of AE would have negative
>>>> consequences for logical decoding. I.e. whether, from a logical decoding POV,
>>>> it'd suffice to increase the lock level to just SRE instead of AE.
>>>>
>>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question.
>>>>
>>>
>>> We lost track of this thread and the bug is still open. IIUC, the
>>> conclusion is to use SRE in OpenTableList() to fix the reported issue.
>>> Andres, Tomas, please let me know if my understanding is wrong,
>>> otherwise, let's proceed and fix this issue.
>>>
>>
>> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
>> don't think we 'lost track' of it, but it's true we haven't done much
>> progress recently.
>>
> 
> Okay, thanks for pointing to the CF entry. Would you like to take care
> of this? Are you seeing anything more than the simple fix to use SRE
> in OpenTableList()?
> 

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: pgindent exit status if a file encounters an error
Next
From: Andrew Dunstan
Date:
Subject: Re: Buildfarm animal caiman showing a plperl test issue with newer Perl versions