Re: Possible bug in logical replication. - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Possible bug in logical replication.
Date
Msg-id CAD21AoA+5Tz0Z8zHOmD=sU5F=cygoEjHs7WvbBDL07fH9ayVaw@mail.gmail.com
Whole thread Raw
In response to Re: Possible bug in logical replication.  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Possible bug in logical replication.  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
On Fri, May 18, 2018 at 2:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Thu, 17 May 2018 13:54:07 +0300, Arseny Sher <a.sher@postgrespro.ru> wrote in <87in7md034.fsf@ars-thinkpad>
>>
>> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
>>
>> > I think that using restart_lsn instead of confirmed_flush is not right approach.
>> > If restart_lsn is not available and confirmed_flush is pointing to page
>> > boundary, then in any case we should somehow handle this case and adjust
>> > startlsn to point on the valid record position (by jjust adding page header
>> > size?).
>>
>> Well, restart_lsn is always available on live slot: it is initially set
>> in ReplicationSlotReserveWal during slot creation.
>
> restart_lsn stays at the beginning of a transaction until the
> transaction ends so just using restart_lsn allows repeated
> decoding of a transaction, in short, rewinding occurs. The
> function works only for inactive slot so the current code works
> fine on this point. Addition to that restart_lsn also can be on a
> page bounary.
>
>
> We can see the problem easily.
>
> 1. Just create a logical replication slot with setting current LSN.
>
>   select pg_create_logical_replication_slot('s1', 'pgoutput');
>
> 2. Advance LSN by two or three pages by doing anyting.
>
> 3. Advance the slot to a page bounadry.
>
>   e.g. select pg_replication_slot_advance('s1', '0/9624000');
>
> 4. advance the slot further, then crash.
>
> So directly set ctx->reader->EndRecPtr by startlsn fixes the
> problem, but I found another problem here.

I confirmed that the attached patch fixes this problem as well as the
same problem reported on another thread.

I'm not sure it's a good approach to change the state of xlogreader
directly in the replication slot codes because it also means that we
have to care about this code as well when xlogreader code is changed.
Another possible way might be to make XLogFindNextRecord valid in
backend code and move startlsn to the first valid record with an lsn
>= startlsn by using that function. Please find attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: SCRAM with channel binding downgrade attack
Next
From: Magnus Hagander
Date:
Subject: Re: SCRAM with channel binding downgrade attack