Re: Segmentation Fault in logical decoding get/peek API - Mailing list pgsql-bugs

From Jeremy Finzel
Subject Re: Segmentation Fault in logical decoding get/peek API
Date
Msg-id CAMa1XUivSJ2idopJUhMi-CWundycnH4OdwWOiZ_D+12BvMLObg@mail.gmail.com
Whole thread Raw
In response to Re: Segmentation Fault in logical decoding get/peek API  (Jeremy Finzel <finzelj@gmail.com>)
List pgsql-bugs
On Wed, Feb 27, 2019 at 8:34 AM Jeremy Finzel <finzelj@gmail.com> wrote:
On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.

> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.

Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively.  Seems mighty
tedious though.  Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.

                        regards, tom lane

I have confirmed which commit fixed the segfault:

cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Mar 6 15:57:20 2018 -0300

    Refrain from duplicating data in reorderbuffers

    If a walsender exits leaving data in reorderbuffers, the next walsender
    that tries to decode the same transaction would append its decoded data
    in the same spill files without truncating it first, which effectively
    duplicate the data.  Avoid that by removing any leftover reorderbuffer
    spill files when a walsender starts.

    Backpatch to 9.4; this bug has been there from the very beginning of
    logical decoding.

    Author: Craig Ringer, revised by me
    Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada 


I'm very curious if anyone more familiar with the code base understands how the old code was vulnerable to a segfault, but not the new code.  I don't think in my situation a walsender was exiting at all, so I do think the specific reason for this code change accidentally fixed a potential segfault unrelated to an exiting walsender leaving data in reorderbuffers.

I still can't explain precisely what reproduces it.  As I have said before, some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE queries, as well as btree gist exclusion indexes, but I suspect this has to do with the former somehow, although I haven't been able to create a reliably reproducing crash.

I suggest at least adding something to the 10.4 (and corollary major version) release notes around this.

Here is the release note now:
  • In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
Perhaps some addition like this:
  • In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
    This change also had the effect of fixing a potential segfault vulnerability in the logical decoding peek/get_changes API in ReorderBufferCommit

Thanks,
Jeremy

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #15666: Seemingly incorrect error reporting/logging for violation of non-null constraint
Next
From: Amit Langote
Date:
Subject: Re: BUG #15662: Row update in foreign partition does not move rowbased on partition key