Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.
> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively. Seems mighty
tedious though. Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.
regards, tom lane
I have confirmed which commit fixed the segfault:
cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Date: Tue Mar 6 15:57:20 2018 -0300
Refrain from duplicating data in reorderbuffers
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
I'm very curious if anyone more familiar with the code base understands how the old code was vulnerable to a segfault, but not the new code. I don't think in my situation a walsender was exiting at all, so I do think the specific reason for this code change accidentally fixed a potential segfault unrelated to an exiting walsender leaving data in reorderbuffers.
I still can't explain precisely what reproduces it. As I have said before, some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE queries, as well as btree gist exclusion indexes, but I suspect this has to do with the former somehow, although I haven't been able to create a reliably reproducing crash.
I suggest at least adding something to the 10.4 (and corollary major version) release notes around this.
Here is the release note now:
- In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
Perhaps some addition like this:
- In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
This change also had the effect of fixing a potential segfault vulnerability in the logical decoding peek/get_changes API in ReorderBufferCommit
Thanks,
Jeremy