Re: subscriptionCheck failures on nightjar - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: subscriptionCheck failures on nightjar
Date
Msg-id 20190826132904.3ayuw36qzl2c4ktr@development
Whole thread Raw
In response to Re: subscriptionCheck failures on nightjar  (Michael Paquier <michael@paquier.xyz>)
Responses Re: subscriptionCheck failures on nightjar
Re: subscriptionCheck failures on nightjar
List pgsql-hackers
On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:
>On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:
>> I'm not yet sure that that's actually something that's supposed to
>> happen, I got to spend some time analysing how this actually
>> happens. Normally the contents of the slot should actually prevent it
>> from being removed (as they're newer than
>> ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
>> a bug in the drop logic in newer releases.
>
>In the same context, could it be a consequence of 9915de6c which has
>introduced a conditional variable to control slot operations?  This
>could have exposed more easily a pre-existing race condition.
>--

This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1].

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2], i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Jun 26 16:38:34 2018 -0400

    Fix "base" snapshot handling in logical decoding

    ...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3],
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).


[1] https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail.gmail.com

[2] https://commitfest.postgresql.org/18/1650/

[3] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Asif Rehman
Date:
Subject: Re: pgbench - allow to create partitioned tables
Next
From: Alexander Kukushkin
Date:
Subject: Re: Statement timeout in pg_rewind