Re: walsender vs. XLogBackgroundFlush during shutdown - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: walsender vs. XLogBackgroundFlush during shutdown
Date
Msg-id 20190502123545.bxq5c6ziefdcjewu@development
Whole thread Raw
In response to Re: walsender vs. XLogBackgroundFlush during shutdown  (Alexander Kukushkin <cyberdemn@gmail.com>)
Responses Re: walsender vs. XLogBackgroundFlush during shutdown
List pgsql-hackers
On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote:
>Hi,
>
>On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> OK, so that seems like a separate issue, somewhat unrelated to the issue I
>> reported. And I'm not sure it's a walsender issue - it seems it might be a
>> psycopg2 issue in not reporting the flush properly, no?
>
>Agree, it is a different issue, but I am unsure what to blame,
>postgres or psycopg2.
>Right now in the psycopg2 we confirm more or less every XLogData
>message, but at the same time LSN on the primary is moving forward and
>we get updates with keepalive messages.
>I perfectly understand the need to periodically send updates of flush
>= walEnd (which comes from keepalive). It might happen that there is
>no transaction activity but WAL is still generated and as a result
>replication slot will prevent WAL from being cleaned up.
>From the client side perspective, it confirmed everything that it
>should, but from the postgres side, this is not enough to shut down
>cleanly. Maybe it is possible to change the check (sentPtr ==
>replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
>it would be unsafe?
>

I don't know.

In general I think it's a bit strange that we're waiting for walsender
processes to catch up even in fast shutdown mode, instead of just aborting
them like other backends. But I assume there are reasons for that. OTOH it
makes us vulnerable to issues like this, when a (presumably) misbehaving
downstream prevents a shutdown.

>> >ptr=0x55cb958dca08, len=94) at be-secure.c:318
>> >#2  0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
>> >ptr=0x55cb958dca08, len=94) at be-secure.c:265
>> >#3  0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
>> >#4  0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
>> >#5  0x000055cb93dac30b in send_message_to_frontend
>> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317
>> >#6  0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
>> >#7  0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
>> >#8  0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
>> >"sending replication keepalive") at elog.c:1376
>> >#9  0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
>> >walsender.c:3358
>>
>> Is it stuck in the send() call forever, or did you happen to grab
>> this bracktrace?
>
>No, it didn't stuck there. During the shutdown postgres starts sending
>a few thousand keepalive messages per second and receives back so many
>feedback message, therefore the chances of interrupting somewhere in
>the send are quite high.

Uh, that seems a bit broken, perhaps?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Fixing order of resowner cleanup in 12, for Windows
Next
From: Laurenz Albe
Date:
Subject: Bad canonicalization for dateranges with 'infinity' bounds