Re: walsender vs. XLogBackgroundFlush during shutdown - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: walsender vs. XLogBackgroundFlush during shutdown |
Date | |
Msg-id | 20190502123545.bxq5c6ziefdcjewu@development Whole thread Raw |
In response to | Re: walsender vs. XLogBackgroundFlush during shutdown (Alexander Kukushkin <cyberdemn@gmail.com>) |
Responses |
Re: walsender vs. XLogBackgroundFlush during shutdown
|
List | pgsql-hackers |
On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote: >Hi, > >On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> OK, so that seems like a separate issue, somewhat unrelated to the issue I >> reported. And I'm not sure it's a walsender issue - it seems it might be a >> psycopg2 issue in not reporting the flush properly, no? > >Agree, it is a different issue, but I am unsure what to blame, >postgres or psycopg2. >Right now in the psycopg2 we confirm more or less every XLogData >message, but at the same time LSN on the primary is moving forward and >we get updates with keepalive messages. >I perfectly understand the need to periodically send updates of flush >= walEnd (which comes from keepalive). It might happen that there is >no transaction activity but WAL is still generated and as a result >replication slot will prevent WAL from being cleaned up. >From the client side perspective, it confirmed everything that it >should, but from the postgres side, this is not enough to shut down >cleanly. Maybe it is possible to change the check (sentPtr == >replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or >it would be unsafe? > I don't know. In general I think it's a bit strange that we're waiting for walsender processes to catch up even in fast shutdown mode, instead of just aborting them like other backends. But I assume there are reasons for that. OTOH it makes us vulnerable to issues like this, when a (presumably) misbehaving downstream prevents a shutdown. >> >ptr=0x55cb958dca08, len=94) at be-secure.c:318 >> >#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, >> >ptr=0x55cb958dca08, len=94) at be-secure.c:265 >> >#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 >> >#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 >> >#5 0x000055cb93dac30b in send_message_to_frontend >> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317 >> >#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 >> >#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 >> >#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 >> >"sending replication keepalive") at elog.c:1376 >> >#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at >> >walsender.c:3358 >> >> Is it stuck in the send() call forever, or did you happen to grab >> this bracktrace? > >No, it didn't stuck there. During the shutdown postgres starts sending >a few thousand keepalive messages per second and receives back so many >feedback message, therefore the chances of interrupting somewhere in >the send are quite high. Uh, that seems a bit broken, perhaps? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: