Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [HACKERS] Restricting maximum keep segments by repslots
Date
Msg-id 20200403231403.GA14901@alvherre.pgsql
Whole thread Raw
In response to Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
So, the more I look at this patch, the less I like the way the slots are
handled.

* I think it's a mistake to try to do anything in KeepLogSeg itself;
  that function is merely in charge of some arithmetic.  I propose to
  make that function aware of the new size limitation (so that it
  doesn't trust the slot's LSNs completely), but to make the function
  have no side effects.  The attached patch does that, I hope.
  To replace that responsibility, let's add another function.  I named it
  InvalidateObsoleteReplicationSlots().  In CreateCheckPoint and
  CreateRestartPoint, we call the new function just before removing
  segments.  Note: the one in this patch doesn't actually work or even
  compile.
  The new function must:

  1. mark the slot as "invalid" somehow.  Maybe it would make sense to
  add a new flag in the on-disk struct for this; but for now I'm just
  thinking that changing the slot's restart_lsn is sufficient.
  (Of course, I haven't tested this, so there might be side-effects that
  mean that this idea doesn't work).

  2. send SIGTERM to a walsender that's using such a slot.

  3. Send the warning message.  Instead of trying to construct a message
  with a list of slots, send one message per slot.  (I think this is
  better from a translatability point of view, and also from a
  monitoring PoV).

* GetWalAvailability seems too much in competition with
  DistanceToWalRemoval.  Which is weird, because both functions do
  pretty much the same thing.  I think a better design is to make the
  former function return the distance as an out parameter.

* Andres complained that the "distance" column was not a great value to
  expose (20171106132050.6apzynxrqrzghb4r@alap3.anarazel.de).  That's
  right: it changes both by the insertion LSN as well as the slot's
  consumption.  Maybe we can expose the earliest live LSN (start of the
  earliest segment?) as a new column.  It'll be the same for all slots,
  I suppose, but we don't care, do we?

I attach a rough sketch, which as I said before doesn't work and doesn't
compile.  Sadly I have reached the end of my day here so I won't be able
to work on this for today anymore.  I'll be glad to try again tomorrow,
but in the meantime I thought it was better to send it over and see
whether you had any thoughts about this proposed design (maybe you know
it doesn't work for some reason), or better yet, you have the chance to
actually complete the code or at least move it a little further.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: backup manifests and contemporaneous buildfarm failures
Next
From: Peter Geoghegan
Date:
Subject: Re: vacuum_defer_cleanup_age inconsistently applied on replicas