Thread: [HACKERS] Restricting maximum keep segments by repslots

[HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

28 February 2017, 06:27:36

Hello.

Although replication slot is helpful to avoid unwanted WAL
deletion, on the other hand it can cause a disastrous situation
by keeping WAL segments without a limit. Removing the causal
repslot will save this situation but it is not doable if the
standby is active. We should do a rather complex and forcible
steps to relieve the situation especially in an automatic
manner. (As for me, specifically in an HA cluster.)

This patch adds a GUC to put a limit to the number of segments
that replication slots can keep. Hitting the limit during
checkpoint shows a warining and the segments older than the limit
are removed.

> WARNING:  restart LSN of replication slots is ignored by checkpoint
> DETAIL:  Some replication slots lose required WAL segnents to continue.

Another measure would be automatic deletion or inactivation of
the culprit slot but it seems too complex for the problem.


As we have already postponed some patches by the triage for the
last commit fest, this might should be postponed to PG11.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

28 February 2017, 06:42:32

On Tue, Feb 28, 2017 at 12:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Although replication slot is helpful to avoid unwanted WAL
> deletion, on the other hand it can cause a disastrous situation
> by keeping WAL segments without a limit. Removing the causal
> repslot will save this situation but it is not doable if the
> standby is active. We should do a rather complex and forcible
> steps to relieve the situation especially in an automatic
> manner. (As for me, specifically in an HA cluster.)
>
> This patch adds a GUC to put a limit to the number of segments
> that replication slots can keep. Hitting the limit during
> checkpoint shows a warining and the segments older than the limit
> are removed.
>
>> WARNING:  restart LSN of replication slots is ignored by checkpoint
>> DETAIL:  Some replication slots lose required WAL segnents to continue.
>
> Another measure would be automatic deletion or inactivation of
> the culprit slot but it seems too complex for the problem.
>
>
> As we have already postponed some patches by the triage for the
> last commit fest, this might should be postponed to PG11.

Please no. Replication slots are designed the current way because we
don't want to have to use something like wal_keep_segments as it is a
wart, and this applies as well for replication slots in my opinion. If
a slot is bloating WAL and you care about your Postgres instance, I
would recommend instead that you use a background worker that does
monitoring of the situation based on max_wal_size for example, killing
the WAL sender associated to the slot if there is something connected
but it is frozen or it cannot keep up the pace of WAL generation, and
then dropping the slot. You may want to issue a checkpoint in this
case as well to ensure that segments get recycled. But anyway, if you
reach this point of WAL bloat, perhaps that's for the best as users
would know about it because backups would get in danger. For some
applications, that is acceptable, but you could always rely on
monitoring slots and kill them on sight if needed. That's as well more
flexible than having a parameter that basically is just a synonym of
max_wal_size.
-- 
Michael

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

28 February 2017, 07:16:38

Thank you for the opinion.

At Tue, 28 Feb 2017 12:42:32 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQm0QetoShggQnn4bLFd9oXKKHG7RafBP3Krno62=ORww@mail.gmail.com>
> Please no. Replication slots are designed the current way because we
> don't want to have to use something like wal_keep_segments as it is a
> wart, and this applies as well for replication slots in my opinion. If
> a slot is bloating WAL and you care about your Postgres instance, I
> would recommend instead that you use a background worker that does
> monitoring of the situation based on max_wal_size for example, killing
> the WAL sender associated to the slot if there is something connected
> but it is frozen or it cannot keep up the pace of WAL generation, and
> then dropping the slot.

It is doable without a plugin and currently we are planning to do
in the way (Maybe such plugin would be unacceptable..). Killing
walsender (which one?), removing the slot and if failed.. This is
the 'steps rather complex' and fragile.

> You may want to issue a checkpoint in this
> case as well to ensure that segments get recycled. But anyway, if you
> reach this point of WAL bloat, perhaps that's for the best as users
> would know about it because backups would get in danger.

Yes, but at the end it is better than that a server just stops
with a PANIC.

> For some applications, that is acceptable, but you could always
> rely on monitoring slots and kill them on sight if
> needed.

Another solution would be that removing a slot kills
corresponding walsender. What do you think about this?

pg_drop_replication_slot(name, *force*)

force = true kills the walsender runs on the slot.

> That's as well more flexible than having a parameter
> that basically is just a synonym of max_wal_size.

I thought the same thing first, max_wal_size_hard, that limits
the wal size including extra (other than them for the two
checkpoig cycles) segments.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Petr Jelinek

Date:

28 February 2017, 07:27:05

On 28/02/17 04:27, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> Although replication slot is helpful to avoid unwanted WAL
> deletion, on the other hand it can cause a disastrous situation
> by keeping WAL segments without a limit. Removing the causal
> repslot will save this situation but it is not doable if the
> standby is active. We should do a rather complex and forcible
> steps to relieve the situation especially in an automatic
> manner. (As for me, specifically in an HA cluster.)
> 

I agree that that it should be possible to limit how much WAL slot keeps.

> This patch adds a GUC to put a limit to the number of segments
> that replication slots can keep. Hitting the limit during
> checkpoint shows a warining and the segments older than the limit
> are removed.
> 
>> WARNING:  restart LSN of replication slots is ignored by checkpoint
>> DETAIL:  Some replication slots lose required WAL segnents to continue.
> 

However this is dangerous as logical replication slot does not consider
it error when too old LSN is requested so we'd continue replication,
hiding data loss.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

28 February 2017, 07:34:54

On Tue, Feb 28, 2017 at 1:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> It is doable without a plugin and currently we are planning to do
> in the way (Maybe such plugin would be unacceptable..). Killing
> walsender (which one?), removing the slot and if failed..

The PID and restart_lsn associated to each slot offer enough
information for monitoring.

> This is the 'steps rather complex' and fragile.

The handling of slot drop is not complex. The insurance that WAL
segments get recycled on time and avoid a full bloat is though.

>> That's as well more flexible than having a parameter
>> that basically is just a synonym of max_wal_size.
>
> I thought the same thing first, max_wal_size_hard, that limits
> the wal size including extra (other than them for the two
> checkpoig cycles) segments.

It would make more sense to just switch max_wal_size from a soft to a
hard limit. The current behavior is not cool with activity spikes.
-- 
Michael

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Robert Haas

Date:

01 March 2017, 12:43:30

On Tue, Feb 28, 2017 at 10:04 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> It would make more sense to just switch max_wal_size from a soft to a
> hard limit. The current behavior is not cool with activity spikes.

Having a hard limit on WAL size would be nice, but that's a different
problem from the one being discussed here.  If max_wal_size becomes a
hard limit, and a standby with a replication slot dies, then the
master eventually starts refusing all writes.  I guess that's better
than a PANIC, but it's not likely to make users very happy.  I think
it's entirely reasonable to want a behavior where the master is
willing to retain up to X amount of extra WAL for the benefit of some
standby, but after that the health of the master takes priority.

You can't really get that behavior today.  Either you can retain as
much WAL as might be necessary through archiving or a slot, or you can
retain a fixed amount of WAL whether it's actually needed or not.
There's currently no approach that retains min(wal_needed,
configured_value).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

01 March 2017, 19:06:10

Hi,

On 2017-02-28 12:42:32 +0900, Michael Paquier wrote:
> Please no. Replication slots are designed the current way because we
> don't want to have to use something like wal_keep_segments as it is a
> wart, and this applies as well for replication slots in my opinion.

I think a per-slot option to limit the amount of retention would make
sense.

- Andres

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Peter Eisentraut

Date:

01 March 2017, 20:17:43

On 2/27/17 23:27, Petr Jelinek wrote:
>>> WARNING:  restart LSN of replication slots is ignored by checkpoint
>>> DETAIL:  Some replication slots lose required WAL segnents to continue.
> However this is dangerous as logical replication slot does not consider
> it error when too old LSN is requested so we'd continue replication,
> hiding data loss.

In general, we would need a much more evident and strict way to discover
when this condition is hit.  Like a "full" column in
pg_stat_replication_slot, and refusing connections to the slot until it
is cleared.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Peter Eisentraut

Date:

01 March 2017, 20:18:07

On 2/27/17 22:27, Kyotaro HORIGUCHI wrote:
> This patch adds a GUC to put a limit to the number of segments
> that replication slots can keep.

Please measure it in size, not in number of segments.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

02 March 2017, 03:39:57

At Wed, 1 Mar 2017 08:06:10 -0800, Andres Freund <andres@anarazel.de> wrote in
<20170301160610.wc7ez3vihmialntd@alap3.anarazel.de>
> On 2017-02-28 12:42:32 +0900, Michael Paquier wrote:
> > Please no. Replication slots are designed the current way because we
> > don't want to have to use something like wal_keep_segments as it is a
> > wart, and this applies as well for replication slots in my opinion.
> 
> I think a per-slot option to limit the amount of retention would make
> sense.

I started from that but I found that all slots refer to the same
location as the origin of the distance, that is, the last segment
number that KeepLogSeg returns. As the result the whole logic
became as the following. This is one reason of the proposed pach.

- Take the maximum value of the maximum-retain-LSN-amount per slot.
- Apply the maximum value during the calcuation in KeepLogSeg.
- (These steps runs only when at least one slot exists)

The another reason was, as Robert retold, I thought that this is
a matter of system (or a DB cluster) wide health and works in a
bit different way from what the name "max_wal_size_hard"
suggests.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

02 March 2017, 03:43:50

At Wed, 1 Mar 2017 12:17:43 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<dc7faead-61c4-402e-a6dc-534192833d77@2ndquadrant.com>
> On 2/27/17 23:27, Petr Jelinek wrote:
> >>> WARNING:  restart LSN of replication slots is ignored by checkpoint
> >>> DETAIL:  Some replication slots lose required WAL segnents to continue.
> > However this is dangerous as logical replication slot does not consider
> > it error when too old LSN is requested so we'd continue replication,
> > hiding data loss.
> 
> In general, we would need a much more evident and strict way to discover
> when this condition is hit.  Like a "full" column in
> pg_stat_replication_slot, and refusing connections to the slot until it
> is cleared.

Anyway, if preserving WAL to replicate has priority to the
master's health, this doesn't nothing by leaving
'max_wal_keep_segments' to 0.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

02 March 2017, 03:54:14

At Wed, 1 Mar 2017 12:18:07 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<98538b00-42ae-6a6b-f852-50b3c937ade4@2ndquadrant.com>
> On 2/27/17 22:27, Kyotaro HORIGUCHI wrote:
> > This patch adds a GUC to put a limit to the number of segments
> > that replication slots can keep.
> 
> Please measure it in size, not in number of segments.

It was difficult to dicide which is reaaonable but I named it
after wal_keep_segments because it has the similar effect.

In bytes(or LSN)max_wal_sizemin_wal_sizewal_write_flush_after

In segmentswal_keep_segments

But surely max_slot_wal_keep_segments works to keep disk space so
bytes would be reasonable.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Peter Eisentraut

Date:

03 March 2017, 22:47:20

On 3/1/17 19:54, Kyotaro HORIGUCHI wrote:
>> Please measure it in size, not in number of segments.
> It was difficult to dicide which is reaaonable but I named it
> after wal_keep_segments because it has the similar effect.
> 
> In bytes(or LSN)
>  max_wal_size
>  min_wal_size
>  wal_write_flush_after
> 
> In segments
>  wal_keep_segments

We have been moving away from measuring in segments.  For example,
checkpoint_segments was replaced by max_wal_size.

Also, with the proposed patch that allows changing the segment size more
easily, this will become more important.  (I wonder if that will require
wal_keep_segments to change somehow.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

06 March 2017, 12:20:06

Thank you for the comment.

At Fri, 3 Mar 2017 14:47:20 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<ac510b45-7805-7ccc-734c-1b38a6645f3e@2ndquadrant.com>
> On 3/1/17 19:54, Kyotaro HORIGUCHI wrote:
> >> Please measure it in size, not in number of segments.
> > It was difficult to dicide which is reaaonable but I named it
> > after wal_keep_segments because it has the similar effect.
> > 
> > In bytes(or LSN)
> >  max_wal_size
> >  min_wal_size
> >  wal_write_flush_after
> > 
> > In segments
> >  wal_keep_segments
> 
> We have been moving away from measuring in segments.  For example,
> checkpoint_segments was replaced by max_wal_size.
> 
> Also, with the proposed patch that allows changing the segment size more
> easily, this will become more important.  (I wonder if that will require
> wal_keep_segments to change somehow.)

Agreed. It is 'max_slot_wal_keep_size' in the new version.

wal_keep_segments might should be removed someday.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Craig Ringer

Date:

07 March 2017, 05:01:53

On 28 February 2017 at 12:27, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

>> This patch adds a GUC to put a limit to the number of segments
>> that replication slots can keep. Hitting the limit during
>> checkpoint shows a warining and the segments older than the limit
>> are removed.
>>
>>> WARNING:  restart LSN of replication slots is ignored by checkpoint
>>> DETAIL:  Some replication slots lose required WAL segnents to continue.
>>
>
> However this is dangerous as logical replication slot does not consider
> it error when too old LSN is requested so we'd continue replication,
> hiding data loss.

That skipping only happens if you request a startpoint older than
confirmed_flush_lsn . It doesn't apply to this situation.

The client cannot control where we start decoding, it's always
restart_lsn, and if we can't find a needed WAL segment we'll ERROR. So
this is safe, though the error will be something about being unable to
find a wal segment that users might not directly associate with having
set this option. It won't say "slot disabled because needed WAL has
been discarded due to [setting]" or anything.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

28 August 2017, 12:36:14

Hello,

I'll add this to CF2017-09.

At Mon, 06 Mar 2017 18:20:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170306.182006.172683338.horiguchi.kyotaro@lab.ntt.co.jp>
> Thank you for the comment.
> 
> At Fri, 3 Mar 2017 14:47:20 -0500, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<ac510b45-7805-7ccc-734c-1b38a6645f3e@2ndquadrant.com>
> > On 3/1/17 19:54, Kyotaro HORIGUCHI wrote:
> > >> Please measure it in size, not in number of segments.
> > > It was difficult to dicide which is reaaonable but I named it
> > > after wal_keep_segments because it has the similar effect.
> > > 
> > > In bytes(or LSN)
> > >  max_wal_size
> > >  min_wal_size
> > >  wal_write_flush_after
> > > 
> > > In segments
> > >  wal_keep_segments
> > 
> > We have been moving away from measuring in segments.  For example,
> > checkpoint_segments was replaced by max_wal_size.
> > 
> > Also, with the proposed patch that allows changing the segment size more
> > easily, this will become more important.  (I wonder if that will require
> > wal_keep_segments to change somehow.)
> 
> Agreed. It is 'max_slot_wal_keep_size' in the new version.
> 
> wal_keep_segments might should be removed someday.

- Following to min/max_wal_size, the variable was renamed to "max_slot_wal_keep_size_mb" and used as
ConvertToXSegs(x)"

- Stopped warning when checkpoint doesn't flush segments required by slots even if max_slot_wal_keep_size have worked.

- Avoided subraction that may be negative.

regards,

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 105,110 **** int            wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 ---- int            CommitDelay = 0;    /* precommit delay in microseconds */ int            CommitSiblings
=5; /* # concurrent xacts needed to sleep */ int            wal_retrieve_retry_interval = 5000;
 
+ int            max_slot_wal_keep_size_mb = 0;  #ifdef WAL_DEBUG bool        XLOG_DEBUG = false;
***************
*** 9353,9361 **** KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9354,9385 ----     if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)     {         XLogSegNo
slotSegNo;
+         int            slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);          XLByteToSeg(keep,
slotSegNo);
 
+         /*
+          * ignore slots if too many wal segments are kept.
+          * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+          */
+         if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+         {
+             segno = segno - slotlimitsegs; /* must be positive */
+ 
+             /*
+              * warn only if the checkpoint flushes the required segment.
+              * we assume here that *logSegNo is calculated keep location.
+              */
+             if (slotSegNo < *logSegNo)
+                 ereport(WARNING,
+                     (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                      errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                        (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+             /* emergency vent */
+             slotSegNo = segno;
+         }
+          if (slotSegNo <= 0)             segno = 1;         else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2366,2371 **** static struct config_int ConfigureNamesInt[] =
--- 2366,2382 ----     },      {
+         {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+             gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+          NULL,
+          GUC_UNIT_MB
+         },
+         &max_slot_wal_keep_size_mb,
+         0, 0, INT_MAX,
+         NULL, NULL, NULL
+     },
+ 
+     {         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,             gettext_noop("Sets the maximum time
towait for WAL replication."),             NULL,
 
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ---- #max_wal_senders = 10        # max number of walsender processes                 # (change requires
restart)#wal_keep_segments = 0        # in logfile segments, 16MB each; 0 disables
 
+ #max_slot_wal_keep_size = 0    # measured in bytes; 0 disables #wal_sender_timeout = 60s    # in milliseconds; 0
disables #max_replication_slots = 10    # max number of replication slots
 
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 97,102 **** extern bool reachedConsistency;
--- 97,103 ---- extern int    min_wal_size_mb; extern int    max_wal_size_mb; extern int    wal_keep_segments;
+ extern int    max_slot_wal_keep_size_mb; extern int    XLOGbuffers; extern int    XLogArchiveTimeout; extern int
wal_retrieve_retry_interval;

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Peter Eisentraut

Date:

02 September 2017, 06:49:21

I'm still concerned about how the critical situation is handled.  Your
patch just prints a warning to the log and then goes on -- doing what?
The warning rolls off the log, and then you have no idea what happened,
or how to recover.

I would like a flag in pg_replication_slots, and possibly also a
numerical column that indicates how far away from the critical point
each slot is.  That would be great for a monitoring system.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

07 September 2017, 08:12:12

Hello,

At Fri, 1 Sep 2017 23:49:21 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<751e09c4-93e0-de57-edd2-e64c4950f5e3@2ndquadrant.com>
> I'm still concerned about how the critical situation is handled.  Your
> patch just prints a warning to the log and then goes on -- doing what?
> 
> The warning rolls off the log, and then you have no idea what happened,
> or how to recover.

The victims should be complaining in their log files, but, yes, I
must admit that it's extremely resembles /dev/null. And the
catastrophe comes suddenly.

> I would like a flag in pg_replication_slots, and possibly also a
> numerical column that indicates how far away from the critical point
> each slot is.  That would be great for a monitoring system.

Great! I'll do that right now.

> -- 
> Peter Eisentraut              http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

07 September 2017, 15:59:56

Hello,

At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170907.141212.227032666.horiguchi.kyotaro@lab.ntt.co.jp>
> > I would like a flag in pg_replication_slots, and possibly also a
> > numerical column that indicates how far away from the critical point
> > each slot is.  That would be great for a monitoring system.
> 
> Great! I'll do that right now.

Done.

In the attached patch on top of the previous patch, I added two
columns in pg_replication_slots, "live" and "distance". The first
indicates the slot will "live" after the next checkpoint. The
second shows the how many bytes checkpoint lsn can advance before
the slot will "die", or how many bytes the slot have lost after
"death".


Setting wal_keep_segments = 1 and max_slot_wal_keep_size = 16MB.

=# select slot_name, restart_lsn, pg_current_wal_lsn(), live, distance from pg_replication_slots;

slot_name | restart_lsn | pg_current_wal_lsn | live | distance  
-----------+-------------+--------------------+------+-----------s1        | 0/162D388   | 0/162D3C0          | t    |
0/29D2CE8

This shows that checkpoint can advance 0x29d2ce8 bytes before the
slot will die even if the connection stalls.
s1        | 0/4001180   | 0/6FFF2B8          | t    | 0/DB8

Just before the slot loses sync.
s1        | 0/4001180   | 0/70008A8          | f    | 0/FFEE80

The checkpoint after this removes some required segments.

2017-09-07 19:04:07.677 JST [13720] WARNING:  restart LSN of replication slots is ignored by checkpoint
2017-09-07 19:04:07.677 JST [13720] DETAIL:  Some replication slots have lost required WAL segnents to continue by up
to1 segments.
 

If max_slot_wal_keep_size if not set (0), live is always true and
distance is NULL.

slot_name | restart_lsn | pg_current_wal_lsn | live | distance  
-----------+-------------+--------------------+------+-----------s1        | 0/4001180   | 0/73117A8          | t    |




- The name (or its content) of the new columns should be arguable.

- pg_replication_slots view takes LWLock on ControlFile and spinlock on XLogCtl for every slot. But seems difficult to
reduceit..
 

- distance seems mitakenly becomes 0/0 for certain condition..

- The result seems almost right but more precise check needed. (Anyway it cannot be perfectly exact.);

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

13 September 2017, 05:43:06

At Thu, 07 Sep 2017 21:59:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170907.215956.110216588.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170907.141212.227032666.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > I would like a flag in pg_replication_slots, and possibly also a
> > > numerical column that indicates how far away from the critical point
> > > each slot is.  That would be great for a monitoring system.
> > 
> > Great! I'll do that right now.
> 
> Done.

The CF status of this patch turned into "Waiting on Author".
This is because the second patch is posted separately from the
first patch. I repost them together after rebasing to the current
master.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 105,110 **** int            wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 ---- int            CommitDelay = 0;    /* precommit delay in microseconds */ int            CommitSiblings
=5; /* # concurrent xacts needed to sleep */ int            wal_retrieve_retry_interval = 5000;
 
+ int            max_slot_wal_keep_size_mb = 0;  #ifdef WAL_DEBUG bool        XLOG_DEBUG = false;
***************
*** 9365,9373 **** KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9366,9397 ----     if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)     {         XLogSegNo
slotSegNo;
+         int            slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);          XLByteToSeg(keep,
slotSegNo);
 
+         /*
+          * ignore slots if too many wal segments are kept.
+          * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+          */
+         if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+         {
+             segno = segno - slotlimitsegs; /* must be positive */
+ 
+             /*
+              * warn only if the checkpoint flushes the required segment.
+              * we assume here that *logSegNo is calculated keep location.
+              */
+             if (slotSegNo < *logSegNo)
+                 ereport(WARNING,
+                     (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                      errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                        (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+             /* emergency vent */
+             slotSegNo = segno;
+         }
+          if (slotSegNo <= 0)             segno = 1;         else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2371,2376 **** static struct config_int ConfigureNamesInt[] =
--- 2371,2387 ----     },      {
+         {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+             gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+          NULL,
+          GUC_UNIT_MB
+         },
+         &max_slot_wal_keep_size_mb,
+         0, 0, INT_MAX,
+         NULL, NULL, NULL
+     },
+ 
+     {         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,             gettext_noop("Sets the maximum time
towait for WAL replication."),             NULL,
 
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ---- #max_wal_senders = 10        # max number of walsender processes                 # (change requires
restart)#wal_keep_segments = 0        # in logfile segments, 16MB each; 0 disables
 
+ #max_slot_wal_keep_size = 0    # measured in bytes; 0 disables #wal_sender_timeout = 60s    # in milliseconds; 0
disables #max_replication_slots = 10    # max number of replication slots
 
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 97,102 **** extern bool reachedConsistency;
--- 97,103 ---- extern int    min_wal_size_mb; extern int    max_wal_size_mb; extern int    wal_keep_segments;
+ extern int    max_slot_wal_keep_size_mb; extern int    XLOGbuffers; extern int    XLogArchiveTimeout; extern int
wal_retrieve_retry_interval;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 9336,9341 **** CreateRestartPoint(int flags)
--- 9336,9420 ---- }  /*
+  * Check if the record on the given lsn will be preserved at the next
+  * checkpoint.
+  *
+  * Returns true if it will be preserved. If distance is given, the distance
+  * from origin to the beginning of the first segment kept at the next
+  * checkpoint. It means margin when this function returns true and gap of lost
+  * records when false.
+  *
+  * This function should return the consistent result with KeepLogSeg.
+  */
+ bool
+ GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance)
+ {
+     XLogRecPtr currpos;
+     XLogRecPtr tailpos;
+     uint64 currSeg;
+     uint64 restByteInSeg;
+     uint64 restartSeg;
+     uint64 tailSeg;
+     uint64 keepSegs;
+ 
+     currpos = GetXLogWriteRecPtr();
+ 
+     LWLockAcquire(ControlFileLock, LW_SHARED);
+     tailpos = ControlFile->checkPointCopy.redo;
+     LWLockRelease(ControlFileLock);
+ 
+     /* Move the pointer to the beginning of the segment*/
+     XLByteToSeg(currpos, currSeg);
+     XLByteToSeg(restartLSN, restartSeg);
+     XLByteToSeg(tailpos, tailSeg);
+     restByteInSeg = 0;
+ 
+     Assert(wal_keep_segments >= 0);
+     Assert(max_slot_wal_keep_size_mb >= 0);
+ 
+     /*
+      * WAL are removed by the unit of segment.
+      */
+     keepSegs = wal_keep_segments + ConvertToXSegs(max_slot_wal_keep_size_mb);
+ 
+     /*
+      * If the latest checkpoint's redo point is older than the current head
+      * minus keep segments, the next checkpoint keeps the redo point's
+      * segment. Elsewise use current head minus number of segments to keep.
+      */
+     if (currSeg < tailSeg + keepSegs)
+     {
+         if (currSeg < keepSegs)
+             tailSeg = 0;
+         else
+             tailSeg = currSeg - keepSegs;
+ 
+         /* In this case, the margin will be the bytes to the next segment */
+         restByteInSeg = XLogSegSize - (currpos % XLogSegSize);
+     }
+ 
+     /* Required sements will be removed at the next checkpoint */
+     if (restartSeg < tailSeg)
+     {
+         /* Calculate how may bytes the slot have lost */
+         if (distance)
+         {
+             uint64 restbytes = (restartSeg + 1) * XLogSegSize - restartLSN;
+             *distance =
+                 (tailSeg - restartSeg - 1) * XLogSegSize
+                 + restbytes;
+         }
+         return false;
+     }
+ 
+     /* Margin at the next checkpoint before the slot lose sync  */
+     if (distance)
+         *distance = (restartSeg - tailSeg) * XLogSegSize + restByteInSeg;
+ 
+     return true;
+ }
+ 
+ /*  * Retreat *logSegNo to the last segment that we need to retain because of  * either wal_keep_segments or
replicationslots.  *
 
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***************
*** 793,799 **** CREATE VIEW pg_replication_slots AS             L.xmin,             L.catalog_xmin,
L.restart_lsn,
!             L.confirmed_flush_lsn     FROM pg_get_replication_slots() AS L             LEFT JOIN pg_database D ON
(L.datoid= D.oid); 
 
--- 793,801 ----             L.xmin,             L.catalog_xmin,             L.restart_lsn,
!             L.confirmed_flush_lsn,
!             L.live,
!             L.distance     FROM pg_get_replication_slots() AS L             LEFT JOIN pg_database D ON (L.datoid =
D.oid);
 
*** a/src/backend/replication/slotfuncs.c
--- b/src/backend/replication/slotfuncs.c
***************
*** 182,188 **** pg_drop_replication_slot(PG_FUNCTION_ARGS) Datum pg_get_replication_slots(PG_FUNCTION_ARGS) {
! #define PG_GET_REPLICATION_SLOTS_COLS 11     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc   tupdesc;     Tuplestorestate *tupstore;
 
--- 182,188 ---- Datum pg_get_replication_slots(PG_FUNCTION_ARGS) {
! #define PG_GET_REPLICATION_SLOTS_COLS 13     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc   tupdesc;     Tuplestorestate *tupstore;
 
***************
*** 304,309 **** pg_get_replication_slots(PG_FUNCTION_ARGS)
--- 304,323 ----         else             nulls[i++] = true; 
+         if (max_slot_wal_keep_size_mb > 0 && restart_lsn != InvalidXLogRecPtr)
+         {
+             uint64 distance;
+ 
+             values[i++] = BoolGetDatum(GetMarginToSlotSegmentLimit(restart_lsn,
+                                                                    &distance));
+             values[i++] = Int64GetDatum(distance);
+         }
+         else
+         {
+             values[i++] = BoolGetDatum(true);
+             nulls[i++] = true;
+         }
+          tuplestore_putvalues(tupstore, tupdesc, values, nulls);     }
LWLockRelease(ReplicationSlotControlLock);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 267,272 **** extern void ShutdownXLOG(int code, Datum arg);
--- 267,273 ---- extern void InitXLOGAccess(void); extern void CreateCheckPoint(int flags); extern bool
CreateRestartPoint(intflags);
 
+ extern bool GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance); extern void XLogPutNextOid(Oid
nextOid);extern XLogRecPtr XLogRestorePoint(const char *rpName); extern void UpdateFullPageWrites(void);
 
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 5347,5353 **** DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 DESCR("create a
physicalreplication slot"); DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1
02278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ )); DESCR("drop a
replicationslot");
 
! DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); DESCR("information about replication slots currently in
use");DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ )); DESCR("set up a logical replication slot");
 
--- 5347,5353 ---- DESCR("create a physical replication slot"); DATA(insert OID = 3780 (  pg_drop_replication_slot
PGNSPPGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_
_null__null_ )); DESCR("drop a replication slot");
 
! DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,16,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,live,distance}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); DESCR("information about replication slots currently in
use");DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ )); DESCR("set up a logical replication slot"); 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

13 September 2017, 11:08:16

At Wed, 13 Sep 2017 11:43:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170913.114306.67844218.horiguchi.kyotaro@lab.ntt.co.jp>
horiguchi.kyotaro> At Thu, 07 Sep 2017 21:59:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp>wrote in <20170907.215956.110216588.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > Hello,
> > 
> > At Thu, 07 Sep 2017 14:12:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170907.141212.227032666.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > > I would like a flag in pg_replication_slots, and possibly also a
> > > > numerical column that indicates how far away from the critical point
> > > > each slot is.  That would be great for a monitoring system.
> > > 
> > > Great! I'll do that right now.
> > 
> > Done.
> 
> The CF status of this patch turned into "Waiting on Author".
> This is because the second patch is posted separately from the
> first patch. I repost them together after rebasing to the current
> master.

Hmm. I was unconsciously careless of regression test since it is
in a tentative shape. This must pass the regression..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 105,110 **** int            wal_level = WAL_LEVEL_MINIMAL;
--- 105,111 ---- int            CommitDelay = 0;    /* precommit delay in microseconds */ int            CommitSiblings
=5; /* # concurrent xacts needed to sleep */ int            wal_retrieve_retry_interval = 5000;
 
+ int            max_slot_wal_keep_size_mb = 0;  #ifdef WAL_DEBUG bool        XLOG_DEBUG = false;
***************
*** 9365,9373 **** KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
--- 9366,9397 ----     if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)     {         XLogSegNo
slotSegNo;
+         int            slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);          XLByteToSeg(keep,
slotSegNo);
 
+         /*
+          * ignore slots if too many wal segments are kept.
+          * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+          */
+         if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+         {
+             segno = segno - slotlimitsegs; /* must be positive */
+ 
+             /*
+              * warn only if the checkpoint flushes the required segment.
+              * we assume here that *logSegNo is calculated keep location.
+              */
+             if (slotSegNo < *logSegNo)
+                 ereport(WARNING,
+                     (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                      errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                        (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+ 
+             /* emergency vent */
+             slotSegNo = segno;
+         }
+          if (slotSegNo <= 0)             segno = 1;         else if (slotSegNo < segno)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2371,2376 **** static struct config_int ConfigureNamesInt[] =
--- 2371,2387 ----     },      {
+         {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+             gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+          NULL,
+          GUC_UNIT_MB
+         },
+         &max_slot_wal_keep_size_mb,
+         0, 0, INT_MAX,
+         NULL, NULL, NULL
+     },
+ 
+     {         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,             gettext_noop("Sets the maximum time
towait for WAL replication."),             NULL,
 
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ---- #max_wal_senders = 10        # max number of walsender processes                 # (change requires
restart)#wal_keep_segments = 0        # in logfile segments, 16MB each; 0 disables
 
+ #max_slot_wal_keep_size = 0    # measured in bytes; 0 disables #wal_sender_timeout = 60s    # in milliseconds; 0
disables #max_replication_slots = 10    # max number of replication slots
 
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 97,102 **** extern bool reachedConsistency;
--- 97,103 ---- extern int    min_wal_size_mb; extern int    max_wal_size_mb; extern int    wal_keep_segments;
+ extern int    max_slot_wal_keep_size_mb; extern int    XLOGbuffers; extern int    XLogArchiveTimeout; extern int
wal_retrieve_retry_interval;
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 9336,9341 **** CreateRestartPoint(int flags)
--- 9336,9420 ---- }  /*
+  * Check if the record on the given lsn will be preserved at the next
+  * checkpoint.
+  *
+  * Returns true if it will be preserved. If distance is given, the distance
+  * from origin to the beginning of the first segment kept at the next
+  * checkpoint. It means margin when this function returns true and gap of lost
+  * records when false.
+  *
+  * This function should return the consistent result with KeepLogSeg.
+  */
+ bool
+ GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance)
+ {
+     XLogRecPtr currpos;
+     XLogRecPtr tailpos;
+     uint64 currSeg;
+     uint64 restByteInSeg;
+     uint64 restartSeg;
+     uint64 tailSeg;
+     uint64 keepSegs;
+ 
+     currpos = GetXLogWriteRecPtr();
+ 
+     LWLockAcquire(ControlFileLock, LW_SHARED);
+     tailpos = ControlFile->checkPointCopy.redo;
+     LWLockRelease(ControlFileLock);
+ 
+     /* Move the pointer to the beginning of the segment*/
+     XLByteToSeg(currpos, currSeg);
+     XLByteToSeg(restartLSN, restartSeg);
+     XLByteToSeg(tailpos, tailSeg);
+     restByteInSeg = 0;
+ 
+     Assert(wal_keep_segments >= 0);
+     Assert(max_slot_wal_keep_size_mb >= 0);
+ 
+     /*
+      * WAL are removed by the unit of segment.
+      */
+     keepSegs = wal_keep_segments + ConvertToXSegs(max_slot_wal_keep_size_mb);
+ 
+     /*
+      * If the latest checkpoint's redo point is older than the current head
+      * minus keep segments, the next checkpoint keeps the redo point's
+      * segment. Elsewise use current head minus number of segments to keep.
+      */
+     if (currSeg < tailSeg + keepSegs)
+     {
+         if (currSeg < keepSegs)
+             tailSeg = 0;
+         else
+             tailSeg = currSeg - keepSegs;
+ 
+         /* In this case, the margin will be the bytes to the next segment */
+         restByteInSeg = XLogSegSize - (currpos % XLogSegSize);
+     }
+ 
+     /* Required sements will be removed at the next checkpoint */
+     if (restartSeg < tailSeg)
+     {
+         /* Calculate how may bytes the slot have lost */
+         if (distance)
+         {
+             uint64 restbytes = (restartSeg + 1) * XLogSegSize - restartLSN;
+             *distance =
+                 (tailSeg - restartSeg - 1) * XLogSegSize
+                 + restbytes;
+         }
+         return false;
+     }
+ 
+     /* Margin at the next checkpoint before the slot lose sync  */
+     if (distance)
+         *distance = (restartSeg - tailSeg) * XLogSegSize + restByteInSeg;
+ 
+     return true;
+ }
+ 
+ /*  * Retreat *logSegNo to the last segment that we need to retain because of  * either wal_keep_segments or
replicationslots.  *
 
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***************
*** 793,799 **** CREATE VIEW pg_replication_slots AS             L.xmin,             L.catalog_xmin,
L.restart_lsn,
!             L.confirmed_flush_lsn     FROM pg_get_replication_slots() AS L             LEFT JOIN pg_database D ON
(L.datoid= D.oid); 
 
--- 793,801 ----             L.xmin,             L.catalog_xmin,             L.restart_lsn,
!             L.confirmed_flush_lsn,
!             L.live,
!             L.distance     FROM pg_get_replication_slots() AS L             LEFT JOIN pg_database D ON (L.datoid =
D.oid);
 
*** a/src/backend/replication/slotfuncs.c
--- b/src/backend/replication/slotfuncs.c
***************
*** 182,188 **** pg_drop_replication_slot(PG_FUNCTION_ARGS) Datum pg_get_replication_slots(PG_FUNCTION_ARGS) {
! #define PG_GET_REPLICATION_SLOTS_COLS 11     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc   tupdesc;     Tuplestorestate *tupstore;
 
--- 182,188 ---- Datum pg_get_replication_slots(PG_FUNCTION_ARGS) {
! #define PG_GET_REPLICATION_SLOTS_COLS 13     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc   tupdesc;     Tuplestorestate *tupstore;
 
***************
*** 304,309 **** pg_get_replication_slots(PG_FUNCTION_ARGS)
--- 304,323 ----         else             nulls[i++] = true; 
+         if (max_slot_wal_keep_size_mb > 0 && restart_lsn != InvalidXLogRecPtr)
+         {
+             uint64 distance;
+ 
+             values[i++] = BoolGetDatum(GetMarginToSlotSegmentLimit(restart_lsn,
+                                                                    &distance));
+             values[i++] = Int64GetDatum(distance);
+         }
+         else
+         {
+             values[i++] = BoolGetDatum(true);
+             nulls[i++] = true;
+         }
+          tuplestore_putvalues(tupstore, tupdesc, values, nulls);     }
LWLockRelease(ReplicationSlotControlLock);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 267,272 **** extern void ShutdownXLOG(int code, Datum arg);
--- 267,273 ---- extern void InitXLOGAccess(void); extern void CreateCheckPoint(int flags); extern bool
CreateRestartPoint(intflags);
 
+ extern bool GetMarginToSlotSegmentLimit(XLogRecPtr restartLSN, uint64 *distance); extern void XLogPutNextOid(Oid
nextOid);extern XLogRecPtr XLogRestorePoint(const char *rpName); extern void UpdateFullPageWrites(void);
 
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 5347,5353 **** DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0 DESCR("create a
physicalreplication slot"); DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1
02278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ )); DESCR("drop a
replicationslot");
 
! DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); DESCR("information about replication slots currently in
use");DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ )); DESCR("set up a logical replication slot");
 
--- 5347,5353 ---- DESCR("create a physical replication slot"); DATA(insert OID = 3780 (  pg_drop_replication_slot
PGNSPPGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_
_null__null_ )); DESCR("drop a replication slot");
 
! DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,16,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,live,distance}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); DESCR("information about replication slots currently in
use");DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ )); DESCR("set up a logical replication slot");
 
*** a/src/test/regress/expected/rules.out
--- b/src/test/regress/expected/rules.out
***************
*** 1451,1458 **** pg_replication_slots| SELECT l.slot_name,     l.xmin,     l.catalog_xmin,     l.restart_lsn,
!     l.confirmed_flush_lsn
!    FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)      LEFT JOIN pg_database d ON ((l.datoid = d.oid))); pg_roles| SELECT
pg_authid.rolname,    pg_authid.rolsuper,
 
--- 1451,1460 ----     l.xmin,     l.catalog_xmin,     l.restart_lsn,
!     l.confirmed_flush_lsn,
!     l.live,
!     l.distance
!    FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, live, distance)      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
pg_roles|SELECT pg_authid.rolname,     pg_authid.rolsuper, 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

31 October 2017, 12:43:10

Hello, this is a rebased version.

It gets a change of the meaning of monitoring value along with
rebasing.

In previous version, the "live" column mysteriously predicts the
necessary segments will be kept or lost by the next checkpoint
and the "distance" offered a still more mysterious value.

In this version the meaning of the two columns became clear and
informative.

pg_replication_slots - live    :   true the slot have not lost necessary segments.
 - distance:   how many bytes LSN can advance before the margin defined by   max_slot_wal_keep_size (and
wal_keep_segments)is exhasuted,   or how many bytes this slot have lost xlog from restart_lsn.
 

There is a case where live = t and distance = 0. The slot is
currently having all the necessary segments but will start to
lose them at most two checkpoint passes.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 57eaa2b878d30bfcebb093cca0e772fe7a9bff0e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---src/backend/access/transam/xlog.c             | 24 ++++++++++++++++++++++++src/backend/utils/misc/guc.c
   | 11 +++++++++++src/backend/utils/misc/postgresql.conf.sample |  1 +src/include/access/xlog.h                     |
1+4 files changed, 37 insertions(+)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..f79cefb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;int            CommitDelay = 0;    /* precommit delay
inmicroseconds */int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */int
wal_retrieve_retry_interval= 5000;
 
+int            max_slot_wal_keep_size_mb = 0;#ifdef WAL_DEBUGbool        XLOG_DEBUG = false;
@@ -9432,9 +9433,32 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)    if (max_replication_slots > 0 && keep !=
InvalidXLogRecPtr)   {        XLogSegNo    slotSegNo;
 
+        int            slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);        XLByteToSeg(keep, slotSegNo,
wal_segment_size);
+        /*
+         * ignore slots if too many wal segments are kept.
+         * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+         */
+        if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+        {
+            segno = segno - slotlimitsegs; /* must be positive */
+
+            /*
+             * warn only if the checkpoint flushes the required segment.
+             * we assume here that *logSegNo is calculated keep location.
+             */
+            if (slotSegNo < *logSegNo)
+                ereport(WARNING,
+                    (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                     errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                       (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+
+            /* emergency vent */
+            slotSegNo = segno;
+        }
+        if (slotSegNo <= 0)            segno = 1;        else if (slotSegNo < segno)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65372d7..511023a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2368,6 +2368,17 @@ static struct config_int ConfigureNamesInt[] =    },    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {        {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,            gettext_noop("Sets the maximum time to
waitfor WAL replication."),            NULL,
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 368b280..e76c73a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@#max_wal_senders = 10        # max number of walsender processes                # (change requires
restart)#wal_keep_segments= 0        # in logfile segments; 0 disables
 
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables#wal_sender_timeout = 60s    # in milliseconds; 0
disables#max_replication_slots= 10    # max number of replication slots
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0f2b8bd..f0c0255 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;extern int    min_wal_size_mb;extern int    max_wal_size_mb;extern int
  wal_keep_segments;
 
+extern int    max_slot_wal_keep_size_mb;extern int    XLOGbuffers;extern int    XLogArchiveTimeout;extern int
wal_retrieve_retry_interval;
-- 
2.9.2

From 37749d46ba97de38e4593f141bb8a82c67fc0af5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Sep 2017 19:13:22 +0900
Subject: [PATCH 2/2] Add monitoring aid for max_replication_slots.

Adds two columns "live" and "distance" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows how long a slot can live on or how many bytes a
slot have lost if max_slot_wal_keep_size is set.
---src/backend/access/transam/xlog.c    | 137 ++++++++++++++++++++++++++++++++++-src/backend/catalog/system_views.sql |
 4 +-src/backend/replication/slotfuncs.c  |  16 +++-src/include/access/xlog.h            |   1
+src/include/catalog/pg_proc.h       |   2 +-src/test/regress/expected/rules.out  |   6 +-6 files changed, 160
insertions(+),6 deletions(-)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f79cefb..a9203ff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9402,6 +9402,140 @@ CreateRestartPoint(int flags)    return true;}
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If distance is given, it receives the
+ * distance to the point where the margin defined by max_slot_wal_keep_size_mb
+ * and wal_keep_segments will be exhausted or how many bytes we have lost
+ * after restartLSN.
+ *
+ * true and distance = 0 means that restartLSN will be lost by at most two
+ * checkpoints.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, uint64 *distance)
+{
+    XLogRecPtr currpos;
+    XLogSegNo currSeg;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+    uint64 keepSegs;
+    uint64 restbytes;
+
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+    /* no need to calculate distance. very easy. */
+    if (!distance)
+        return    oldestSeg <= restartSeg;
+
+    /* This must perform the same thing as KeepLogSeg. */
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+    /*
+     * calculate the oldest segment that will be kept by wal_keep_segments and
+     * max_slot_wal_keep_size_mb
+     */
+    if (currSeg < keepSegs)
+        tailSeg = 0;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    if (restartSeg < oldestSeg)
+    {
+        /*
+         * restartSeg has been removed. Calculate how many bytes from the
+         * restartLSN have lost.
+         */
+        restbytes = (restartSeg + 1) * wal_segment_size - restartLSN;
+        *distance =
+            (oldestSeg - (restartSeg + 1)) * wal_segment_size + restbytes;
+
+        return false;
+    }
+
+    if (tailSeg <= restartSeg)
+    {
+        /* Return how many bytes we can advance before the slot loses margin */
+        restbytes = wal_segment_size - (currpos % wal_segment_size);
+        *distance = (restartSeg - tailSeg) * wal_segment_size + restbytes;
+    }
+    else
+    {
+        /* the margin ran out */
+        *distance = 0;
+    }
+
+    return true;
+}
+/* * Retreat *logSegNo to the last segment that we need to retain because of * either wal_keep_segments or replication
slots.
@@ -9433,7 +9567,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)    if (max_replication_slots > 0 && keep !=
InvalidXLogRecPtr)   {        XLogSegNo    slotSegNo;
 
-        int            slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb);
+        int            slotlimitsegs =
+            ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);        XLByteToSeg(keep, slotSegNo,
wal_segment_size);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..c55c88b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS            L.xmin,            L.catalog_xmin,
L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.live,
+            L.distance    FROM pg_get_replication_slots() AS L            LEFT JOIN pg_database D ON (L.datoid =
D.oid);
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ab776e8..107da1a 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -182,7 +182,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)Datumpg_get_replication_slots(PG_FUNCTION_ARGS){
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;    TupleDesc
  tupdesc;    Tuplestorestate *tupstore;
 
@@ -304,6 +304,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)        else            nulls[i++] = true;
+        if (max_slot_wal_keep_size_mb > 0 && restart_lsn != InvalidXLogRecPtr)
+        {
+            uint64 distance;
+
+            values[i++] = BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                          &distance));
+            values[i++] = Int64GetDatum(distance);
+        }
+        else
+        {
+            values[i++] = BoolGetDatum(true);
+            nulls[i++] = true;
+        }
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);    }    LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f0c0255..a7a1e4d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);extern void InitXLOGAccess(void);extern void
CreateCheckPoint(intflags);extern bool CreateRestartPoint(int flags);
 
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, uint64 *distance);extern void XLogPutNextOid(Oid nextOid);extern
XLogRecPtrXLogRestorePoint(const char *rpName);extern void UpdateFullPageWrites(void);
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..0913e56 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5340,7 +5340,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0DESCR("create a
physicalreplication slot");DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1
02278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));DESCR("drop a
replicationslot");
 
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,16,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,live,distance}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));DESCR("information about replication slots currently in
use");DATA(insertOID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));DESCR("set up a logical replication slot");
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..16a99d8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,    l.xmin,    l.catalog_xmin,    l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.live,
+    l.distance
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, live, distance)     LEFT JOIN pg_database d ON ((l.datoid =
d.oid)));pg_roles|SELECT pg_authid.rolname,    pg_authid.rolsuper,
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Thomas Munro

Date:

06 November 2017, 04:02:58

On Tue, Oct 31, 2017 at 10:43 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, this is a rebased version.

Hello Horiguchi-san,

I think the "ddl" test under contrib/test_decoding also needs to be
updated because it looks at pg_replication_slots and doesn't expect
your new columns.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Craig Ringer

Date:

06 November 2017, 06:07:04

On 31 October 2017 at 17:43, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, this is a rebased version.
>
> It gets a change of the meaning of monitoring value along with
> rebasing.
>
> In previous version, the "live" column mysteriously predicts the
> necessary segments will be kept or lost by the next checkpoint
> and the "distance" offered a still more mysterious value.

Would it make sense to teach xlogreader how to fetch from WAL archive,
too? That way if there's an archive, slots could continue to be used
even after we purge from local pg_xlog, albeit at a performance cost.

I'm thinking of this mainly for logical slots.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

06 November 2017, 16:19:37

On 2017-11-06 11:07:04 +0800, Craig Ringer wrote:
> Would it make sense to teach xlogreader how to fetch from WAL archive,
> too? That way if there's an archive, slots could continue to be used
> even after we purge from local pg_xlog, albeit at a performance cost.
> 
> I'm thinking of this mainly for logical slots.

That seems more like a page read callback's job than xlogreader's.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

06 November 2017, 16:20:50

Hi,

On 2017-10-31 18:43:10 +0900, Kyotaro HORIGUCHI wrote:
>   - distance:
>     how many bytes LSN can advance before the margin defined by
>     max_slot_wal_keep_size (and wal_keep_segments) is exhasuted,
>     or how many bytes this slot have lost xlog from restart_lsn.

I don't think 'distance' is a good metric - that's going to continually
change. Why not store the LSN that's available and provide a function
that computes this? Or just rely on the lsn - lsn operator?

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

08 November 2017, 07:14:31

Hello,

At Mon, 6 Nov 2017 05:20:50 -0800, Andres Freund <andres@anarazel.de> wrote in
<20171106132050.6apzynxrqrzghb4r@alap3.anarazel.de>
> Hi,
> 
> On 2017-10-31 18:43:10 +0900, Kyotaro HORIGUCHI wrote:
> >   - distance:
> >     how many bytes LSN can advance before the margin defined by
> >     max_slot_wal_keep_size (and wal_keep_segments) is exhasuted,
> >     or how many bytes this slot have lost xlog from restart_lsn.
> 
> I don't think 'distance' is a good metric - that's going to continually
> change. Why not store the LSN that's available and provide a function
> that computes this? Or just rely on the lsn - lsn operator?

It seems reasonable.,The 'secured minimum LSN' is common among
all slots so showing it in the view may look a bit stupid but I
don't find another suitable place for it.  distance = 0 meant the
state that the slot is living but insecured in the previous patch
and that information is lost by changing 'distance' to
'min_secure_lsn'.

Thus I changed the 'live' column to 'status' and show that staus
in text representation.

status: secured | insecured | broken

So this looks like the following (max_slot_wal_keep_size = 8MB,
which is a half of the default segment size)

-- slots that required WAL is surely available
select restart_lsn, status, min_secure_lsn, pg_current_wal_lsn() from pg_replication_slots;
restart_lsn | status  | min_recure_lsn | pg_current_wal_lsn 
------------+---------+----------------+--------------------
0/1A000060  | secured | 0/1A000000     | 0/1B42BC78

-- slots that required WAL is still available but insecured
restart_lsn | status    | min_recure_lsn | pg_current_wal_lsn 
------------+-----------+----------------+--------------------
0/1A000060  | insecured | 0/1C000000     | 0/1D76C948

-- slots that required WAL is lost
# We should have seen the log 'Some replication slots have lost...'

restart_lsn | status | min_recure_lsn | pg_current_wal_lsn 
------------+--------+----------------+--------------------
0/1A000060  | broken | 0/1C000000     | 0/1D76C9F0


I noticed that I abandoned the segment fragment of
max_slot_wal_keep_size in calculating in the routines. The
current patch honors the frament part of max_slot_wal_keep_size.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 109f056e257aba70dddc8d466767ed0a1db371e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---src/backend/access/transam/xlog.c             | 39 +++++++++++++++++++++++++++src/backend/utils/misc/guc.c
      | 11 ++++++++src/backend/utils/misc/postgresql.conf.sample |  1 +src/include/access/xlog.h                     |
1+4 files changed, 52 insertions(+)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..cfdae39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;int            CommitDelay = 0;    /* precommit delay
inmicroseconds */int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */int
wal_retrieve_retry_interval= 5000;
 
+int            max_slot_wal_keep_size_mb = 0;#ifdef WAL_DEBUGbool        XLOG_DEBUG = false;
@@ -9432,9 +9433,47 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)    if (max_replication_slots > 0 && keep !=
InvalidXLogRecPtr)   {        XLogSegNo    slotSegNo;
 
+        int            slotlimitsegs;
+        uint64        recptroff;
+        uint64        slotlimitbytes;
+        uint64        slotlimitfragment;
+
+        recptroff = XLogSegmentOffset(recptr, wal_segment_size);
+        slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+        slotlimitfragment =    XLogSegmentOffset(slotlimitbytes,
+                                              wal_segment_size);
+
+        /* calculate segments to keep by max_slot_wal_keep_size_mb */
+        slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb,
+                                       wal_segment_size);
+        /* honor the fragment */
+        if (recptroff < slotlimitfragment)
+            slotlimitsegs++;        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+        /*
+         * ignore slots if too many wal segments are kept.
+         * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+         */
+        if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+        {
+            segno = segno - slotlimitsegs; /* must be positive */
+
+            /*
+             * warn only if the checkpoint flushes the required segment.
+             * we assume here that *logSegNo is calculated keep location.
+             */
+            if (slotSegNo < *logSegNo)
+                ereport(WARNING,
+                    (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                     errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                       (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+
+            /* emergency vent */
+            slotSegNo = segno;
+        }
+        if (slotSegNo <= 0)            segno = 1;        else if (slotSegNo < segno)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65372d7..511023a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2368,6 +2368,17 @@ static struct config_int ConfigureNamesInt[] =    },    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {        {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,            gettext_noop("Sets the maximum time to
waitfor WAL replication."),            NULL,
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 368b280..e76c73a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@#max_wal_senders = 10        # max number of walsender processes                # (change requires
restart)#wal_keep_segments= 0        # in logfile segments; 0 disables
 
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables#wal_sender_timeout = 60s    # in milliseconds; 0
disables#max_replication_slots= 10    # max number of replication slots
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0f2b8bd..f0c0255 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;extern int    min_wal_size_mb;extern int    max_wal_size_mb;extern int
  wal_keep_segments;
 
+extern int    max_slot_wal_keep_size_mb;extern int    XLOGbuffers;extern int    XLogArchiveTimeout;extern int
wal_retrieve_retry_interval;
-- 
2.9.2

From 03eb40b7af4df41e0d755c1c00af1dfa5a71b09a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Sep 2017 19:13:22 +0900
Subject: [PATCH 2/2] Add monitoring aid for max_replication_slots.

Adds two columns "live" and "distance" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows how long a slot can live on or how many bytes a
slot have lost if max_slot_wal_keep_size is set.
---src/backend/access/transam/xlog.c    | 122 ++++++++++++++++++++++++++++++++++-src/backend/catalog/system_views.sql |
 4 +-src/backend/replication/slotfuncs.c  |  27 +++++++-src/include/access/xlog.h            |   1
+src/include/catalog/pg_proc.h       |   2 +-src/test/regress/expected/rules.out  |   6 +-6 files changed, 156
insertions(+),6 deletions(-)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cfdae39..8ce7044 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9402,6 +9402,122 @@ CreateRestartPoint(int flags)    return true;}
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If distance is given, it receives the
+ * distance to the point where the margin defined by max_slot_wal_keep_size_mb
+ * and wal_keep_segments will be exhausted or how many bytes we have lost
+ * after restartLSN.
+ *
+ * true and distance = 0 means that restartLSN will be lost by at most two
+ * checkpoints.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo currSeg;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+    uint64 keepSegs;
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minSecureLSN)
+    {
+        uint64 slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+        uint64 slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                     wal_segment_size);
+        uint64 currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+
+        /* Calculate keep segments. Must be in sync with KeepLogSeg. */
+        Assert(wal_keep_segments >= 0);
+        Assert(max_slot_wal_keep_size_mb >= 0);
+        keepSegs = wal_keep_segments +
+            ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+        if (currposoff < slotlimitfragment)
+            keepSegs++;
+
+        /*
+         * calculate the oldest segment that will be kept by wal_keep_segments and
+         * max_slot_wal_keep_size_mb
+         */
+        if (currSeg < keepSegs)
+            tailSeg = 0;
+        else
+            tailSeg = currSeg - keepSegs;
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minSecureLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+/* * Retreat *logSegNo to the last segment that we need to retain because of * either wal_keep_segments or replication
slots.
@@ -9429,7 +9545,11 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)            segno = segno - wal_keep_segments;
  }
 
-    /* then check whether slots limit removal further */
+    /*
+     * then check whether slots limit removal further
+     * should be consistent with IsLsnStillAvaiable().
+     */
+    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)    {        XLogSegNo    slotSegNo;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..6512ac3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS            L.xmin,            L.catalog_xmin,
L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.status,
+            L.min_secure_lsn    FROM pg_get_replication_slots() AS L            LEFT JOIN pg_database D ON (L.datoid =
D.oid);
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ab776e8..0dca618 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -182,7 +182,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)Datumpg_get_replication_slots(PG_FUNCTION_ARGS){
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;    TupleDesc
  tupdesc;    Tuplestorestate *tupstore;
 
@@ -304,6 +304,31 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)        else            nulls[i++] = true;
+        if (max_slot_wal_keep_size_mb > 0 && restart_lsn != InvalidXLogRecPtr)
+        {
+            XLogRecPtr    min_secure_lsn;
+            char *status = "unknown";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_secure_lsn)))
+            {
+                if (min_secure_lsn <= restart_lsn)
+                    status = "secured";
+                else
+                    status = "insecured";
+            }
+            else
+                status = "broken";
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_secure_lsn);
+        }
+        else
+        {
+            values[i++] = BoolGetDatum(true);
+            nulls[i++] = true;
+        }
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);    }    LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f0c0255..a316ead 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);extern void InitXLOGAccess(void);extern void
CreateCheckPoint(intflags);extern bool CreateRestartPoint(int flags);
 
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);extern void XLogPutNextOid(Oid
nextOid);externXLogRecPtr XLogRestorePoint(const char *rpName);extern void UpdateFullPageWrites(void);
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..d03fd6f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5340,7 +5340,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0DESCR("create a
physicalreplication slot");DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1
02278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));DESCR("drop a
replicationslot");
 
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,status,min_secure_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));DESCR("information about replication slots currently in
use");DATA(insertOID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));DESCR("set up a logical replication slot");
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..d9d74a3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,    l.xmin,    l.catalog_xmin,    l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.status,
+    l.min_secure_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, status, min_secure_lsn)     LEFT JOIN pg_database d ON ((l.datoid =
d.oid)));pg_roles|SELECT pg_authid.rolname,    pg_authid.rolsuper,
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

09 November 2017, 11:31:28

Oops! The previous patch is forgetting the default case and crashes.

At Wed, 08 Nov 2017 13:14:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171108.131431.170534842.horiguchi.kyotaro@lab.ntt.co.jp>
> > I don't think 'distance' is a good metric - that's going to continually
> > change. Why not store the LSN that's available and provide a function
> > that computes this? Or just rely on the lsn - lsn operator?
> 
> It seems reasonable.,The 'secured minimum LSN' is common among
> all slots so showing it in the view may look a bit stupid but I
> don't find another suitable place for it.  distance = 0 meant the
> state that the slot is living but insecured in the previous patch
> and that information is lost by changing 'distance' to
> 'min_secure_lsn'.
> 
> Thus I changed the 'live' column to 'status' and show that staus
> in text representation.
> 
> status: secured | insecured | broken
> 
> So this looks like the following (max_slot_wal_keep_size = 8MB,
> which is a half of the default segment size)
> 
> -- slots that required WAL is surely available
> select restart_lsn, status, min_secure_lsn, pg_current_wal_lsn() from pg_replication_slots;
> restart_lsn | status  | min_recure_lsn | pg_current_wal_lsn 
> ------------+---------+----------------+--------------------
> 0/1A000060  | secured | 0/1A000000     | 0/1B42BC78
> 
> -- slots that required WAL is still available but insecured
> restart_lsn | status    | min_recure_lsn | pg_current_wal_lsn 
> ------------+-----------+----------------+--------------------
> 0/1A000060  | insecured | 0/1C000000     | 0/1D76C948
> 
> -- slots that required WAL is lost
> # We should have seen the log 'Some replication slots have lost...'
> 
> restart_lsn | status | min_recure_lsn | pg_current_wal_lsn 
> ------------+--------+----------------+--------------------
> 0/1A000060  | broken | 0/1C000000     | 0/1D76C9F0
> 
> 
> I noticed that I abandoned the segment fragment of
> max_slot_wal_keep_size in calculating in the routines. The
> current patch honors the frament part of max_slot_wal_keep_size.

I changed IsLsnStillAvailable to return meaningful values
regardless whether max_slot_wal_keep_size is set or not.

# I had been forgetting to count the version for latestst several
# patches. I give the version '4' - as the next of the last
# numbered patch.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 109f056e257aba70dddc8d466767ed0a1db371e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Feb 2017 11:39:48 +0900
Subject: [PATCH 1/2] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---src/backend/access/transam/xlog.c             | 39 +++++++++++++++++++++++++++src/backend/utils/misc/guc.c
      | 11 ++++++++src/backend/utils/misc/postgresql.conf.sample |  1 +src/include/access/xlog.h                     |
1+4 files changed, 52 insertions(+)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..cfdae39 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;int            CommitDelay = 0;    /* precommit delay
inmicroseconds */int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */int
wal_retrieve_retry_interval= 5000;
 
+int            max_slot_wal_keep_size_mb = 0;#ifdef WAL_DEBUGbool        XLOG_DEBUG = false;
@@ -9432,9 +9433,47 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)    if (max_replication_slots > 0 && keep !=
InvalidXLogRecPtr)   {        XLogSegNo    slotSegNo;
 
+        int            slotlimitsegs;
+        uint64        recptroff;
+        uint64        slotlimitbytes;
+        uint64        slotlimitfragment;
+
+        recptroff = XLogSegmentOffset(recptr, wal_segment_size);
+        slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+        slotlimitfragment =    XLogSegmentOffset(slotlimitbytes,
+                                              wal_segment_size);
+
+        /* calculate segments to keep by max_slot_wal_keep_size_mb */
+        slotlimitsegs = ConvertToXSegs(max_slot_wal_keep_size_mb,
+                                       wal_segment_size);
+        /* honor the fragment */
+        if (recptroff < slotlimitfragment)
+            slotlimitsegs++;        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+        /*
+         * ignore slots if too many wal segments are kept.
+         * max_slot_wal_keep_size is just accumulated on wal_keep_segments.
+         */
+        if (max_slot_wal_keep_size_mb > 0 && slotSegNo + slotlimitsegs < segno)
+        {
+            segno = segno - slotlimitsegs; /* must be positive */
+
+            /*
+             * warn only if the checkpoint flushes the required segment.
+             * we assume here that *logSegNo is calculated keep location.
+             */
+            if (slotSegNo < *logSegNo)
+                ereport(WARNING,
+                    (errmsg ("restart LSN of replication slots is ignored by checkpoint"),
+                     errdetail("Some replication slots have lost required WAL segnents to continue by up to %ld
segments.",
+                       (segno < *logSegNo ? segno : *logSegNo) - slotSegNo)));
+
+            /* emergency vent */
+            slotSegNo = segno;
+        }
+        if (slotSegNo <= 0)            segno = 1;        else if (slotSegNo < segno)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65372d7..511023a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2368,6 +2368,17 @@ static struct config_int ConfigureNamesInt[] =    },    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {        {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,            gettext_noop("Sets the maximum time to
waitfor WAL replication."),            NULL,
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 368b280..e76c73a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@#max_wal_senders = 10        # max number of walsender processes                # (change requires
restart)#wal_keep_segments= 0        # in logfile segments; 0 disables
 
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables#wal_sender_timeout = 60s    # in milliseconds; 0
disables#max_replication_slots= 10    # max number of replication slots
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0f2b8bd..f0c0255 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;extern int    min_wal_size_mb;extern int    max_wal_size_mb;extern int
  wal_keep_segments;
 
+extern int    max_slot_wal_keep_size_mb;extern int    XLOGbuffers;extern int    XLogArchiveTimeout;extern int
wal_retrieve_retry_interval;
-- 
2.9.2

From 67f73c35b0c1c97bd2fff80139bfd3b7142f6bee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Sep 2017 19:13:22 +0900
Subject: [PATCH 2/2] Add monitoring aid for max_replication_slots.

Adds two columns "live" and "distance" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows how long a slot can live on or how many bytes a
slot have lost if max_slot_wal_keep_size is set.
---src/backend/access/transam/xlog.c    | 128 ++++++++++++++++++++++++++++++++++-src/backend/catalog/system_views.sql |
 4 +-src/backend/replication/slotfuncs.c  |  25 ++++++-src/include/access/xlog.h            |   1
+src/include/catalog/pg_proc.h       |   2 +-src/test/regress/expected/rules.out  |   6 +-6 files changed, 160
insertions(+),6 deletions(-)
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cfdae39..be53e0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9402,6 +9402,128 @@ CreateRestartPoint(int flags)    return true;}
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minSecureLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo currSeg;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+    uint64 keepSegs;
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minSecureLSN)
+    {
+        if (max_slot_wal_keep_size_mb > 0)
+        {
+            uint64 slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+            uint64 slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                         wal_segment_size);
+            uint64 currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+
+            /* Calculate keep segments. Must be in sync with KeepLogSeg. */
+            Assert(wal_keep_segments >= 0);
+            Assert(max_slot_wal_keep_size_mb >= 0);
+
+            keepSegs = wal_keep_segments +
+                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+            if (currposoff < slotlimitfragment)
+                keepSegs++;
+
+            /*
+             * calculate the oldest segment that will be kept by
+             * wal_keep_segments and max_slot_wal_keep_size_mb
+             */
+            if (currSeg < keepSegs)
+                tailSeg = 0;
+            else
+                tailSeg = currSeg - keepSegs;
+
+        }
+        else
+        {
+            /* all requred segments are secured in this case */
+            XLogRecPtr keep = XLogGetReplicationSlotMinimumLSN();
+            XLByteToSeg(keep, tailSeg, wal_segment_size);
+        }
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minSecureLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+/* * Retreat *logSegNo to the last segment that we need to retain because of * either wal_keep_segments or replication
slots.
@@ -9429,7 +9551,11 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)            segno = segno - wal_keep_segments;
  }
 
-    /* then check whether slots limit removal further */
+    /*
+     * then check whether slots limit removal further
+     * should be consistent with IsLsnStillAvaiable().
+     */
+    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)    {        XLogSegNo    slotSegNo;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..6512ac3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS            L.xmin,            L.catalog_xmin,
L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.status,
+            L.min_secure_lsn    FROM pg_get_replication_slots() AS L            LEFT JOIN pg_database D ON (L.datoid =
D.oid);
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ab776e8..200a478 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -182,7 +182,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)Datumpg_get_replication_slots(PG_FUNCTION_ARGS){
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;    TupleDesc
  tupdesc;    Tuplestorestate *tupstore;
 
@@ -304,6 +304,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)        else            nulls[i++] = true;
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_secure_lsn;
+            char *status = "borken";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_secure_lsn)))
+            {
+                if (min_secure_lsn <= restart_lsn)
+                    status = "secured";
+                else
+                    status = "insecured";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_secure_lsn);
+        }
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);    }    LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f0c0255..a316ead 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);extern void InitXLOGAccess(void);extern void
CreateCheckPoint(intflags);extern bool CreateRestartPoint(int flags);
 
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);extern void XLogPutNextOid(Oid
nextOid);externXLogRecPtr XLogRestorePoint(const char *rpName);extern void UpdateFullPageWrites(void);
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..d03fd6f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5340,7 +5340,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0DESCR("create a
physicalreplication slot");DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1
02278 "19" _null_ _null_ _null_ _null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));DESCR("drop a
replicationslot");
 
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,status,min_secure_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));DESCR("information about replication slots currently in
use");DATA(insertOID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19
1916" "{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));DESCR("set up a logical replication slot");
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..d9d74a3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,    l.xmin,    l.catalog_xmin,    l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.status,
+    l.min_secure_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, status, min_secure_lsn)     LEFT JOIN pg_database d ON ((l.datoid =
d.oid)));pg_roles|SELECT pg_authid.rolname,    pg_authid.rolsuper,
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

30 November 2017, 06:44:16

On Thu, Nov 9, 2017 at 5:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> # I had been forgetting to count the version for latestst several
> # patches. I give the version '4' - as the next of the last
> # numbered patch.

With all the changes that have happened in the documentation lately, I
suspect that this is going to need a rework.. Moved to next CF per
lack of reviews, with waiting on author as status.
-- 
Michael

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

22 December 2017, 09:03:20

At Thu, 30 Nov 2017 12:44:16 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqS4bhSsDm_47GVjQno=iU6thx13MQVwwXXKBHQwfwwNCA@mail.gmail.com>
> On Thu, Nov 9, 2017 at 5:31 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > # I had been forgetting to count the version for latestst several
> > # patches. I give the version '4' - as the next of the last
> > # numbered patch.
> 
> With all the changes that have happened in the documentation lately, I
> suspect that this is going to need a rework.. Moved to next CF per
> lack of reviews, with waiting on author as status.

I refactored this patch so that almost-same don't appear
twice. And added recovery TAP test for this.

New function GetMinSecuredSegment() calculates the segment number
considering wal_keep_segments and
max_slot_wal_keep_size. KeepLogSeg and IsLsnStillAvailable no
longer have the code block that should be in "sync".
I think the new code is far understandable than the previous one.

The new third patch contains a TAP test to check
max_slot_wal_keep_size and relevant stats view are surely
working.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b80345f1d600ecb427fec8e0a03bb4ed0f1ec7ba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/3] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 114 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 107 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3e9a12d..723a983 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetMinSecuredSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9348,6 +9350,74 @@ CreateRestartPoint(int flags)
 }
 
 /*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetMinSecuredSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
+/*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
@@ -9359,34 +9429,38 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetMinSecuredSegment(recptr, slotminptr);
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
+    {
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The most affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
+    if (minSegNo < segno)
+        segno = minSegNo;
+
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
         *logSegNo = segno;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e32901d..97d83f3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2395,6 +2395,17 @@ static struct config_int ConfigureNamesInt[] =
     },
 
     {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
             NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 69f40f0..c7335b6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index dd7d8b5..45eb51a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.9.2

From c972f22a2697f54cda71b6b4e7b7f0eac477e9af Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/3] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 src/backend/access/transam/xlog.c    | 93 ++++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql |  4 +-
 src/backend/replication/slotfuncs.c  | 25 +++++++++-
 src/include/access/xlog.h            |  1 +
 src/include/catalog/pg_proc.h        |  2 +-
 src/test/regress/expected/rules.out  |  6 ++-
 6 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 723a983..b630224 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9349,6 +9349,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minSecureLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minSecureLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetMinSecuredSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minSecureLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 394aea8..4167146 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.status,
+            L.min_secure_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ab776e8..3880807 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -182,7 +182,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -304,6 +304,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_secure_lsn;
+            char *status = "broken";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_secure_lsn)))
+            {
+                if (min_secure_lsn <= restart_lsn)
+                    status = "secured";
+                else
+                    status = "insecured";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_secure_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 45eb51a..542df28 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c969375..1157438 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5340,7 +5340,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_
_null__null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,status,min_secure_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19
16""{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));
 
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..d9d74a3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn) 
+    l.confirmed_flush_lsn,
+    l.status,
+    l.min_secure_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, status, min_secure_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.9.2

From badd89a8c167cc7887349564c6f8fb3007d158f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/3] TAP test for the slot limit feature

---
 src/test/recovery/t/014_replslot_limit.pl | 162 ++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)
 create mode 100644 src/test/recovery/t/014_replslot_limit.pl

diff --git a/src/test/recovery/t/014_replslot_limit.pl b/src/test/recovery/t/014_replslot_limit.pl
new file mode 100644
index 0000000..41b828d
--- /dev/null
+++ b/src/test/recovery/t/014_replslot_limit.pl
@@ -0,0 +1,162 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, status, min_secure_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|secured|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, status, min_secure_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|secured|$start_lsn", 'check slot is securing all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, status FROM pg_replication_slots WHERE slot_name =
'rep1'");
+is($result, "$start_lsn|insecured", 'check some segments became insecured');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check no replication failure is caused by insecure state');
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+my $logstart = get_log_size($node_master);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, status FROM pg_replication_slots WHERE slot_name =
'rep1'");
+is($result, "$start_lsn|broken", 'check overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by ten segments (= 160MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.9.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Sergei Kornilov

Date:

22 December 2017, 15:04:20

Hello
I think limit wal in replication slots is useful in some cases. But first time i was confused with proposed terminology
secured/insecured/broken/unknownstate.
 

patch -p1 gives some "Stripping trailing CRs from patch" messages for me, but applied to current HEAD and builds. After
littletesting i understood the difference in secured/insecured/broken terminology. Secured means garantee to keep wal,
insecure- wal may be deleted with next checkpoint, broken - wal already deleted.
 
I think, we may split "secure" to "streaming" and... hmm... "waiting"? "keeping"? - according active flag for clarify
andreadable "status" field.
 

regards, Sergei

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

11 January 2018, 09:59:10

Hello. Thank you for the comment.

(And sorry for the absense.)

At Fri, 22 Dec 2017 15:04:20 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <337571513944260@web55j.yandex.ru>
> Hello
> I think limit wal in replication slots is useful in some cases. But first time i was confused with proposed
terminologysecured/insecured/broken/unknown state.
 

I'm not confident on the terminology. Suggestions are welcome on
the wording that makes more sense.

> patch -p1 gives some "Stripping trailing CRs from patch"
> messages for me, but applied to current HEAD and builds. After

Hmm. I wonder why I get that complaint so often. (It's rather
common? or caused by the MIME format of my mail?)  I'd say with
confidence that it is because you retrieved the patch file on
Windows mailer.

> little testing i understood the difference in
> secured/insecured/broken terminology. Secured means garantee to
> keep wal, insecure - wal may be deleted with next checkpoint,
> broken - wal already deleted.

Right. I'm sorry that I haven't written that clearly anywhere and
bothered you confirming that. I added documentation as the forth
patch.

> I think, we may split "secure" to "streaming"
> and... hmm... "waiting"? "keeping"? - according active flag for
> clarify and readable "status" field.

streaming / keeping and lost? (and unknown) Also "status" is
surely offers somewhat obscure meaning. wal_status (or
(wal_)availability) and min_keep_lsn maeke more sense?

The additional fields in pg_replication_slots have been changed
as the follows in the attached patch.

  confirmed_flush_lsn:
+ wal_status         : (streaming | keeping | lost | unknown)
+ min_keep_lsn       : <The oldest LSN that is available in WAL files>


The changes of documentation are seen in the following html files.

doc/src/sgml/html/warm-standby.html#STREAMING-REPLICATION-SLOTS
doc/src/sgml/html/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE
doc/src/sgml/html/view-pg-replication-slots.html


One annoyance is that the min_keep_lsn always has the same value
among all slots.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 95e2a5eec2cfd9cbcafbac8617bd5ccdecbed6d2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 114 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 107 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e42b828..bdb7156 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetMinKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9348,6 +9350,74 @@ CreateRestartPoint(int flags)
 }
 
 /*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetMinKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
+/*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
@@ -9359,34 +9429,38 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetMinKeepSegment(recptr, slotminptr);
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
+    {
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The most affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
+    if (minSegNo < segno)
+        segno = minSegNo;
+
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
         *logSegNo = segno;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be3..7bfadcf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2395,6 +2395,17 @@ static struct config_int ConfigureNamesInt[] =
     },
 
     {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
             NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 69f40f0..c7335b6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d..12cd0d1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.9.2

From f77b61a72eb6ac7c1775a3314c70159c6b6d834d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 src/backend/access/transam/xlog.c    | 93 ++++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql |  4 +-
 src/backend/replication/slotfuncs.c  | 25 +++++++++-
 src/include/access/xlog.h            |  1 +
 src/include/catalog/pg_proc.h        |  2 +-
 src/test/regress/expected/rules.out  |  6 ++-
 6 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bdb7156..a8423f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9349,6 +9349,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetMinKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5652e9e..cd714cc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index b02df59..84f4154 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -182,7 +182,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -304,6 +304,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1..52e64f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 298e0ae..93f62d7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5345,7 +5345,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_
_null__null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19
16""{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));
 
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..75d44af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn) 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.9.2

From dce2d9d4fa5d09d23e4b43734e6e5e2d9e59fe9a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/014_replslot_limit.pl | 162 ++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)
 create mode 100644 src/test/recovery/t/014_replslot_limit.pl

diff --git a/src/test/recovery/t/014_replslot_limit.pl b/src/test/recovery/t/014_replslot_limit.pl
new file mode 100644
index 0000000..d397049
--- /dev/null
+++ b/src/test/recovery/t/014_replslot_limit.pl
@@ -0,0 +1,162 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check slot is securing all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check no replication failure is caused by insecure state');
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+my $logstart = get_log_size($node_master);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by ten segments (= 160MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.9.2

From 97c6a15b28334173ea500b880b06c59d3c177f71 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3f02202..249b336 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9791,6 +9791,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a0169..efa7d7b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2977,6 +2977,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198..7bf5cc7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.9.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Sergei Kornilov

Date:

11 January 2018, 12:55:27

Hello

>>  patch -p1 gives some "Stripping trailing CRs from patch"
>>  messages for me, but applied to current HEAD and builds. After
>
> Hmm. I wonder why I get that complaint so often. (It's rather
> common? or caused by the MIME format of my mail?) I'd say with
> confidence that it is because you retrieved the patch file on
> Windows mailer.
I use Debian and web based mailer. Hm, i wget patches from links here
https://www.postgresql.org/message-id/flat/20180111.155910.26212237.horiguchi.kyotaro%40lab.ntt.co.jp- applies clean
bothlast and previous messages. Its strange.
 

Updated patches builds ok, but i found one failed test in make check-world: contrib/test_decoding/sql/ddl.sql at the
endmakes SELECT * FROM pg_replication_slots; which result of course was changed
 
And i still have no better ideas for naming. I think on something like
if (min_keep_lsn <= restart_lsn)
    if (active_pid != 0) 
        status = "streaming";
    else 
        status = "keeping";
else
    status = "may_lost";
This duplicates an existing active field, but I think it's useful as slot status description.
wal_status streaming/keeping/lost/unknown as described in docs patch is also acceptable for me. Maybe anyone else has
betteridea?
 

Regards, Sergei

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Greg Stark

Date:

11 January 2018, 16:56:14

On 11 January 2018 at 09:55, Sergei Kornilov <sk@zsrv.org> wrote:
>         if (active_pid != 0)
>                 status = "streaming";
>         else
>                 status = "keeping";

Perhaps "idle" by analogy to a pg_stat_activity entry for a backend
that's connected but not doing anything.

>         status = "may_lost";

Perhaps "stale" or "expired"?

Is this patch in bike-shed territory? Are there any questions about
whether we want the basic shape to look like this?

Fwiw I think there's a real need for this feature so I would like to
get it in for Postgres 11.

-- 
greg

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

15 January 2018, 09:05:07

Hello,

At Thu, 11 Jan 2018 12:55:27 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <2798121515664527@web40g.yandex.ru>
> Hello
> 
> >>  patch -p1 gives some "Stripping trailing CRs from patch"
> >>  messages for me, but applied to current HEAD and builds. After
> >
> > Hmm. I wonder why I get that complaint so often. (It's rather
> > common? or caused by the MIME format of my mail?) I'd say with
> > confidence that it is because you retrieved the patch file on
> > Windows mailer.
> I use Debian and web based mailer. Hm, i wget patches from links here
https://www.postgresql.org/message-id/flat/20180111.155910.26212237.horiguchi.kyotaro%40lab.ntt.co.jp- applies clean
bothlast and previous messages. Its strange.
 

Thanks for the information. The cause I suppose is that *I*
attached the files in *text* MIME type. I taught my mailer
application to use "Application/Octet-stream" instead and that
should make most (or all) people here happy.

> Updated patches builds ok, but i found one failed test in make check-world: contrib/test_decoding/sql/ddl.sql at the
endmakes SELECT * FROM pg_replication_slots; which result of course was changed
 

Mmm. Good catch. check-world (contribs) was out of my sight.
It is fixed locally.

> And i still have no better ideas for naming. I think on something like
> if (min_keep_lsn <= restart_lsn)
>     if (active_pid != 0) 
>         status = "streaming";
>     else 
>         status = "keeping";
> else
>     status = "may_lost";
> This duplicates an existing active field, but I think it's useful as slot status description.
> wal_status streaming/keeping/lost/unknown as described in docs patch is also acceptable for me. Maybe anyone else has
betteridea?
 

I'll fix this after the discussion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

15 January 2018, 09:36:32

Hello,

At Thu, 11 Jan 2018 13:56:14 +0000, Greg Stark <stark@mit.edu> wrote in
<CAM-w4HOVYZkCbCdFt8N8zwAAcuETFimwOB_Db+jgFajn-iYHEQ@mail.gmail.com>
> On 11 January 2018 at 09:55, Sergei Kornilov <sk@zsrv.org> wrote:
> >         if (active_pid != 0)
> >                 status = "streaming";
> >         else
> >                 status = "keeping";
> 
> Perhaps "idle" by analogy to a pg_stat_activity entry for a backend
> that's connected but not doing anything.

The state "keeping" is "some segments that are needed by a slot
are still existing but to be removed by the next checkpoint". The
three states are alogogous to green/yellow/red in traffic
lights. "idle" doesn't feel right.

> >         status = "may_lost";
> 
> Perhaps "stale" or "expired"?

Some random thoughts on this topic:

Reading the field as "WAL record at restrat_lsn is/has been
$(status)", "expired" fits there.  "safe"/"crtical"/("stale" and
"expired") would fit "restart_lsn is $(status)"?

If we merge the second sate to the red-side, a boolean column
with the names "wal_preserved" or "wal_available" might work. But
I believe the second state is crucial.

> Is this patch in bike-shed territory? Are there any questions about
> whether we want the basic shape to look like this?

FWIW the summary history of this patch follows.

 - added monitoring feature,
 - GUC in bytes not in segments,
 - show the "min_keep_lsn" instead of "spare amount of avalable
   WAL(distance)" (*1)
 - changed the words to show the status. (still under discussion)
 - added documentation.

I didn't adopt "setting per slot" since the keep amount is not
measured from slot's restart_lsn, but from checkpoint LSN.

*1: As I mentioned upthread, I think that at least the
  "pg_replication_slots.min_keep_lsn" is arguable since it shows
  the same value for all slots and I haven't found no other
  appropriate place.

> Fwiw I think there's a real need for this feature so I would like to
> get it in for Postgres 11.

It encourages me a lot. Thanks.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Robert Haas

Date:

16 January 2018, 05:34:05

On Mon, Jan 15, 2018 at 1:05 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >>  patch -p1 gives some "Stripping trailing CRs from patch"
>> >>  messages for me, but applied to current HEAD and builds. After
>> >
>> > Hmm. I wonder why I get that complaint so often. (It's rather
>> > common? or caused by the MIME format of my mail?) I'd say with
>> > confidence that it is because you retrieved the patch file on
>> > Windows mailer.
>> I use Debian and web based mailer. Hm, i wget patches from links here
https://www.postgresql.org/message-id/flat/20180111.155910.26212237.horiguchi.kyotaro%40lab.ntt.co.jp- applies clean
bothlast and previous messages. Its strange.
 
>
> Thanks for the information. The cause I suppose is that *I*
> attached the files in *text* MIME type. I taught my mailer
> application to use "Application/Octet-stream" instead and that
> should make most (or all) people here happy.

Since the "Stripping trailing CRs from patch" message is totally
harmless, I'm not sure why you should need to devote any effort to
avoiding it.  Anyone who gets it should just ignore it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Tom Lane

Date:

16 January 2018, 05:45:34

Robert Haas <robertmhaas@gmail.com> writes:
> Since the "Stripping trailing CRs from patch" message is totally
> harmless, I'm not sure why you should need to devote any effort to
> avoiding it.  Anyone who gets it should just ignore it.

Not sure, but that might be another situation in which "patch"
works and "git apply" doesn't.  (Feeling too lazy to test it...)

            regards, tom lane

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

16 January 2018, 12:04:51

I'm digressing...

At Mon, 15 Jan 2018 21:45:34 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <26718.1516070734@sss.pgh.pa.us>
> Robert Haas <robertmhaas@gmail.com> writes:
> > Since the "Stripping trailing CRs from patch" message is totally
> > harmless, I'm not sure why you should need to devote any effort to
> > avoiding it.  Anyone who gets it should just ignore it.

I know that and totally agree to Robert but still I wonder why
(and am annoyed by) I sometimes receive such complain or even an
accusation that I sent an out-of-the-convention patch and I was
afraid that it is not actually common.

For thie reason I roughly counted up CT/CTE's that people here is
using for patches in my mail box this time and got the following
numbers. (Counted on attachments with a name "*.patch/diff".)

Rank : Freq : CT/CTE
    1:  3308: application/octet-stream:base64
    2:  1642: text/x-patch;charset=us-ascii:base64
    3:  1286: text/x-diff;charset=us-ascii:7bit
*   4:   997: text/x-patch;charset=us-ascii:7bit
    5:   497: text/x-diff;charset=us-ascii:base64
    6:   406: text/x-diff:quoted-printable
    7:   403: text/plain;charset=us-ascii:7bit
    8:   389: text/x-diff:base64
    9:   321: application/x-gzip:base64
   10:   281: text/plain;charset=us-ascii:base64
<snip>
Total: attachments=11461 / mails=158121

The most common setting is application/octet-stream:base64 but
text/x-patch;charset=us-ascii:7bit is also one of ... the majority?

I'm convinced that my original setting is not so problematic so I
reverted it.

> Not sure, but that might be another situation in which "patch"
> works and "git apply" doesn't.  (Feeling too lazy to test it...)

I was also afraid of that as I wrote upthread but it seems also a
needless fear.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Thomas Munro

Date:

29 January 2018, 01:07:31

On Thu, Jan 11, 2018 at 7:59 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> [new patch set]

FYI this is still broken:

test ddl                      ... FAILED

You could see that like this:

cd contrib/test_decoding
make check

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

29 January 2018, 13:26:34

Thank you for kindly noticing me of that.

At Mon, 29 Jan 2018 11:07:31 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=3nOUqNWyKQ83StGUeCB9LUsTw66w=Sy6H+xKfSbcRu3Q@mail.gmail.com>
> On Thu, Jan 11, 2018 at 7:59 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > [new patch set]
> 
> FYI this is still broken:
> 
> test ddl                      ... FAILED
> 
> You could see that like this:
> 
> cd contrib/test_decoding
> make check

I guess I might somehow have sent a working version of 0002.
While rechecking the patch, I fixed the message issued on losing
segments in 0001, revised the TAP test since I found that it was
unstable.

The attached files are the correct version of the latest patch.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 20e2ab6a6fdccac0381e42a1db56e63024279757 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 114 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 107 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e42b828..55694cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetMinKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9348,6 +9350,74 @@ CreateRestartPoint(int flags)
 }
 
 /*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetMinKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
+/*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
@@ -9359,34 +9429,38 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetMinKeepSegment(recptr, slotminptr);
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
+    {
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The mostly affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
+    if (minSegNo < segno)
+        segno = minSegNo;
+
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
         *logSegNo = segno;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5884fa9..0eb3f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2397,6 +2397,17 @@ static struct config_int ConfigureNamesInt[] =
     },
 
     {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
             NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index abffde6..db4ae2b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -236,6 +236,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d..12cd0d1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.9.2

From 7097ecf5d817eb6a5d4c2efda137b489a49a830d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 93 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 ++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 1e22c1e..92cd56a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -702,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 55694cb..cbece1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9349,6 +9349,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetMinKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5652e9e..cd714cc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index cf2195b..e6e0386 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1..52e64f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f01648c..0712e60 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5345,7 +5345,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_
_null__null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19
16""{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));
 
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5433944..c5240c9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.9.2

From 730386aab64fc5cbfb1aae57fcb0c5d80453fc1d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/014_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/014_replslot_limit.pl

diff --git a/src/test/recovery/t/014_replslot_limit.pl b/src/test/recovery/t/014_replslot_limit.pl
new file mode 100644
index 0000000..9e96714
--- /dev/null
+++ b/src/test/recovery/t/014_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by ten segments (= 160MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.9.2

From fd3e6660ef344799ff6d0b4eb510afa9ac3ac9a3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 71e20f2..a7c36a0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9814,6 +9814,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f951ddb..4456d72 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3011,6 +3011,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198..7bf5cc7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.9.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

29 January 2018, 13:40:23

Hello,

At Mon, 29 Jan 2018 19:26:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180129.192634.217484965.horiguchi.kyotaro@lab.ntt.co.jp>
> While rechecking the patch, I fixed the message issued on losing
> segments in 0001, revised the TAP test since I found that it was
> unstable.
> 
> The attached files are the correct version of the latest patch.

The name of the new function GetMinKeepSegment seems giving wrong
meaning. I renamed it to GetOlestKeepSegment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 162dec1f6de1449047c4856722a276e1dcc14b63 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 114 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 107 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e42b828..c01d0b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9348,6 +9350,74 @@ CreateRestartPoint(int flags)
 }
 
 /*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
+/*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
@@ -9359,34 +9429,38 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
+    {
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The mostly affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
+    if (minSegNo < segno)
+        segno = minSegNo;
+
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
         *logSegNo = segno;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5884fa9..0eb3f46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2397,6 +2397,17 @@ static struct config_int ConfigureNamesInt[] =
     },
 
     {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
             NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index abffde6..db4ae2b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -236,6 +236,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d..12cd0d1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.9.2

From ec7f5d8bba2bed2787e786a05d347ccd2acf557e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 93 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 ++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 1e22c1e..92cd56a 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -702,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c01d0b3..099d89f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9349,6 +9349,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5652e9e..cd714cc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -793,7 +793,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index cf2195b..e6e0386 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1..52e64f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f01648c..0712e60 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5345,7 +5345,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 1 0 2278 "19" _null_ _null_
_null__null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 2249 "19 19
16""{19,19,16,25,3220}" "{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));
 
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5433944..c5240c9 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.9.2

From 6f852d2c6c9e16b4cac62a4c0e622cdfb21370f6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/014_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/014_replslot_limit.pl

diff --git a/src/test/recovery/t/014_replslot_limit.pl b/src/test/recovery/t/014_replslot_limit.pl
new file mode 100644
index 0000000..9e96714
--- /dev/null
+++ b/src/test/recovery/t/014_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by ten segments (= 160MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.9.2

From a512843d9613c254e6ce11fc38f79605c7450ac3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 71e20f2..a7c36a0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9814,6 +9814,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f951ddb..4456d72 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3011,6 +3011,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198..7bf5cc7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.9.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

19 March 2018, 11:09:48

At Mon, 29 Jan 2018 19:40:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180129.194023.228030941.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 29 Jan 2018 19:26:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180129.192634.217484965.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > While rechecking the patch, I fixed the message issued on losing
> > segments in 0001, revised the TAP test since I found that it was
> > unstable.
> > 
> > The attached files are the correct version of the latest patch.
> 
> The name of the new function GetMinKeepSegment seems giving wrong
> meaning. I renamed it to GetOlestKeepSegment.

I found that fd1a421fe6 and df411e7c66 hit this . Rebased to the
current HEAD.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From dbb5ca5bb79e7910f00bff20e8295e2fa3005d2d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 116 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 108 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d895..542e1d78fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9344,6 +9346,74 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9356,33 +9426,37 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The mostly affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7a7ac479c1..de43c7139b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2406,6 +2406,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 048bf4cccd..7d5171c32c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -239,6 +239,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.2

From a4b6ae2ec3acfb8de4f702450dcc2960842d8fd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 93 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 ++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.h          |  2 +-
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 1e22c1eefc..92cd56a5f0 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -702,7 +702,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 542e1d78fe..529aee9014 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9346,6 +9346,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5e6e8a64f6..9284175f7d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -795,7 +795,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index e873dd1f81..16575fa411 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..52e64f392d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0fdb42f639..e8e32c1a97 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5385,7 +5385,7 @@ DATA(insert OID = 3779 (  pg_create_physical_replication_slot PGNSP PGUID 12 1 0
 DESCR("create a physical replication slot");
 DATA(insert OID = 3780 (  pg_drop_replication_slot PGNSP PGUID 12 1 0 0 0 f f f t f v u 1 0 2278 "19" _null_ _null_
_null__null_ _null_ pg_drop_replication_slot _null_ _null_ _null_ ));
 
 DESCR("drop a replication slot");
-DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220}""{o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ ));
 
+DATA(insert OID = 3781 (  pg_get_replication_slots    PGNSP PGUID 12 1 10 0 0 f f f f t s s 0 0 2249 ""
"{19,19,25,26,16,16,23,28,28,3220,3220,25,3220}""{o,o,o,o,o,o,o,o,o,o,o,o,o}"
"{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}"
_null__null_ pg_get_replication_slots _null_ _null_ _null_ )); 
 DESCR("information about replication slots currently in use");
 DATA(insert OID = 3786 (  pg_create_logical_replication_slot PGNSP PGUID 12 1 0 0 0 f f f t f v u 3 0 2249 "19 19 16"
"{19,19,16,25,3220}""{i,i,i,o,o}" "{slot_name,plugin,temporary,slot_name,lsn}" _null_ _null_
pg_create_logical_replication_slot_null_ _null_ _null_ ));
 
 DESCR("set up a logical replication slot");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5e0597e091..3944e36681 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.2

From 079ba54bda2570709c51ff6da195c693cc39d6b7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/015_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/015_replslot_limit.pl

diff --git a/src/test/recovery/t/015_replslot_limit.pl b/src/test/recovery/t/015_replslot_limit.pl
new file mode 100644
index 0000000000..9e96714d39
--- /dev/null
+++ b/src/test/recovery/t/015_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by ten segments (= 160MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.2

From 1908e5c2bbe0e6b2945acd3ff32b0f72e72477e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 30e6741305..7a6f4540f1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9802,6 +9802,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f18d2b3353..8715dee1ed 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198a2a..7bf5cc7f79 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

26 June 2018, 10:26:59

Hello. This is the reabased version of slot-limit feature.

This patch limits maximum WAL segments to be kept by replication
slots. Replication slot is useful to avoid desync with replicas
after temporary disconnection but it is dangerous when some of
replicas are lost. The WAL space can be exhausted and server can
PANIC in the worst case. This can prevent the worst case having a
benefit from replication slots using a new GUC variable
max_slot_wal_keep_size.

This is a feature mentioned in the documentation.

https://www.postgresql.org/docs/current/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

> In lieu of using replication slots, it is possible to prevent the
> removal of old WAL segments using wal_keep_segments, or by
> storing the segments in an archive using
> archive_command. However, these methods often result in retaining
> more WAL segments than required, whereas replication slots retain
> only the number of segments known to be needed. An advantage of
> these methods is that they bound the space requirement for
> pg_wal; there is currently no way to do this using replication
> slots.

The previous patche files doesn't have version number so I let
the attached latest version be v2.


v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
  The body of the limiting feature

v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
  Shows the status of WAL rataining in pg_replication_slot view

v2-0003-TAP-test-for-the-slot-limit-feature.patch
  TAP test for this feature

v2-0004-Documentation-for-slot-limit-feature.patch
  Documentation, as the name.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9fe29d9fef53891a40b81ed255ca9060f8af4ea1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..3ab67f0bdd 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9881,6 +9881,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7bfbc87109..967a73236f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198a2a..7bf5cc7f79 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

From a2b59e66217cb2d01e3b8a716010bd0cba7f1c20 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/015_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/015_replslot_limit.pl

diff --git a/src/test/recovery/t/015_replslot_limit.pl b/src/test/recovery/t/015_replslot_limit.pl
new file mode 100644
index 0000000000..05a1113a67
--- /dev/null
+++ b/src/test/recovery/t/015_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 2346e5a2da79646e23c2c683dc04fded73664271 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 93 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 ++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.dat        |  6 +--
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 130 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..276f7f6efd 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cf48ed06af..048f55ab77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9379,6 +9379,99 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8cd8bf40ac..1664a086e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..f13aa4d455 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..52e64f392d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 40d54ed030..ef8f0eab91 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5..93f6bff77e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 89417a56b25c19d28838c09b559c346d75fe74c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 116 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 108 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a419aa49b..cf48ed06af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9377,6 +9379,74 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9389,33 +9459,37 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The mostly affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 859ef931e7..2a183c0a4c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2530,6 +2530,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..0e605a1765 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -239,6 +239,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

04 July 2018, 11:28:38

Hello.

At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180626.162659.223208514.horiguchi.kyotaro@lab.ntt.co.jp>
> The previous patche files doesn't have version number so I let
> the attached latest version be v2.
> 
> 
> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
>   The body of the limiting feature
> 
> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
>   Shows the status of WAL rataining in pg_replication_slot view
> 
> v2-0003-TAP-test-for-the-slot-limit-feature.patch
>   TAP test for this feature
> 
> v2-0004-Documentation-for-slot-limit-feature.patch
>   Documentation, as the name.

Travis (test_decoding test) showed that GetOldestXLogFileSegNo
added by 0002 forgets to close temporarily opened pg_wal
directory. This is the fixed version v3.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5ef0e221ee29a185743576cbdc93ca79f649d1d3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 116 +++++++++++++++++++++-----
 src/backend/utils/misc/guc.c                  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 108 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0981657801..959d18e029 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9377,6 +9379,74 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+    uint64 keepSegs;
+    XLogSegNo currSeg;
+    XLogSegNo tailSeg;
+    uint64 slotlimitbytes;
+    uint64 slotlimitfragment;
+    uint64 currposoff;
+    XLogRecPtr slotpos = minSlotPtr;
+    XLogSegNo    slotSeg;
+
+    Assert(wal_keep_segments >= 0);
+    Assert(max_slot_wal_keep_size_mb >= 0);
+
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+    /*
+     * wal_keep_segments keeps more segments than slot, slotpos is no longer
+     * useful. Don't perform subtraction to keep values positive.
+     */
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {
+        /* avoid underflow, don't go below 1 */
+        if (currSeg <= wal_keep_segments)
+            return 1;
+
+        return currSeg - wal_keep_segments;
+    }
+
+    /* just return slotSeg if we don't put a limit */
+    if (max_slot_wal_keep_size_mb == 0)
+        return slotSeg;
+
+    /*
+     * Slot limit is defined and slot gives the oldest segment to keep,
+     * calculate the oldest segment that should not be removed
+     */
+    slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+    slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+                                                 wal_segment_size);
+    currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+    if (currposoff < slotlimitfragment)
+        keepSegs++;
+
+    /*
+     * calculate the oldest segment that is kept by wal_keep_segments and
+     * max_slot_wal_keep_size.
+     */
+    if (currSeg <= keepSegs)
+        tailSeg = 1;
+    else
+        tailSeg = currSeg - keepSegs;
+
+    return tailSeg;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9389,33 +9459,37 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+            ereport(WARNING,
+                    (errmsg ("some replication slots have lost required WAL segments"),
+                     errdetail("The mostly affected slot has lost %ld segments.",
+                           minSegNo - slotSegNo)));
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b05fb209bb..25e688db33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2530,6 +2530,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..0e605a1765 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -239,6 +239,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From cf2044e2c6729d450c8fd9b7a7603254418bb6d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 95 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 ++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.dat        |  6 +--
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 132 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..276f7f6efd 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 959d18e029..5c16750e89 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9379,6 +9379,101 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));
+
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8cd8bf40ac..1664a086e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..f13aa4d455 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..52e64f392d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 40d54ed030..ef8f0eab91 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5..93f6bff77e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 53168c7aa0bcd356d1463b9212020a0f32c6ea36 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/015_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/015_replslot_limit.pl

diff --git a/src/test/recovery/t/015_replslot_limit.pl b/src/test/recovery/t/015_replslot_limit.pl
new file mode 100644
index 0000000000..05a1113a67
--- /dev/null
+++ b/src/test/recovery/t/015_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From fad916da9dcb4293b56961822e95cd52031965b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 24 ++++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..3ab67f0bdd 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9881,6 +9881,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed in the next
+      checkpoint. <literal>lost</literal> means that some of them have been
+      removed. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1..70b88ed5db 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,30 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files kept in
+        the <filename>pg_wal</filename> directory at checkpoint time, even in
+        case some of them are still claimed by
+        <link linkend="streaming-replication-slots">replication
+        slots</link>. If <varname>max_slot_wal_keep_size</varname> is zero
+        (the default), replication slots retain unlimited size of WAL
+        files.
+       </para>
+
+       <para>
+        This size is counted apart from
+        <xref linkend="guc-wal-keep-segments"/>. 
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d..50ebb23c23 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

05 July 2018, 09:43:56

On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180626.162659.223208514.horiguchi.kyotaro@lab.ntt.co.jp>

>> The previous patche files doesn't have version number so I let
>> the attached latest version be v2.
>>
>>
>> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
>>   The body of the limiting feature
>>
>> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
>>   Shows the status of WAL rataining in pg_replication_slot view
>>
>> v2-0003-TAP-test-for-the-slot-limit-feature.patch
>>   TAP test for this feature
>>
>> v2-0004-Documentation-for-slot-limit-feature.patch
>>   Documentation, as the name.
>
> Travis (test_decoding test) showed that GetOldestXLogFileSegNo
> added by 0002 forgets to close temporarily opened pg_wal
> directory. This is the fixed version v3.
>

Thank you for updating the patch! I looked at v3 patches. Here is
review comments.

---
+               {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+                       gettext_noop("Sets the maximum size of extra
WALs kept by replication slots."),
+                NULL,
+                GUC_UNIT_MB
+               },
+               &max_slot_wal_keep_size_mb,
+               0, 0, INT_MAX,
+               NULL, NULL, NULL
+       },

Maybe MAX_KILOBYTES would be better instead of INT_MAX like wal_max_size.

---
Once the following WARNING emitted this message is emitted whenever we
execute CHECKPOINT even if we don't change anything. Is that expected
behavior? I think it would be better to emit this message only when we
remove WAL segements that are required by slots.

WARNING:  some replication slots have lost required WAL segments
DETAIL:  The mostly affected slot has lost 153 segments.

---
+       Assert(wal_keep_segments >= 0);
+       Assert(max_slot_wal_keep_size_mb >= 0);

These assertions are not meaningful because these parameters are
ensured >= 0 by those definition.

---
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {

Perhaps XLogRecPtrIsInvalid(slotpos) is better.

---
+    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+        slotpos = InvalidXLogRecPtr;
+
+    /* slots aren't useful, consider only wal_keep_segments */
+    if (slotpos == InvalidXLogRecPtr)
+    {

This logic is hard to read to me. The slotpos can be any of: valid,
valid but then become invalid in halfway or invalid from beginning of
this function. Can we convert this logic to following?

if (XLogRecPtrIsInvalid(slotpos) ||
    currSeg <= slotSeg + wal_keep_segments)

---
+    keepSegs = wal_keep_segments +
+        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

Why do we need to keep (wal_keep_segment + max_slot_wal_keep_size) WAL
segments? I think what this feature does is, if wal_keep_segments is
not useful (that is, slotSeg < (currSeg - wal_keep_segment) then we
normally choose slotSeg as lower boundary but max_slot_wal_keep_size
restrict the lower boundary so that it doesn't get lower than the
threshold. So I thought what this function should do is to calculate
min(currSeg - wal_keep_segment, max(currSeg - max_slot_wal_keep_size,
slotSeg)), I might be missing something though.

---
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);

We can use XLogGetLastRemovedSegno() instead.

---
+    xldir = AllocateDir(XLOGDIR);
+    if (xldir == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not open write-ahead log directory \"%s\": %m",
+                        XLOGDIR)));

Looking at other code allocating a directory we don't check xldir ==
NULL because it will be detected by ReadDir() function and raise an
error in that function. So maybe we don't need to check it just after
allocation.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

09 July 2018, 08:47:06

Hello. Sawada-san.

Thank you for the comments.

At Thu, 5 Jul 2018 15:43:56 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDiiA4qHj0thqw80Bt=vefSQ9yGpZnr0kuLTXszbrV9iQ@mail.gmail.com>
> On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello.
> >
> > At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180626.162659.223208514.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >> The previous patche files doesn't have version number so I let
> >> the attached latest version be v2.
> >>
> >>
> >> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
> >>   The body of the limiting feature
> >>
> >> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
> >>   Shows the status of WAL rataining in pg_replication_slot view
> >>
> >> v2-0003-TAP-test-for-the-slot-limit-feature.patch
> >>   TAP test for this feature
> >>
> >> v2-0004-Documentation-for-slot-limit-feature.patch
> >>   Documentation, as the name.
> >
> > Travis (test_decoding test) showed that GetOldestXLogFileSegNo
> > added by 0002 forgets to close temporarily opened pg_wal
> > directory. This is the fixed version v3.
> >
> 
> Thank you for updating the patch! I looked at v3 patches. Here is
> review comments.
> 
> ---
> +               {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
> +                       gettext_noop("Sets the maximum size of extra
> WALs kept by replication slots."),
> +                NULL,
> +                GUC_UNIT_MB
> +               },
> +               &max_slot_wal_keep_size_mb,
> +               0, 0, INT_MAX,
> +               NULL, NULL, NULL
> +       },
> 
> Maybe MAX_KILOBYTES would be better instead of INT_MAX like wal_max_size.

The MAX_KILOBYTES is maximum value of size in kB, which fits long
or Size/size_t variables after convreted into bytes. Applying the
limit there means that we assume that the _mb variable can be
converted not into bytes but kB. So applying it to
max/min_wal_size seems somewhat wrong but doesn't harm since they
are not acutually converted into bytes.

max_slot_wal_keep_size is not converted into bytes so capping
with INT_MAX is no problem. However it doesn't need to be larger
than MAX_KILOBYTES, I follow that in order to make it same with
max/min_wal_size.

> ---
> Once the following WARNING emitted this message is emitted whenever we
> execute CHECKPOINT even if we don't change anything. Is that expected
> behavior? I think it would be better to emit this message only when we
> remove WAL segements that are required by slots.
> 
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  The mostly affected slot has lost 153 segments.

I didn't consider the situation the number of lost segments
doesn't change. Changed to mute the message when the number of
lost segments is not changed.

> ---
> +       Assert(wal_keep_segments >= 0);
> +       Assert(max_slot_wal_keep_size_mb >= 0);
> 
> These assertions are not meaningful because these parameters are
> ensured >= 0 by those definition.

Yeah, that looks a bit being paranoid. Removed.

> ---
> +    /* slots aren't useful, consider only wal_keep_segments */
> +    if (slotpos == InvalidXLogRecPtr)
> +    {
> 
> Perhaps XLogRecPtrIsInvalid(slotpos) is better.

Agreed. It is changed to "slotpos != InvalidXLogRecPtr" after
changing the function by the comments below. I think that the
double negation !XLogRecPtrInvalid() is not fine.

> ---
> +    if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
> +        slotpos = InvalidXLogRecPtr;
> +
> +    /* slots aren't useful, consider only wal_keep_segments */
> +    if (slotpos == InvalidXLogRecPtr)
> +    {
> 
> This logic is hard to read to me. The slotpos can be any of: valid,
> valid but then become invalid in halfway or invalid from beginning of
> this function. Can we convert this logic to following?
> 
> if (XLogRecPtrIsInvalid(slotpos) ||
>     currSeg <= slotSeg + wal_keep_segments)

Right. But it is removed.

> ---
> +    keepSegs = wal_keep_segments +
> +        ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
> 
> Why do we need to keep (wal_keep_segment + max_slot_wal_keep_size) WAL
> segments? I think what this feature does is, if wal_keep_segments is
> not useful (that is, slotSeg < (currSeg - wal_keep_segment) then we
> normally choose slotSeg as lower boundary but max_slot_wal_keep_size
> restrict the lower boundary so that it doesn't get lower than the
> threshold. So I thought what this function should do is to calculate
> min(currSeg - wal_keep_segment, max(currSeg - max_slot_wal_keep_size,
> slotSeg)), I might be missing something though.

You're right that wal_keep_segments should not been added, but
should give lower limit of segments to keep as the current
KeepLogSeg() does. Fixed that.

Since the amount is specified in mega bytes, silently rounding
down to segment bounds may not be proper in general and this
feature used to use the fragments to show users something. But
there's no loner a place where the fragments are perceptible to
users and anyway the fragments are way smaller than the expected
total WAL size.

As the result, I removed the fragment calculation at all as you
suggested. It gets way smaller and simpler.

> ---
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    oldestSeg = XLogCtl->lastRemovedSegNo;
> +    SpinLockRelease(&XLogCtl->info_lck);
> 
> We can use XLogGetLastRemovedSegno() instead.

It is because I thought that it is for external usage,
spcifically by slot.c since CheckXLogRemoved() is reading it
directly. I leave it alone and they would have to be fixed at
once if we decide to use it internally.

> ---
> +    xldir = AllocateDir(XLOGDIR);
> +    if (xldir == NULL)
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("could not open write-ahead log directory \"%s\": %m",
> +                        XLOGDIR)));
> 
> Looking at other code allocating a directory we don't check xldir ==
> NULL because it will be detected by ReadDir() function and raise an
> error in that function. So maybe we don't need to check it just after
> allocation.

Thanks. I found that in the comment of ReadDir(). This doesn't
need a special error handling so I leave it to ReadDir there.

Addition to that, documentation is fixed.


Attached is the v4 files.



regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7e80eb619a82d9fd0a4ed9c91186fe4c21016622 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 100 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 ++++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 94 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 44017d33e4..c27a589e89 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9424,6 +9426,51 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    slotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, slotSeg, wal_segment_size);
+
+    /*
+     * Calcualte keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && slotSeg <= currSeg)
+        keepSegs = currSeg - slotSeg;
+
+    /*
+     * slot keep segments is limited by max_slot_wal_keep_size, fragment of a
+     * segment is ignored
+     */
+    if (max_slot_wal_keep_size_mb > 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep larger than wal_segment_size if any*/
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9436,33 +9483,46 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail("The mostly affected slot has lost %ld segments.",
+                                   lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
+            prev_lost_segs = 0;
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17292e04fe..01b8c8edec 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2530,6 +2530,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..0e605a1765 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -239,6 +239,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From 4e3a0ca744620c129c512b47e2656e697762c42f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_replication_slots.

Adds two columns "status" and "min_secure_lsn" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows that a slot can be reconnected or not, or about
to lose required WAL segments. And the LSN back to where the next
checkpoint will secure.
---
 contrib/test_decoding/expected/ddl.out |  4 +-
 src/backend/access/transam/xlog.c      | 89 ++++++++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/replication/slotfuncs.c    | 25 +++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/catalog/pg_proc.dat        |  6 +--
 src/test/regress/expected/rules.out    |  6 ++-
 7 files changed, 126 insertions(+), 9 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..276f7f6efd 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | min_keep_lsn 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c27a589e89..577be4ecd9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9426,6 +9426,95 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignorig timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given restartLSN is present in XLOG files.
+ *
+ * Returns true if it is present. If minKeepLSN is given, it receives the
+ * LSN at the beginning of the oldest existing WAL segment.
+ */
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{
+    XLogRecPtr currpos;
+    XLogSegNo restartSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(restartLSN));
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+    if (minKeepLSN)
+    {
+        XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+        Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+        tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+        XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN, wal_segment_size);
+    }
+
+    return    oldestSeg <= restartSeg;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8cd8bf40ac..1664a086e9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_keep_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..f13aa4d455 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,29 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            XLogRecPtr    min_keep_lsn;
+            char *status = "lost";
+
+            if (BoolGetDatum(IsLsnStillAvaiable(restart_lsn,
+                                                &min_keep_lsn)))
+            {
+                if (min_keep_lsn <= restart_lsn)
+                    status = "streaming";
+                else
+                    status = "keeping";
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(min_keep_lsn);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..52e64f392d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern bool IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minSecureLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 40d54ed030..ef8f0eab91 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_keep_lsn}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5..93f6bff77e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_keep_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_keep_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 6a2bd69eba11b5aa611c3194711e2039744a624c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..05a1113a67
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_keep_lsn FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|$start_lsn", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 04687424be04ca31de4ddd8707d159cb25a7e380 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 29 +++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 22 ++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..3d5bc666b9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9881,6 +9881,35 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them have been removed. The
+      last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1..33a623729f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to reatin in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>
+       <para>
+        This parameter is used being rounded down to the multiples of WAL file
+        size.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d..50ebb23c23 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

11 July 2018, 09:09:23

On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello. Sawada-san.
>
> Thank you for the comments.
>

Thank you for updating the patch!

> At Thu, 5 Jul 2018 15:43:56 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDiiA4qHj0thqw80Bt=vefSQ9yGpZnr0kuLTXszbrV9iQ@mail.gmail.com>
>> On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI
>> ---
>> +    SpinLockAcquire(&XLogCtl->info_lck);
>> +    oldestSeg = XLogCtl->lastRemovedSegNo;
>> +    SpinLockRelease(&XLogCtl->info_lck);
>>
>> We can use XLogGetLastRemovedSegno() instead.
>
> It is because I thought that it is for external usage,
> spcifically by slot.c since CheckXLogRemoved() is reading it
> directly. I leave it alone and they would have to be fixed at
> once if we decide to use it internally.

Agreed. I noticed that after commented.

Here is review comments of v4 patches.

+       if (minKeepLSN)
+       {
+               XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
+               Assert(!XLogRecPtrIsInvalid(slotPtr));
+
+               tailSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+               XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN,
wal_segment_size);
+       }

The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
destination at 4th argument the wal_segment_size will be changed in
the above expression. The regression tests by PostgreSQL Patch Tester
seem passed but I got the following assertion failure in
recovery/t/010_logical_decoding_timelines.pl

TRAP: FailedAssertion("!(XLogRecPtrToBytePos(*StartPos) ==
startbytepos)", File: "xlog.c", Line: 1277)
----
+       XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+
+
+       if (minKeepLSN)

There is an extra empty line.

----
+    /* but, keep larger than wal_segment_size if any*/
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;

You meant wal_keep_segments in the above comment rather than
wal_segment_size? Also, the above comment need a whitespace just after
"any".

----
+bool
+IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
+{

I think restartLSN is a word used for replication slots. Since this
function is defined in xlog.c it would be better to change the
argument name to more generic name, for example recptr.

----
+       /*
+        * Calcualte keep segments by slots first. The second term of the
+        * condition is just a sanity check.
+        */

s/calcualte/calculate/

----
+               /* get minimum segment ignorig timeline ID */

s/ignorig/ignoring/

----
min_keep_lsn in pg_replication_slots currently shows the same value in
every slots but I felt that the value seems not easy to understand
intuitively for users because users will have to confirm that value
and to compare the current LSN in order to check if replication slots
will become the "lost" status. So how about showing values that
indicate how far away from the point where we become "lost" for
individual slots?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

13 July 2018, 11:40:04

Hello.

At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCFtW6+SN_eVTszDAjQeeU2sSea2VpCEx08ejNafk8H9w@mail.gmail.com>
> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
..
> Here is review comments of v4 patches.
> 
> +       if (minKeepLSN)
> +       {
> +               XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
> +               Assert(!XLogRecPtrIsInvalid(slotPtr));
> +
> +               tailSeg = GetOldestKeepSegment(currpos, slotPtr);
> +
> +               XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN,
> wal_segment_size);
> +       }
> 
> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
> destination at 4th argument the wal_segment_size will be changed in
> the above expression. The regression tests by PostgreSQL Patch Tester

I'm not sure I get this correctly, the definition of the macro is
as follows.

| #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
|         (dest) = (segno) * (wal_segsz_bytes) + (offset)

The destination is the *3rd* parameter and the forth is segment
size which is not to be written.

> seem passed but I got the following assertion failure in
> recovery/t/010_logical_decoding_timelines.pl
> 
> TRAP: FailedAssertion("!(XLogRecPtrToBytePos(*StartPos) ==
> startbytepos)", File: "xlog.c", Line: 1277)

Hmm. I don't see a relation with this patch, but how did you
cause the failure? The failure means inconsistency between
existing XLogBytePosToRecPtr and XLogRecPtrToBytePos, which
doesn't seem to happen without modifying the two functions.

> ----
> +       XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
> +
> +
> +       if (minKeepLSN)
> There is an extra empty line.
> 
> ----
> +    /* but, keep larger than wal_segment_size if any*/
> +    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
> +        keepSegs = wal_keep_segments;
> 
> You meant wal_keep_segments in the above comment rather than
> wal_segment_size? Also, the above comment need a whitespace just after
> "any".

Ouch! You're right. Fixed.

> ----
> +bool
> +IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
> +{
> 
> I think restartLSN is a word used for replication slots. Since this
> function is defined in xlog.c it would be better to change the
> argument name to more generic name, for example recptr.

Agreed. I used "target" instead.

> ----
> +       /*
> +        * Calcualte keep segments by slots first. The second term of the
> +        * condition is just a sanity check.
> +        */
> 
> s/calcualte/calculate/

Fixed.

> ----
> +               /* get minimum segment ignorig timeline ID */
> 
> s/ignorig/ignoring/

Fixed.

# One of my fingers is literally fatter with bandaid than usual..

> ----
> min_keep_lsn in pg_replication_slots currently shows the same value in
> every slots but I felt that the value seems not easy to understand
> intuitively for users because users will have to confirm that value
> and to compare the current LSN in order to check if replication slots
> will become the "lost" status. So how about showing values that
> indicate how far away from the point where we become "lost" for
> individual slots?

Yeah, that is what I did in the first cut of this patch from the
same thought. pg_replication_slots have two additional columns
"live" and "distance".

https://www.postgresql.org/message-id/20171031.184310.182012625.horiguchi.kyotaro@lab.ntt.co.jp

Thre current design is changed following a comment.

https://www.postgresql.org/message-id/20171108.131431.170534842.horiguchi.kyotaro%40lab.ntt.co.jp

> > I don't think 'distance' is a good metric - that's going to continually
> > change. Why not store the LSN that's available and provide a function
> > that computes this? Or just rely on the lsn - lsn operator?
> 
> It seems reasonable.,The 'secured minimum LSN' is common among
> all slots so showing it in the view may look a bit stupid but I
> don't find another suitable place for it.  distance = 0 meant the
> state that the slot is living but insecured in the previous patch
> and that information is lost by changing 'distance' to
> 'min_secure_lsn'.

As I reconsidered this, I noticed that "lsn - lsn" doesn't make
sense here. The correct formula for the value is
"max_slot_wal_keep_size * 1024 * 1024 - ((oldest LSN to keep) -
restart_lsn). It is not a simple formula to write by hand but
doesn't seem general enough. I re-changed my mind to show the
"distance" there again.

pg_replication_slots now has the column "remain" instaed of
"min_keep_lsn", which shows an LSN when wal_status is "streaming"
and otherwise "0/0". In a special case, "remain" can be "0/0"
while "wal_status" is "streaming". It is the reason for the
tristate return value of IsLsnStillAvaialbe().

wal_status | remain 
streaming  | 0/19E3C0  -- WAL is reserved
streaming  | 0/0       -- Still reserved but on the boundary
keeping    | 0/0       -- About to be lost.
lost       | 0/0       -- Lost.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c0cd29e0bb568834cc8889d69d3e6081236c5784 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 100 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 ++++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 94 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4049deb968..df6b5e89e6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9462,6 +9464,51 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    slotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, slotSeg, wal_segment_size);
+
+    /*
+     * Calcualte keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && slotSeg <= currSeg)
+        keepSegs = currSeg - slotSeg;
+
+    /*
+     * slot keep segments is limited by max_slot_wal_keep_size, fragment of a
+     * segment is ignored
+     */
+    if (max_slot_wal_keep_size_mb > 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep larger than wal_segment_size if any*/
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9474,33 +9521,46 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail("The mostly affected slot has lost %ld segments.",
+                                   lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
+            prev_lost_segs = 0;
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17292e04fe..01b8c8edec 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2530,6 +2530,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 657c3f81f8..23af9ea274 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From 32b4fa4556dda77bb6b6563692d19f0b9556d85e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaing bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 135 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 +++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 172 insertions(+), 16 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df6b5e89e6..9bf648dc17 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9464,12 +9464,110 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is alredy removed.
+ * 1 means that WAL record at tagetLSN is availble.
+ * 2 means that WAL record at tagetLSN is availble but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
+ * slot loses reserving a WAL record.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
@@ -9479,26 +9577,49 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     XLByteToSeg(minSlotLSN, slotSeg, wal_segment_size);
 
     /*
-     * Calcualte keep segments by slots first. The second term of the
+     * Calculate keep segments by slots first. The second term of the
      * condition is just a sanity check.
      */
     if (minSlotLSN != InvalidXLogRecPtr && slotSeg <= currSeg)
         keepSegs = currSeg - slotSeg;
 
+    if (restBytes)
+        *restBytes = 0;
+
     /*
-     * slot keep segments is limited by max_slot_wal_keep_size, fragment of a
-     * segment is ignored
+     * Calculate number of segments to keep ignoring segment fragment. If
+     * requested, return remaining LSN bytes to advance until the slot gives
+     * up to reserve WAL records.
      */
     if (max_slot_wal_keep_size_mb > 0)
     {
         uint64 limitSegs;
 
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
         if (limitSegs < keepSegs)
+        {
+            /* This slot gave up to retain reserved WAL records. */
             keepSegs = limitSegs;
+        }
+        else if (restBytes)
+        {
+            /* calculate return rest bytes until this slot loses WAL */
+            uint64 fragbytes;
+
+            /* If wal_keep_segments may be larger than slot limit. However
+             * it's a rather useless configuration, we should consider the
+             * case anyway.
+             */
+            if (limitSegs < wal_keep_segments)
+                limitSegs = wal_keep_segments;
+
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            *restBytes = (limitSegs - keepSegs) * wal_segment_size + fragbytes;
+        }
     }
 
-    /* but, keep larger than wal_segment_size if any*/
+    /* but, keep at least wal_keep_segments segments if any */
     if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
         keepSegs = wal_keep_segments;
 
@@ -9533,7 +9654,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checktpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..d28896dc58 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 450f73759f..bf7fbb7833 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = LSNGetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a14651010f..18acf1f8ef 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5..fe7a675e1e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From c12b68ee828ade7ed587e74d9f354f08ba39828d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..401e3b1bd0
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 98fbdc59a4e8079de0ea1ca6bb2e09bf0ddfdcc9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 29 +++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 22 ++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 4851bc2e24..ce1eeb68bb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9882,6 +9882,35 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them have been removed. The
+      last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e307bb4e8e..1db0736dc5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to reatin in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>
+       <para>
+        This parameter is used being rounded down to the multiples of WAL file
+        size.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d..50ebb23c23 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

17 July 2018, 07:37:59

On Fri, Jul 13, 2018 at 5:40 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCFtW6+SN_eVTszDAjQeeU2sSea2VpCEx08ejNafk8H9w@mail.gmail.com>
>> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> ..
>> Here is review comments of v4 patches.
>>
>> +       if (minKeepLSN)
>> +       {
>> +               XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
>> +               Assert(!XLogRecPtrIsInvalid(slotPtr));
>> +
>> +               tailSeg = GetOldestKeepSegment(currpos, slotPtr);
>> +
>> +               XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN,
>> wal_segment_size);
>> +       }
>>
>> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
>> destination at 4th argument the wal_segment_size will be changed in
>> the above expression. The regression tests by PostgreSQL Patch Tester
>
> I'm not sure I get this correctly, the definition of the macro is
> as follows.
>
> | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
> |               (dest) = (segno) * (wal_segsz_bytes) + (offset)
>
> The destination is the *3rd* parameter and the forth is segment
> size which is not to be written.

Please see commit a22445ff0b which flipped input and output arguments.
So maybe you need to rebase the patches to current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

17 July 2018, 12:58:48

Hello.

At Tue, 17 Jul 2018 13:37:59 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCAdDfXNwVhoAKhBtpmrY-0tfQoQh2NiTX_Ji15msNPew@mail.gmail.com>
> >> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
> >> destination at 4th argument the wal_segment_size will be changed in
> >> the above expression. The regression tests by PostgreSQL Patch Tester
> >
> > I'm not sure I get this correctly, the definition of the macro is
> > as follows.
> >
> > | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
> > |               (dest) = (segno) * (wal_segsz_bytes) + (offset)
> >
> > The destination is the *3rd* parameter and the forth is segment
> > size which is not to be written.
> 
> Please see commit a22445ff0b which flipped input and output arguments.
> So maybe you need to rebase the patches to current HEAD.

Mmm. Thanks. I never thought such change happned but it is
accidentially took away in the latest patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

20 July 2018, 04:13:58

On Fri, Jul 13, 2018 at 5:40 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCFtW6+SN_eVTszDAjQeeU2sSea2VpCEx08ejNafk8H9w@mail.gmail.com>
>> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
>> ----
>> min_keep_lsn in pg_replication_slots currently shows the same value in
>> every slots but I felt that the value seems not easy to understand
>> intuitively for users because users will have to confirm that value
>> and to compare the current LSN in order to check if replication slots
>> will become the "lost" status. So how about showing values that
>> indicate how far away from the point where we become "lost" for
>> individual slots?
>
> Yeah, that is what I did in the first cut of this patch from the
> same thought. pg_replication_slots have two additional columns
> "live" and "distance".
>
> https://www.postgresql.org/message-id/20171031.184310.182012625.horiguchi.kyotaro@lab.ntt.co.jp
>
> Thre current design is changed following a comment.
>
> https://www.postgresql.org/message-id/20171108.131431.170534842.horiguchi.kyotaro%40lab.ntt.co.jp
>
>> > I don't think 'distance' is a good metric - that's going to continually
>> > change. Why not store the LSN that's available and provide a function
>> > that computes this? Or just rely on the lsn - lsn operator?
>>
>> It seems reasonable.,The 'secured minimum LSN' is common among
>> all slots so showing it in the view may look a bit stupid but I
>> don't find another suitable place for it.  distance = 0 meant the
>> state that the slot is living but insecured in the previous patch
>> and that information is lost by changing 'distance' to
>> 'min_secure_lsn'.
>
> As I reconsidered this, I noticed that "lsn - lsn" doesn't make
> sense here. The correct formula for the value is
> "max_slot_wal_keep_size * 1024 * 1024 - ((oldest LSN to keep) -
> restart_lsn). It is not a simple formula to write by hand but
> doesn't seem general enough. I re-changed my mind to show the
> "distance" there again.
>
> pg_replication_slots now has the column "remain" instaed of
> "min_keep_lsn", which shows an LSN when wal_status is "streaming"
> and otherwise "0/0". In a special case, "remain" can be "0/0"
> while "wal_status" is "streaming". It is the reason for the
> tristate return value of IsLsnStillAvaialbe().
>
> wal_status | remain
> streaming  | 0/19E3C0  -- WAL is reserved
> streaming  | 0/0       -- Still reserved but on the boundary
> keeping    | 0/0       -- About to be lost.
> lost       | 0/0       -- Lost.
>

The "remain" column still shows same value at all rows as follows
because you always compare between the current LSN and the minimum LSN
of replication slots. Is that you expected? My comment was to show the
distance from the restart_lsn of individual slots to the critical
point where it will lost WAL. That way, we can easily find out which
slots is about to get lost.

postgres(1:126712)=# select pg_current_wal_lsn(), slot_name,
restart_lsn, remain from pg_replication_slots ;
 pg_current_wal_lsn | slot_name | restart_lsn |   remain
--------------------+-----------+-------------+------------
 0/4000108          | 5         | 0/1645CA0   | 0/3DFFFEF8
 0/4000108          | 4         | 0/40000D0   | 0/3DFFFEF8
(2 rows)

Also, I'm not sure it's a good way to show the distance as LSN. LSN is
a monotone increasing value but in your patch, a value of the "remain"
column can get decreased. As an alternative way I'd suggest to show it
as the number of segments. Attached patch is a patch for your v5 patch
that changes it so that the column shows how many WAL segments of
individual slots are remained until they get lost WAL.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

report_remaining_xlogsegs.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

20 July 2018, 07:25:20

On Fri, Jul 20, 2018 at 10:13:58AM +0900, Masahiko Sawada wrote:
> Also, I'm not sure it's a good way to show the distance as LSN. LSN is
> a monotone increasing value but in your patch, a value of the "remain"
> column can get decreased.

If that can happen, I think that this is a very, very bad idea.  A
couple of code paths, including segment recycling and the new WAL
advancing rely on such monotonic properties.  That would be also very
confusing for any monitoring job looking at pg_replication_slots.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

23 July 2018, 10:16:18

Hello.

At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
> > As I reconsidered this, I noticed that "lsn - lsn" doesn't make
> > sense here. The correct formula for the value is
> > "max_slot_wal_keep_size * 1024 * 1024 - ((oldest LSN to keep) -
> > restart_lsn). It is not a simple formula to write by hand but
> > doesn't seem general enough. I re-changed my mind to show the
> > "distance" there again.
> >
> > pg_replication_slots now has the column "remain" instaed of
> > "min_keep_lsn", which shows an LSN when wal_status is "streaming"
> > and otherwise "0/0". In a special case, "remain" can be "0/0"
> > while "wal_status" is "streaming". It is the reason for the
> > tristate return value of IsLsnStillAvaialbe().
> >
> > wal_status | remain
> > streaming  | 0/19E3C0  -- WAL is reserved
> > streaming  | 0/0       -- Still reserved but on the boundary
> > keeping    | 0/0       -- About to be lost.
> > lost       | 0/0       -- Lost.
> >
> 
> The "remain" column still shows same value at all rows as follows
> because you always compare between the current LSN and the minimum LSN
> of replication slots. Is that you expected? My comment was to show the

Ouch! Sorry for the silly mistake. GetOldestKeepSegment should
calculate restBytes based on the distance from the cutoff LSN to
restart_lsn, not to minSlotLSN.  The attached fixed v6 correctly
shows the distance individually.

> Also, I'm not sure it's a good way to show the distance as LSN. LSN is
> a monotone increasing value but in your patch, a value of the "remain"
> column can get decreased. As an alternative way I'd suggest to show it

The LSN of WAL won't be decreased but an LSN is just a position
in a WAL stream. Since the representation of LSN is composed of
the two components 'file number' and 'offset', it's quite natural
to show the difference in the same unit. The distance between the
points at "6m" and "10m" is "4m".

> as the number of segments. Attached patch is a patch for your v5 patch
> that changes it so that the column shows how many WAL segments of
> individual slots are remained until they get lost WAL.

Segment size varies by configuration, so segment number is not
intuitive to show distance. I think it is the most significant
reason we move to "bytes" from "segments" about WAL sizings like
max_wal_size. More than anything, it's too coarse. The required
segments may be lasts for the time to consume a whole segment or
may be removed just after. We could calculate the fragment bytes
but it requires some internal knowledge.

Instead, I made the field be shown in flat "bytes" using bigint,
which can be nicely shown using pg_size_pretty;

=# select pg_current_wal_lsn(), restart_lsn, wal_status, pg_size_pretty(remain) as remain from pg_replication_slots ;
 pg_current_wal_lsn | restart_lsn | wal_status | remain 
--------------------+-------------+------------+--------
 0/DD3B188          | 0/CADD618   | streaming  | 19 MB
 0/DD3B188          | 0/DD3B188   | streaming  | 35 MB


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 8cc6fe3106f58ae8cfe3ad8d4b25b5774ac6ec05 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 100 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 ++++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 94 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 335b4a573d..4bf1536d8f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9491,6 +9493,51 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    slotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, slotSeg, wal_segment_size);
+
+    /*
+     * Calcualte keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && slotSeg <= currSeg)
+        keepSegs = currSeg - slotSeg;
+
+    /*
+     * slot keep segments is limited by max_slot_wal_keep_size, fragment of a
+     * segment is ignored
+     */
+    if (max_slot_wal_keep_size_mb > 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep larger than wal_segment_size if any*/
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9503,33 +9550,46 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
     XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
     XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
+
+    /*
+     * We should keep certain number of WAL segments after this checktpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail("The mostly affected slot has lost %ld segments.",
+                                   lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
+            prev_lost_segs = 0;
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
-    }
+    if (minSegNo < segno)
+        segno = minSegNo;
 
     /* don't delete WAL segments newer than the calculated segment */
     if (segno < *logSegNo)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a88ea6cfc9..63e6d8d9b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2537,6 +2537,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c0d3fb8491..cb5b2bcc89 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From d41fb5b4e457c787b3763aca92b8932be550b48e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaing bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 152 +++++++++++++++++++++++++++++----
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 ++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 181 insertions(+), 24 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4bf1536d8f..1b9cc619f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr, XLogRecPtr restartLSN, uint64
*restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9493,44 +9493,165 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is alredy removed.
+ * 1 means that WAL record at tagetLSN is availble.
+ * 2 means that WAL record at tagetLSN is availble but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checktpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
+ * slot loses reserving a WAL record.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN, XLogRecPtr restartLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
+    uint64        limitSegs = 0;
     XLogSegNo    currSeg;
-    XLogSegNo    slotSeg;
+    XLogSegNo    minSlotSeg;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
-    XLByteToSeg(minSlotLSN, slotSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
 
     /*
-     * Calcualte keep segments by slots first. The second term of the
+     * Calculate keep segments by slots first. The second term of the
      * condition is just a sanity check.
      */
-    if (minSlotLSN != InvalidXLogRecPtr && slotSeg <= currSeg)
-        keepSegs = currSeg - slotSeg;
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
 
-    /*
-     * slot keep segments is limited by max_slot_wal_keep_size, fragment of a
-     * segment is ignored
-     */
+    /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb > 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keeping segments */
         if (limitSegs < keepSegs)
             keepSegs = limitSegs;
     }
 
-    /* but, keep larger than wal_segment_size if any*/
+    /* but, keep at least wal_keep_segments segments if any */
     if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
         keepSegs = wal_keep_segments;
 
+    /*
+     * Return remaining LSN bytes to advance until the slot gives up reserving
+     * WAL records if requested.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo restartSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+        if (limitSegs > 0 && currSeg <= restartSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses restart_lsn.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            *restBytes =
+                (restartSeg + limitSegs - currSeg) *    wal_segment_size
+                + fragbytes;
+        }
+    }
+
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
         return 1;
@@ -9562,7 +9683,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checktpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..d28896dc58 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..d9ed9e8cf2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a14651010f..4a096c9478 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 744d501e31..dcd5f19644 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 9af60de1bc2c172ddb0b1d21fe506bd5fa179fe9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..401e3b1bd0
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'insecured'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From a878a619751d7a229706ee76542f47633eabe013 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 29 +++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 22 ++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index fffb79f713..6d76da97dc 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9886,6 +9886,35 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them have been removed. The
+      last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_keep_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The address (<literal>LSN</literal>) back to which is available
+      to the replication slot. The user of the slot can no longer continue
+      streaming if this exceeds restart_lsn.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4d48d93305..8c2c1cf345 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to reatin in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>
+       <para>
+        This parameter is used being rounded down to the multiples of WAL file
+        size.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d..50ebb23c23 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn | wal_status |
min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |             |                     | unknown    |
0/1000000
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

24 July 2018, 10:47:41

On Mon, Jul 23, 2018 at 4:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
>> > As I reconsidered this, I noticed that "lsn - lsn" doesn't make
>> > sense here. The correct formula for the value is
>> > "max_slot_wal_keep_size * 1024 * 1024 - ((oldest LSN to keep) -
>> > restart_lsn). It is not a simple formula to write by hand but
>> > doesn't seem general enough. I re-changed my mind to show the
>> > "distance" there again.
>> >
>> > pg_replication_slots now has the column "remain" instaed of
>> > "min_keep_lsn", which shows an LSN when wal_status is "streaming"
>> > and otherwise "0/0". In a special case, "remain" can be "0/0"
>> > while "wal_status" is "streaming". It is the reason for the
>> > tristate return value of IsLsnStillAvaialbe().
>> >
>> > wal_status | remain
>> > streaming  | 0/19E3C0  -- WAL is reserved
>> > streaming  | 0/0       -- Still reserved but on the boundary
>> > keeping    | 0/0       -- About to be lost.
>> > lost       | 0/0       -- Lost.
>> >
>>
>> The "remain" column still shows same value at all rows as follows
>> because you always compare between the current LSN and the minimum LSN
>> of replication slots. Is that you expected? My comment was to show the
>
> Ouch! Sorry for the silly mistake. GetOldestKeepSegment should
> calculate restBytes based on the distance from the cutoff LSN to
> restart_lsn, not to minSlotLSN.  The attached fixed v6 correctly
> shows the distance individually.
>
>> Also, I'm not sure it's a good way to show the distance as LSN. LSN is
>> a monotone increasing value but in your patch, a value of the "remain"
>> column can get decreased. As an alternative way I'd suggest to show it
>
> The LSN of WAL won't be decreased but an LSN is just a position
> in a WAL stream. Since the representation of LSN is composed of
> the two components 'file number' and 'offset', it's quite natural
> to show the difference in the same unit. The distance between the
> points at "6m" and "10m" is "4m".
>
>> as the number of segments. Attached patch is a patch for your v5 patch
>> that changes it so that the column shows how many WAL segments of
>> individual slots are remained until they get lost WAL.
>
> Segment size varies by configuration, so segment number is not
> intuitive to show distance. I think it is the most significant
> reason we move to "bytes" from "segments" about WAL sizings like
> max_wal_size. More than anything, it's too coarse. The required
> segments may be lasts for the time to consume a whole segment or
> may be removed just after. We could calculate the fragment bytes
> but it requires some internal knowledge.
>
> Instead, I made the field be shown in flat "bytes" using bigint,
> which can be nicely shown using pg_size_pretty;

Thank you for updating. I agree showing the remain in bytes.

Here is review comments for v6 patch.

@@ -967,9 +969,9 @@ postgres=# SELECT * FROM
pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |

 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin |
restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | slot_type | datoid | database | active | xmin |
restart_lsn | confirmed_flush_lsn | wal_status | min_keep_lsn

+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
+ node_a_slot | physical  |        |          | f      |      |
     |                     | unknown    | 0/1000000

This funk should be updated.

-----
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
+ * slot loses reserving a WAL record.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
XLogRecPtr restartLSN, uint64 *restBytes)
+{

You're assuming that the minSlotLSN is the minimum LSN of replication
slots but it's not mentioned anywhere. Since you check minSlotSeg <=
currSeg but not force it, if a caller sets a wrong value to minSlotLSN
this function will return a wrong value with no complaints. Similarly
there is not explanation about the resetartLSN, so you can add it. I'm
not sure the augment name restartLSN is suitable for the function in
xlog.c but I'd defer it to committers.

Since this function assumes that both restartLSN and *restBytes are
valid or invalid (and NULL) it's better to add assertions for safety.
The current code accepts even the case where only either argment is
valid.

-----
+               if (limitSegs > 0 && currSeg <= restartSeg + limitSegs)
+               {
+                       /*
+                        * This slot still has all required segments.
Calculate how many
+                        * LSN bytes the slot has until it loses restart_lsn.
+                        */
+                       fragbytes = wal_segment_size - (currLSN %
wal_segment_size);
+                       *restBytes =
+                               (restartSeg + limitSegs - currSeg) *
 wal_segment_size
+                               + fragbytes;
+               }
+       }

This code doesn't consider the case where wal_keep_segments >
max_slot_keep_size. In the case I think we should use (currSeg -
wal_keep_segments) as the lower bound in order to avoid showing
"streaming" in the wal_status although the remain is 0.

-----
+                       *restBytes =
+                               (restartSeg + limitSegs - currSeg) *
 wal_segment_size
+                               + fragbytes;

Maybe you can use XLogSegNoOffsetToRecPtr instead.

-----
+ * 0 means that WAL record at targetLSN is alredy removed.
+ * 1 means that WAL record at tagetLSN is availble.
+ * 2 means that WAL record at tagetLSN is availble but about to be removed by

s/alredy/already/
s/tagetLSN/targetLSN/
s/availble/available/

-----
+ * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
+ * slot loses reserving a WAL record.

s/resetBytes/restBytes/

-----
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to reatin in the <filename>pg_wal</filename>

s/reatin/retain/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

31 July 2018, 12:11:21

Hello.

At Tue, 24 Jul 2018 16:47:41 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD0rChq7wQE=_o95quopcQGjcVG9omwdH07nT5cm81hzg@mail.gmail.com>
> On Mon, Jul 23, 2018 at 4:16 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello.
> >
> > At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
..
> > Instead, I made the field be shown in flat "bytes" using bigint,
> > which can be nicely shown using pg_size_pretty;
> 
> Thank you for updating. I agree showing the remain in bytes.
> 
> Here is review comments for v6 patch.
> 
> @@ -967,9 +969,9 @@ postgres=# SELECT * FROM
> pg_create_physical_replication_slot('node_a_slot');
>   node_a_slot |
> 
>  postgres=# SELECT * FROM pg_replication_slots;
> -  slot_name  | slot_type | datoid | database | active | xmin |
> restart_lsn | confirmed_flush_lsn
> --------------+-----------+--------+----------+--------+------+-------------+---------------------
> - node_a_slot | physical  |        |          | f      |      |             |
> +  slot_name  | slot_type | datoid | database | active | xmin |
> restart_lsn | confirmed_flush_lsn | wal_status | min_keep_lsn
>
+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
> + node_a_slot | physical  |        |          | f      |      |
>      |                     | unknown    | 0/1000000
> 
> This funk should be updated.

Perhaps you need a fresh database cluster.

> -----
> +/*
> + * Returns minimum segment number the next checktpoint must leave considering
> + * wal_keep_segments, replication slots and max_slot_wal_keep_size.
> + *
> + * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
> + * slot loses reserving a WAL record.
> + */
> +static XLogSegNo
> +GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
> XLogRecPtr restartLSN, uint64 *restBytes)
> +{
> 
> You're assuming that the minSlotLSN is the minimum LSN of replication
> slots but it's not mentioned anywhere. Since you check minSlotSeg <=

I added description for parameters in the function comment.

> currSeg but not force it, if a caller sets a wrong value to minSlotLSN
> this function will return a wrong value with no complaints. Similarly

I don't think such case can happen on a sane system. Even that
happenes it behaves in the same way as minSlotLSN being invalid
in the case. KeepLogSeg() also behaves in the same way and the
wal recycling will be performed as pg_replication_losts
predicted. Nothing can improve the behavior and I don't think
placing assertion there is overkill.

> there is not explanation about the resetartLSN, so you can add it. I'm
> not sure the augment name restartLSN is suitable for the function in
> xlog.c but I'd defer it to committers.

Done.

> Since this function assumes that both restartLSN and *restBytes are
> valid or invalid (and NULL) it's better to add assertions for safety.
> The current code accepts even the case where only either argment is
> valid.
> -----
> +               if (limitSegs > 0 && currSeg <= restartSeg + limitSegs)
> +               {

Even if the caller gives InvalidRecPtr as restartLSN, which is an
insane situation, the function just treats the value as zero and
reuturns the "correct" value for the restartLSN, which doesn't
harm anything.

> +                       /*
> +                        * This slot still has all required segments.
> Calculate how many
> +                        * LSN bytes the slot has until it loses restart_lsn.
> +                        */
> +                       fragbytes = wal_segment_size - (currLSN %
> wal_segment_size);
> +                       *restBytes =
> +                               (restartSeg + limitSegs - currSeg) *
>  wal_segment_size
> +                               + fragbytes;
> +               }
> +       }
> 
> This code doesn't consider the case where wal_keep_segments >
> max_slot_keep_size. In the case I think we should use (currSeg -
> wal_keep_segments) as the lower bound in order to avoid showing
> "streaming" in the wal_status although the remain is 0.

Thanks. It should use keepSegs instead of limitSegs. Fixed.

> -----
> +                       *restBytes =
> +                               (restartSeg + limitSegs - currSeg) *
>  wal_segment_size
> +                               + fragbytes;
> 
> Maybe you can use XLogSegNoOffsetToRecPtr instead.

Indeed. I'm not sure it is easier to read, though. (Maybe the
functions should use wal_segment_size out-of-band. (That is, not
passed as a parameter)).

> -----
> + * 0 means that WAL record at targetLSN is alredy removed.
> + * 1 means that WAL record at tagetLSN is availble.
> + * 2 means that WAL record at tagetLSN is availble but about to be removed by
> 
> s/alredy/already/
> s/tagetLSN/targetLSN/
> s/availble/available/
> -----
> + * If resetBytes is not NULL, returns remaining LSN bytes to advance until any
> + * slot loses reserving a WAL record.
> 
> s/resetBytes/restBytes/

Ugggh! Sorry that my fingers are extra-fat.. Fixed. I rechecked
through the whole patch and found one more typo.

> -----
> +        Specify the maximum size of WAL files
> +        that <link linkend="streaming-replication-slots">replication
> +        slots</link> are allowed to reatin in the <filename>pg_wal</filename>
> 
> s/reatin/retain/

Thank you.  I also found other leftovers in catalogs.sgml and
high-availability.sgml.

# The latter file seems needing amendment for v11.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 83284a5e3f23ad45492ce54421c6af9c86e1d598 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 22 ++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml | 14 ++++++++------
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index fffb79f713..5acee4d0e8 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9886,6 +9886,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them have been removed. The
+      last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until the
+        slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..9d190c3daa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>
+       <para>
+        This parameter is used being rounded down to the multiples of WAL file
+        size.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d..39068b8f82 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
@@ -967,9 +969,9 @@ postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  node_a_slot |
 
 postgres=# SELECT * FROM pg_replication_slots;
-  slot_name  | slot_type | datoid | database | active | xmin | restart_lsn | confirmed_flush_lsn
--------------+-----------+--------+----------+--------+------+-------------+---------------------
- node_a_slot | physical  |        |          | f      |      |             |
+  slot_name  | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain  
 

+-------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
+ node_a_slot |        | physical  |        |          | f         | f      |            |      |              |
unknown    |                     | unknown    |      0 
 
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
-- 
2.16.3

From 3e589f127df280f56bc8382426f9f3b1f16f21f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 161 ++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..15820e049e
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,161 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming", 'check that slot is keeping all segments');
+
+# The stanby can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Advance WAL again
+advance_wal($node_master, 10);
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 32;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Some segments become 'keeping'
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|lost", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+
+    $node->safe_psql('postgres', "CHECKPOINT;");
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 712dda58cf51abc5ea64110d566c323e807aa8dd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 130 ++++++++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 +++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 171 insertions(+), 12 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 982eedad32..c9f28fd890 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr, XLogRecPtr restartLSN, uint64
*restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9490,15 +9490,114 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is already removed.
+ * 1 means that WAL record at targetLSN is available.
+ * 2 means that WAL record at targetLSN is available but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * restartLSN is restart_lsn of a slot.
+
+ * If restBytes is not NULL, returns remaining LSN bytes to advance until the
+ * segment which contains restart_LSN is will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN, XLogRecPtr restartLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
@@ -9530,6 +9629,30 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
         keepSegs = wal_keep_segments;
 
+    /*
+     * Return remaining LSN bytes to advance until the slot gives up reserving
+     * WAL records if requested.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo restartSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+        if (keepSegs > 0 && currSeg <= restartSeg + keepSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses restart_lsn.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(restartSeg + keepSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
+
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
         return 1;
@@ -9558,7 +9681,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..5db294f64e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..d9ed9e8cf2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a14651010f..4a096c9478 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 744d501e31..dcd5f19644 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From d29a5c6d53dbb88077f00c8a30d7a1ca0a5a24d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 106 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..982eedad32 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9488,6 +9490,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb > 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keeping segments */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9499,38 +9548,45 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail("The mostly affected slot has lost %ld segments.",
+                                   lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5ba149996..dd65f8f17c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2538,6 +2538,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c0d3fb8491..cb5b2bcc89 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Bruce Momjian

Date:

31 July 2018, 22:11:52

On Tue, Jun 26, 2018 at 04:26:59PM +0900, Kyotaro HORIGUCHI wrote:
> Hello. This is the reabased version of slot-limit feature.
> 
> This patch limits maximum WAL segments to be kept by replication
> slots. Replication slot is useful to avoid desync with replicas
> after temporary disconnection but it is dangerous when some of
> replicas are lost. The WAL space can be exhausted and server can
> PANIC in the worst case. This can prevent the worst case having a
> benefit from replication slots using a new GUC variable
> max_slot_wal_keep_size.

Have you considered just using a boolean to control if max_wal_size
honors WAL preserved by replication slots, rather than creating the new
GUC max_slot_wal_keep_size?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

31 July 2018, 22:14:03

On 2018-07-31 15:11:52 -0400, Bruce Momjian wrote:
> On Tue, Jun 26, 2018 at 04:26:59PM +0900, Kyotaro HORIGUCHI wrote:
> > Hello. This is the reabased version of slot-limit feature.
> > 
> > This patch limits maximum WAL segments to be kept by replication
> > slots. Replication slot is useful to avoid desync with replicas
> > after temporary disconnection but it is dangerous when some of
> > replicas are lost. The WAL space can be exhausted and server can
> > PANIC in the worst case. This can prevent the worst case having a
> > benefit from replication slots using a new GUC variable
> > max_slot_wal_keep_size.
> 
> Have you considered just using a boolean to control if max_wal_size
> honors WAL preserved by replication slots, rather than creating the new
> GUC max_slot_wal_keep_size?

That seems like a bad idea. max_wal_size influences checkpoint
scheduling - there's no good reason to conflate that with retention?

Greetings,

Andres Freund

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Stephen Frost

Date:

31 July 2018, 22:21:27

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2018-07-31 15:11:52 -0400, Bruce Momjian wrote:
> > On Tue, Jun 26, 2018 at 04:26:59PM +0900, Kyotaro HORIGUCHI wrote:
> > > Hello. This is the reabased version of slot-limit feature.
> > >
> > > This patch limits maximum WAL segments to be kept by replication
> > > slots. Replication slot is useful to avoid desync with replicas
> > > after temporary disconnection but it is dangerous when some of
> > > replicas are lost. The WAL space can be exhausted and server can
> > > PANIC in the worst case. This can prevent the worst case having a
> > > benefit from replication slots using a new GUC variable
> > > max_slot_wal_keep_size.
> >
> > Have you considered just using a boolean to control if max_wal_size
> > honors WAL preserved by replication slots, rather than creating the new
> > GUC max_slot_wal_keep_size?
>
> That seems like a bad idea. max_wal_size influences checkpoint
> scheduling - there's no good reason to conflate that with retention?

I agree that we shouldn't conflate checkpointing and retention.  What I
wonder about though is what value will wal_keep_segments have once this
new GUC exists..?  I wonder if we could deprecate it...  I wish we had
implemented repliation slots from the start with wal_keep_segments
capping the max WAL retained but that ship has sailed and changing it
now would break existing configurations.

Thanks!

Stephen

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

31 July 2018, 22:24:13

On 2018-07-31 15:21:27 -0400, Stephen Frost wrote:
> Greetings,
> 
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2018-07-31 15:11:52 -0400, Bruce Momjian wrote:
> > > On Tue, Jun 26, 2018 at 04:26:59PM +0900, Kyotaro HORIGUCHI wrote:
> > > > Hello. This is the reabased version of slot-limit feature.
> > > > 
> > > > This patch limits maximum WAL segments to be kept by replication
> > > > slots. Replication slot is useful to avoid desync with replicas
> > > > after temporary disconnection but it is dangerous when some of
> > > > replicas are lost. The WAL space can be exhausted and server can
> > > > PANIC in the worst case. This can prevent the worst case having a
> > > > benefit from replication slots using a new GUC variable
> > > > max_slot_wal_keep_size.
> > > 
> > > Have you considered just using a boolean to control if max_wal_size
> > > honors WAL preserved by replication slots, rather than creating the new
> > > GUC max_slot_wal_keep_size?
> > 
> > That seems like a bad idea. max_wal_size influences checkpoint
> > scheduling - there's no good reason to conflate that with retention?
> 
> I agree that we shouldn't conflate checkpointing and retention.  What I
> wonder about though is what value will wal_keep_segments have once this
> new GUC exists..?  I wonder if we could deprecate it...

Don't think that's a good idea. It's entirely conceivable to have a
wal_keep_segments much lower than max_slot_wal_keep_size.  For some
throwaway things it can be annoying to have to slots, and if you remove
wal_keep_segments there's no alternative.

Greetings,

Andres Freund

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

01 August 2018, 04:52:21

At Tue, 31 Jul 2018 12:24:13 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180731192413.7lr4qbc4qbyoim5y@alap3.anarazel.de>
> On 2018-07-31 15:21:27 -0400, Stephen Frost wrote:
> > Greetings,
> > 
> > * Andres Freund (andres@anarazel.de) wrote:
> > > On 2018-07-31 15:11:52 -0400, Bruce Momjian wrote:
> > > > On Tue, Jun 26, 2018 at 04:26:59PM +0900, Kyotaro HORIGUCHI wrote:
> > > > > Hello. This is the reabased version of slot-limit feature.
> > > > > 
> > > > > This patch limits maximum WAL segments to be kept by replication
> > > > > slots. Replication slot is useful to avoid desync with replicas
> > > > > after temporary disconnection but it is dangerous when some of
> > > > > replicas are lost. The WAL space can be exhausted and server can
> > > > > PANIC in the worst case. This can prevent the worst case having a
> > > > > benefit from replication slots using a new GUC variable
> > > > > max_slot_wal_keep_size.
> > > > 
> > > > Have you considered just using a boolean to control if max_wal_size
> > > > honors WAL preserved by replication slots, rather than creating the new
> > > > GUC max_slot_wal_keep_size?
> > > 
> > > That seems like a bad idea. max_wal_size influences checkpoint
> > > scheduling - there's no good reason to conflate that with retention?
> > 
> > I agree that we shouldn't conflate checkpointing and retention.  What I
> > wonder about though is what value will wal_keep_segments have once this
> > new GUC exists..?  I wonder if we could deprecate it...
> 
> Don't think that's a good idea. It's entirely conceivable to have a
> wal_keep_segments much lower than max_slot_wal_keep_size.  For some
> throwaway things it can be annoying to have to slots, and if you remove
> wal_keep_segments there's no alternative.

I thought it's to be deprecated for some reason so I'm leaving
wal_keep_segments in '# of segments' even though the new GUC is
in MB. I'm a bit uneasy that the two similar settings are in
different units. Couldn't we turn it into MB taking this
opportunity if we will keep wal_keep_segments, changing its name
to min_wal_keep_size?  max_slot_wal_keep_size could be changed to
just max_wal_keep_size along with it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Robert Haas

Date:

02 August 2018, 16:05:33

On Tue, Jul 31, 2018 at 9:52 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I thought it's to be deprecated for some reason so I'm leaving
> wal_keep_segments in '# of segments' even though the new GUC is
> in MB. I'm a bit uneasy that the two similar settings are in
> different units. Couldn't we turn it into MB taking this
> opportunity if we will keep wal_keep_segments, changing its name
> to min_wal_keep_size?  max_slot_wal_keep_size could be changed to
> just max_wal_keep_size along with it.

This seems like it's a little bit of a separate topic from what this
thread about, but FWIW, +1 for standardizing on MB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

03 August 2018, 07:59:51

At Thu, 2 Aug 2018 09:05:33 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYVrKY0W0jigJymFZo0ewkQoWGfLLpiTSgJLQN3tcHGTg@mail.gmail.com>
> On Tue, Jul 31, 2018 at 9:52 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I thought it's to be deprecated for some reason so I'm leaving
> > wal_keep_segments in '# of segments' even though the new GUC is
> > in MB. I'm a bit uneasy that the two similar settings are in
> > different units. Couldn't we turn it into MB taking this
> > opportunity if we will keep wal_keep_segments, changing its name
> > to min_wal_keep_size?  max_slot_wal_keep_size could be changed to
> > just max_wal_keep_size along with it.
> 
> This seems like it's a little bit of a separate topic from what this
> thread about, but FWIW, +1 for standardizing on MB.

Thanks. Ok, I'll raise this after separately with this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

04 September 2018, 03:42:20

Thank you for updating the patch.

On Tue, Jul 31, 2018 at 6:11 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Tue, 24 Jul 2018 16:47:41 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD0rChq7wQE=_o95quopcQGjcVG9omwdH07nT5cm81hzg@mail.gmail.com>
>> On Mon, Jul 23, 2018 at 4:16 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello.
>> >
>> > At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
> ..
>> > Instead, I made the field be shown in flat "bytes" using bigint,
>> > which can be nicely shown using pg_size_pretty;
>>
>> Thank you for updating. I agree showing the remain in bytes.
>>
>> Here is review comments for v6 patch.
>>
>> @@ -967,9 +969,9 @@ postgres=# SELECT * FROM
>> pg_create_physical_replication_slot('node_a_slot');
>>   node_a_slot |
>>
>>  postgres=# SELECT * FROM pg_replication_slots;
>> -  slot_name  | slot_type | datoid | database | active | xmin |
>> restart_lsn | confirmed_flush_lsn
>> --------------+-----------+--------+----------+--------+------+-------------+---------------------
>> - node_a_slot | physical  |        |          | f      |      |             |
>> +  slot_name  | slot_type | datoid | database | active | xmin |
>> restart_lsn | confirmed_flush_lsn | wal_status | min_keep_lsn
>>
+-------------+-----------+--------+----------+--------+------+-------------+---------------------+------------+--------------
>> + node_a_slot | physical  |        |          | f      |      |
>>      |                     | unknown    | 0/1000000
>>
>> This funk should be updated.
>
> Perhaps you need a fresh database cluster.

I meant this was a doc update in 0004 patch but it's fixed in v7 patch.

While testing the v7 patch, I got the following result with
max_slot_wal_keep_size = 5GB and without wal_keep_segments setting.

=# select pg_current_wal_lsn(), slot_name, restart_lsn,
confirmed_flush_lsn, wal_status, remain, pg_size_pretty(remain) from
pg_replication_slots ;
 pg_current_wal_lsn | slot_name | restart_lsn | confirmed_flush_lsn |
wal_status |  remain  | pg_size_pretty
--------------------+-----------+-------------+---------------------+------------+----------+----------------
 2/A30000D8         | l1        | 1/AC000910  | 1/AC000948          |
streaming  | 16777000 | 16 MB
(1 row)

The actual distance between the slot limit and the slot 'l1' is about
1GB(5GB - (2/A30000D8 - 1/AC000910)) but the system view says the
remain is only 16MB. For the calculation of resetBytes in
GetOldestKeepSegment(), the current patch seems to calculate the
distance between the minSlotLSN and restartLSN when (curLSN -
max_slot_wal_keep_size) < minSlotLSN. However, I think that the actual
remained bytes until the slot lost the required WAL is (restartLSN -
(currLSN - max_slot_wal_keep_size)) in that case.

Also, 0004 patch needs to be rebased on the current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

04 September 2018, 13:52:50

At Mon, 3 Sep 2018 18:14:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBgCMc9bp2cADMFm40qoEXxbomdu1dtj5EaFSAS4BtAyw@mail.gmail.com>
> Thank you for updating the patch!
> 
> On Tue, Jul 31, 2018 at 6:11 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello.
> >
> > At Tue, 24 Jul 2018 16:47:41 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD0rChq7wQE=_o95quopcQGjcVG9omwdH07nT5cm81hzg@mail.gmail.com>
> >> On Mon, Jul 23, 2018 at 4:16 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> > Hello.
> >> >
> >> > At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
> >> This funk should be updated.
> >
> > Perhaps you need a fresh database cluster.
> 
> I meant this was a doc update in 0004 patch but it's fixed in v7 patch.

Wow..

> While testing the v7 patch, I got the following result with
> max_slot_wal_keep_size = 5GB and without wal_keep_segments setting.
> 
> =# select pg_current_wal_lsn(), slot_name, restart_lsn,
> confirmed_flush_lsn, wal_status, remain, pg_size_pretty(remain) from
> pg_replication_slots ;
>  pg_current_wal_lsn | slot_name | restart_lsn | confirmed_flush_lsn |
> wal_status |  remain  | pg_size_pretty
> --------------------+-----------+-------------+---------------------+------------+----------+----------------
>  2/A30000D8         | l1        | 1/AC000910  | 1/AC000948          |
> streaming  | 16777000 | 16 MB
> (1 row)
> 
> The actual distance between the slot limit and the slot 'l1' is about
> 1GB(5GB - (2/A30000D8 - 1/AC000910)) but the system view says the
> remain is only 16MB. For the calculation of resetBytes in
> GetOldestKeepSegment(), the current patch seems to calculate the
> distance between the minSlotLSN and restartLSN when (curLSN -
> max_slot_wal_keep_size) < minSlotLSN. However, I think that the actual
> remained bytes until the slot lost the required WAL is (restartLSN -
> (currLSN - max_slot_wal_keep_size)) in that case.

Oops! That's a silly thinko or rather a typo. It's apparently
wrong that keepSegs instead of limitSegs is involved in making
the calculation of restBytes. Just using limitSegs makes it
sane. It's a pity that I removed the remain from regression test.

Fixed that and added an item for remain calculation in the TAP
test. I expect that pg_size_pretty() adds some robustness to the
test.

> Also, 0004 patch needs to be rebased on the current HEAD.

Done. Please find the v8 attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From a380a7fffcf01eb869035113da451c670ec52772 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 106 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..982eedad32 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9488,6 +9490,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb > 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keeping segments */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9499,38 +9548,45 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail("The mostly affected slot has lost %ld segments.",
+                                   lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0625eff219..897fb72e15 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2538,6 +2538,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        0, 0,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7486d20a34..7d7f04aa51 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = 0    # measured in bytes; 0 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From 19d421790bb03cf93f2c944fdd70e38b0d710b3d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 133 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 +++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 172 insertions(+), 14 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 982eedad32..1934f2165b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr, XLogRecPtr restartLSN, uint64
*restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9490,19 +9490,119 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Returns the segment number of the oldest file in XLOG directory.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is already removed.
+ * 1 means that WAL record at targetLSN is available.
+ * 2 means that WAL record at targetLSN is available but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * oldestSeg is zero before at least one segment has been removed since
+     * startup. Use oldest segno taken from file names.
+     */
+    if (oldestSeg == 0)
+    {
+        static XLogSegNo oldestFileSeg = 0;
+
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+        /* let it have the same meaning with lastRemovedSegNo here */
+        oldestSeg = oldestFileSeg - 1;
+    }
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * restartLSN is restart_lsn of a slot.
+
+ * If restBytes is not NULL, returns remaining LSN bytes to advance until the
+ * segment which contains restart_LSN is will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN, XLogRecPtr restartLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
+    uint64        limitSegs = 0;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9517,8 +9617,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb > 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Apply max_slot_wal_keep_size to keeping segments */
@@ -9530,6 +9628,30 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
         keepSegs = wal_keep_segments;
 
+    /*
+     * Return remaining LSN bytes to advance until the slot gives up reserving
+     * WAL records if requested.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo restartSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
+        if (limitSegs > 0 && currSeg <= restartSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses restart_lsn.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
+
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
         return 1;
@@ -9558,7 +9680,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..5db294f64e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..d9ed9e8cf2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a14651010f..4a096c9478 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 078129f251..02286cdfe8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From f7770eee8533c91953bed6d37612c068b73d30e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 173 ++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..d198ca3054
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,173 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# All segments still must be secured after a checkpoint.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is keeping all segments');
+
+# The stanby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 48;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# Segments are still secured.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|64 MB", 'check that remaining byte are calculated');
+
+# Advance WAL again with checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Segments are still secured.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|32 MB", 'remaining byte should be reduced by 32MB');
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Segments become 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that some segments are about to removed');
+
+# The stanby still can connect master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments",
+               $logstart),
+   'check that the warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From def2508a74f4f921da329c82efcacef73fcfc1c7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 22 ++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 07e8b3325f..7e31267d68 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9887,6 +9887,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them have been removed. The
+      last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is not zero. If the slot
+      doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until the
+        slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..9d190c3daa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>
+       <para>
+        This parameter is used being rounded down to the multiples of WAL file
+        size.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 8cb77f85ec..04cdccb10d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

05 September 2018, 08:31:10

On Tue, Sep 4, 2018 at 7:52 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Mon, 3 Sep 2018 18:14:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBgCMc9bp2cADMFm40qoEXxbomdu1dtj5EaFSAS4BtAyw@mail.gmail.com>
>> Thank you for updating the patch!
>>
>> On Tue, Jul 31, 2018 at 6:11 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello.
>> >
>> > At Tue, 24 Jul 2018 16:47:41 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD0rChq7wQE=_o95quopcQGjcVG9omwdH07nT5cm81hzg@mail.gmail.com>
>> >> On Mon, Jul 23, 2018 at 4:16 PM, Kyotaro HORIGUCHI
>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >> > Hello.
>> >> >
>> >> > At Fri, 20 Jul 2018 10:13:58 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDayePWwu4t=VPP5P1QFDSBvks1d8j76bXp5rbXoPbZcA@mail.gmail.com>
>> >> This funk should be updated.
>> >
>> > Perhaps you need a fresh database cluster.
>>
>> I meant this was a doc update in 0004 patch but it's fixed in v7 patch.
>
> Wow..
>
>> While testing the v7 patch, I got the following result with
>> max_slot_wal_keep_size = 5GB and without wal_keep_segments setting.
>>
>> =# select pg_current_wal_lsn(), slot_name, restart_lsn,
>> confirmed_flush_lsn, wal_status, remain, pg_size_pretty(remain) from
>> pg_replication_slots ;
>>  pg_current_wal_lsn | slot_name | restart_lsn | confirmed_flush_lsn |
>> wal_status |  remain  | pg_size_pretty
>> --------------------+-----------+-------------+---------------------+------------+----------+----------------
>>  2/A30000D8         | l1        | 1/AC000910  | 1/AC000948          |
>> streaming  | 16777000 | 16 MB
>> (1 row)
>>
>> The actual distance between the slot limit and the slot 'l1' is about
>> 1GB(5GB - (2/A30000D8 - 1/AC000910)) but the system view says the
>> remain is only 16MB. For the calculation of resetBytes in
>> GetOldestKeepSegment(), the current patch seems to calculate the
>> distance between the minSlotLSN and restartLSN when (curLSN -
>> max_slot_wal_keep_size) < minSlotLSN. However, I think that the actual
>> remained bytes until the slot lost the required WAL is (restartLSN -
>> (currLSN - max_slot_wal_keep_size)) in that case.
>
> Oops! That's a silly thinko or rather a typo. It's apparently
> wrong that keepSegs instead of limitSegs is involved in making
> the calculation of restBytes. Just using limitSegs makes it
> sane. It's a pity that I removed the remain from regression test.
>
> Fixed that and added an item for remain calculation in the TAP
> test. I expect that pg_size_pretty() adds some robustness to the
> test.
>
>> Also, 0004 patch needs to be rebased on the current HEAD.
>
> Done. Please find the v8 attached.
>

Thank you for updating! Here is the review comment for v8 patch.

+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses restart_lsn.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg,
fragbytes,
+                                    wal_segment_size, *restBytes);

For the calculation of fragbytes, I think we should calculate the
fragment bytes of restartLSN instead. The the formula "restartSeg +
limitSegs - currSeg" means the # of segment between restartLSN and the
limit by the new parameter. I don't think that the summation of it and
fragment bytes of currenLSN is correct. As the following result
(max_slot_wal_keep_size is 128MB) shows, the remain column shows the
actual remains + 16MB (get_bytes function returns the value of
max_slot_wal_keep_size in bytes).

postgres(1:29447)=# select pg_current_wal_lsn(), slot_name,
restart_lsn, wal_status, remain, pg_size_pretty(remain),
pg_size_pretty(get_bytes('max_slot_wal_keep_size') -
(pg_current_wal_lsn() - restart_lsn)) from pg_replication_slots ;
 pg_current_wal_lsn | slot_name | restart_lsn | wal_status |  remain
| pg_size_pretty | pg_size_pretty
--------------------+-----------+-------------+------------+-----------+----------------+----------------
 0/1D0001F0         | l1        | 0/1D0001B8  | streaming  | 150994448
| 144 MB         | 128 MB
(1 row)

---
If the wal_keeps_segments is greater than max_slot_wal_keep_size, the
wal_keep_segments doesn't affect the value of the remain column.

postgres(1:48422)=# show max_slot_wal_keep_size ;
 max_slot_wal_keep_size
------------------------
 128MB
(1 row)

postgres(1:48422)=# show wal_keep_segments ;
 wal_keep_segments
-------------------
 5000
(1 row)

postgres(1:48422)=# select slot_name, wal_status, remain,
pg_size_pretty(remain) as remain  from pg_replication_slots ;
 slot_name | wal_status |  remain   | remain
-----------+------------+-----------+--------
 l1        | streaming  | 150994728 | 144 MB
(1 row)

*** After consumed over 128MB WAL ***

postgres(1:48422)=# select slot_name, wal_status, remain,
pg_size_pretty(remain) as remain  from pg_replication_slots ;
 slot_name | wal_status | remain | remain
-----------+------------+--------+---------
 l1        | streaming  |      0 | 0 bytes
(1 row)

---
For the cosmetic stuff there are code where need the line break.

 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr
minSlotPtr, XLogRecPtr restartLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);

and

 +static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
XLogRecPtr restartLSN, uint64 *restBytes)
+{

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

06 September 2018, 10:10:03

Thank you for the comment.

At Wed, 5 Sep 2018 14:31:10 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB-HJvL+uKsv40Gb8Dymh9uBBQUXTucqv4MDtH_AGKh4g@mail.gmail.com>
> On Tue, Sep 4, 2018 at 7:52 PM, Kyotaro HORIGUCHI
> Thank you for updating! Here is the review comment for v8 patch.
> 
> +            /*
> +             * This slot still has all required segments. Calculate how many
> +             * LSN bytes the slot has until it loses restart_lsn.
> +             */
> +            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
> +            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg,
> fragbytes,
> +                                    wal_segment_size, *restBytes);
> 
> For the calculation of fragbytes, I think we should calculate the
> fragment bytes of restartLSN instead. The the formula "restartSeg +
> limitSegs - currSeg" means the # of segment between restartLSN and the
> limit by the new parameter. I don't think that the summation of it and
> fragment bytes of currenLSN is correct. As the following result
> (max_slot_wal_keep_size is 128MB) shows, the remain column shows the
> actual remains + 16MB (get_bytes function returns the value of
> max_slot_wal_keep_size in bytes).

Since a oldest segment is removed after the current LSN moves to
the next segmen, current LSN naturally determines the fragment
bytes. Maybe you're concerning that the number of segments looks
too much by one segment.

One arguable point of the feature is how max_slot_wal_keep_size
works exactly. I assume that even though the name is named as
"max_", we actually expect that "at least that bytes are
kept". So, for example, with 16MB of segment size and 50MB of
max_s_w_k_s, I designed this so that the size of preserved WAL
doesn't go below 50MB, actually (rounding up to multples of 16MB
of 50MB), and loses the oldest segment when it reaches 64MB +
16MB = 80MB as you saw.

# I believe that the difference is not so significant since we
# have around hunderd or several hundreds of segments in common
# cases.

Do you mean that we should define the GUC parameter literally as
"we won't have exactly that many bytes of WAL segmetns"? That is,
we have at most 48MB preserved WAL records for 50MB of
max_s_w_k_s setting. This is the same to how max_wal_size is
counted but I don't think max_slot_wal_keep_size will be regarded
in the same way.

The another design would be that we remove the oldest segnent
when WAL reaches to 64MB and reduces to 48MB after deletion.

> ---
> For the cosmetic stuff there are code where need the line break.
> 
>  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
> +static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr
> minSlotPtr, XLogRecPtr restartLSN, uint64 *restBytes);
>  static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
>  static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
> 
> and
> 
>  +static XLogSegNo
> +GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
> XLogRecPtr restartLSN, uint64 *restBytes)
> +{

Thanks, I folded the parameter list in my working repository.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

06 September 2018, 13:55:39

On Thu, Sep 6, 2018 at 4:10 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thank you for the comment.
>
> At Wed, 5 Sep 2018 14:31:10 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB-HJvL+uKsv40Gb8Dymh9uBBQUXTucqv4MDtH_AGKh4g@mail.gmail.com>
>> On Tue, Sep 4, 2018 at 7:52 PM, Kyotaro HORIGUCHI
>> Thank you for updating! Here is the review comment for v8 patch.
>>
>> +            /*
>> +             * This slot still has all required segments. Calculate how many
>> +             * LSN bytes the slot has until it loses restart_lsn.
>> +             */
>> +            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
>> +            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg,
>> fragbytes,
>> +                                    wal_segment_size, *restBytes);
>>
>> For the calculation of fragbytes, I think we should calculate the
>> fragment bytes of restartLSN instead. The the formula "restartSeg +
>> limitSegs - currSeg" means the # of segment between restartLSN and the
>> limit by the new parameter. I don't think that the summation of it and
>> fragment bytes of currenLSN is correct. As the following result
>> (max_slot_wal_keep_size is 128MB) shows, the remain column shows the
>> actual remains + 16MB (get_bytes function returns the value of
>> max_slot_wal_keep_size in bytes).
>
> Since a oldest segment is removed after the current LSN moves to
> the next segmen, current LSN naturally determines the fragment
> bytes. Maybe you're concerning that the number of segments looks
> too much by one segment.
>
> One arguable point of the feature is how max_slot_wal_keep_size
> works exactly. I assume that even though the name is named as
> "max_", we actually expect that "at least that bytes are
> kept". So, for example, with 16MB of segment size and 50MB of
> max_s_w_k_s, I designed this so that the size of preserved WAL
> doesn't go below 50MB, actually (rounding up to multples of 16MB
> of 50MB), and loses the oldest segment when it reaches 64MB +
> 16MB = 80MB as you saw.
>
> # I believe that the difference is not so significant since we
> # have around hunderd or several hundreds of segments in common
> # cases.
>
> Do you mean that we should define the GUC parameter literally as
> "we won't have exactly that many bytes of WAL segmetns"? That is,
> we have at most 48MB preserved WAL records for 50MB of
> max_s_w_k_s setting. This is the same to how max_wal_size is
> counted but I don't think max_slot_wal_keep_size will be regarded
> in the same way.

I might be missing something but what I'm expecting to this feature is
to restrict the how much WAL we can keep at a maximum for replication
slots. In other words, the distance between the current LSN and the
minimum restart_lsn of replication slots doesn't over the value of
max_slot_wal_keep_size. It's similar to wal_keep_segments except for
that this feature affects only replication slots. And
wal_keep_segments cannot restrict WAL that replication slots are
holding. For example, with 16MB of segment size and 50MB of
max_slot_wal_keep_size, we can keep at most 50MB WAL for replication
slots. However, once we consumed more than 50MB WAL while not
advancing any restart_lsn the required WAL might be lost by the next
checkpoint, which depends on the min_wal_size. On the other hand, if
we mostly can advance restart_lsn to approximately the current LSN the
size of preserved WAL for replication slots can go below 50MB.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Peter Eisentraut

Date:

06 September 2018, 23:32:21

This documentation

+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the
<filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
+        replication slots retain unlimited size of WAL files.
+       </para>

doesn't say anything about what happens when the limit is exceeded.
Does the system halt until the WAL is fetched from the slots?  Do the
slots get invalidated?

Also, I don't think 0 is a good value for the default behavior.  0 would
mean that a slot is not allowed to retain any more WAL than already
exists anyway.  Maybe we don't want to support that directly, but it's a
valid configuration.  So maybe use -1 for infinity.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

10 September 2018, 13:19:21

Hello.

At Thu, 6 Sep 2018 19:55:39 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAZCdvdMN-vG4D_653vb_FN-AaMAP5+GXgF1JRjy+LeyA@mail.gmail.com>
> On Thu, Sep 6, 2018 at 4:10 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Thank you for the comment.
> >
> > At Wed, 5 Sep 2018 14:31:10 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB-HJvL+uKsv40Gb8Dymh9uBBQUXTucqv4MDtH_AGKh4g@mail.gmail.com>
> >> On Tue, Sep 4, 2018 at 7:52 PM, Kyotaro HORIGUCHI
> >> Thank you for updating! Here is the review comment for v8 patch.
> >>
> >> +            /*
> >> +             * This slot still has all required segments. Calculate how many
> >> +             * LSN bytes the slot has until it loses restart_lsn.
> >> +             */
> >> +            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
> >> +            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg,
> >> fragbytes,
> >> +                                    wal_segment_size, *restBytes);
> >>
> >> For the calculation of fragbytes, I think we should calculate the
> >> fragment bytes of restartLSN instead. The the formula "restartSeg +
> >> limitSegs - currSeg" means the # of segment between restartLSN and the
> >> limit by the new parameter. I don't think that the summation of it and
> >> fragment bytes of currenLSN is correct. As the following result
> >> (max_slot_wal_keep_size is 128MB) shows, the remain column shows the
> >> actual remains + 16MB (get_bytes function returns the value of
> >> max_slot_wal_keep_size in bytes).
> >
> > Since a oldest segment is removed after the current LSN moves to
> > the next segmen, current LSN naturally determines the fragment
> > bytes. Maybe you're concerning that the number of segments looks
> > too much by one segment.
> >
> > One arguable point of the feature is how max_slot_wal_keep_size
> > works exactly. I assume that even though the name is named as
> > "max_", we actually expect that "at least that bytes are
> > kept". So, for example, with 16MB of segment size and 50MB of
> > max_s_w_k_s, I designed this so that the size of preserved WAL
> > doesn't go below 50MB, actually (rounding up to multples of 16MB
> > of 50MB), and loses the oldest segment when it reaches 64MB +
> > 16MB = 80MB as you saw.
> >
> > # I believe that the difference is not so significant since we
> > # have around hunderd or several hundreds of segments in common
> > # cases.
> >
> > Do you mean that we should define the GUC parameter literally as
> > "we won't have exactly that many bytes of WAL segmetns"? That is,
> > we have at most 48MB preserved WAL records for 50MB of
> > max_s_w_k_s setting. This is the same to how max_wal_size is
> > counted but I don't think max_slot_wal_keep_size will be regarded
> > in the same way.
> 
> I might be missing something but what I'm expecting to this feature is
> to restrict the how much WAL we can keep at a maximum for replication
> slots. In other words, the distance between the current LSN and the
> minimum restart_lsn of replication slots doesn't over the value of
> max_slot_wal_keep_size.

Yes, it's one possible design, the same with "we won't have more
than exactly that many bytes of WAL segmetns" above ("more than"
is added, which I meant). But anyway we cannot keep the limit
strictly since WAL segments are removed only at checkpoint
time. So If doing so, we can reach the lost state before the
max_slot_wal_keep_size is filled up meanwhile WAL can exceed the
size by a WAL flood. We can define it precisely at most as "wal
segments are preserved at most aorund the value".  So I choosed
the definition so that we can tell about this as "we don't
guarantee more than that bytes".

# Uuuu. sorry for possiblly hard-to-read sentence..

>                          It's similar to wal_keep_segments except for
> that this feature affects only replication slots. And

It defines the *extra* segments to be kept, that is, if we set it
to 2, at least 3 segments are present. If we set
max_slot_wal_keep_size to 32MB (= 2 segs here), we have at most 3
segments since 32MB range before the current LSN almost always
spans over 3 segments. Doesn't this seemingly in a similar way
with wal_keep_segments?

If the current LSN is at the very last of a segment and
restart_lsn is catching up to the current LSN, the "remain" is
equal to max_slot_wal_keep_size as the guaranteed size. If very
beginning of a segments, it gets extra 16MB.

> wal_keep_segments cannot restrict WAL that replication slots are
> holding. For example, with 16MB of segment size and 50MB of
> max_slot_wal_keep_size, we can keep at most 50MB WAL for replication
> slots. However, once we consumed more than 50MB WAL while not
> advancing any restart_lsn the required WAL might be lost by the next
> checkpoint, which depends on the min_wal_size.

I don't get the last phrase. With small min_wal_size, we don't
recycle most of the "removed" segments. If large, we recycle more
of them. It doesn't affect up to where the checkpoint removes WAL
files. But it is right that LSN advance with
max_slot_wal_keep_size bytes immediately leands to breaking a
slot and it is intended behavior.

>                                                On the other hand, if
> we mostly can advance restart_lsn to approximately the current LSN the
> size of preserved WAL for replication slots can go below 50MB.

Y..eah.. That's right. It is just how this works. But I don't
understand how this is related to the intepretation of the "max"
of the GUC variable. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

10 September 2018, 13:52:24

Hello.

At Thu, 6 Sep 2018 22:32:21 +0200, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<29bbd79d-696b-509e-578a-0fc38a3b9405@2ndquadrant.com>
> This documentation
> 
> +       <para>
> +        Specify the maximum size of WAL files
> +        that <link linkend="streaming-replication-slots">replication
> +        slots</link> are allowed to retain in the
> <filename>pg_wal</filename>
> +        directory at checkpoint time.
> +        If <varname>max_slot_wal_keep_size</varname> is zero (the default),
> +        replication slots retain unlimited size of WAL files.
> +       </para>
> 
> doesn't say anything about what happens when the limit is exceeded.
> Does the system halt until the WAL is fetched from the slots?  Do the
> slots get invalidated?

Thanks for pointing that. That's a major cause of confusion. Does
the following make sense?

> Specify the maximum size of WAL files that <link
> linkend="streaming-replication-slots">replication slots</link>
> are allowed to retain in the <filename>pg_wal</filename>
> directory at checkpoint time.  If
> <varname>max_slot_wal_keep_size</varname> is zero (the
> default), replication slots retain unlimited size of WAL files.
+ If restart_lsn of a replication slot gets behind more than that
+ bytes from the current LSN, the standby using the slot may not
+ be able to reconnect due to removal of required WAL records.

And the following sentense is wrong now. I'll remove it in the
coming version 9.

> <para>
>  This parameter is used being rounded down to the multiples of WAL file
>  size.
> </para>


> Also, I don't think 0 is a good value for the default behavior.  0 would
> mean that a slot is not allowed to retain any more WAL than already
> exists anyway.  Maybe we don't want to support that directly, but it's a
> valid configuration.  So maybe use -1 for infinity.

In realtion to the reply just sent to Sawada-san, remain of a
slot can be at most 16MB in the 0 case with the default segment
size. So you're right in this sense. Will fix in the coming
version. Thanks.

=# show max_slot_wal_keep_size;
 max_slot_wal_keep_size 
------------------------
 0
(1 row)
=# select pg_current_wal_lsn(), restart_lsn, remain, pg_size_pretty(remain) as remain from pg_replication_slots ;
 pg_current_wal_lsn | restart_lsn |  remain  | remain 
--------------------+-------------+----------+--------
 0/4000000          | 0/4000000   | 16777216 | 16 MB
(1 row)
....
=# select pg_current_wal_lsn(), restart_lsn, remain, pg_size_pretty(remain) as remain from pg_replication_slots ;
 pg_current_wal_lsn | restart_lsn | remain | remain 
--------------------+-------------+--------+--------
 0/4FF46D8          | 0/4FF46D8   |  47400 | 46 kB
(1 row)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

12 September 2018, 07:45:32

On Mon, Sep 10, 2018 at 7:19 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
>
> At Thu, 6 Sep 2018 19:55:39 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAZCdvdMN-vG4D_653vb_FN-AaMAP5+GXgF1JRjy+LeyA@mail.gmail.com>
>> On Thu, Sep 6, 2018 at 4:10 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Thank you for the comment.
>> >
>> > At Wed, 5 Sep 2018 14:31:10 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB-HJvL+uKsv40Gb8Dymh9uBBQUXTucqv4MDtH_AGKh4g@mail.gmail.com>
>> >> On Tue, Sep 4, 2018 at 7:52 PM, Kyotaro HORIGUCHI
>> >> Thank you for updating! Here is the review comment for v8 patch.
>> >>
>> >> +            /*
>> >> +             * This slot still has all required segments. Calculate how many
>> >> +             * LSN bytes the slot has until it loses restart_lsn.
>> >> +             */
>> >> +            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
>> >> +            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg,
>> >> fragbytes,
>> >> +                                    wal_segment_size, *restBytes);
>> >>
>> >> For the calculation of fragbytes, I think we should calculate the
>> >> fragment bytes of restartLSN instead. The the formula "restartSeg +
>> >> limitSegs - currSeg" means the # of segment between restartLSN and the
>> >> limit by the new parameter. I don't think that the summation of it and
>> >> fragment bytes of currenLSN is correct. As the following result
>> >> (max_slot_wal_keep_size is 128MB) shows, the remain column shows the
>> >> actual remains + 16MB (get_bytes function returns the value of
>> >> max_slot_wal_keep_size in bytes).
>> >
>> > Since a oldest segment is removed after the current LSN moves to
>> > the next segmen, current LSN naturally determines the fragment
>> > bytes. Maybe you're concerning that the number of segments looks
>> > too much by one segment.
>> >
>> > One arguable point of the feature is how max_slot_wal_keep_size
>> > works exactly. I assume that even though the name is named as
>> > "max_", we actually expect that "at least that bytes are
>> > kept". So, for example, with 16MB of segment size and 50MB of
>> > max_s_w_k_s, I designed this so that the size of preserved WAL
>> > doesn't go below 50MB, actually (rounding up to multples of 16MB
>> > of 50MB), and loses the oldest segment when it reaches 64MB +
>> > 16MB = 80MB as you saw.
>> >
>> > # I believe that the difference is not so significant since we
>> > # have around hunderd or several hundreds of segments in common
>> > # cases.
>> >
>> > Do you mean that we should define the GUC parameter literally as
>> > "we won't have exactly that many bytes of WAL segmetns"? That is,
>> > we have at most 48MB preserved WAL records for 50MB of
>> > max_s_w_k_s setting. This is the same to how max_wal_size is
>> > counted but I don't think max_slot_wal_keep_size will be regarded
>> > in the same way.
>>
>> I might be missing something but what I'm expecting to this feature is
>> to restrict the how much WAL we can keep at a maximum for replication
>> slots. In other words, the distance between the current LSN and the
>> minimum restart_lsn of replication slots doesn't over the value of
>> max_slot_wal_keep_size.
>
> Yes, it's one possible design, the same with "we won't have more
> than exactly that many bytes of WAL segmetns" above ("more than"
> is added, which I meant). But anyway we cannot keep the limit
> strictly since WAL segments are removed only at checkpoint
> time.

Agreed. It should be something like a soft limit.

> So If doing so, we can reach the lost state before the
> max_slot_wal_keep_size is filled up meanwhile WAL can exceed the
> size by a WAL flood. We can define it precisely at most as "wal
> segments are preserved at most aorund the value".  So I choosed
> the definition so that we can tell about this as "we don't
> guarantee more than that bytes".

Agreed.

>
> # Uuuu. sorry for possiblly hard-to-read sentence..
>
>>                          It's similar to wal_keep_segments except for
>> that this feature affects only replication slots. And
>
> It defines the *extra* segments to be kept, that is, if we set it
> to 2, at least 3 segments are present. If we set
> max_slot_wal_keep_size to 32MB (= 2 segs here), we have at most 3
> segments since 32MB range before the current LSN almost always
> spans over 3 segments. Doesn't this seemingly in a similar way
> with wal_keep_segments

Yeah, that's fine with me. The wal_keep_segments works regardless
existence of replication slots. If we have replication slots and set
both settings we can reserve extra WAL as much as
max(wal_keep_segments, max_slot_wal_keep_size).

>
> If the current LSN is at the very last of a segment and
> restart_lsn is catching up to the current LSN, the "remain" is
> equal to max_slot_wal_keep_size as the guaranteed size. If very
> beginning of a segments, it gets extra 16MB.

Agreed.

>
>> wal_keep_segments cannot restrict WAL that replication slots are
>> holding. For example, with 16MB of segment size and 50MB of
>> max_slot_wal_keep_size, we can keep at most 50MB WAL for replication
>> slots. However, once we consumed more than 50MB WAL while not
>> advancing any restart_lsn the required WAL might be lost by the next
>> checkpoint, which depends on the min_wal_size.
>
> I don't get the last phrase. With small min_wal_size, we don't
> recycle most of the "removed" segments. If large, we recycle more
> of them. It doesn't affect up to where the checkpoint removes WAL
> files. But it is right that LSN advance with
> max_slot_wal_keep_size bytes immediately leands to breaking a
> slot and it is intended behavior.

Sorry I was wrong. Please ignore the last sentence. What I want to say
is that there is no guarantees that the required WAL is kept once the
reserved extra WAL by replication slots exceeds the threshold.

>
>>                                                On the other hand, if
>> we mostly can advance restart_lsn to approximately the current LSN the
>> size of preserved WAL for replication slots can go below 50MB.
>
> Y..eah.. That's right. It is just how this works. But I don't
> understand how this is related to the intepretation of the "max"
> of the GUC variable.

When I wrote this I understood that the following sentence is that we
regularly keep at least max_slot_wal_keep_size byte regardless the
progress of the minimum restart_lsn, I might have been
misunderstanding though.

>> >  I assume that even though the name is named as
>> > "max_", we actually expect that "at least that bytes are
>> > kept".

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

13 September 2018, 12:29:31

Hello.

Thank you for the comments, Sawada-san, Peter.

At Mon, 10 Sep 2018 19:52:24 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180910.195224.22629595.horiguchi.kyotaro@lab.ntt.co.jp>
> At Thu, 6 Sep 2018 22:32:21 +0200, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<29bbd79d-696b-509e-578a-0fc38a3b9405@2ndquadrant.com>
> Thanks for pointing that. That's a major cause of confusion. Does
> the following make sense?
> 
> > Specify the maximum size of WAL files that <link
> > linkend="streaming-replication-slots">replication slots</link>
> > are allowed to retain in the <filename>pg_wal</filename>
> > directory at checkpoint time.  If
> > <varname>max_slot_wal_keep_size</varname> is zero (the
> > default), replication slots retain unlimited size of WAL files.
> + If restart_lsn of a replication slot gets behind more than that
> + bytes from the current LSN, the standby using the slot may not
> + be able to reconnect due to removal of required WAL records.
...
> > Also, I don't think 0 is a good value for the default behavior.  0 would
> > mean that a slot is not allowed to retain any more WAL than already
> > exists anyway.  Maybe we don't want to support that directly, but it's a
> > valid configuration.  So maybe use -1 for infinity.
> 
> In realtion to the reply just sent to Sawada-san, remain of a
> slot can be at most 16MB in the 0 case with the default segment
> size. So you're right in this sense. Will fix in the coming
> version. Thanks.

I did the following thinkgs in the new version.

- Changed the disable (or infinite) and default value of
  max_slot_wal_keep_size to -1 from 0.
  (patch 1, 2. guc.c, xlog.c: GetOldestKeepSegment())

- Fixed documentation for max_slot_wal_keep_size tomention what
  happnes when WAL exceeds the size, and additional rewrites.
  (patch 4, catalogs.sgml, config.sgml)

- Folded parameter list of GetOldestKeepSegment().
  (patch 1, 2. xlog.c)

- Provided the plural form of errdetail of checkpoint-time
  warning.  (patch 1, xlog.c: KeepLogSeg())

- Some cosmetic change and small refactor.
  (patch 1, 2, 3)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ee8ddfa69b6fb6832307d15374ea5f2446bda85f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 108 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 97 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 85a7b285ec..deda43607d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9505,6 +9507,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keepSegs */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9516,38 +9565,47 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail_plural(
+                             "The mostly affected slot has lost %ld segment.",
+                             "The mostly affected slot has lost %ld segments.",
+                             lost_segs, lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0625eff219..7edff8aca8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2538,6 +2538,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7486d20a34..b4e027c1df 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From b4a555e4ddb1ce770ed1356e3b4da54e4fbeaf12 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 141 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 +++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 180 insertions(+), 14 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index deda43607d..bad9db51b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,8 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9507,19 +9508,126 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Finds the segment number of the oldest file in XLOG directory.
+ *
+ * This function is intended to be used only when we haven't removed a WAL
+ * segment. Read XLogCtl->lastRemovedSegNo if any.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is already removed.
+ * 1 means that WAL record at targetLSN is available.
+ * 2 means that WAL record at targetLSN is available but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (oldestSeg != 0)
+    {
+        /* oldest segment is just after the last removed segment */
+        oldestSeg++;
+    }
+    else
+    {
+        /*
+         * We haven't removed a WAL segment since startup. Get the number
+         * looking WAL files.
+         */
+        static XLogSegNo oldestFileSeg = 0;
+
+        /* Must do it the hard way for the first time */
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+
+        oldestSeg = oldestFileSeg;
+    }
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
+    uint64        limitSegs = 0;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9534,8 +9642,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Apply max_slot_wal_keep_size to keepSegs */
@@ -9547,6 +9653,30 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
         keepSegs = wal_keep_segments;
 
+    /*
+     * If requested, return remaining LSN bytes to advance until the slot
+     * gives up reserving WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo restartSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, restartSeg, wal_segment_size);
+        if (max_slot_wal_keep_size_mb >= 0 && currSeg <= restartSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(restartSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
+
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
         return 1;
@@ -9575,7 +9705,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..5db294f64e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..d9ed9e8cf2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 860571440a..2c7cdbb66e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9796,9 +9796,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 078129f251..02286cdfe8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 0453e17be4e04c108f8989455ba069ab14242a17 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 174 ++++++++++++++++++++++++++++++
 1 file changed, 174 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..f5a87b6617
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,174 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is unconditionally "safe" with the default setting.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is keeping all segments');
+
+# The stanby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 48;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|64 MB", 'check that remaining byte is calculated');
+
+# Advance WAL again then checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|32 MB", 'remaining byte should be reduced by 32MB');
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Slot gets to 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that some segments are about to removed');
+
+# The stanby still can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again, the slot loses some segments.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*The mostly affected slot has lost 5 segments.",
+               $logstart),
+   'check that warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 79751f21b586c570ec41ef9b2aca37f7d707d53a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 0179deea2e..84a937e1fe 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9879,6 +9879,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them are no longer
+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
+      slot doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until the
+        slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..edd5419ec6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3116,6 +3116,29 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited size of WAL files.  If restart_lsn
+        of a replication slot gets behind more than that bytes from the
+        current LSN, the standby using the slot may no longer be able to
+        reconnect due to removal of required WAL records. You can see the WAL
+        availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 8cb77f85ec..04cdccb10d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

22 October 2018, 10:35:04

On Thu, Sep 13, 2018 at 6:30 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello.
>
> Thank you for the comments, Sawada-san, Peter.
>
> At Mon, 10 Sep 2018 19:52:24 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180910.195224.22629595.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > At Thu, 6 Sep 2018 22:32:21 +0200, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<29bbd79d-696b-509e-578a-0fc38a3b9405@2ndquadrant.com>
> > Thanks for pointing that. That's a major cause of confusion. Does
> > the following make sense?
> >
> > > Specify the maximum size of WAL files that <link
> > > linkend="streaming-replication-slots">replication slots</link>
> > > are allowed to retain in the <filename>pg_wal</filename>
> > > directory at checkpoint time.  If
> > > <varname>max_slot_wal_keep_size</varname> is zero (the
> > > default), replication slots retain unlimited size of WAL files.
> > + If restart_lsn of a replication slot gets behind more than that
> > + bytes from the current LSN, the standby using the slot may not
> > + be able to reconnect due to removal of required WAL records.
> ...
> > > Also, I don't think 0 is a good value for the default behavior.  0 would
> > > mean that a slot is not allowed to retain any more WAL than already
> > > exists anyway.  Maybe we don't want to support that directly, but it's a
> > > valid configuration.  So maybe use -1 for infinity.
> >
> > In realtion to the reply just sent to Sawada-san, remain of a
> > slot can be at most 16MB in the 0 case with the default segment
> > size. So you're right in this sense. Will fix in the coming
> > version. Thanks.
>
> I did the following thinkgs in the new version.
>
> - Changed the disable (or infinite) and default value of
>   max_slot_wal_keep_size to -1 from 0.
>   (patch 1, 2. guc.c, xlog.c: GetOldestKeepSegment())
>
> - Fixed documentation for max_slot_wal_keep_size tomention what
>   happnes when WAL exceeds the size, and additional rewrites.
>   (patch 4, catalogs.sgml, config.sgml)
>
> - Folded parameter list of GetOldestKeepSegment().
>   (patch 1, 2. xlog.c)
>
> - Provided the plural form of errdetail of checkpoint-time
>   warning.  (patch 1, xlog.c: KeepLogSeg())
>
> - Some cosmetic change and small refactor.
>   (patch 1, 2, 3)
>

Sorry for the late response. The patch still can be applied to the
curent HEAD so I reviewed the latest patch.

The value of 'remain' and 'wal_status' might not be correct. Although
'wal_stats' shows 'lost' but we can get changes from the slot. I've
tested it with the following steps.

=# alter system set max_slot_wal_keep_size to '64MB'; -- while
wal_keep_segments is 0
=# select pg_reload_conf();
=# select slot_name, wal_status, remain, pg_size_pretty(remain) as
remain_pretty from pg_replication_slots ;
 slot_name | wal_status |  remain  | remain_pretty
-----------+------------+----------+---------------
 1         | streaming  | 83885648 | 80 MB
(1 row)

** consume 80MB WAL, and do CHECKPOINT **

=# select slot_name, wal_status, remain, pg_size_pretty(remain) as
remain_pretty from pg_replication_slots ;
 slot_name | wal_status | remain | remain_pretty
-----------+------------+--------+---------------
 1         | lost       |      0 | 0 bytes
(1 row)
=# select count(*) from pg_logical_slot_get_changes('1', NULL, NULL);
 count
-------
    15
(1 row)

-----
I got the following result with setting of wal_keep_segments >
max_slot_keep_size. The 'wal_status' shows 'streaming' although the
'remain' is 0.

=# select slot_name, wal_status, remain from pg_replication_slots limit 1;
 slot_name | wal_status | remain
-----------+------------+--------
 1         | streaming  |      0
(1 row)

+               XLByteToSeg(targetLSN, restartSeg, wal_segment_size);
+               if (max_slot_wal_keep_size_mb >= 0 && currSeg <=
restartSeg + limitSegs)
+               {

You use limitSegs here but shouldn't we use keepSeg instead? Actually
I've commented this point for v6 patch before[1], and this had been
fixed in the v7 patch. However you're using limitSegs again from v8
patch again. I might be missing something though.

Changed the status to 'Waiting on Author'.

[1] https://www.postgresql.org/message-id/CAD21AoD0rChq7wQE%3D_o95quopcQGjcVG9omwdH07nT5cm81hzg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20180904.195250.144186960.horiguchi.kyotaro%40lab.ntt.co.jp

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATIONNTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

25 October 2018, 12:55:18

Hello.

At Mon, 22 Oct 2018 19:35:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBdfoLSgujPZ_TpnH5zdQz0jg-Y8OXtZ=TCO787Sey-=w@mail.gmail.com>
> On Thu, Sep 13, 2018 at 6:30 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Sorry for the late response. The patch still can be applied to the

It's alright. Thanks.

> curent HEAD so I reviewed the latest patch.
> The value of 'remain' and 'wal_status' might not be correct. Although
> 'wal_stats' shows 'lost' but we can get changes from the slot. I've
> tested it with the following steps.
> 
> =# alter system set max_slot_wal_keep_size to '64MB'; -- while
> wal_keep_segments is 0
> =# select pg_reload_conf();
> =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> remain_pretty from pg_replication_slots ;
>  slot_name | wal_status |  remain  | remain_pretty
> -----------+------------+----------+---------------
>  1         | streaming  | 83885648 | 80 MB
> (1 row)
> 
> ** consume 80MB WAL, and do CHECKPOINT **
> 
> =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> remain_pretty from pg_replication_slots ;
>  slot_name | wal_status | remain | remain_pretty
> -----------+------------+--------+---------------
>  1         | lost       |      0 | 0 bytes
> (1 row)
> =# select count(*) from pg_logical_slot_get_changes('1', NULL, NULL);
>  count
> -------
>     15
> (1 row)

Mmm. The function looks into the segment already open before
losing the segment in the file system (precisely, its direcotory
entry has been deleted). So just 1 lost segment doesn't
matter. Please try losing more one segment.

=# select * from pg_logical_slot_get_changes('s1', NULL, NULL);
ERROR:  unexpected pageaddr 0/29000000 in log segment 000000010000000000000023, offset 0

Or, instead just restarting will let the opened segment forgotten.

...
>  1         | lost       |      0 | 0 bytes
(just restart)
> =# select * from pg_logical_slot_get_changes('s1', NULL, NULL);
> ERROR:  requested WAL segment pg_wal/000000010000000000000029 has already been removed

I'm not sure this is counted to be a bug...


> -----
> I got the following result with setting of wal_keep_segments >
> max_slot_keep_size. The 'wal_status' shows 'streaming' although the
> 'remain' is 0.
> 
> =# select slot_name, wal_status, remain from pg_replication_slots limit 1;
>  slot_name | wal_status | remain
> -----------+------------+--------
>  1         | streaming  |      0
> (1 row)
> 
> +               XLByteToSeg(targetLSN, restartSeg, wal_segment_size);
> +               if (max_slot_wal_keep_size_mb >= 0 && currSeg <=
> restartSeg + limitSegs)
> +               {
> 
> You use limitSegs here but shouldn't we use keepSeg instead? Actually
> I've commented this point for v6 patch before[1], and this had been
> fixed in the v7 patch. However you're using limitSegs again from v8
> patch again. I might be missing something though.

No. keepSegs is the number of segments *actually* kept around. So
reverting it to keptSegs just resurrects the bug you pointed
upthread. What needed here is at most how many segments will be
kept. So raising limitSegs by wal_keep_segments fixes that.
Sorry for the sequence of silly bugs. TAP test for the case
added.


> Changed the status to 'Waiting on Author'.
> 
> [1] https://www.postgresql.org/message-id/CAD21AoD0rChq7wQE%3D_o95quopcQGjcVG9omwdH07nT5cm81hzg%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/20180904.195250.144186960.horiguchi.kyotaro%40lab.ntt.co.jp

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ecb7a8dd6a59ededd24578920bc1b4fbaf481b10 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 108 ++++++++++++++++++++------
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 4 files changed, 97 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffc..814cd01f79 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -867,6 +868,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9513,6 +9515,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keepSegs */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9524,38 +9573,47 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+                ereport(WARNING,
+                        (errmsg ("some replication slots have lost required WAL segments"),
+                         errdetail_plural(
+                             "The mostly affected slot has lost %ld segment.",
+                             "The mostly affected slot has lost %ld segments.",
+                             lost_segs, lost_segs)));
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6b..b26c9abec0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2538,6 +2538,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..0fe00d99a9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -238,6 +238,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..12cd0d1d10 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,6 +98,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
-- 
2.16.3

From 52ab262b1a8fc6c95e2113e9a330999b273c4ef8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/4] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns shows whether the slot can be reconnected or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be written until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 src/backend/access/transam/xlog.c      | 154 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  32 ++++++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 7 files changed, 190 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..6b6a2df213 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -706,7 +706,7 @@ SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
+ slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn | wal_status | remain 
 

+-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+--------
 (0 rows)
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 814cd01f79..e9327d0e76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -868,7 +868,8 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9515,19 +9516,126 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Finds the segment number of the oldest file in XLOG directory.
+ *
+ * This function is intended to be used only when we haven't removed a WAL
+ * segment. Read XLogCtl->lastRemovedSegNo if any.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /* get minimum segment ignoring timeline ID */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Check if the record on the given targetLSN is present in XLOG files.
+ *
+ * Returns three kind of values.
+ * 0 means that WAL record at targetLSN is already removed.
+ * 1 means that WAL record at targetLSN is available.
+ * 2 means that WAL record at targetLSN is available but about to be removed by
+ * the next checkpoint.
+ */
+int
+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (oldestSeg != 0)
+    {
+        /* oldest segment is just after the last removed segment */
+        oldestSeg++;
+    }
+    else
+    {
+        /*
+         * We haven't removed a WAL segment since startup. Get the number
+         * looking WAL files.
+         */
+        static XLogSegNo oldestFileSeg = 0;
+
+        /* Must do it the hard way for the first time */
+        if (oldestFileSeg == 0)
+            oldestFileSeg = GetOldestXLogFileSegNo();
+
+        oldestSeg = oldestFileSeg;
+    }
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return 1;
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return 2;
+
+    /* targetSeg has gone */
+    return    0;
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
+    uint64        limitSegs = 0;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9542,8 +9650,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Apply max_slot_wal_keep_size to keepSegs */
@@ -9551,9 +9657,40 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* also, limitSegs should be raised if wal_keep_segments is larger */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, return remaining LSN bytes to advance until the slot
+     * gives up reserving WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+        if (max_slot_wal_keep_size_mb >= 0 && currSeg <= targetSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9583,7 +9720,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a03b005f73..1d680e7ed8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -797,7 +797,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..d9ed9e8cf2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,36 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+            char *status;
+
+            switch (IsLsnStillAvaiable(restart_lsn, &remaining_bytes))
+            {
+            case 0:
+                status = "lost";
+                break;
+            case 1:
+                status = "streaming";
+                break;
+            case 2:
+                status = "keeping";
+                break;
+            default:
+                status = "unknown";
+                break;
+            }
+
+            values[i++] = CStringGetTextDatum(status);
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 12cd0d1d10..ad9d1dec29 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -269,6 +269,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern int IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cff58ed2d8..2253c780ba 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9599,9 +9599,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37acf..13e2b51376 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 5a131ba043da3e60eeb4837409b68fd74b08e33e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 3/4] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 184 ++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..a7285d94c0
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,184 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 48MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+$node_standby->append_conf('recovery.conf', qq(
+primary_slot_name = 'rep1'
+));
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be secured.
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check initial state of standby');
+
+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is unconditionally "safe" with the default setting.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is keeping all segments');
+
+# The stanby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 48;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|64 MB", 'check that remaining byte is calculated');
+
+# Advance WAL again then checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|32 MB", 'remaining byte should be reduced by 32MB');
+
+
+# wal_keep_segments can override
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|80 MB", 'check that wal_keep_segments override');
+
+# restore wal_keep_segments (no test)
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Slot gets to 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that some segments are about to be removed');
+
+# The stanby still can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again, the slot loses some segments.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+print "### $logstart\n";
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*The mostly affected slot has lost 5 segments.",
+               $logstart),
+   'check that warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (a int); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From fc3a69b6fa82c9f6ecdc11b0885d34a39f5e5884 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 4/4] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9edba96fab..fc4cbc9239 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9881,6 +9881,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by the
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them are no longer
+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
+      slot doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until the
+        slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7554cba3f9..f3e504862c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3117,6 +3117,29 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited size of WAL files.  If restart_lsn
+        of a replication slot gets behind more than that bytes from the
+        current LSN, the standby using the slot may no longer be able to
+        reconnect due to removal of required WAL records. You can see the WAL
+        availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index ebcb3daaed..15a98340a6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -927,9 +927,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allotted
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

26 October 2018, 02:26:36

At Thu, 25 Oct 2018 21:55:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20181025.215518.189844649.horiguchi.kyotaro@lab.ntt.co.jp>
> > =# alter system set max_slot_wal_keep_size to '64MB'; -- while
> > wal_keep_segments is 0
> > =# select pg_reload_conf();
> > =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> > remain_pretty from pg_replication_slots ;
> >  slot_name | wal_status |  remain  | remain_pretty
> > -----------+------------+----------+---------------
> >  1         | streaming  | 83885648 | 80 MB
> > (1 row)
> > 
> > ** consume 80MB WAL, and do CHECKPOINT **
> > 
> > =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> > remain_pretty from pg_replication_slots ;
> >  slot_name | wal_status | remain | remain_pretty
> > -----------+------------+--------+---------------
> >  1         | lost       |      0 | 0 bytes
> > (1 row)
> > =# select count(*) from pg_logical_slot_get_changes('1', NULL, NULL);
> >  count
> > -------
> >     15
> > (1 row)
> 
> Mmm. The function looks into the segment already open before
> losing the segment in the file system (precisely, its direcotory
> entry has been deleted). So just 1 lost segment doesn't
> matter. Please try losing more one segment.

I considered this a bit more and the attached patch let
XLogReadRecord() check for segment removal every time it is
called and emits the following error in the case.

> =# select * from pg_logical_slot_get_changes('s1', NULL, NULL);
> ERROR:  WAL record at 0/870001B0 no longer available
> DETAIL:  The segment for the record has been removed.

The reason for doing that in the fucntion is it can happen also
for physical replication when walsender is active but far
behind. The removed(renamed)-but-still-open segment may be
recycled and can be overwritten while reading, and it will be
caught by page/record validation. It is substantially lost in
that sense.  I don't think the strictness is useful for anything..

Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 775f6366d78ac6818023cc158e37c70119246e19 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 5/5] Check removal of in-read segment file.

Checkpoint can remove or recycle a segment file while it is being read
by ReadRecord. This patch checks for the case and error out
immedaitely.  Reading recycled file is basically safe and
inconsistenty caused by overwrites as new segment will be caught by
page/record validation. So this is only for keeping consistency with
the wal_status shown in pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 0768ca7822..a6c97cf260 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -217,6 +217,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
     XLogRecord *record;
     XLogRecPtr    targetPagePtr;
+    XLogSegNo    targetSegNo;
     bool        randAccess;
     uint32        len,
                 total_len;
@@ -270,6 +271,18 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+    /*
+     * checkpoint can remove the segment currently looking for.  make sure the
+     * current segment is still exists. We check this only once per record.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+        ereport(ERROR,
+                (errcode(ERRCODE_NO_DATA),
+                 errmsg("WAL record at %X/%X no longer available",
+                        (uint32)(RecPtr >> 32), (uint32) RecPtr),
+                 errdetail("The segment for the record has been removed.")));
+            
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Masahiko Sawada

Date:

29 October 2018, 08:37:52

On Thu, Oct 25, 2018 at 9:56 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello.
>
> At Mon, 22 Oct 2018 19:35:04 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBdfoLSgujPZ_TpnH5zdQz0jg-Y8OXtZ=TCO787Sey-=w@mail.gmail.com>
> > On Thu, Sep 13, 2018 at 6:30 PM Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Sorry for the late response. The patch still can be applied to the
>
> It's alright. Thanks.
>
> > curent HEAD so I reviewed the latest patch.
> > The value of 'remain' and 'wal_status' might not be correct. Although
> > 'wal_stats' shows 'lost' but we can get changes from the slot. I've
> > tested it with the following steps.
> >
> > =# alter system set max_slot_wal_keep_size to '64MB'; -- while
> > wal_keep_segments is 0
> > =# select pg_reload_conf();
> > =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> > remain_pretty from pg_replication_slots ;
> >  slot_name | wal_status |  remain  | remain_pretty
> > -----------+------------+----------+---------------
> >  1         | streaming  | 83885648 | 80 MB
> > (1 row)
> >
> > ** consume 80MB WAL, and do CHECKPOINT **
> >
> > =# select slot_name, wal_status, remain, pg_size_pretty(remain) as
> > remain_pretty from pg_replication_slots ;
> >  slot_name | wal_status | remain | remain_pretty
> > -----------+------------+--------+---------------
> >  1         | lost       |      0 | 0 bytes
> > (1 row)
> > =# select count(*) from pg_logical_slot_get_changes('1', NULL, NULL);
> >  count
> > -------
> >     15
> > (1 row)
>
> Mmm. The function looks into the segment already open before
> losing the segment in the file system (precisely, its direcotory
> entry has been deleted). So just 1 lost segment doesn't
> matter. Please try losing more one segment.
>
> =# select * from pg_logical_slot_get_changes('s1', NULL, NULL);
> ERROR:  unexpected pageaddr 0/29000000 in log segment 000000010000000000000023, offset 0
>
> Or, instead just restarting will let the opened segment forgotten.
>
> ...
> >  1         | lost       |      0 | 0 bytes
> (just restart)
> > =# select * from pg_logical_slot_get_changes('s1', NULL, NULL);
> > ERROR:  requested WAL segment pg_wal/000000010000000000000029 has already been removed
>
> I'm not sure this is counted to be a bug...
>
>
> > -----
> > I got the following result with setting of wal_keep_segments >
> > max_slot_keep_size. The 'wal_status' shows 'streaming' although the
> > 'remain' is 0.
> >
> > =# select slot_name, wal_status, remain from pg_replication_slots limit 1;
> >  slot_name | wal_status | remain
> > -----------+------------+--------
> >  1         | streaming  |      0
> > (1 row)
> >
> > +               XLByteToSeg(targetLSN, restartSeg, wal_segment_size);
> > +               if (max_slot_wal_keep_size_mb >= 0 && currSeg <=
> > restartSeg + limitSegs)
> > +               {
> >
> > You use limitSegs here but shouldn't we use keepSeg instead? Actually
> > I've commented this point for v6 patch before[1], and this had been
> > fixed in the v7 patch. However you're using limitSegs again from v8
> > patch again. I might be missing something though.
>
> No. keepSegs is the number of segments *actually* kept around. So
> reverting it to keptSegs just resurrects the bug you pointed
> upthread. What needed here is at most how many segments will be
> kept. So raising limitSegs by wal_keep_segments fixes that.
> Sorry for the sequence of silly bugs. TAP test for the case
> added.
>

Thank you for updating the patch. The 0001 - 0004 patches work fine
and looks good to me except for the following comment for the code.

+       /*
+        * Calculate keep segments by slots first. The second term of the
+        * condition is just a sanity check.
+        */
+       if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+               keepSegs = currSeg - minSlotSeg;

I think that we can use assertion of the second term of the condition
instead of just checking. If the function get minSlotSeg > currSeg the
return value will be incorrect. That means that the function requires
the condition is always true. Thought?

Since this comment can be deferred to committers I've marked this
patch as "Ready for Committer". For 0005 patch the issue I reported is
a relatively rare issue and is not critical, we can discuss it after
this patch gets committed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

19 November 2018, 04:39:58

On Fri, Oct 26, 2018 at 11:26:36AM +0900, Kyotaro HORIGUCHI wrote:
> The reason for doing that in the fucntion is it can happen also
> for physical replication when walsender is active but far
> behind. The removed(renamed)-but-still-open segment may be
> recycled and can be overwritten while reading, and it will be
> caught by page/record validation. It is substantially lost in
> that sense.  I don't think the strictness is useful for anything..

I was just coming by to look at bit at the patch series, and bumped
into that:

> +    /*
> +     * checkpoint can remove the segment currently looking for.  make sure the
> +     * current segment is still exists. We check this only once per record.
> +     */
> +    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
> +    if (targetSegNo <= XLogGetLastRemovedSegno())
> +        ereport(ERROR,
> +                (errcode(ERRCODE_NO_DATA),
> +                 errmsg("WAL record at %X/%X no longer available",
> +                        (uint32)(RecPtr >> 32), (uint32) RecPtr),
> +                 errdetail("The segment for the record has been removed.")));
> +

ereport should not be called within xlogreader.c as a base rule:
 *      This file is compiled as both front-end and backend code, so it
 *      may not use ereport, server-defined static variables, etc.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

20 November 2018, 05:07:44

On Mon, Nov 19, 2018 at 01:39:58PM +0900, Michael Paquier wrote:
> I was just coming by to look at bit at the patch series, and bumped
> into that:

So I have been looking at the last patch series 0001-0004 posted on this
thread, and coming from here:
https://postgr.es/m/20181025.215518.189844649.horiguchi.kyotaro@lab.ntt.co.jp

/* check that the slot is gone */
SELECT * FROM pg_replication_slots
It could be an idea to switch to the expanded mode here, not that it
matters much still..

+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
You mean Available here, not Avaiable.  This function is only used when
scanning for slot information with pg_replication_slots, so wouldn't it
be better to just return the status string in this case?

Not sure I see the point of the "remain" field, which can be found with
a simple calculation using the current insertion LSN, the segment size
and the amount of WAL that the slot is retaining.  It may be interesting
to document a query to do that though.

GetOldestXLogFileSegNo() has race conditions if WAL recycling runs in
parallel, no?  How is it safe to scan pg_wal on a process querying
pg_replication_slots while another process may manipulate its contents
(aka the checkpointer or just the startup process with an
end-of-recovery checkpoint.).  This routine relies on unsafe
assumptions as this is not concurrent-safe.  You can avoid problems by
making sure instead that lastRemovedSegNo is initialized correctly at
startup, which would be normally one segment older than what's in
pg_wal, which feels a bit hacky to rely on to track the oldest segment.

It seems to me that GetOldestXLogFileSegNo() should also check for
segments matching the current timeline, no?

+           if (prev_lost_segs != lost_segs)
+               ereport(WARNING,
+                       (errmsg ("some replication slots have lost
required WAL segments"),
+                        errdetail_plural(
+                            "The mostly affected slot has lost %ld
segment.",
+                            "The mostly affected slot has lost %ld
segments.",
+                            lost_segs, lost_segs)));
This can become very noisy with the time, and it would be actually
useful to know which replication slot is impacted by that.

+      slot doesn't have valid restart_lsn, this field
Missing a determinant here, and restart_lsn should have a <literal>
markup.

+    many WAL segments that they fill up the space allotted
s/allotted/allocated/.

+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
+      slot doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
I am a bit confused by this statement.  The last two states are "lost"
and "keeping", but shouldn't "keeping" be the state showing up by
default as it means that all WAL segments are kept around.

+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
This makes the tests very costly, which is something we should avoid as
much as possible.  One trick which could be used here, on top of
reducing the number of segment switches, is to use initdb
--wal-segsize=1.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

20 December 2018, 07:24:38

Thank you for piking this and sorry being late.

At Mon, 19 Nov 2018 13:39:58 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181119043958.GE4400@paquier.xyz>
> ereport should not be called within xlogreader.c as a base rule:

Ouch! I forgot that. Fixed to use report_invalid_record slightly
changing the message. The code is not required (or cannot be
used) on frontend so #ifndef FRONTENDed the code.

At Tue, 20 Nov 2018 14:07:44 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181120050744.GJ4400@paquier.xyz>
> On Mon, Nov 19, 2018 at 01:39:58PM +0900, Michael Paquier wrote:
> > I was just coming by to look at bit at the patch series, and bumped
> > into that:
> 
> So I have been looking at the last patch series 0001-0004 posted on this
> thread, and coming from here:
> https://postgr.es/m/20181025.215518.189844649.horiguchi.kyotaro@lab.ntt.co.jp
> 
> /* check that the slot is gone */
> SELECT * FROM pg_replication_slots
> It could be an idea to switch to the expanded mode here, not that it
> matters much still..

No problem doing that. Done.

TAP test complains that it still uses recovery.conf. Fixed. On
the way doing that I added parameter primary_slot_name to
init_from_backup in PostgresNode.pm

> +IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
> You mean Available here, not Avaiable.  This function is only used when
> scanning for slot information with pg_replication_slots, so wouldn't it
> be better to just return the status string in this case?

Mmm. Sure. Auto-completion hid it from my eyes. Fixed the name.
The fix sounds reasonable. The function was created as returning
boolean and the name doen't fit the current function. I renamed
the name to GetLsnAvailability() that returns a string.

> Not sure I see the point of the "remain" field, which can be found with
> a simple calculation using the current insertion LSN, the segment size
> and the amount of WAL that the slot is retaining.  It may be interesting
> to document a query to do that though.

It's not that simple. wal_segment_size, max_slot_wal_keep_size,
wal_keep_segments, max_slot_wal_keep_size and the current LSN are
invoved in the calculation which including several conditional
branches, maybe as you see upthread. We could show "the largest
current LSN until WAL is lost" but the "current LSN" is not shown
there. So it is showing the "remain".

> GetOldestXLogFileSegNo() has race conditions if WAL recycling runs in
> parallel, no?  How is it safe to scan pg_wal on a process querying
> pg_replication_slots while another process may manipulate its contents
> (aka the checkpointer or just the startup process with an
> end-of-recovery checkpoint.).  This routine relies on unsafe
> assumptions as this is not concurrent-safe.  You can avoid problems by
> making sure instead that lastRemovedSegNo is initialized correctly at
> startup, which would be normally one segment older than what's in
> pg_wal, which feels a bit hacky to rely on to track the oldest segment.

Concurrent recycling makes the function's result vary between the
segment numbers before and after it. It is unstable but doesn't
matter so much. The reason for the timing is to avoid extra
startup time by a scan over pg_wal that is unncecessary in most
cases.

Anyway the attached patch initializes lastRemovedSegNo in
StartupXLOG().

> It seems to me that GetOldestXLogFileSegNo() should also check for
> segments matching the current timeline, no?

RemoveOldXlogFiles() ignores timeline and the function is made to
behave the same way (in different manner). I added a comment for
the behavior in the function.

> +           if (prev_lost_segs != lost_segs)
> +               ereport(WARNING,
> +                       (errmsg ("some replication slots have lost
> required WAL segments"),
> +                        errdetail_plural(
> +                            "The mostly affected slot has lost %ld
> segment.",
> +                            "The mostly affected slot has lost %ld
> segments.",
> +                            lost_segs, lost_segs)));
> This can become very noisy with the time, and it would be actually
> useful to know which replication slot is impacted by that.

One message per one segment doen't seem so noisy. The reason for
not showing slot identifier individually is just to avoid
complexity comes from involving slot details. DBAs will see the
details in pg_stat_replication.

Anyway I did that in the attached patch. ReplicationSlotsBehind
returns the list of the slot names that behind specified
LSN. With this patch the messages looks as the follows:

WARNING:  some replication slots have lost required WAL segments
DETAIL:  Slot s1 lost 8 segment(s).
WARNING:  some replication slots have lost required WAL segments
DETAIL:  Slots s1, s2, s3 lost at most 9 segment(s).

> +      slot doesn't have valid restart_lsn, this field
> Missing a determinant here, and restart_lsn should have a <literal>
> markup.

structfield? Reworded as below:

|  non-negative. If <structfield>restart_lsn</structfield> is NULL, this
|  field is <literal>unknown</literal>.

I changed "the slot" with "this slot" in the two added fields
(wal_status, remain).

> +    many WAL segments that they fill up the space allotted
> s/allotted/allocated/.

Fixed.

> +      available. The last two states are seen only when
> +      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
> +      slot doesn't have valid restart_lsn, this field
> +      is <literal>unknown</literal>.
> I am a bit confused by this statement.  The last two states are "lost"
> and "keeping", but shouldn't "keeping" be the state showing up by
> default as it means that all WAL segments are kept around.

It's "streaming".  I didn't came up with nice words to
distinguish the two states. I'm not sure "keep around" exactly
means but "keeping" here means rather "just not removed yet". The
states could be reworded as the follows:

streaming: kept/keeping/(secure, in the first version)
keeping  : mortal/about to be removed
lost/unkown : (lost/unknown)

Do you have any better wording?

> +# Advance WAL by ten segments (= 160MB) on master
> +advance_wal($node_master, 10);
> +$node_master->safe_psql('postgres', "CHECKPOINT;");
> This makes the tests very costly, which is something we should avoid as
> much as possible.  One trick which could be used here, on top of
> reducing the number of segment switches, is to use initdb
> --wal-segsize=1.

That sounds nice. Done. In the new version the number of segments
can be reduced and a new test item for the initial unkonwn state
as the first item.

Please find the attached new version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c35115eab0148e44b59eb974821de28684899cd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 127 +++++++++++++++++++++-----
 src/backend/replication/slot.c                |  57 ++++++++++++
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 174 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..2a4cec1adf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -101,6 +101,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -873,6 +874,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9319,6 +9321,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keepSegs */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9330,38 +9379,66 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed.
+                 * We don't need show anything here if no affected slot
+                 * remains.
+                 */
+                if (slot_names)
+                {
+                    ereport(WARNING,
+                            (errmsg ("some replication slots have lost required WAL segments"),
+                             errdetail_plural(
+                                 "Slot %s lost %ld segment(s).",
+                                 "Slots %s lost at most %ld segment(s).",
+                                 nslots, slot_names, lost_segs)));
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1f2e7139a7..1805e23171 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,63 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns the list of replication slots restart_lsn of whch are behind
+ * specified LSN. Returs palloc'ed character array stuffed with slot names
+ * delimited by the givein separator.  Returns NULL if no slot matches.  If
+ * pnslots is given, the number of the returned slots is returned there.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        initStringInfo(&retstr);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        if (s->in_use && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * slot names consist only with lower-case letters. we don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL if the result is an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fe1939881..438ff723d5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2593,6 +2593,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fa02d2c93..7b2e07bea1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -279,6 +279,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f3a7ba4d42..2cf9c9bc98 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 7964ae254f..69e4fccb5e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.16.3

From 131ae101f3bc78d86c0629bef9f653a2f3b0bb93 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/6] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns show whether the slot is reconnectable or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be advance until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 150 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  16 +++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..c5f52d6ee8 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -705,8 +705,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index c4b10a4cf9..5040d5e85e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -374,4 +374,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2a4cec1adf..4a5ab3be40 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -874,7 +874,9 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestXLogFileSegNo(void);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -6654,6 +6656,12 @@ StartupXLOG(void)
      */
     StartupReplicationOrigin();
 
+    /*
+     * Initialize lastRemovedSegNo looking pg_wal directory. The minimum
+     * segment number is 1 so no wrap-around can happen.
+     */
+    XLogCtl->lastRemovedSegNo = GetOldestXLogFileSegNo() - 1;
+
     /*
      * Initialize unlogged LSN. On a clean shutdown, it's restored from the
      * control file. On recovery, all unlogged relations are blown away, so
@@ -9321,19 +9329,115 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Finds the segment number of the oldest file in XLOG directory.
+ *
+ * This function is intended to be used for initialization of
+ * XLogCtl->lastRemovedSegNo.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * get minimum segment ignoring timeline ID.  Since RemoveOldXlog
+         * works ignoring timeline ID, this function works the same way.
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Returns availability status of the record at given targetLSN
+ *
+ * Returns three kinds of value.
+ * "streaming" when the WAL record at targetLSN is available.
+ * "keeping" when still available but about to be removed by the next
+ * checkpoint.
+ * "lost" when the WAL record at targetLSN is already removed.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
+ */
+char *
+GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return "streaming";
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return "keeping";
+
+    /* targetSeg has gone */
+    return    "lost";
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
+    uint64        limitSegs = 0;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9348,8 +9452,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Apply max_slot_wal_keep_size to keepSegs */
@@ -9357,9 +9459,40 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* also, limitSegs should be raised if wal_keep_segments is larger */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, return remaining LSN bytes to advance until the slot
+     * gives up reserving WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+        if (max_slot_wal_keep_size_mb >= 0 && currSeg <= targetSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9389,7 +9522,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5253837b54..9ed00c7a7b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -798,7 +798,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 8782bad4a2..a4a028f4d7 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+
+            values[i++] = CStringGetTextDatum(
+                GetLsnAvailability(restart_lsn, &remaining_bytes));
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2cf9c9bc98..4d4d8101f6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -302,6 +302,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d0a571ef95..5a587e9685 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9655,9 +9655,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd2279..956c3c9525 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 845986891fbda7dac41fea9eae76666212a362c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH 3/6] Add primary_slot_name to init_from_backup in TAP test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8a2c6fc122..daca2e0085 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -672,11 +672,11 @@ sub init_from_backup
     chmod(0700, $data_path);
 
     # Base configuration for this node
-    $self->append_conf(
-        'postgresql.conf',
-        qq(
-port = $port
-));
+    $self->append_conf('postgresql.conf', qq(port = $port));
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.16.3

From 5714dcde4d6ca7d75cd0aae27fe1e5fc66d03f5d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 185 ++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..8d1d3a2275
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,185 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 11;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 3MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state should be known before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "|unknown|0", 'non-reserved slot shows unknown');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be streaming state.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check initial state of standby');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is unconditionally "safe" with the default setting.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is keeping all segments');
+
+# The stanby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 3;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaing bytes should be almost just
+# (max_slot_wal_keep_size + 1) times as large as the segment size.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|4096 kB", 'check that remaining bytes is calculated');
+
+# Advance WAL again then checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|2048 kB", 'remaining byte should be reduced by 2MB');
+
+
+# wal_keep_segments can override
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that wal_keep_segments overrides');
+
+# restore wal_keep_segments (no test)
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Slot gets to 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that some segments are about to be removed');
+
+# The stanby still can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again, the slot loses some segments.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 2 segment\\(s\\)\\.",
+               $logstart),
+   'check that warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 6d456a2b8eeecd34b630816ca3388c7e1d8d68af Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index af4d0625ea..7ec8764ce5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9825,6 +9825,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them are no longer
+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until
+        this slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4a7121a51f..3d034ac0d1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3531,6 +3531,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited size of WAL files.  If restart_lsn
+        of a replication slot gets behind more than that bytes from the
+        current LSN, the standby using the slot may no longer be able to
+        reconnect due to removal of required WAL records. You can see the WAL
+        availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d8fd195da0..e30eaaeebe 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

From 44bdb5b92b651c524e6f901be4eaa0184714f04d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 6/6] Check removal of in-reading segment file.

Checkpoint can remove or recycle a segment file while it is being read
by ReadRecord during logical decoding. This patch checks for the case
and error out immedaitely.  Reading recycled file is basically safe
and inconsistency caused by overwrites as new segment will be caught
by page/record validation. So this is only for keeping consistency
with the wal_status shown in pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c5e019bf77..117710c55b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -26,6 +26,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "access/xlog.h"
 #include "utils/memutils.h"
 #endif
 
@@ -224,7 +225,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND    
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -270,6 +273,21 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * checkpoint can remove the segment currently looking for.  make sure the
+     * current segment still exists. we check this once per page. This cannot
+     * happen on frontend.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

30 January 2019, 01:42:04

At Thu, 20 Dec 2018 16:24:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20181220.162438.121484007.horiguchi.kyotaro@lab.ntt.co.jp>
> Thank you for piking this and sorry being late.
> 
> At Mon, 19 Nov 2018 13:39:58 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20181119043958.GE4400@paquier.xyz>
> > ereport should not be called within xlogreader.c as a base rule:
> 
> Ouch! I forgot that. Fixed to use report_invalid_record slightly
> changing the message. The code is not required (or cannot be
> used) on frontend so #ifndef FRONTENDed the code.
> 
> At Tue, 20 Nov 2018 14:07:44 +0900, Michael Paquier <michael@paquier.xyz> wrote in
<20181120050744.GJ4400@paquier.xyz>
> > On Mon, Nov 19, 2018 at 01:39:58PM +0900, Michael Paquier wrote:
> > > I was just coming by to look at bit at the patch series, and bumped
> > > into that:
> > 
> > So I have been looking at the last patch series 0001-0004 posted on this
> > thread, and coming from here:
> > https://postgr.es/m/20181025.215518.189844649.horiguchi.kyotaro@lab.ntt.co.jp
> > 
> > /* check that the slot is gone */
> > SELECT * FROM pg_replication_slots
> > It could be an idea to switch to the expanded mode here, not that it
> > matters much still..
> 
> No problem doing that. Done.
> 
> TAP test complains that it still uses recovery.conf. Fixed. On
> the way doing that I added parameter primary_slot_name to
> init_from_backup in PostgresNode.pm
> 
> > +IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
> > You mean Available here, not Avaiable.  This function is only used when
> > scanning for slot information with pg_replication_slots, so wouldn't it
> > be better to just return the status string in this case?
> 
> Mmm. Sure. Auto-completion hid it from my eyes. Fixed the name.
> The fix sounds reasonable. The function was created as returning
> boolean and the name doen't fit the current function. I renamed
> the name to GetLsnAvailability() that returns a string.
> 
> > Not sure I see the point of the "remain" field, which can be found with
> > a simple calculation using the current insertion LSN, the segment size
> > and the amount of WAL that the slot is retaining.  It may be interesting
> > to document a query to do that though.
> 
> It's not that simple. wal_segment_size, max_slot_wal_keep_size,
> wal_keep_segments, max_slot_wal_keep_size and the current LSN are
> invoved in the calculation which including several conditional
> branches, maybe as you see upthread. We could show "the largest
> current LSN until WAL is lost" but the "current LSN" is not shown
> there. So it is showing the "remain".
> 
> > GetOldestXLogFileSegNo() has race conditions if WAL recycling runs in
> > parallel, no?  How is it safe to scan pg_wal on a process querying
> > pg_replication_slots while another process may manipulate its contents
> > (aka the checkpointer or just the startup process with an
> > end-of-recovery checkpoint.).  This routine relies on unsafe
> > assumptions as this is not concurrent-safe.  You can avoid problems by
> > making sure instead that lastRemovedSegNo is initialized correctly at
> > startup, which would be normally one segment older than what's in
> > pg_wal, which feels a bit hacky to rely on to track the oldest segment.
> 
> Concurrent recycling makes the function's result vary between the
> segment numbers before and after it. It is unstable but doesn't
> matter so much. The reason for the timing is to avoid extra
> startup time by a scan over pg_wal that is unncecessary in most
> cases.
> 
> Anyway the attached patch initializes lastRemovedSegNo in
> StartupXLOG().
> 
> > It seems to me that GetOldestXLogFileSegNo() should also check for
> > segments matching the current timeline, no?
> 
> RemoveOldXlogFiles() ignores timeline and the function is made to
> behave the same way (in different manner). I added a comment for
> the behavior in the function.
> 
> > +           if (prev_lost_segs != lost_segs)
> > +               ereport(WARNING,
> > +                       (errmsg ("some replication slots have lost
> > required WAL segments"),
> > +                        errdetail_plural(
> > +                            "The mostly affected slot has lost %ld
> > segment.",
> > +                            "The mostly affected slot has lost %ld
> > segments.",
> > +                            lost_segs, lost_segs)));
> > This can become very noisy with the time, and it would be actually
> > useful to know which replication slot is impacted by that.
> 
> One message per one segment doen't seem so noisy. The reason for
> not showing slot identifier individually is just to avoid
> complexity comes from involving slot details. DBAs will see the
> details in pg_stat_replication.
> 
> Anyway I did that in the attached patch. ReplicationSlotsBehind
> returns the list of the slot names that behind specified
> LSN. With this patch the messages looks as the follows:
> 
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slot s1 lost 8 segment(s).
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slots s1, s2, s3 lost at most 9 segment(s).
> 
> > +      slot doesn't have valid restart_lsn, this field
> > Missing a determinant here, and restart_lsn should have a <literal>
> > markup.
> 
> structfield? Reworded as below:
> 
> |  non-negative. If <structfield>restart_lsn</structfield> is NULL, this
> |  field is <literal>unknown</literal>.
> 
> I changed "the slot" with "this slot" in the two added fields
> (wal_status, remain).
> 
> > +    many WAL segments that they fill up the space allotted
> > s/allotted/allocated/.
> 
> Fixed.
> 
> > +      available. The last two states are seen only when
> > +      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
> > +      slot doesn't have valid restart_lsn, this field
> > +      is <literal>unknown</literal>.
> > I am a bit confused by this statement.  The last two states are "lost"
> > and "keeping", but shouldn't "keeping" be the state showing up by
> > default as it means that all WAL segments are kept around.
> 
> It's "streaming".  I didn't came up with nice words to
> distinguish the two states. I'm not sure "keep around" exactly
> means but "keeping" here means rather "just not removed yet". The
> states could be reworded as the follows:
> 
> streaming: kept/keeping/(secure, in the first version)
> keeping  : mortal/about to be removed
> lost/unkown : (lost/unknown)
> 
> Do you have any better wording?
> 
> > +# Advance WAL by ten segments (= 160MB) on master
> > +advance_wal($node_master, 10);
> > +$node_master->safe_psql('postgres', "CHECKPOINT;");
> > This makes the tests very costly, which is something we should avoid as
> > much as possible.  One trick which could be used here, on top of
> > reducing the number of segment switches, is to use initdb
> > --wal-segsize=1.
> 
> That sounds nice. Done. In the new version the number of segments
> can be reduced and a new test item for the initial unkonwn state
> as the first item.
> 
> Please find the attached new version.

Rebased. No conflict found since the last version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 270aff9b08ced425b4c4e23b53193285eb2359a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c             | 127 +++++++++++++++++++++-----
 src/backend/replication/slot.c                |  57 ++++++++++++
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 174 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..9988ef943c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -100,6 +100,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -872,6 +873,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9325,6 +9327,53 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number the next checkpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location
+ * minSlotLSN is the minimum restart_lsn of all active slots
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    uint64        keepSegs = 0;
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate keep segments by slots first. The second term of the
+     * condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Apply max_slot_wal_keep_size to keepSegs */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9336,38 +9385,66 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * warn if the checkpoint flushes the segments required by replication
+     * slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed.
+                 * We don't need show anything here if no affected slot
+                 * remains.
+                 */
+                if (slot_names)
+                {
+                    ereport(WARNING,
+                            (errmsg ("some replication slots have lost required WAL segments"),
+                             errdetail_plural(
+                                 "Slot %s lost %ld segment(s).",
+                                 "Slots %s lost at most %ld segment(s).",
+                                 nslots, slot_names, lost_segs)));
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..1a705ca0d3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,63 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns the list of replication slots restart_lsn of whch are behind
+ * specified LSN. Returs palloc'ed character array stuffed with slot names
+ * delimited by the givein separator.  Returns NULL if no slot matches.  If
+ * pnslots is given, the number of the returned slots is returned there.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        initStringInfo(&retstr);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        if (s->in_use && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * slot names consist only with lower-case letters. we don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL if the result is an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98d75be292..789fabca66 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2616,6 +2616,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..3768e8d08a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -278,6 +278,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9139..b2eb30b779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66bae..9c3635dc0e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.16.3

From 4be4ce4c671499c373ac5f8318f432db182eb8f4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/6] Add monitoring aid for max_slot_wal_keep_size.

Adds two columns "status" and "remain" in pg_replication_slot.
Setting max_slot_wal_keep_size, long-disconnected slots may lose sync.
The two columns show whether the slot is reconnectable or not, or
about to lose reserving WAL segments, and the remaining bytes of WAL
that can be advance until the slot loses reserving WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 150 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  16 +++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 172 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index b7c76469fc..c5f52d6ee8 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -705,8 +705,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index c4b10a4cf9..5040d5e85e 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -374,4 +374,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9988ef943c..aaafa6b74f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -873,7 +873,9 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestXLogFileSegNo(void);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -6660,6 +6662,12 @@ StartupXLOG(void)
      */
     StartupReplicationOrigin();
 
+    /*
+     * Initialize lastRemovedSegNo looking pg_wal directory. The minimum
+     * segment number is 1 so no wrap-around can happen.
+     */
+    XLogCtl->lastRemovedSegNo = GetOldestXLogFileSegNo() - 1;
+
     /*
      * Initialize unlogged LSN. On a clean shutdown, it's restored from the
      * control file. On recovery, all unlogged relations are blown away, so
@@ -9327,19 +9335,115 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Finds the segment number of the oldest file in XLOG directory.
+ *
+ * This function is intended to be used for initialization of
+ * XLogCtl->lastRemovedSegNo.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * get minimum segment ignoring timeline ID.  Since RemoveOldXlog
+         * works ignoring timeline ID, this function works the same way.
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Returns availability status of the record at given targetLSN
+ *
+ * Returns three kinds of value.
+ * "streaming" when the WAL record at targetLSN is available.
+ * "keeping" when still available but about to be removed by the next
+ * checkpoint.
+ * "lost" when the WAL record at targetLSN is already removed.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
+ */
+char *
+GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return "streaming";
+
+    /* targetSeg is not reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return "keeping";
+
+    /* targetSeg has gone */
+    return    "lost";
+}
+
 /*
  * Returns minimum segment number the next checkpoint must leave considering
  * wal_keep_segments, replication slots and max_slot_wal_keep_size.
  *
  * currLSN is the current insert location
  * minSlotLSN is the minimum restart_lsn of all active slots
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes to advance until the
+ * segment that contains targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     uint64        keepSegs = 0;
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
+    uint64        limitSegs = 0;
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9354,8 +9458,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Apply max_slot_wal_keep_size to keepSegs */
@@ -9363,9 +9465,40 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* also, limitSegs should be raised if wal_keep_segments is larger */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, return remaining LSN bytes to advance until the slot
+     * gives up reserving WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+        if (max_slot_wal_keep_size_mb >= 0 && currSeg <= targetSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9395,7 +9528,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * warn if the checkpoint flushes the segments required by replication
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f4d9e9daf7..358df2c183 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -798,7 +798,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 224dd920c8..cac66978ed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+
+            values[i++] = CStringGetTextDatum(
+                GetLsnAvailability(restart_lsn, &remaining_bytes));
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b2eb30b779..b0cdba6d7a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -302,6 +302,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ecc2e12c3..f7e6d18e35 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9665,9 +9665,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd2279..956c3c9525 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1451,8 +1451,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 62b60031f7aed53bd176d2296a1a6d36bf7017c9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH 3/6] Add primary_slot_name to init_from_backup in TAP test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8a2c6fc122..daca2e0085 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -672,11 +672,11 @@ sub init_from_backup
     chmod(0700, $data_path);
 
     # Base configuration for this node
-    $self->append_conf(
-        'postgresql.conf',
-        qq(
-port = $port
-));
+    $self->append_conf('postgresql.conf', qq(port = $port));
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.16.3

From 254ae4a490f951b235550d7f9c0555cbbbee108e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 185 ++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..8d1d3a2275
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,185 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slot.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 11;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 3MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state should be known before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "|unknown|0", 'non-reserved slot shows unknown');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using a replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data on the standby
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, currently the slot must be streaming state.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check initial state of standby');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is unconditionally "safe" with the default setting.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is keeping all segments');
+
+# The stanby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 3;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaing bytes should be almost just
+# (max_slot_wal_keep_size + 1) times as large as the segment size.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|4096 kB", 'check that remaining bytes is calculated');
+
+# Advance WAL again then checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|2048 kB", 'remaining byte should be reduced by 2MB');
+
+
+# wal_keep_segments can override
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that wal_keep_segments overrides');
+
+# restore wal_keep_segments (no test)
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Slot gets to 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that some segments are about to be removed');
+
+# The stanby still can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that no replication failure is caused by insecure state');
+
+# Advance WAL again, the slot loses some segments.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 2 segment\\(s\\)\\.",
+               $logstart),
+   'check that warning is correctly logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that overflown segments have been removed');
+
+# The stanby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From ad8ae6bb324f3a7c6ac380b0976b66fda08154f5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index af4d0625ea..7ec8764ce5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9825,6 +9825,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them are no longer
+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until
+        this slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..7177c6122a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3537,6 +3537,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited size of WAL files.  If restart_lsn
+        of a replication slot gets behind more than that bytes from the
+        current LSN, the standby using the slot may no longer be able to
+        reconnect due to removal of required WAL records. You can see the WAL
+        availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bbab7395a2..79901c5f06 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

From 96991c3d5fb4d5dc2df6e98fd725eb85189e739e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 6/6] Check removal of in-reading segment file.

Checkpoint can remove or recycle a segment file while it is being read
by ReadRecord during logical decoding. This patch checks for the case
and error out immedaitely.  Reading recycled file is basically safe
and inconsistency caused by overwrites as new segment will be caught
by page/record validation. So this is only for keeping consistency
with the wal_status shown in pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 69b6226f8f..cc24afd30d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -26,6 +26,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "access/xlog.h"
 #include "utils/memutils.h"
 #endif
 
@@ -224,7 +225,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND    
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -270,6 +273,21 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * checkpoint can remove the segment currently looking for.  make sure the
+     * current segment still exists. we check this once per page. This cannot
+     * happen on frontend.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

16 February 2019, 03:13:23

Hi,

On 2019-01-30 10:42:04 +0900, Kyotaro HORIGUCHI wrote:
> From 270aff9b08ced425b4c4e23b53193285eb2359a6 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 21 Dec 2017 21:20:20 +0900
> Subject: [PATCH 1/6] Add WAL relief vent for replication slots
> 
> Adds a capability to limit the number of segments kept by replication
> slots by a GUC variable.

Maybe I'm missing something, but how does this prevent issues with
active slots that are currently accessing the WAL this patch now
suddenly allows to be removed? Especially for logical slots that seems
not unproblematic?

Besides that, this patch needs substantial spelling / language / comment
polishing. Horiguchi-san, it'd probably be good if you could make a
careful pass, and then maybe a native speaker could go over it?

Greetings,

Andres Freund

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

22 February 2019, 01:12:51

At Fri, 15 Feb 2019 19:13:23 -0800, Andres Freund <andres@anarazel.de> wrote in
<20190216031323.t7tfrae4l6zqtseo@alap3.anarazel.de>
> Hi,
> 
> On 2019-01-30 10:42:04 +0900, Kyotaro HORIGUCHI wrote:
> > From 270aff9b08ced425b4c4e23b53193285eb2359a6 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> > Date: Thu, 21 Dec 2017 21:20:20 +0900
> > Subject: [PATCH 1/6] Add WAL relief vent for replication slots
> > 
> > Adds a capability to limit the number of segments kept by replication
> > slots by a GUC variable.
> 
> Maybe I'm missing something, but how does this prevent issues with
> active slots that are currently accessing the WAL this patch now
> suddenly allows to be removed? Especially for logical slots that seems
> not unproblematic?

No matter whether logical or physical, when reading an
overwritten page of a recycled/renamed segment file, page
validation at reading-in finds that it is of a different segment
than expected. 0006 in [1] introduces more active checking on
that.

[1] https://www.postgresql.org/message-id/20181220.162438.121484007.horiguchi.kyotaro%40lab.ntt.co.jp

> Besides that, this patch needs substantial spelling / language / comment
> polishing. Horiguchi-san, it'd probably be good if you could make a
> careful pass, and then maybe a native speaker could go over it?

Thank you for your kind suggestion. As I did for other patches,
I'll review it by myself and come up with a new version soon.

# I often don't understand what I wrote:(

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro HORIGUCHI

Date:

22 February 2019, 05:12:28

At Fri, 22 Feb 2019 10:12:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190222.101251.03333542.horiguchi.kyotaro@lab.ntt.co.jp>
horiguchi.kyotaro> At Fri, 15 Feb 2019 19:13:23 -0800, Andres Freund <andres@anarazel.de> wrote in
<20190216031323.t7tfrae4l6zqtseo@alap3.anarazel.de>
> > Maybe I'm missing something, but how does this prevent issues with
> > active slots that are currently accessing the WAL this patch now
> > suddenly allows to be removed? Especially for logical slots that seems
> > not unproblematic?
> 
> No matter whether logical or physical, when reading an
> overwritten page of a recycled/renamed segment file, page
> validation at reading-in finds that it is of a different segment
> than expected. 0006 in [1] introduces more active checking on
> that.
> 
> [1] https://www.postgresql.org/message-id/20181220.162438.121484007.horiguchi.kyotaro%40lab.ntt.co.jp
>
> > Besides that, this patch needs substantial spelling / language / comment
> > polishing. Horiguchi-san, it'd probably be good if you could make a
> > careful pass, and then maybe a native speaker could go over it?
> 
> Thank you for your kind suggestion. As I did for other patches,
> I'll review it by myself and come up with a new version soon.

I checked spelling comments and commit messages, then perhaps
corrected and improved them. I hope they looks nice..

0006 is separate from 0001, since I still doubt the necessity.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9bc7ca30006ebe0fe13c6ffbf4bfc87e52176876 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c             | 128 +++++++++++++++++++++-----
 src/backend/replication/slot.c                |  57 ++++++++++++
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 175 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..998b779277 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -100,6 +100,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -872,6 +873,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9329,6 +9331,54 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9340,38 +9390,66 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed.
+                 * We don't need show anything here if no affected slot
+                 * remains.
+                 */
+                if (slot_names)
+                {
+                    ereport(WARNING,
+                            (errmsg ("some replication slots have lost required WAL segments"),
+                             errdetail_plural(
+                                 "Slot %s lost %ld segment(s).",
+                                 "Slots %s lost at most %ld segment(s).",
+                                 nslots, slot_names, lost_segs)));
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..6ef63ae7c0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,63 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns names of replication slots that their restart_lsn are behind
+ * specified LSN, in palloc'ed character array stuffed with slot names
+ * delimited by the given separator.  Returns NULL if no slot matches. If
+ * pnslots is given, the number of the returned slots is returned there.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        initStringInfo(&retstr);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        if (s->in_use && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * Slot names consist only with lower-case letters. We don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL instead of an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..c5f04fb8a5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2629,6 +2629,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..6f96177bbd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -282,6 +282,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9139..b2eb30b779 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index a8f1d66bae..9c3635dc0e 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.16.3

From 3784220883e1723104f116add9b47161a847da83 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/6] Add monitoring aid for max_slot_wal_keep_size

Adds two columns "status" and "remain" in pg_replication_slot. Setting
max_slot_wal_keep_size, replication connections may lose sync by a
long delay. The "status" column shows whether the slot is
reconnectable or not, or about to lose reserving WAL segments. The
"remain" column shows the remaining bytes of WAL that can be advance
until the slot loses required WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 152 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  16 +++-
 src/include/access/xlog.h              |   1 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 174 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2bd28e6d15..9f42dc0991 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index a55086443c..e793ddd366 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 998b779277..9623469a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -873,7 +873,9 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestXLogFileSegNo(void);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -6664,6 +6666,12 @@ StartupXLOG(void)
      */
     StartupReplicationOrigin();
 
+    /*
+     * Initialize lastRemovedSegNo looking pg_wal directory. The minimum
+     * segment number is 1 so wrap-around cannot happen.
+     */
+    XLogCtl->lastRemovedSegNo = GetOldestXLogFileSegNo() - 1;
+
     /*
      * Initialize unlogged LSN. On a clean shutdown, it's restored from the
      * control file. On recovery, all unlogged relations are blown away, so
@@ -9331,6 +9339,96 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+
+/*
+ * Finds the oldest segment number in XLOG directory.
+ *
+ * This function is intended to be used to initialize
+ * XLogCtl->lastRemovedSegNo.
+ */
+static XLogSegNo
+GetOldestXLogFileSegNo(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = 0;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * Get minimum segment ignoring timeline ID, the same way with
+         * RemoveOldXlogFiles().
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    return segno;
+}
+
+/*
+ * Returns availability of the record at given targetLSN.
+ *
+ * Returns three kinds of value in string.
+ * "streaming" means the WAL record at targetLSN is available.
+ * "keeping" means it is still available but about to be removed at the next
+ * checkpoint.
+ * "lost" means it is already removed.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
+ */
+char *
+GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo targetSeg;
+    XLogSegNo tailSeg;
+    XLogSegNo oldestSeg;
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    oldestSeg = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /* oldest segment is just after the last removed segment */
+    oldestSeg++;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    tailSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN, restBytes);
+
+    /* targetSeg is being reserved by slots */
+    if (tailSeg <= targetSeg)
+        return "streaming";
+
+    /* targetSeg is no longer reserved but still available */
+    if (oldestSeg <= targetSeg)
+        return "keeping";
+
+    /* targetSeg has gone */
+    return    "lost";
+}
+
 /*
  * Returns minimum segment number that the next checkpoint must leave
  * considering wal_keep_segments, replication slots and
@@ -9338,13 +9436,19 @@ CreateRestartPoint(int flags)
  *
  * currLSN is the current insert location.
  * minSlotLSN is the minimum restart_lsn of all active slots.
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
     uint64        keepSegs = 0;    /* # of segments actually kept */
+    uint64        limitSegs = 0;    /* # of maximum segments possibly kept */
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9359,8 +9463,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Reduce it if slots already reserves too many. */
@@ -9368,9 +9470,42 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* ditto for limitSegs */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, calculate the remaining LSN bytes until the slot gives up
+     * keeping WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+        /* avoid underflow */
+        if (max_slot_wal_keep_size_mb >= 0 && currSeg <= targetSeg + limitSegs)
+        {
+            /*
+             * This slot still has all required segments. Calculate how many
+             * LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg, fragbytes,
+                                    wal_segment_size, *restBytes);
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9400,7 +9535,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * Warn the checkpoint is going to flush the segments required by
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..26a6c3bfd5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -800,7 +800,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 224dd920c8..cac66978ed 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -185,7 +185,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -307,6 +307,20 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            values[i++] = LSNGetDatum(InvalidXLogRecPtr);
+        }
+        else
+        {
+            uint64    remaining_bytes;
+
+            values[i++] = CStringGetTextDatum(
+                GetLsnAvailability(restart_lsn, &remaining_bytes));
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b2eb30b779..b0cdba6d7a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -302,6 +302,7 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(XLogRecPtr targetLSN, uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a4e173b484..0014bd18ee 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9685,9 +9685,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 98f417cb57..d0b05a26d5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1459,8 +1459,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From c09cc4b319c11be17dd33c36bbc95fc984a7e394 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH 3/6] Add primary_slot_name to init_from_backup in TAP test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 8a2c6fc122..daca2e0085 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -672,11 +672,11 @@ sub init_from_backup
     chmod(0700, $data_path);
 
     # Base configuration for this node
-    $self->append_conf(
-        'postgresql.conf',
-        qq(
-port = $port
-));
+    $self->append_conf('postgresql.conf', qq(port = $port));
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.16.3

From beded18a29ff790b0fee891be05ea641502c5537 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/016_replslot_limit.pl | 184 ++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 src/test/recovery/t/016_replslot_limit.pl

diff --git a/src/test/recovery/t/016_replslot_limit.pl b/src/test/recovery/t/016_replslot_limit.pl
new file mode 100644
index 0000000000..e150ca7a54
--- /dev/null
+++ b/src/test/recovery/t/016_replslot_limit.pl
@@ -0,0 +1,184 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 11;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 3MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state should be the state "unknown" before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "|unknown|0", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "streaming" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that slot is working');
+
+# The standby can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 3;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|4096 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working.
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|2048 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint
+advance_wal($node_master, 2);
+
+# Slot gets to 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|0 bytes", 'check that the slot state changes to "keeping"');
+
+# The standby still can connect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses some segments.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 2 segment\\(s\\)\\.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From de0f6d17ff2100615cf127a5bbf79d88811c3a06 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 28 ++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 0fd792ff1a..ad5931800d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9865,6 +9865,34 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available. <literal>keeping</literal> means that
+      some of them are to be removed by the next checkpoint.
+      <literal>lost</literal> means that some of them are no longer
+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until
+        this slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..59b5c1b03a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3573,6 +3573,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited size of WAL files.  If restart_lsn
+        of a replication slot gets behind more than that bytes from the
+        current LSN, the standby using the slot may no longer be able to
+        reconnect due to removal of required WAL records. You can see the WAL
+        availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 2b4dcd03c8..d52ffd97a3 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

From 54890cffa5944aeaefe86c6360af60f182314411 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 6/6] Check removal of in-reading segment file.

Checkpoints can recycle a segment file while it is being read by
ReadRecord and that leads to an apparently odd error message during
logical decoding. This patch explicitly checks that then error out
immediately.  Reading a recycled file is safe. Inconsistency caused by
overwrites as a new segment are caught by page/record validation. So
this is only for keeping consistency with the wal_status shown in
pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cbc7e4e7ea..c7a39ebfc5 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -26,6 +26,7 @@
 #include "replication/origin.h"
 
 #ifndef FRONTEND
+#include "access/xlog.h"
 #include "utils/memutils.h"
 #endif
 
@@ -224,7 +225,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -270,6 +273,22 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Jehan-Guillaume de Rorthais

Date:

27 June 2019, 14:22:56

Hi all,

Being interested by this feature, I did a patch review.

This features adds the GUC "max_slot_wal_keep_size". This is the maximum amount
of WAL that can be kept in "pg_wal" by active slots.

If the amount of WAL is superior to this limit, the slot is deactivated and
its status (new filed in pg_replication_slot) is set as "lost".


Patching
========

The patch v13-0003 does not apply on HEAD anymore.

The patch v13-0005 applies using "git am --ignore-space-change"

Other patches applies correctly.

Please, find attached the v14 set of patches rebased on master.


Documentation
=============

The documentation explains the GUC and related columns in "pg_replication_slot".

It reflects correctly the current behavior of the patch.


Usability
=========

The patch implement what it described. It is easy to enable and disable. The
GUC name is describing correctly its purpose.

This feature is useful in some HA scenario where slot are required (eg. no
possible archiving), but where primary availability is more important than
standbys.

In "pg_replication_slots" view, the new "wal_status" field is misleading.
Consider this sentence and the related behavior from documentation
(catalogs.sgml):

  <literal>keeping</literal> means that some of them are to be removed by the
  next checkpoint.

"keeping" appears when the current checkpoint will delete some WAL further than
"current_lsn - max_slot_wal_keep_size", but still required by at least one slot.
As some WAL required by some slots will be deleted quite soon, probably before
anyone can react, "keeping" status is misleading here. We are already in the
red zone.

I would expect this "wal_status" to be:

- streaming: slot lag between 0 and "max_wal_size"
- keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
  slot actually protect some WALs from being deleted
- lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't protect
  some WAL from deletion

Documentation follows with:

  The last two states are seen only when max_slot_wal_keep_size is
  non-negative

This is true with the current behavior. However, if "keeping" is set as soon as
te slot lag is superior than "max_wal_size", this status could be useful even
with "max_slot_wal_keep_size = -1". As soon as a slot is stacking WALs that
should have been removed by previous checkpoint, it "keeps" them.


Feature tests
=============

I have played with various traffic shaping setup between nodes, to observe how
columns "active", "wal_status" and "remain" behaves in regard to each others
using:

  while true; do
   sleep 0.3; 
   psql -p 5432 -AXtc "
    select now(), active, restart_lsn, wal_status, pg_size_pretty(remain)
    from pg_replication_slots
    where slot_name='slot_limit_st'" 
  done

The primary is created using:

  initdb -U postgres -D slot_limit_pr --wal-segsize=1

  cat<<EOF >>slot_limit_pr/postgresql.conf
  port=5432
  max_wal_size = 3MB
  min_wal_size = 2MB
  max_slot_wal_keep_size = 4MB
  logging_collector = on
  synchronous_commit = off
  EOF

WAL activity is generated using a simple pgbench workload. Then, during
this activity, packets on loopback are delayed using:

  tc qdisc add dev lo root handle 1:0 netem delay 140msec

Here is how the wal_status behave. I removed the timestamps, but the
record order is the original one:

     t|1/7B116898|streaming|1872 kB
     t|1/7B1A0000|lost|0 bytes
     t|1/7B320000|keeping|0 bytes
     t|1/7B780000|lost|0 bytes
     t|1/7BB00000|keeping|0 bytes
     t|1/7BE00000|keeping|0 bytes
     t|1/7C100000|lost|0 bytes
     t|1/7C400000|keeping|0 bytes
     t|1/7C700000|lost|0 bytes
     t|1/7CA40000|keeping|0 bytes
     t|1/7CDE0000|lost|0 bytes
     t|1/7D100000|keeping|0 bytes
     t|1/7D400000|keeping|0 bytes
     t|1/7D7C0000|keeping|0 bytes
     t|1/7DB40000|keeping|0 bytes
     t|1/7DE60000|lost|0 bytes
     t|1/7E180000|keeping|0 bytes
     t|1/7E500000|keeping|0 bytes
     t|1/7E860000|lost|0 bytes
     t|1/7EB80000|keeping|0 bytes
     [...x15]
     t|1/80800000|keeping|0 bytes
     t|1/80900000|streaming|940 kB
     t|1/80A00000|streaming|1964 kB

When increasing the network delay to 145ms, the slot has been lost for real.
Note that it has been shown as lost but active twice (during approx 0.6s) before
being deactivated.

     t|1/85700000|streaming|2048 kB
     t|1/85800000|keeping|0 bytes
     t|1/85940000|lost|0 bytes
     t|1/85AC0000|lost|0 bytes
     f|1/85C40000|lost|0 bytes

Finally, at least once the following messages appeared in primary logs
**before** the "wal_status" changed from "keeping" to "streaming":

     WARNING:  some replication slots have lost required WAL segments

So the slot lost one WAL, but the standby has been able to catch-up anyway.
 
My humble opinion about these results:

* after many different tests, the status "keeping" appears only when "remain"
  equals 0. In current implementation, "keeping" really adds no value...
* "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
  active
* the "lost" status should be a definitive status
* it seems related, but maybe the "wal_status" should be set as "lost"
  only when the slot has been deactivate ?
* logs should warn about a failing slot as soon as it is effectively
  deactivated, not before.

Attachment

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

30 July 2019, 12:30:45

Thanks for reviewing!

At Thu, 27 Jun 2019 16:22:56 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in
<20190627162256.4f4872b8@firost>
> Hi all,
> 
> Being interested by this feature, I did a patch review.
> 
> This features adds the GUC "max_slot_wal_keep_size". This is the maximum amount
> of WAL that can be kept in "pg_wal" by active slots.
> 
> If the amount of WAL is superior to this limit, the slot is deactivated and
> its status (new filed in pg_replication_slot) is set as "lost".

This patch doesn't deactivate walsender. A walsender stops by
itself when it finds that it cannot continue ongoing replication.

> Patching
> ========
> 
> The patch v13-0003 does not apply on HEAD anymore.
> 
> The patch v13-0005 applies using "git am --ignore-space-change"
> 
> Other patches applies correctly.
> 
> Please, find attached the v14 set of patches rebased on master.

Sorry for missing this for log time. It is hit by 67b9b3ca32
again so I repost a rebased version.

> Documentation
> =============
> 
> The documentation explains the GUC and related columns in "pg_replication_slot".
> 
> It reflects correctly the current behavior of the patch.
> 
> 
> Usability
> =========
> 
> The patch implement what it described. It is easy to enable and disable. The
> GUC name is describing correctly its purpose.
> 
> This feature is useful in some HA scenario where slot are required (eg. no
> possible archiving), but where primary availability is more important than
> standbys.

Yes. Thanks for the clear explanation on the purpose.

> In "pg_replication_slots" view, the new "wal_status" field is misleading.
> Consider this sentence and the related behavior from documentation
> (catalogs.sgml):
> 
>   <literal>keeping</literal> means that some of them are to be removed by the
>   next checkpoint.
> 
> "keeping" appears when the current checkpoint will delete some WAL further than
> "current_lsn - max_slot_wal_keep_size", but still required by at least one slot.
> As some WAL required by some slots will be deleted quite soon, probably before
> anyone can react, "keeping" status is misleading here. We are already in the
> red zone.

It may be "losing", which would be less misleading.

> I would expect this "wal_status" to be:
> 
> - streaming: slot lag between 0 and "max_wal_size"
> - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
>   slot actually protect some WALs from being deleted
> - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't protect
>   some WAL from deletion

I agree that comparing to max_wal_size is meaningful. The revised
version behaves as that.

> Documentation follows with:
> 
>   The last two states are seen only when max_slot_wal_keep_size is
>   non-negative
> 
> This is true with the current behavior. However, if "keeping" is set as soon as
> te slot lag is superior than "max_wal_size", this status could be useful even
> with "max_slot_wal_keep_size = -1". As soon as a slot is stacking WALs that
> should have been removed by previous checkpoint, it "keeps" them.

I revised the documentation that way. Both
view-pg-replication-slots.html and
runtime-config-replication.html are reworded.

> Feature tests
> =============
> 
> I have played with various traffic shaping setup between nodes, to observe how
> columns "active", "wal_status" and "remain" behaves in regard to each others
> using:
> 
>   while true; do
> 
<removed testing details>
> 
> Finally, at least once the following messages appeared in primary logs
> **before** the "wal_status" changed from "keeping" to "streaming":
> 
>      WARNING:  some replication slots have lost required WAL segments
> 
> So the slot lost one WAL, but the standby has been able to catch-up anyway.

Thanks for the intensive test run. It is quite helpful.

> My humble opinion about these results:
> 
> * after many different tests, the status "keeping" appears only when "remain"
>   equals 0. In current implementation, "keeping" really adds no value...

Hmm. I agree that given that the "lost" (or "losing" in the
patch) state is a not definite state. That is, the slot may
recover from the state.

> * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
>   active

The revised  version shows the following statuses.

   streaming / NULL             max_slot_wal_keep_size is -1
   unkown    / NULL             mswks >= 0 and restart_lsn is invalid
   <status>  / <bytes>          elsewise

> * the "lost" status should be a definitive status
> * it seems related, but maybe the "wal_status" should be set as "lost"
>   only when the slot has been deactivate ?

Agreed. While replication is active, if required segments seems
to be lost once, delayed walreceiver ack can advance restart_lsn
to "safe" zone later. So, in the revised version, if the segment
for restart_lsn has been removed, GetLsnAvailablity() returns
"losing" if walsender is active and "lost" if not.

> * logs should warn about a failing slot as soon as it is effectively
>   deactivated, not before.

Agreed. Slots on which walsender is running are exlucded from the
output of ReplicationSlotsEnumerateBehnds. As theresult the "some
replcation slots lost.." is emitted after related walsender
stops.

I attach the revised patch. I'll repost the polished version
sooner.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 87b35cb3c1c1a50218563037a97a368d86451040 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c             | 128 +++++++++++++++++++++-----
 src/backend/replication/slot.c                |  58 ++++++++++++
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 176 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f553523857..fcb076100f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -104,6 +104,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -872,6 +873,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9288,6 +9290,54 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9299,38 +9349,66 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed.
+                 * We don't need show anything here if no affected slot
+                 * remains.
+                 */
+                if (slot_names)
+                {
+                    ereport(WARNING,
+                            (errmsg ("some replication slots have lost required WAL segments"),
+                             errdetail_plural(
+                                 "Slot %s lost %ld segment(s).",
+                                 "Slots %s lost at most %ld segment(s).",
+                                 nslots, slot_names, lost_segs)));
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 62342a69cb..24b8d42eab 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,64 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns names of replication slots that their restart_lsn are behind
+ * specified LSN, in palloc'ed character array stuffed with slot names
+ * delimited by the given separator.  Returns NULL if no slot matches. If
+ * pnslots is given, the number of the returned slots is returned there.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        initStringInfo(&retstr);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        /* Exclude active walsenders */
+        if (s->in_use && s->active_pid == 0 && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * Slot names consist only with lower-case letters. We don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL instead of an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fc463601ff..f8e796b6c1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2654,6 +2654,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cfad86c02a..aadbc76d85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -286,6 +286,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..b355452072 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea78f..e0fee0663c 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.16.3

From 839a102791aadef5dd7af28623f55be411b9374b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/6] Add monitoring aid for max_slot_wal_keep_size

Adds two columns "status" and "remain" in pg_replication_slot. Setting
max_slot_wal_keep_size, replication connections may lose sync by a
long delay. The "status" column shows whether the slot is
reconnectable or not, or about to lose reserving WAL segments. The
"remain" column shows the remaining bytes of WAL that can be advance
until the slot loses required WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 117 ++++++++++++++++++++++++++++++---
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  21 +++++-
 src/include/access/xlog.h              |   2 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 144 insertions(+), 18 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcb076100f..d0cc2e0f6d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -873,7 +873,8 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                       XLogRecPtr targetLSN, uint64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9290,6 +9291,63 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns availability of the record at given targetLSN.
+ *
+ * Returns three kinds of value in string.
+ * "streaming" means the WAL record at targetLSN is available.
+ * "keeping" means it is still available but about to be removed at the next
+ * checkpoint.
+ * "losing" means it is already removed. This state is not definite since
+ * delayed ack from walreceiver can advance restart_lsn later.
+ * "lost" means it is already removed and no walsenders are running on it. The
+ * slot is no longer recoverable.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
+ */
+char *
+GetLsnAvailability(pid_t walsender_pid, XLogRecPtr targetLSN, uint64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo targetSeg;    /* segid of targetLSN */
+    XLogSegNo oldestSeg;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+
+    Assert(!XLogRecPtrIsInvalid(targetLSN));
+    Assert(restBytes);
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* oldest segment currently needed by slots */
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr, targetLSN,
+                                         restBytes);
+
+    /* oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    oldestSeg = currSeg -
+        ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    /* targetSeg is within max_wal_size */
+    if (oldestSeg <= targetSeg)
+        return "streaming";
+
+    /* targetSeg is being retained by slots */
+    if (oldestSlotSeg <= targetSeg)
+        return "keeping";
+
+    /* targetSeg is no longer protected. We ignore the possible availability */
+
+    if (walsender_pid != 0)
+        return    "losing";
+
+    return "lost";
+}
+
 /*
  * Returns minimum segment number that the next checkpoint must leave
  * considering wal_keep_segments, replication slots and
@@ -9297,13 +9355,19 @@ CreateRestartPoint(int flags)
  *
  * currLSN is the current insert location.
  * minSlotLSN is the minimum restart_lsn of all active slots.
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, uint64 *restBytes)
 {
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
     uint64        keepSegs = 0;    /* # of segments actually kept */
+    uint64        limitSegs = 0;    /* # of maximum segments possibly kept */
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9318,8 +9382,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Reduce it if slots already reserves too many. */
@@ -9327,9 +9389,45 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* ditto for limitSegs */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, calculate the remaining LSN bytes until the slot gives up
+     * keeping WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        if (max_slot_wal_keep_size_mb >= 0)
+        {
+            XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+            /* avoid underflow */
+            if (currSeg <= targetSeg + limitSegs)
+            {
+                /*
+                 * This slot still has all required segments. Calculate how
+                 * many LSN bytes the slot has until it loses targetLSN.
+                 */
+                fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+                XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg, fragbytes,
+                                        wal_segment_size, *restBytes);
+            }
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9359,7 +9457,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo =
+        GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr, NULL);
 
     /*
      * Warn the checkpoint is going to flush the segments required by
@@ -9393,7 +9492,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
                 if (slot_names)
                 {
                     ereport(WARNING,
-                            (errmsg ("some replication slots have lost required WAL segments"),
+                            (errmsg ("some replication slots lost required WAL segments"),
                              errdetail_plural(
                                  "Slot %s lost %ld segment(s).",
                                  "Slots %s lost at most %ld segment(s).",
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ea4c85e395..6a9491e64a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -849,7 +849,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 808a6f5b83..2229a46154 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -221,7 +221,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -343,6 +343,25 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        if (max_slot_wal_keep_size_mb < 0)
+        {
+            values[i++] = CStringGetTextDatum("streaming");
+            nulls[i++] = true;
+        }
+        else if (restart_lsn == InvalidXLogRecPtr)
+        {
+            values[i++] = CStringGetTextDatum("unknown");
+            nulls[i++] = true;
+        }
+        else
+        {
+            uint64    remaining_bytes;
+
+            values[i++] = CStringGetTextDatum(
+                GetLsnAvailability(active_pid, restart_lsn, &remaining_bytes));
+            values[i++] = Int64GetDatum(remaining_bytes);
+        }
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b355452072..68ca21f780 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -305,6 +305,8 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(pid_t walsender_pid, XLogRecPtr targetLSN,
+                                uint64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..300d868980 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9844,9 +9844,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd146..74c44891a4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1461,8 +1461,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From a7d798cbe4142a2f04393671acfc032917c4cd11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH 3/6] Add primary_slot_name to init_from_backup in TAP test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37f91..c7e138c121 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -694,6 +694,10 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.16.3

From 07b28b280aeb4f10273057c1b528ab33adddd5ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/018_replslot_limit.pl | 198 ++++++++++++++++++++++++++++++
 1 file changed, 198 insertions(+)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..2bd6fdf39c
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,198 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 3MB
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state should be the state "unknown" before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots
WHEREslot_name = 'rep1'");
 
+is($result, "|unknown|0", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "streaming" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|streaming|0", 'check that within max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain FROM pg_replication_slots WHERE
slot_name= 'rep1'");
 
+is($result, "$start_lsn|keeping|0", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 4;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|3072 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 1 segment\\(s\\)\\.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|0 bytes", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 320af84bf916ba7b70cf01147dab234c8bf318cb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 30 ++++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 68ad5071ca..dc9679283a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9975,6 +9975,36 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>lost</literal>
+      or <literal>unknown</literal>. <literal>streaming</literal> means that
+      the claimed records are available within
+      max_wal_size. <literal>keeping</literal> means max_wal_size is exceeded
+      but still required records are held by replication slots or
+      wal_keep_segments.
+      <literal>lost</literal> means that some of them are on the verge of
+      removal or no longer available. This state is seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is <literal>unknown</literal>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes that WAL location (LSN) can advance until
+        this slot may lose required WAL records.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..c345538c8f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3650,6 +3650,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL records. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 543691dad4..ae8c3a2aca 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

From ab129736e06b23f4e251cbb65e1b841670ba924a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 6/6] Check removal of in-reading segment file.

Checkpoints can recycle a segment file while it is being read by
ReadRecord and that leads to an apparently odd error message during
logical decoding. This patch explicitly checks that then error out
immediately.  Reading a recycled file is safe. Inconsistency caused by
overwrites as a new segment are caught by page/record validation. So
this is only for keeping consistency with the wal_status shown in
pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 33ccfc1553..4999892932 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -27,6 +27,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "access/xlog.h"
 #include "utils/memutils.h"
 #endif
 
@@ -225,7 +226,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -271,6 +274,22 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

30 July 2019, 12:33:00

At Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190730.213045.221405075.horikyota.ntt@gmail.com>
> I attach the revised patch. I'll repost the polished version
> sooner.

(Mainly TAP test and documentation, code comments will be revised)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

31 July 2019, 07:56:16

At Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190730.213045.221405075.horikyota.ntt@gmail.com>
> I attach the revised patch. I'll repost the polished version
> sooner.

This is the revised patch.

- Status criteria has been changed.

  "streaming" : restart_lsn is within max_wal_size. (and kept)

  "keeping" : restart_lsn is behind max_wal_size but still kept
       by max_slot_wal_keep_size or wal_keep_segments.

  "losing" : The segment for restart_lsn is being lost or has
       been lost, but active walsender (or session) using the
       slot is still running. If the walsender caught up before
       stopped, the state will transfer to "keeping" or
       "streaming" again.

  "lost" : The segment for restart_lsn has been lost and the
       active session on the slot is gone. The standby cannot
       continue replication using this slot.

  null : restart_lsn is null (never activated).

- remain is null if restart_lsn is null (never activated) or
  wal_status is "losing" or "lost".

- catalogs.sgml is updated.

- Refactored GetLsnAvailability and GetOldestKeepSegment and
  pg_get_replication_slots.

- TAP test is fied. But test for "losing" state cannot be done
  since it needs interactive session. (I think using isolation
  tester is too much)..

reards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From a7f2efdcf64eae2cd4fd707981658f29090d36ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c             | 128 +++++++++++++++++++++-----
 src/backend/replication/slot.c                |  62 +++++++++++++
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 180 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f553523857..3989f6e54a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -104,6 +104,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -872,6 +873,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9288,6 +9290,54 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9299,38 +9349,66 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed. We
+                 * don't need show anything here if no affected slots are
+                 * remaining.
+                 */
+                if (slot_names)
+                {
+                    ereport(WARNING,
+                            (errmsg ("some replication slots have lost required WAL segments"),
+                             errdetail_plural(
+                                 "Slot %s lost %ld segment(s).",
+                                 "Slots %s lost at most %ld segment(s).",
+                                 nslots, slot_names, lost_segs)));
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 62342a69cb..5bdf1e90fb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,68 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns names of inactive replication slots that their restart_lsn are
+ * behind specified LSN for the purpose of error message, in palloc'ed
+ * character array stuffed with slot names delimited by the given
+ * separator. Returns NULL if no slot matches. If pnslots is given, the number
+ * of the returned slots is returned there.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        initStringInfo(&retstr);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        /*
+         * We are collecting slots that are definitely behind the given target
+         * LSN. Active slots are exluded since they can catch up later.
+         */
+        if (s->in_use && s->active_pid == 0 && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * Slot names consist only with lower-case letters. We don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL instead of an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fc463601ff..f8e796b6c1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2654,6 +2654,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cfad86c02a..aadbc76d85 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -286,6 +286,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..b355452072 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 8fbddea78f..e0fee0663c 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -199,6 +199,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.16.3

From 68232445aca90c9d104c58893f45db6f740354b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH 2/6] Add monitoring aid for max_slot_wal_keep_size

Adds two columns "status" and "remain" in pg_replication_slot. Setting
max_slot_wal_keep_size, replication connections may lose sync by a
long delay. The "status" column shows whether the slot is
reconnectable or not, or about to lose reserving WAL segments. The
"remain" column shows the remaining bytes of WAL that can be advance
until the slot loses required WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 157 +++++++++++++++++++++++++++++++--
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  17 +++-
 src/include/access/xlog.h              |   2 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 181 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3989f6e54a..f4cab30d5d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -873,7 +873,8 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                                      XLogRecPtr targetLSN, int64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9290,6 +9291,96 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ * restBytes is the pointer to uint64 variable, to store the remaining bytes
+ * until the slot goes into "losing" state.
+ *
+ * Returns in for kinds of strings.
+ *
+ * "streaming" means targetLSN is available because it is in the range of
+ * max_wal_size.
+ *
+ * "keeping" means it is still available by preserving extra segments beyond
+ * max_wal_size.
+ *
+ * "losing" means it is being removed or already removed but the walsender
+ * that using the given slot is keeping repliation stream yet. The state may
+ * return to "keeping" or "streaming" state if the walsender advances
+ * restart_lsn.
+ *
+ * "lost" means it is definitly lost. The walsender worked on the slot has
+ * been stopped.
+ *
+ * returns NULL if restart_lsn is invalid.
+ *
+ * -1 is stored to restBytes if the values is useless.
+ */
+char *
+GetLsnAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid,
+                   int64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+
+    Assert(restBytes);
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+    {
+        *restBytes = -1;
+        return NULL;
+    }
+
+    /*
+     * slot limitation is not activated, WAL files are kept unlimitedlllly in
+     * the case.
+     */
+    if (max_slot_wal_keep_size_mb < 0)
+    {
+        *restBytes = -1;
+        return "streaming";
+    }
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr, restart_lsn,
+                                         restBytes);
+
+    /* oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    oldestSeg = currSeg -
+        ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    /* restartSeg is within max_wal_size */
+    if (oldestSeg <= restartSeg)
+        return "streaming";
+
+    /* being retained by slots */
+    if (oldestSlotSeg <= restartSeg)
+        return "keeping";
+
+    /* it is useless for the states below */
+    *restBytes = -1;
+
+    /* no longer protected, but the working walsender can advance restart_lsn */
+    if (walsender_pid != 0)
+        return    "losing";
+
+    /* definitely lost. stopped walsender can no longer restart */
+    return "lost";
+}
+
 /*
  * Returns minimum segment number that the next checkpoint must leave
  * considering wal_keep_segments, replication slots and
@@ -9297,13 +9388,19 @@ CreateRestartPoint(int flags)
  *
  * currLSN is the current insert location.
  * minSlotLSN is the minimum restart_lsn of all active slots.
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, int64 *restBytes)
 {
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
     uint64        keepSegs = 0;    /* # of segments actually kept */
+    uint64        limitSegs = 0;    /* # of maximum segments possibly kept */
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9318,8 +9415,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Reduce it if slots already reserves too many. */
@@ -9327,9 +9422,54 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* ditto for limitSegs */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, calculate the remaining LSN bytes until the slot gives up
+     * keeping WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+        /* avoid underflow */
+        if (currSeg <= targetSeg + limitSegs)
+        {
+            uint64 restbytes;
+
+            /*
+             * This slot still has all required segments. Calculate how
+             * many LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                                    fragbytes, wal_segment_size,
+                                    restbytes);
+
+            /*
+             * not realistic, but make sure that it is not out of the
+             * range of int64. No problem to do so since such large values
+             * have no significant difference.
+             */
+            if (restbytes > PG_INT64_MAX)
+                restbytes = PG_INT64_MAX;
+            *restBytes = restbytes;
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9359,7 +9499,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr,
+                                    NULL);
 
     /*
      * Warn the checkpoint is going to flush the segments required by
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ea4c85e395..6a9491e64a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -849,7 +849,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 808a6f5b83..5c65c116d2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -221,7 +221,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -276,6 +276,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        char       *walstate;
+        int64        remaining_bytes;
         int            i;
 
         if (!slot->in_use)
@@ -343,6 +345,19 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate =
+            GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes);
+
+        if (walstate)
+            values[i++] = CStringGetTextDatum(walstate);
+        else
+            nulls[i++] = true;
+
+        if (remaining_bytes >= 0)
+            values[i++] = Int64GetDatum(remaining_bytes);
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b355452072..b021d12835 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -305,6 +305,8 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid,
+                                int64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0902dce5f1..300d868980 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9844,9 +9844,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 210e9cd146..74c44891a4 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1461,8 +1461,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.16.3

From 306381b4ca2ec2335f73d7f7ebfc3c4a96d53dd9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH 3/6] Add primary_slot_name to init_from_backup in TAP test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37f91..c7e138c121 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -694,6 +694,10 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.16.3

From 19c4243383229ccd200bf0ac30744a37b5b5695e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/018_replslot_limit.pl | 202 ++++++++++++++++++++++++++++++
 1 file changed, 202 insertions(+)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..4b41a68faa
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,202 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 3MB
+log_checkpoints = yes
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "streaming" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 4;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|3072 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 6; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|streaming|5120 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 5);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 1 segment\\(s\\)\\.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.16.3

From 531d600ebe100b53568f4b1ddf2defb2722abbd4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 37 +++++++++++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 +++++++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 +++++---
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 68ad5071ca..3605e34149 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9975,6 +9975,43 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,
+      <literal>losing</literal> or <literal>lost</literal>.
+      <literal>streaming</literal> means that the claimed records are
+      available within max_wal_size. <literal>keeping</literal> means
+      max_wal_size is exceeded but still held by replication slots or
+      wal_keep_segments.
+      <literal>losing</literal> means that some of them are on the verge of
+      removal but the using session may go furether.
+      <literal>lost</literal> means that some of them are definitely lost and
+      the session that used this slot cannot continue replication. This state
+      also implies the using session has been stopped.
+
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes WAL location (LSN) can advance that bytes
+        until this slot may lose required WAL
+        records. If <structfield>restart_lsn</structfield> is null
+        or <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..c345538c8f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3650,6 +3650,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL records. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 543691dad4..ae8c3a2aca 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.16.3

From b5d6d702e0d36fde05323b647baa7b36e86f273b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH 6/6] Check removal of in-reading segment file.

Checkpoints can recycle a segment file while it is being read by
ReadRecord and that leads to an apparently odd error message during
logical decoding. This patch explicitly checks that then error out
immediately.  Reading a recycled file is safe. Inconsistency caused by
overwrites as a new segment are caught by page/record validation. So
this is only for keeping consistency with the wal_status shown in
pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 33ccfc1553..4999892932 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -27,6 +27,7 @@
 
 #ifndef FRONTEND
 #include "miscadmin.h"
+#include "access/xlog.h"
 #include "utils/memutils.h"
 #endif
 
@@ -225,7 +226,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -271,6 +274,22 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.16.3

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

17 September 2019, 19:58:00

Hello

I have a couple of API-level reservation about this patch series.

Firstly, "behind" when used as a noun refers to buttocks.  Therefore,
the ReplicationSlotsEnumerateBehinds function name seems funny (I think
when used as a preposition you wouldn't put it in plural).  I don't
suggest a substitute name, because the API itself doesn't convince me; I
think it would be sufficient to have it return a single slot name,
perhaps the one that is behind the most ... or maybe the one that is
behind the least?  This simplifies a lot of code (in particular you do
away with the bunch of statics, right?), and I don't think the warning
messages loses anything, because for details the user should really look
into the monitoring view anyway.

I didn't like GetLsnAvailability() returning a string either.  It seems
more reasonable to me to define a enum with possible return states, and
have the enum value be expanded to some string in
pg_get_replication_slots().

In the same function, I think that setting restBytes to -1 when
"useless" is bad style.  I would just leave that variable alone when the
returned status is not one that receives the number of bytes.  So the
caller is only entitled to read the value if the returned enum value is
such-and-such ("keeping" and "streaming" I think).

I'm somewhat uncomfortable with the API change to GetOldestKeepSegment
in 0002.  Can't its caller do the math itself instead?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Jehan-Guillaume de Rorthais

Date:

02 October 2019, 15:08:07

On Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Thanks for reviewing!
> 
> At Thu, 27 Jun 2019 16:22:56 +0200, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in <20190627162256.4f4872b8@firost>
> > Hi all,
> > 
> > Being interested by this feature, I did a patch review.
> > 
> > This features adds the GUC "max_slot_wal_keep_size". This is the maximum
> > amount of WAL that can be kept in "pg_wal" by active slots.
> > 
> > If the amount of WAL is superior to this limit, the slot is deactivated and
> > its status (new filed in pg_replication_slot) is set as "lost".  
> 
> This patch doesn't deactivate walsender. A walsender stops by
> itself when it finds that it cannot continue ongoing replication.

Sure, sorry for the confusion, I realize my sentence is ambiguous. Thanks for
the clarification.

[...]

> > In "pg_replication_slots" view, the new "wal_status" field is misleading.
> > Consider this sentence and the related behavior from documentation
> > (catalogs.sgml):
> > 
> >   <literal>keeping</literal> means that some of them are to be removed by
> > the next checkpoint.
> > 
> > "keeping" appears when the current checkpoint will delete some WAL further
> > than "current_lsn - max_slot_wal_keep_size", but still required by at least
> > one slot. As some WAL required by some slots will be deleted quite soon,
> > probably before anyone can react, "keeping" status is misleading here. We
> > are already in the red zone.  
> 
> It may be "losing", which would be less misleading.

Indeed, "loosing" is a better match for this state.

However, what's the point of this state from the admin point of view? In various
situation, the admin will have no time to react immediately and fix whatever
could help.

How useful is this specific state?

> > I would expect this "wal_status" to be:
> > 
> > - streaming: slot lag between 0 and "max_wal_size"
> > - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
> >   slot actually protect some WALs from being deleted
> > - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't
> > protect some WAL from deletion  
> 
> I agree that comparing to max_wal_size is meaningful. The revised
> version behaves as that.

The v16-0006 patch doesn't apply anymore because of commit 709d003fbd. Here is
the fix:

  --- a/src/backend/access/transam/xlogreader.c
  +++ b/src/backend/access/transam/xlogreader.c
  @@ -304,7 +304,7
  -       XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
  +       XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);

I suppose you might have more refactoring to do in regard with Alvaro's
review. 

I confirm the new patch behaves correctly in my tests in regard with the
"wal_status" field.

> > Documentation follows with:
> > 
> >   The last two states are seen only when max_slot_wal_keep_size is
> >   non-negative
> > 
> > This is true with the current behavior. However, if "keeping" is set as
> > soon as te slot lag is superior than "max_wal_size", this status could be
> > useful even with "max_slot_wal_keep_size = -1". As soon as a slot is
> > stacking WALs that should have been removed by previous checkpoint, it
> > "keeps" them.
> 
> I revised the documentation that way. Both
> view-pg-replication-slots.html and
> runtime-config-replication.html are reworded.

+      <entry>Availability of WAL records claimed by this
+      slot. <literal>streaming</literal>, <literal>keeping</literal>,

Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of WAL
files claimed by this slot"?

+      <literal>streaming</literal> means that the claimed records are
+      available within max_wal_size. <literal>keeping</literal> means

I wonder if streaming is the appropriate name here. The WALs required might be
available for streaming, but the slot not active, thus not "streaming". What
about merging with the "active" field, in the same fashion as
pg_stat_activity.state? We would have an enum "pg_replication_slots.state" with
the following states:

* inactive: non active slot
* active: activated, required WAL within max_wal_size
* keeping: activated, max_wal_size is exceeded but still held by replication
  slots or wal_keep_segments.
* lost: some WAL are definitely lost

Thoughts?

[...]
> > * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
> >   active  
> 
> The revised  version shows the following statuses.
> 
>    streaming / NULL             max_slot_wal_keep_size is -1
>    unkown    / NULL             mswks >= 0 and restart_lsn is invalid
>    <status>  / <bytes>          elsewise

Works for me.

> > * the "lost" status should be a definitive status
> > * it seems related, but maybe the "wal_status" should be set as "lost"
> >   only when the slot has been deactivate ?  
> 
> Agreed. While replication is active, if required segments seems
> to be lost once, delayed walreceiver ack can advance restart_lsn
> to "safe" zone later. So, in the revised version, if the segment
> for restart_lsn has been removed, GetLsnAvailablity() returns
> "losing" if walsender is active and "lost" if not.

ok.

> > * logs should warn about a failing slot as soon as it is effectively
> >   deactivated, not before.  
> 
> Agreed. Slots on which walsender is running are exlucded from the
> output of ReplicationSlotsEnumerateBehnds. As theresult the "some
> replcation slots lost.." is emitted after related walsender
> stops.

Once a slot lost WALs and has been deactivated, the following message appears
during every checkpoints:

  WARNING:  some replication slots have lost required WAL segments
  DETAIL:  Slot slot_limit_st lost 177 segment(s)

I wonder if this is useful to show these messages for slots that were already
dead before this checkpoint?

Regards,

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

28 November 2019, 03:09:03

On Wed, Oct 02, 2019 at 05:08:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> I wonder if this is useful to show these messages for slots that were already
> dead before this checkpoint?

This thread has been waiting for input from the patch author,
Horiguchi-san, for a couple of months now, so I am switching it to
returned with feedback in the CF app.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

24 December 2019, 12:26:14

I'm very sorry for being late to reply.

At Wed, 2 Oct 2019 17:08:07 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > > In "pg_replication_slots" view, the new "wal_status" field is misleading.
> > > Consider this sentence and the related behavior from documentation
> > > (catalogs.sgml):
> > > 
> > >   <literal>keeping</literal> means that some of them are to be removed by
> > > the next checkpoint.
> > > 
> > > "keeping" appears when the current checkpoint will delete some WAL further
> > > than "current_lsn - max_slot_wal_keep_size", but still required by at least
> > > one slot. As some WAL required by some slots will be deleted quite soon,
> > > probably before anyone can react, "keeping" status is misleading here. We
> > > are already in the red zone.  
> > 
> > It may be "losing", which would be less misleading.
> 
> Indeed, "loosing" is a better match for this state.
>
> However, what's the point of this state from the admin point of view? In various
> situation, the admin will have no time to react immediately and fix whatever
> could help.
> 
> How useful is this specific state?

If we assume "losing" segments as "lost", a segment once "lost" can
return to "keeping" or "streaming" state. That is intuitively
impossible. On the other hand if we assume it as "keeping", it should
not be removed by the next checkpoint but actually it can be
removed. The state "losing" means such a unstable state different from
both "lost" and "keeping".

> > > I would expect this "wal_status" to be:
> > > 
> > > - streaming: slot lag between 0 and "max_wal_size"
> > > - keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
> > >   slot actually protect some WALs from being deleted
> > > - lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't
> > > protect some WAL from deletion  
> > 
> > I agree that comparing to max_wal_size is meaningful. The revised
> > version behaves as that.
>
> The v16-0006 patch doesn't apply anymore because of commit 709d003fbd. Here is
> the fix:
> 
>   --- a/src/backend/access/transam/xlogreader.c
>   +++ b/src/backend/access/transam/xlogreader.c
>   @@ -304,7 +304,7
>   -       XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
>   +       XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
> 
> I suppose you might have more refactoring to do in regard with Alvaro's
> review. 
> 
> I confirm the new patch behaves correctly in my tests in regard with the
> "wal_status" field.

Thanks for testing. I fixed it in the attached patch.

> +      <entry>Availability of WAL records claimed by this
> +      slot. <literal>streaming</literal>, <literal>keeping</literal>,
> 
> Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of WAL
> files claimed by this slot"?

I choosed "record" since a slot points a record. I'm not sure but I'm
fine with "file". Fixed catalogs.sgml and config.sgml that way.

> +      <literal>streaming</literal> means that the claimed records are
> +      available within max_wal_size. <literal>keeping</literal> means
> 
> I wonder if streaming is the appropriate name here. The WALs required might be
> available for streaming, but the slot not active, thus not "streaming". What
> about merging with the "active" field, in the same fashion as
> pg_stat_activity.state? We would have an enum "pg_replication_slots.state" with
> the following states:
> * inactive: non active slot
> * active: activated, required WAL within max_wal_size
> * keeping: activated, max_wal_size is exceeded but still held by replication
>   slots or wal_keep_segments.
> * lost: some WAL are definitely lost
> 
> Thoughts?

In the first place, I realized that I was missed a point about the
relationship between max_wal_size and max_slot_wal_keep_size
here. Since the v15 of this patch, GetLsnAvailablity returns
"streaming" when the restart_lsn is within max_wal_size. That behavior
makes sense when max_slot_wal_keep_size > max_wal_size. However, in
the contrary case, restart_lsn could be lost even it is
withinmax_wal_size. So we would see "streaming" (or "normal") even
though restart_lsn is already lost. That is broken.

In short, the "streaming/normal" state is useless if
max_slot_wal_keep_size < max_wal_size.


Finally I used the following wordings.

(there's no "inactive" wal_state)

* normal: required WAL within max_wal_size when max_slot_wal_keep_size
          is larger than max_wal_size.
* keeping: required segments are held by replication slots or
  wal_keep_segments.

* losing: required segments are about to be removed or may be already
  removed but streaming is not dead yet.

* lost: cannot continue streaming using this slot.

> [...]
> > > * "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
> > >   active  
> > 
> > The revised  version shows the following statuses.
> > 
> >    streaming / NULL             max_slot_wal_keep_size is -1
> >    unkown    / NULL             mswks >= 0 and restart_lsn is invalid
> >    <status>  / <bytes>          elsewise
> 
> Works for me.

Thanks.

> > > * the "lost" status should be a definitive status
> > > * it seems related, but maybe the "wal_status" should be set as "lost"
> > >   only when the slot has been deactivate ?  
> > 
> > Agreed. While replication is active, if required segments seems
> > to be lost once, delayed walreceiver ack can advance restart_lsn
> > to "safe" zone later. So, in the revised version, if the segment
> > for restart_lsn has been removed, GetLsnAvailablity() returns
> > "losing" if walsender is active and "lost" if not.
> 
> ok.
> 
> > > * logs should warn about a failing slot as soon as it is effectively
> > >   deactivated, not before.  
> > 
> > Agreed. Slots on which walsender is running are exlucded from the
> > output of ReplicationSlotsEnumerateBehnds. As theresult the "some
> > replcation slots lost.." is emitted after related walsender
> > stops.
> 
> Once a slot lost WALs and has been deactivated, the following message appears
> during every checkpoints:
> 
>   WARNING:  some replication slots have lost required WAL segments
>   DETAIL:  Slot slot_limit_st lost 177 segment(s)
> 
> I wonder if this is useful to show these messages for slots that were already
> dead before this checkpoint?

Makes sense. I changed KeepLogSeg so that it emits the message only on
slot_names changes.

The attached v17 patch is changed in the follwing points.

- Rebased to the current master.

- Change KeepLogSeg not to emit the message "Slot %s lost %ld
  segment(s)" if the slot list is not changed.

- Documentation is fixed following  the change of state names.

- Change GetLsnAvailability returns more correct state for wider
  situations. It returned a wrong status when max_slot_wal_keep_size
  is smaller than max_wal_size, or when max_slot_wal_keep_size is
  increased so that the new value covers the restart_lsn of a slot
  that have lost required segments in the old setting.

  Since it is needed by the above change, I revived
  GetOldestXLogFileSegNo() that was removed in v15 as
  FindOldestXLogFileSegNo() in a bit different shape.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 094d8c12a9c040afcaefdd9a3e93b575b1e2f504 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH v17 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c             | 137 +++++++++++++++---
 src/backend/replication/slot.c                |  65 +++++++++
 src/backend/utils/misc/guc.c                  |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 194 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index edee0c0f22..ba6b9b0d4f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -104,6 +104,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -871,6 +872,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9322,6 +9324,54 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9333,38 +9383,79 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        XLogSegNo    slotSegNo;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                static char *prev_slot_names = NULL;
+                char *slot_names;
+                int nslots;
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+                /*
+                 * Some of the affected slots could have just been removed. We
+                 * don't need show anything here if no affected slots are
+                 * remaining.
+                 */
+                if (slot_names)
+                {
+                    if (prev_slot_names == NULL ||
+                        strcmp(slot_names, prev_slot_names) != 0)
+                    {
+                        MemoryContext oldcxt;
+
+                        ereport(WARNING,
+                                (errmsg ("some replication slots have lost required WAL segments"),
+                                 errdetail_plural(
+                                     "Slot %s lost %ld segment(s).",
+                                     "Slots %s lost at most %ld segment(s).",
+                                     nslots, slot_names, lost_segs)));
+
+                        if (prev_slot_names)
+                            pfree(prev_slot_names);
+                        oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+                        prev_slot_names = pstrdup(slot_names);
+                        MemoryContextSwitchTo(oldcxt);
+                    }
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
+        else
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 21ae8531b3..030d17f0bf 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -49,6 +49,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1064,6 +1065,70 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns names of inactive replication slots that their restart_lsn are
+ * behind specified LSN for the purpose of error message character array
+ * stuffed with slot names delimited by the given separator. Returns NULL if no
+ * slot matches. If pnslots is given, the number of the returned slots is
+ * returned there. The returned array is palloc'ed in TopMemoryContext.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+        initStringInfo(&retstr);
+        MemoryContextSwitchTo(oldcxt);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        /*
+         * We are collecting slots that are definitely behind the given target
+         * LSN. Active slots are exluded since they can catch up later.
+         */
+        if (s->in_use && s->active_pid == 0 && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * Slot names consist only with lower-case letters. We don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL instead of an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8d951ce404..154ac2237e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2682,6 +2682,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..5541a882b6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3fea1993bc..c454e9a061 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3a5763fb07..e07caf9a13 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -198,6 +198,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.23.0

From 183024ca7e9abcf175e0a454fe8d6f9adc9e6089 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH v17 2/6] Add monitoring aid for max_slot_wal_keep_size

Adds two columns "status" and "remain" in pg_replication_slot. Setting
max_slot_wal_keep_size, replication connections may lose sync by a
long delay. The "status" column shows whether the slot is
reconnectable or not, or about to lose reserving WAL segments. The
"remain" column shows the remaining bytes of WAL that can be advance
until the slot loses required WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 234 ++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slotfuncs.c    |  17 +-
 src/include/access/xlog.h              |   3 +
 src/include/catalog/pg_proc.dat        |   6 +-
 src/test/regress/expected/rules.out    |   6 +-
 8 files changed, 259 insertions(+), 17 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ba6b9b0d4f..bb6bfda529 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -872,7 +872,8 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr,
+                                      XLogRecPtr targetLSN, int64 *restBytes);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -3900,6 +3901,55 @@ XLogGetLastRemovedSegno(void)
     return lastRemovedSegNo;
 }
 
+/*
+ * Return the oldest WAL segment file.
+ *
+ * The returned value is XLogGetLastRemovedSegno() + 1 when the function
+ * returns a valid value.  Otherwise this function scans over WAL files and
+ * finds the oldest segment at the first time, which could be very slow.
+ */
+XLogSegNo
+FindOldestXLogFileSegNo(void)
+{
+    static XLogSegNo lastFoundOldestSeg = 0;
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = XLogGetLastRemovedSegno();
+
+    if (segno > 0)
+        return segno + 1;
+
+    if (lastFoundOldestSeg > 0)
+        return lastFoundOldestSeg;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * Get minimum segment ignoring timeline ID, the same way with
+         * RemoveOldXlogFiles().
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    lastFoundOldestSeg = segno;
+
+    return segno;
+}
+
 /*
  * Update the last removed segno pointer in shared memory, to reflect
  * that the given XLOG file has been removed.
@@ -9324,6 +9374,124 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ * restBytes is the pointer to uint64 variable, to store the remaining bytes
+ * until the slot goes into "losing" state.
+ *
+ * Returns in four kinds of strings.
+ *
+ * "normal" means targetLSN is available because it is in the range of
+ * max_wal_size.
+ *
+ * "keeping" means it is still available by preserving extra segments beyond
+ * max_wal_size.
+ *
+ * "losing" means it is being removed or already removed but the walsender
+ * using the given slot is keeping repliation stream yet. The state may return
+ * to "keeping" or "normal" state if the walsender advances restart_lsn.
+ *
+ * "lost" means it is definitly lost. The walsender worked on the slot has
+ * been stopped.
+ *
+ * returns NULL if restart_lsn is invalid.
+ *
+ * -1 is stored to restBytes if the values is useless.
+ */
+char *
+GetLsnAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid,
+                   int64 *restBytes)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    Assert(restBytes);
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+    {
+        *restBytes = -1;
+        return NULL;
+    }
+
+    /*
+     * slot limitation is not activated, WAL files are kept unlimitedlllly in
+     * the case.
+     */
+    if (max_slot_wal_keep_size_mb < 0)
+    {
+        *restBytes = -1;
+        return "normal";
+    }
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr, restart_lsn,
+                                         restBytes);
+
+    /* find the oldest segment file actually exists */
+    oldestSeg = FindOldestXLogFileSegNo();
+
+    /* calculate oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if (max_slot_wal_keep_size_mb >= max_wal_size_mb &&
+            oldestSegMaxWalSize <= restartSeg)
+            return "normal";
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return "keeping";
+    }
+    
+    /* it is useless for the states below */
+    *restBytes = -1;
+
+    /*
+     * The segment is alrady lost or being lost. If the oldest segment is just
+     * after the restartSeg, running walsender may be reading the just removed
+     * segment. The walsender may safely move to the oldest existing segment in
+     * that case.
+     */
+    if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
+        return    "losing";
+
+    /* definitely lost. stopped walsender can no longer restart */
+    return "lost";
+}
+
 /*
  * Returns minimum segment number that the next checkpoint must leave
  * considering wal_keep_segments, replication slots and
@@ -9331,13 +9499,19 @@ CreateRestartPoint(int flags)
  *
  * currLSN is the current insert location.
  * minSlotLSN is the minimum restart_lsn of all active slots.
+ * targetLSN is used when restBytes is not NULL.
+ *
+ * If restBytes is not NULL, sets the remaining LSN bytes until the segment
+ * for targetLSN will be removed.
  */
 static XLogSegNo
-GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
+                     XLogRecPtr targetLSN, int64 *restBytes)
 {
     XLogSegNo    currSeg;
     XLogSegNo    minSlotSeg;
     uint64        keepSegs = 0;    /* # of segments actually kept */
+    uint64        limitSegs = 0;    /* # of maximum segments possibly kept */
 
     XLByteToSeg(currLSN, currSeg, wal_segment_size);
     XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
@@ -9352,8 +9526,6 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     /* Cap keepSegs by max_slot_wal_keep_size */
     if (max_slot_wal_keep_size_mb >= 0)
     {
-        uint64 limitSegs;
-
         limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
         /* Reduce it if slots already reserves too many. */
@@ -9361,9 +9533,54 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
             keepSegs = limitSegs;
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
-    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
-        keepSegs = wal_keep_segments;
+    if (wal_keep_segments > 0)
+    {
+        /* but, keep at least wal_keep_segments segments if any */
+        if (keepSegs < wal_keep_segments)
+            keepSegs = wal_keep_segments;
+
+        /* ditto for limitSegs */
+        if (limitSegs < wal_keep_segments)
+            limitSegs = wal_keep_segments;
+    }
+
+    /*
+     * If requested, calculate the remaining LSN bytes until the slot gives up
+     * keeping WAL records.
+     */
+    if (restBytes)
+    {
+        uint64 fragbytes;
+        XLogSegNo targetSeg;
+
+        *restBytes = 0;
+
+        XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+        /* avoid underflow */
+        if (currSeg <= targetSeg + limitSegs)
+        {
+            uint64 restbytes;
+
+            /*
+             * This slot still has all required segments. Calculate how
+             * many LSN bytes the slot has until it loses targetLSN.
+             */
+            fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+            XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                                    fragbytes, wal_segment_size,
+                                    restbytes);
+
+            /*
+             * not realistic, but make sure that it is not out of the
+             * range of int64. No problem to do so since such large values
+             * have no significant difference.
+             */
+            if (restbytes > PG_INT64_MAX)
+                restbytes = PG_INT64_MAX;
+            *restBytes = restbytes;
+        }
+    }
 
     /* avoid underflow, don't go below 1 */
     if (currSeg <= keepSegs)
@@ -9393,7 +9610,8 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     /*
      * We should keep certain number of WAL segments after this checkpoint.
      */
-    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr, InvalidXLogRecPtr,
+                                    NULL);
 
     /*
      * Warn the checkpoint is going to flush the segments required by
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..2fe346461d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -854,7 +854,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4d12..7db48aaa14 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -221,7 +221,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -276,6 +276,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        char       *walstate;
+        int64        remaining_bytes;
         int            i;
 
         if (!slot->in_use)
@@ -343,6 +345,19 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate =
+            GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes);
+
+        if (walstate)
+            values[i++] = CStringGetTextDatum(walstate);
+        else
+            nulls[i++] = true;
+
+        if (remaining_bytes >= 0)
+            values[i++] = Int64GetDatum(remaining_bytes);
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c454e9a061..1f92e87a25 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -268,6 +268,7 @@ extern int    XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo FindOldestXLogFileSegNo(void);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
@@ -304,6 +305,8 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern char *GetLsnAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid,
+                                int64 *restBytes);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..3887eb3ce0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9873,9 +9873,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80a07825b9..6fc5251536 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1461,8 +1461,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.23.0

From b698117b0987ed7545620cb36923012e3e5d474f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v17 3/6] Add primary_slot_name to init_from_backup in TAP
 test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..20a586245b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -698,6 +698,10 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.23.0

From a7286ab37010239b5c44cfc2ae8e47ecb8406072 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH v17 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/018_replslot_limit.pl | 202 ++++++++++++++++++++++
 1 file changed, 202 insertions(+)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..87080647c5
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,202 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 4MB
+log_checkpoints = yes
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 4 MB.
+advance_wal($node_master, 4);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "some replication slots have lost required WAL segments\n".
+               ".*Slot rep1 lost 1 segment\\(s\\)\\.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.23.0

From 11c8693e2037858e149a27e4726c6c0cca7a2064 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH v17 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 37 +++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 ++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 ++++---
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..ec178d77cd 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9969,6 +9969,43 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this
+      slot. <literal>normal</literal>, <literal>keeping</literal>,
+      <literal>losing</literal> or <literal>lost</literal>.
+      <literal>normal</literal> means that the claimed files are
+      available within max_wal_size. <literal>keeping</literal> means
+      max_wal_size is exceeded but still held by replication slots or
+      wal_keep_segments.
+      <literal>losing</literal> means that some of them are on the verge of
+      removal but the using session may go further.
+      <literal>lost</literal> means that some of them are definitely lost and
+      the session that used this slot cannot continue replication. This state
+      also implies the using session has been stopped.
+
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes WAL location (LSN) can advance that bytes
+        until this slot may lose required WAL
+        files. If <structfield>restart_lsn</structfield> is null
+        or <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..d83ce8842f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3730,6 +3730,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..328464c240 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.23.0

From b1866015543b3cd91bc9a7a408c9d019003e6e09 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH v17 6/6] Check removal of in-reading segment file.

Checkpoints can recycle a segment file while it is being read by
ReadRecord and that leads to an apparently odd error message during
logical decoding. This patch explicitly checks that then error out
immediately.  Reading a recycled file is safe. Inconsistency caused by
overwrites as a new segment are caught by page/record validation. So
this is only for keeping consistency with the wal_status shown in
pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 67418b05f1..f98ce0fe48 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -246,7 +246,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -292,6 +294,22 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.23.0

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

26 December 2019, 09:08:12

At Tue, 24 Dec 2019 21:26:14 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The attached v17 patch is changed in the follwing points.
> 
> - Rebased to the current master.
> 
> - Change KeepLogSeg not to emit the message "Slot %s lost %ld
>   segment(s)" if the slot list is not changed.
> 
> - Documentation is fixed following  the change of state names.
> 
> - Change GetLsnAvailability returns more correct state for wider
>   situations. It returned a wrong status when max_slot_wal_keep_size
>   is smaller than max_wal_size, or when max_slot_wal_keep_size is
>   increased so that the new value covers the restart_lsn of a slot
>   that have lost required segments in the old setting.
> 
>   Since it is needed by the above change, I revived
>   GetOldestXLogFileSegNo() that was removed in v15 as
>   FindOldestXLogFileSegNo() in a bit different shape.

I'd like to re-enter this patch to the next cf.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Jehan-Guillaume de Rorthais

Date:

22 January 2020, 16:47:23

Hi,

First, it seems you did not reply to Alvaro's concerns in your new set of
patch. See:

https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql

On Tue, 24 Dec 2019 21:26:14 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]
> > Indeed, "loosing" is a better match for this state.
> >
> > However, what's the point of this state from the admin point of view? In
> > various situation, the admin will have no time to react immediately and fix
> > whatever could help.
> > 
> > How useful is this specific state?  
> 
> If we assume "losing" segments as "lost", a segment once "lost" can
> return to "keeping" or "streaming" state. That is intuitively
> impossible. On the other hand if we assume it as "keeping", it should
> not be removed by the next checkpoint but actually it can be
> removed. The state "losing" means such a unstable state different from
> both "lost" and "keeping".

OK, indeed.

But I'm still unconfortable with this "unstable" state. It would be better if
we could grab a stable state: either "keeping" or "lost".

> > +      <entry>Availability of WAL records claimed by this
> > +      slot. <literal>streaming</literal>, <literal>keeping</literal>,
> > 
> > Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of
> > WAL files claimed by this slot"?  
> 
> I choosed "record" since a slot points a record. I'm not sure but I'm
> fine with "file". Fixed catalogs.sgml and config.sgml that way.

Thanks.

> > +      <literal>streaming</literal> means that the claimed records are
> > +      available within max_wal_size. <literal>keeping</literal> means
> > 
> > I wonder if streaming is the appropriate name here. The WALs required might
> > be available for streaming, but the slot not active, thus not "streaming".
> > What about merging with the "active" field, in the same fashion as
> > pg_stat_activity.state? We would have an enum "pg_replication_slots.state"
> > with the following states:
> > * inactive: non active slot
> > * active: activated, required WAL within max_wal_size
> > * keeping: activated, max_wal_size is exceeded but still held by replication
> >   slots or wal_keep_segments.
> > * lost: some WAL are definitely lost
> > 
> > Thoughts?  
> 
> In the first place, I realized that I was missed a point about the
> relationship between max_wal_size and max_slot_wal_keep_size
> here. Since the v15 of this patch, GetLsnAvailablity returns
> "streaming" when the restart_lsn is within max_wal_size. That behavior
> makes sense when max_slot_wal_keep_size > max_wal_size. However, in
> the contrary case, restart_lsn could be lost even it is
> withinmax_wal_size. So we would see "streaming" (or "normal") even
> though restart_lsn is already lost. That is broken.
> 
> In short, the "streaming/normal" state is useless if
> max_slot_wal_keep_size < max_wal_size.

Good catch!

> Finally I used the following wordings.
> 
> (there's no "inactive" wal_state)
> 
> * normal: required WAL within max_wal_size when max_slot_wal_keep_size
>           is larger than max_wal_size.
> * keeping: required segments are held by replication slots or
>   wal_keep_segments.
> 
> * losing: required segments are about to be removed or may be already
>   removed but streaming is not dead yet.

As I wrote, I'm still uncomfortable with this state. Maybe we should ask
other reviewers opinions on this.

[...]
> >   WARNING:  some replication slots have lost required WAL segments
> >   DETAIL:  Slot slot_limit_st lost 177 segment(s)
> > 
> > I wonder if this is useful to show these messages for slots that were
> > already dead before this checkpoint?  
> 
> Makes sense. I changed KeepLogSeg so that it emits the message only on
> slot_names changes.

Thanks.

Bellow some code review.

In regard with FindOldestXLogFileSegNo(...):

> /*
>  * Return the oldest WAL segment file.
>  *
>  * The returned value is XLogGetLastRemovedSegno() + 1 when the function
>  * returns a valid value.  Otherwise this function scans over WAL files and
>  * finds the oldest segment at the first time, which could be very slow.
>  */
> XLogSegNo
> FindOldestXLogFileSegNo(void)

The comment is not clear to me. I suppose "at the first time" might better be
expressed as "if none has been removed since last startup"?

Moreover, what about patching XLogGetLastRemovedSegno() itself instead of
adding a function?

In regard with GetLsnAvailability(...):

> /*
>  * Detect availability of the record at given targetLSN.
>  *
>  * targetLSN is restart_lsn of a slot.

Wrong argument name. It is called "restart_lsn" in the function
declaration.

>  * restBytes is the pointer to uint64 variable, to store the remaining bytes
>  * until the slot goes into "losing" state.

I'm not convinced with this argument name. What about "remainingBytes"? Note
that you use remaining_bytes elsewhere in your patch.

>  * -1 is stored to restBytes if the values is useless.

What about returning a true negative value when the slot is really lost?

All in all, I feel like this function is on the fence between being generic
because of its name and being slot-only oriented because of the first parameter
name, use of "max_slot_wal_keep_size_mb", returned status and "slotPtr".

I wonder if it should be more generic and stay here or move to xlogfuncs.c with
a more specific name?

> * slot limitation is not activated, WAL files are kept unlimitedlllly

"unlimitedly"? "infinitely"? "unconditionally"?

>   /* it is useless for the states below */
>   *restBytes = -1;

This might be set to the real bytes kept, even if status is "losing".

> * The segment is alrady lost or being lost. If the oldest segment is just

"already"

>  if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
>      return  "losing";

I wonder if this should be "oldestSeg > restartSeg"?
Many segments can be removed by the next or running checkpoint. And a running
walsender can send more than one segment in the meantime I suppose?

In regard with GetOldestKeepSegment(...):

> static XLogSegNo
> GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
>                                       XLogRecPtr targetLSN, int64 *restBytes)

I wonder if minSlotLSN is really useful as a parameter or if it should be
fetched from GetOldestKeepSegment() itself? Currently,
XLogGetReplicationSlotMinimumLSN() is always called right before
GetOldestKeepSegment() just to fill this argument.

>      walstate =
>              GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes);

I agree with Alvaro: we might want to return an enum and define the related
state string here. Or, if we accept negative remaining_bytes, GetLsnAvailability
might even only return remaining_bytes and we deduce the state directly from
here.

Regards,

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

23 January 2020, 12:28:54

Hello, Jehan.

At Wed, 22 Jan 2020 17:47:23 +0100, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> Hi,
> 
> First, it seems you did not reply to Alvaro's concerns in your new set of
> patch. See:
> 
> https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql

Mmmm. Thank you very much for noticing that, Jehan, and sorry for
overlooking, Alvaro.


At Tue, 17 Sep 2019 16:58:00 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> suggest a substitute name, because the API itself doesn't convince me; I
> think it would be sufficient to have it return a single slot name,
> perhaps the one that is behind the most ... or maybe the one that is
> behind the least?  This simplifies a lot of code (in particular you do
> away with the bunch of statics, right?), and I don't think the warning
> messages loses anything, because for details the user should really look
> into the monitoring view anyway.

Ok, I removed the fannily-named function. The message become more or
less the following.  The DETAILS might not needed.

| WARNING:  2 replication slots have lost required WAL segments by 5 segments
| DETAIL:  Most affected slot is s1.

> I didn't like GetLsnAvailability() returning a string either.  It seems
> more reasonable to me to define a enum with possible return states, and
> have the enum value be expanded to some string in
> pg_get_replication_slots().

Agreed. Done.

> In the same function, I think that setting restBytes to -1 when
> "useless" is bad style.  I would just leave that variable alone when the
> returned status is not one that receives the number of bytes.  So the
> caller is only entitled to read the value if the returned enum value is
> such-and-such ("keeping" and "streaming" I think).

That is the only condition. If max_slot_wal_keep_size = -1, The value
is useless for the two states.  I added that explanation to the
comment of Get(Lsn)Walavailability().

> I'm somewhat uncomfortable with the API change to GetOldestKeepSegment
> in 0002.  Can't its caller do the math itself instead?

Mmm.  Finally I found that I merged two calculations that have scarce
relation. You're right here. Thanks.

The attached v18 addressed all of your (Alvaro's) comments.



> On Tue, 24 Dec 2019 21:26:14 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > If we assume "losing" segments as "lost", a segment once "lost" can
> > return to "keeping" or "streaming" state. That is intuitively
> > impossible. On the other hand if we assume it as "keeping", it should
> > not be removed by the next checkpoint but actually it can be
> > removed. The state "losing" means such a unstable state different from
> > both "lost" and "keeping".
> 
> OK, indeed.
> 
> But I'm still unconfortable with this "unstable" state. It would be better if
> we could grab a stable state: either "keeping" or "lost".

I feel the same, but the being-removed WAL segments remain until
checkpoint runs and even after removal replication can continue if
walsender is reading the removed-but-already-opened file.  I'll put
more thought on that.

> > In short, the "streaming/normal" state is useless if
> > max_slot_wal_keep_size < max_wal_size.
> 
> Good catch!

Thanks!:)

> > Finally I used the following wordings.
> > 
> > (there's no "inactive" wal_state)
> > 
> > * normal: required WAL within max_wal_size when max_slot_wal_keep_size
> >           is larger than max_wal_size.
> > * keeping: required segments are held by replication slots or
> >   wal_keep_segments.
> > 
> > * losing: required segments are about to be removed or may be already
> >   removed but streaming is not dead yet.
> 
> As I wrote, I'm still uncomfortable with this state. Maybe we should ask
> other reviewers opinions on this.
> 
> [...]
> > >   WARNING:  some replication slots have lost required WAL segments
> > >   DETAIL:  Slot slot_limit_st lost 177 segment(s)
> > > 
> > > I wonder if this is useful to show these messages for slots that were
> > > already dead before this checkpoint?  
> > 
> > Makes sense. I changed KeepLogSeg so that it emits the message only on
> > slot_names changes.
> 
> Thanks.
> 
> Bellow some code review.

Thank you for the review, I don't have a time right now but address
the below comments them soon.


> In regard with FindOldestXLogFileSegNo(...):
> 
> > /*
> >  * Return the oldest WAL segment file.
> >  *
> >  * The returned value is XLogGetLastRemovedSegno() + 1 when the function
> >  * returns a valid value.  Otherwise this function scans over WAL files and
> >  * finds the oldest segment at the first time, which could be very slow.
> >  */
> > XLogSegNo
> > FindOldestXLogFileSegNo(void)
> 
> The comment is not clear to me. I suppose "at the first time" might better be
> expressed as "if none has been removed since last startup"?
> 
> Moreover, what about patching XLogGetLastRemovedSegno() itself instead of
> adding a function?
> 
> In regard with GetLsnAvailability(...):
> 
> > /*
> >  * Detect availability of the record at given targetLSN.
> >  *
> >  * targetLSN is restart_lsn of a slot.
> 
> Wrong argument name. It is called "restart_lsn" in the function
> declaration.
> 
> >  * restBytes is the pointer to uint64 variable, to store the remaining bytes
> >  * until the slot goes into "losing" state.
> 
> I'm not convinced with this argument name. What about "remainingBytes"? Note
> that you use remaining_bytes elsewhere in your patch.
> 
> >  * -1 is stored to restBytes if the values is useless.
> 
> What about returning a true negative value when the slot is really lost?
> 
> All in all, I feel like this function is on the fence between being generic
> because of its name and being slot-only oriented because of the first parameter
> name, use of "max_slot_wal_keep_size_mb", returned status and "slotPtr".
> 
> I wonder if it should be more generic and stay here or move to xlogfuncs.c with
> a more specific name?
> 
> > * slot limitation is not activated, WAL files are kept unlimitedlllly
> 
> "unlimitedly"? "infinitely"? "unconditionally"?
> 
> >   /* it is useless for the states below */
> >   *restBytes = -1;
> 
> This might be set to the real bytes kept, even if status is "losing".
> 
> > * The segment is alrady lost or being lost. If the oldest segment is just
> 
> "already"
> 
> >  if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
> >      return  "losing";
> 
> I wonder if this should be "oldestSeg > restartSeg"?
> Many segments can be removed by the next or running checkpoint. And a running
> walsender can send more than one segment in the meantime I suppose?
> 
> In regard with GetOldestKeepSegment(...):
> 
> > static XLogSegNo
> > GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
> >                                       XLogRecPtr targetLSN, int64 *restBytes)
> 
> I wonder if minSlotLSN is really useful as a parameter or if it should be
> fetched from GetOldestKeepSegment() itself? Currently,
> XLogGetReplicationSlotMinimumLSN() is always called right before
> GetOldestKeepSegment() just to fill this argument.
> 
> >      walstate =
> >              GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes);
> 
> I agree with Alvaro: we might want to return an enum and define the related
> state string here. Or, if we accept negative remaining_bytes, GetLsnAvailability
> might even only return remaining_bytes and we deduce the state directly from
> here.
> 
> Regards,

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From cf652ad945242ec7591c62d76de7cf2f81065f9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH v18 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c             | 141 ++++++++++++++----
 src/backend/replication/slot.c                |  65 ++++++++
 src/backend/utils/misc/guc.c                  |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 src/include/replication/slot.h                |   1 +
 6 files changed, 196 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..7015300c77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -104,6 +104,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -871,6 +872,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9320,6 +9322,54 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9331,38 +9381,79 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    slotSegNo;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
+        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+
+        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+
+        if (slotSegNo < minSegNo)
+        {
+            XLogSegNo lost_segs = minSegNo - slotSegNo;
+            if (prev_lost_segs != lost_segs)
+            {
+                /* We have lost a new segment, warn it.*/
+                XLogRecPtr minlsn;
+                static char *prev_slot_names = NULL;
+                char *slot_names;
+                int nslots;
+
+                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
+                slot_names =
+                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+
+                /*
+                 * Some of the affected slots could have just been removed. We
+                 * don't need show anything here if no affected slots are
+                 * remaining.
+                 */
+                if (slot_names)
+                {
+                    if (prev_slot_names == NULL ||
+                        strcmp(slot_names, prev_slot_names) != 0)
+                    {
+                        MemoryContext oldcxt;
+
+                        ereport(WARNING,
+                                (errmsg ("some replication slots have lost required WAL segments"),
+                                 errdetail_plural(
+                                     "Slot %s lost %ld segment(s).",
+                                     "Slots %s lost at most %ld segment(s).",
+                                     nslots, slot_names, lost_segs)));
+
+                        if (prev_slot_names)
+                            pfree(prev_slot_names);
+                        oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+                        prev_slot_names = pstrdup(slot_names);
+                        MemoryContextSwitchTo(oldcxt);
+                    }
+                }
+            }
+            prev_lost_segs = lost_segs;
+        }
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            prev_lost_segs = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 976f6479a9..fcaede60d0 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -49,6 +49,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -1064,6 +1065,70 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Returns names of inactive replication slots that their restart_lsn are
+ * behind specified LSN for the purpose of error message character array
+ * stuffed with slot names delimited by the given separator. Returns NULL if no
+ * slot matches. If pnslots is given, the number of the returned slots is
+ * returned there. The returned array is palloc'ed in TopMemoryContext.
+ */
+char *
+ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
+{
+    static StringInfoData retstr;
+    static bool retstr_initialized = false;
+    bool insert_separator = false;
+    int i;
+    int nslots = 0;
+
+    Assert (separator);
+    if (max_replication_slots <= 0)
+        return NULL;
+
+    if (!retstr_initialized)
+    {
+        MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+        initStringInfo(&retstr);
+        MemoryContextSwitchTo(oldcxt);
+        retstr_initialized = true;
+    }
+    else
+        resetStringInfo(&retstr);
+
+    /* construct name list */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        /*
+         * We are collecting slots that are definitely behind the given target
+         * LSN. Active slots are exluded since they can catch up later.
+         */
+        if (s->in_use && s->active_pid == 0 && s->data.restart_lsn < target)
+        {
+            if (insert_separator)
+                appendStringInfoString(&retstr, separator);
+
+            /*
+             * Slot names consist only with lower-case letters. We don't
+             * bother quoting.
+             */
+            appendStringInfoString(&retstr, NameStr(s->data.name));
+            insert_separator = true;
+            nslots++;
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /* return the number of slots in the list if requested */
+    if (pnslots)
+        *pnslots = nslots;
+
+    /* return NULL instead of an empty string */
+    return retstr.data[0] ? retstr.data : NULL;
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e44f71e991..0d01e1f042 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2695,6 +2695,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of extra WALs kept by replication slots."),
+         NULL,
+         GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1,
+        MAX_KILOBYTES, /* XXX: This is in megabytes, like max/min_wal_size */
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e1048c0047..8a39bf7582 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..5d117d5cfc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3e95b019b3..09b0ab7953 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -198,6 +198,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
-- 
2.18.2

From b0fb4d797697fc9d96f88a61b7464613f150cbed Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:23:25 +0900
Subject: [PATCH v18 2/6] Add monitoring aid for max_slot_wal_keep_size

Adds two columns "status" and "remain" in pg_replication_slot. Setting
max_slot_wal_keep_size, replication connections may lose sync by a
long delay. The "status" column shows whether the slot is
reconnectable or not, or about to lose reserving WAL segments. The
"remain" column shows the remaining bytes of WAL that can be advance
until the slot loses required WAL records.
---
 contrib/test_decoding/expected/ddl.out |   4 +-
 contrib/test_decoding/sql/ddl.sql      |   2 +
 src/backend/access/transam/xlog.c      | 298 +++++++++++++++++++++----
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/replication/slot.c         |  64 ------
 src/backend/replication/slotfuncs.c    |  39 +++-
 src/include/access/xlog.h              |  18 ++
 src/include/catalog/pg_proc.dat        |   6 +-
 src/include/replication/slot.h         |   1 -
 src/test/regress/expected/rules.out    |   6 +-
 10 files changed, 328 insertions(+), 114 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7015300c77..8a83f87c8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3900,6 +3900,55 @@ XLogGetLastRemovedSegno(void)
     return lastRemovedSegNo;
 }
 
+/*
+ * Return the oldest WAL segment file.
+ *
+ * The returned value is XLogGetLastRemovedSegno() + 1 when the function
+ * returns a valid value.  Otherwise this function scans over WAL files and
+ * finds the oldest segment at the first time, which could be very slow.
+ */
+XLogSegNo
+FindOldestXLogFileSegNo(void)
+{
+    static XLogSegNo lastFoundOldestSeg = 0;
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno = XLogGetLastRemovedSegno();
+
+    if (segno > 0)
+        return segno + 1;
+
+    if (lastFoundOldestSeg > 0)
+        return lastFoundOldestSeg;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * Get minimum segment ignoring timeline ID, the same way with
+         * RemoveOldXlogFiles().
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    lastFoundOldestSeg = segno;
+
+    return segno;
+}
+
 /*
  * Update the last removed segno pointer in shared memory, to reflect
  * that the given XLOG file has been removed.
@@ -9322,6 +9371,105 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ * restBytes is pointer to uint64 variable, to store the remaining bytes until
+ * the slot goes into WAL_BEING_REMOVED state if max_slot_wal_keep_size >=
+ * 0. It is set only when WALAVAIL_NORMAL or WALAVAIL_PRESERVED is returned.
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL_ means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_BEING_REMOVED means it is being removed or already removed but the
+ * replication stream on the given slot is live yet. The state may transit to
+ * WALAVAIL_PRESERVED or WALAVAIL_NORMAL state if the walsender advances
+ * restart_lsn.
+ *
+ * WALAVAIL_REMOVED means it is definitly lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * returns WALAVAIL_NULL if restart_lsn is invalid.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+    /* find the oldest segment file actually exists */
+    oldestSeg = FindOldestXLogFileSegNo();
+
+    /* calculate oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+    
+    /*
+     * The segment is alrady lost or being lost. If the oldest segment is just
+     * after the restartSeg, running walsender may be reading the just removed
+     * segment. The walsender may safely move to the oldest existing segment in
+     * that case.
+     */
+    if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
+        return    WALAVAIL_BEING_REMOVED;
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
 /*
  * Returns minimum segment number that the next checkpoint must leave
  * considering wal_keep_segments, replication slots and
@@ -9370,6 +9518,53 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
     return currSeg - keepSegs;
 }
 
+/*
+ * Calculate remaining bytes until WAL segment for targetLSN will be removed.
+ */
+int64
+DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN)
+{
+    XLogSegNo    currSeg;
+    uint64        limitSegs = 0;
+    int64         restbytes;
+    uint64        fragbytes;
+    XLogSegNo    targetSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+
+    /* Calculate how far back WAL segments are preserved */
+    if (max_slot_wal_keep_size_mb >= 0)
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+    if (wal_keep_segments > 0 && limitSegs < wal_keep_segments)
+        limitSegs = wal_keep_segments;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    /* avoid underflow */
+    if (targetSeg + limitSegs < currSeg)
+        return 0;
+
+    /*
+     * This slot still has all required segments. Calculate how
+     * many LSN bytes the slot has until it loses targetLSN.
+     */
+    fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+    XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                            fragbytes, wal_segment_size,
+                            restbytes);
+
+    /*
+     * not realistic, but make sure that it is not out of the
+     * range of int64. No problem to do so since such large values
+     * have no significant difference.
+     */
+    if (restbytes > PG_INT64_MAX)
+        restbytes = PG_INT64_MAX;
+
+    return restbytes;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9381,9 +9576,13 @@ GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
+    static XLogSegNo last_lost_segs = 0;
+    static int last_nslots = 0;
+    static char *last_slot_name = NULL;
     XLogRecPtr    slotminptr = InvalidXLogRecPtr;
     XLogSegNo    minSegNo;
-    XLogSegNo    slotSegNo;
+    XLogSegNo    minSlotSegNo;
+    int            nslots_affected = 0;
 
     if (max_replication_slots > 0)
         slotminptr = XLogGetReplicationSlotMinimumLSN();
@@ -9399,56 +9598,75 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
      */
     if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        static XLogSegNo prev_lost_segs = 0;    /* avoid duplicate messages */
+        Assert (max_replication_slots > 0);
 
-        XLByteToSeg(slotminptr, slotSegNo, wal_segment_size);
+        XLByteToSeg(slotminptr, minSlotSegNo, wal_segment_size);
 
-        if (slotSegNo < minSegNo)
+        if (minSlotSegNo < minSegNo)
         {
-            XLogSegNo lost_segs = minSegNo - slotSegNo;
-            if (prev_lost_segs != lost_segs)
+            /* Some slots has lost requred segments */
+            XLogSegNo    lost_segs = minSegNo - minSlotSegNo;
+            ReplicationSlot *earliest = NULL;
+            char       *earliest_name = NULL;
+            int            i;
+
+            /* Find the most affected slot */
+            LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+            for (i = 0 ; i < max_replication_slots ; i++)
             {
-                /* We have lost a new segment, warn it.*/
-                XLogRecPtr minlsn;
-                static char *prev_slot_names = NULL;
-                char *slot_names;
-                int nslots;
+                ReplicationSlot *s =
+                    &ReplicationSlotCtl->replication_slots[i];
+                XLogSegNo slotSegNo;
 
-                XLogSegNoOffsetToRecPtr(minSegNo, 0, wal_segment_size, minlsn);
-                slot_names =
-                    ReplicationSlotsEnumerateBehinds(minlsn, ", ", &nslots);
+                XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);
 
-                /*
-                 * Some of the affected slots could have just been removed. We
-                 * don't need show anything here if no affected slots are
-                 * remaining.
-                 */
-                if (slot_names)
+                if (s->in_use && s->active_pid == 0 && slotSegNo < minSegNo)
                 {
-                    if (prev_slot_names == NULL ||
-                        strcmp(slot_names, prev_slot_names) != 0)
-                    {
-                        MemoryContext oldcxt;
+                    nslots_affected++;
 
-                        ereport(WARNING,
-                                (errmsg ("some replication slots have lost required WAL segments"),
-                                 errdetail_plural(
-                                     "Slot %s lost %ld segment(s).",
-                                     "Slots %s lost at most %ld segment(s).",
-                                     nslots, slot_names, lost_segs)));
-
-                        if (prev_slot_names)
-                            pfree(prev_slot_names);
-                        oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-                        prev_slot_names = pstrdup(slot_names);
-                        MemoryContextSwitchTo(oldcxt);
-                    }
+                    if (earliest == NULL ||
+                        s->data.restart_lsn < earliest->data.restart_lsn)
+                        earliest = s;
                 }
             }
-            prev_lost_segs = lost_segs;
+
+            if (earliest)
+            {
+                MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+                earliest_name = pstrdup(NameStr(earliest->data.name));
+                MemoryContextSwitchTo(oldcxt);
+            }
+
+            LWLockRelease(ReplicationSlotControlLock);
+
+            /* Emit WARNING if something has changed */
+            if (earliest_name &&
+                (last_lost_segs != lost_segs || last_nslots != nslots_affected))
+            {
+                ereport(WARNING,
+                        (errmsg_plural ("%d replication slot has lost required WAL segments by %lu segments",
+                                        "%d replication slots have lost required WAL segments by %lu segments",
+                                        nslots_affected, nslots_affected,
+                                        lost_segs),
+                         errdetail("Most affected slot is %s.",
+                                   earliest_name)));
+
+                if (last_slot_name)
+                    pfree(last_slot_name);
+                last_slot_name = earliest_name;
+                last_lost_segs = lost_segs;
+                last_nslots = nslots_affected;
+            }
         }
-        else
-            prev_lost_segs = 0;
+    }
+
+    /* Reset the state if no affected slots remain. */
+    if (nslots_affected == 0 && last_slot_name)
+    {
+        pfree(last_slot_name);
+        last_slot_name = NULL;
+        last_lost_segs = 0;
+        last_nslots = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e75f4370..a3c7373d4f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -860,7 +860,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index fcaede60d0..bba61fd324 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1065,70 +1065,6 @@ ReplicationSlotReserveWal(void)
     }
 }
 
-/*
- * Returns names of inactive replication slots that their restart_lsn are
- * behind specified LSN for the purpose of error message character array
- * stuffed with slot names delimited by the given separator. Returns NULL if no
- * slot matches. If pnslots is given, the number of the returned slots is
- * returned there. The returned array is palloc'ed in TopMemoryContext.
- */
-char *
-ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots)
-{
-    static StringInfoData retstr;
-    static bool retstr_initialized = false;
-    bool insert_separator = false;
-    int i;
-    int nslots = 0;
-
-    Assert (separator);
-    if (max_replication_slots <= 0)
-        return NULL;
-
-    if (!retstr_initialized)
-    {
-        MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
-        initStringInfo(&retstr);
-        MemoryContextSwitchTo(oldcxt);
-        retstr_initialized = true;
-    }
-    else
-        resetStringInfo(&retstr);
-
-    /* construct name list */
-    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
-    for (i = 0 ; i < max_replication_slots ; i++)
-    {
-        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
-
-        /*
-         * We are collecting slots that are definitely behind the given target
-         * LSN. Active slots are exluded since they can catch up later.
-         */
-        if (s->in_use && s->active_pid == 0 && s->data.restart_lsn < target)
-        {
-            if (insert_separator)
-                appendStringInfoString(&retstr, separator);
-
-            /*
-             * Slot names consist only with lower-case letters. We don't
-             * bother quoting.
-             */
-            appendStringInfoString(&retstr, NameStr(s->data.name));
-            insert_separator = true;
-            nslots++;
-        }
-    }
-    LWLockRelease(ReplicationSlotControlLock);
-
-    /* return the number of slots in the list if requested */
-    if (pnslots)
-        *pnslots = nslots;
-
-    /* return NULL instead of an empty string */
-    return retstr.data[0] ? retstr.data : NULL;
-}
-
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683e2a..83533ea6c2 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -221,7 +221,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -275,6 +275,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
         int            i;
 
         if (!slot->in_use)
@@ -342,6 +343,42 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_BEING_REMOVED:
+                values[i++] = CStringGetTextDatum("losing");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >=0 &&
+            (walstate == WALAVAIL_NORMAL ||
+             walstate == WALAVAIL_PRESERVED))
+        {
+            XLogRecPtr currptr = GetXLogWriteRecPtr();
+            values[i++] =
+                Int64GetDatum(DistanceToWalRemoval(currptr, restart_lsn));
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 5d117d5cfc..52ff676638 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -255,6 +255,20 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter errror */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_BEING_REMOVED,            /* WAL segment is no longer preserved */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -268,6 +282,7 @@ extern int    XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo FindOldestXLogFileSegNo(void);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
@@ -304,6 +319,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern int64 DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fcf2a1214c..e70e62a657 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9892,9 +9892,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 09b0ab7953..3e95b019b3 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -198,7 +198,6 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
-extern char *ReplicationSlotsEnumerateBehinds(XLogRecPtr target, char *separator, int *pnslots);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 70e1e2f78d..4dec2b1c3d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1461,8 +1461,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

From 232e0b6ae9da6ae9fc0cd0fe7b50984eba6bb4d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v18 3/6] Add primary_slot_name to init_from_backup in TAP
 test.

It is convenient that priary_slot_name can be specified on taking a
base backup. This adds a new parameter of the name to the perl
function.
---
 src/test/perl/PostgresNode.pm | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2e0cf4a2f3..5f2659c3fc 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -698,6 +698,10 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+    $self->append_conf('postgresql.conf',
+                       qq(primary_slot_name = $params{primary_slot_name}))
+      if (defined $params{primary_slot_name});
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node) if $params{has_restoring};
     return;
-- 
2.18.2

From 6a77f2c86ca26abd1d2b2da95f8df80256a53ff7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 17:33:53 +0900
Subject: [PATCH v18 4/6] TAP test for the slot limit feature

---
 src/test/recovery/t/018_replslot_limit.pl | 202 ++++++++++++++++++++++
 1 file changed, 202 insertions(+)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..6688167546
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,202 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 4MB
+log_checkpoints = yes
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1, primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 4 MB.
+advance_wal($node_master, 4);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "1 replication slot has lost required WAL segments by 1 segments\n".
+               ".*Most affected slot is rep1.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.18.2

From 84403b9717d4090513a12a3499b5a9b181efe68a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Jan 2018 15:00:32 +0900
Subject: [PATCH v18 5/6] Documentation for slot-limit feature

---
 doc/src/sgml/catalogs.sgml          | 37 +++++++++++++++++++++++++++++
 doc/src/sgml/config.sgml            | 23 ++++++++++++++++++
 doc/src/sgml/high-availability.sgml |  8 ++++---
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 85ac79f07e..58dd7b6445 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9974,6 +9974,43 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this
+      slot. <literal>normal</literal>, <literal>keeping</literal>,
+      <literal>losing</literal> or <literal>lost</literal>.
+      <literal>normal</literal> means that the claimed files are
+      available within max_wal_size. <literal>keeping</literal> means
+      max_wal_size is exceeded but still held by replication slots or
+      wal_keep_segments.
+      <literal>losing</literal> means that some of them are on the verge of
+      removal but the using session may go further.
+      <literal>lost</literal> means that some of them are definitely lost and
+      the session that used this slot cannot continue replication. This state
+      also implies the using session has been stopped.
+
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes WAL location (LSN) can advance that bytes
+        until this slot may lose required WAL
+        files. If <structfield>restart_lsn</structfield> is null
+        or <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3ccacd528b..3e8884458c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3730,6 +3730,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..328464c240 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
-- 
2.18.2

From b12f9890f129f9cd6e5a811adc656069b429c108 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 26 Oct 2018 10:07:05 +0900
Subject: [PATCH v18 6/6] Check removal of in-reading segment file.

Checkpoints can recycle a segment file while it is being read by
ReadRecord and that leads to an apparently odd error message during
logical decoding. This patch explicitly checks that then error out
immediately.  Reading a recycled file is safe. Inconsistency caused by
overwrites as a new segment are caught by page/record validation. So
this is only for keeping consistency with the wal_status shown in
pg_replication_slots.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3aa68127a3..f6566d17ae 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -246,7 +246,9 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -292,6 +294,22 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

23 January 2020, 12:33:25

At Thu, 23 Jan 2020 21:28:54 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > In the same function, I think that setting restBytes to -1 when
> > "useless" is bad style.  I would just leave that variable alone when the
> > returned status is not one that receives the number of bytes.  So the
> > caller is only entitled to read the value if the returned enum value is
> > such-and-such ("keeping" and "streaming" I think).
> 
> That is the only condition. If max_slot_wal_keep_size = -1, The value
> is useless for the two states.  I added that explanation to the
> comment of Get(Lsn)Walavailability().

The reply is bogus since restBytes is no longer a parameter of
GetWalAvailability following the next comment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

30 March 2020, 23:03:27

I rebased this patch; it's failing to apply due to minor concurrent
changes in PostgresNode.pm.  I squashed the patches in a series that
made the most sense to me.

I have a question about static variable lastFoundOldestSeg in
FindOldestXLogFileSegNo.  It may be set the first time the function
runs; if it is, the function never again does anything, it just returns
that value.  In other words, the static value is never reset; it never
advances either.  Isn't that strange?  I think the coding is to assume
that XLogCtl->lastRemovedSegNo will always be set, so its code will
almost never run ... except when the very first wal file has not been
removed yet.  This seems weird and pointless.  Maybe we should think
about this differently -- example: if XLogGetLastRemovedSegno returns
zero, then the oldest file is the zeroth one.  In what cases this is
wrong?  Maybe we should fix those.

Regarding the PostgresNode change in 0001, I think adding a special
parameter for primary_slot_name is limited.  I'd like to change the
definition so that anything that you give as a parameter that's not one
of the recognized keywords (has_streaming, etc) is tested to see if it's
a GUC; and if it is, then put it in postgresql.conf.  This would have to
apply both to PostgresNode::init() as well as
PostgresNode::init_from_backup(), obviously, since it would make no
sense for the APIs to diverge on this point.  So you'd be able to do
  $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
without having to add code to init_from_backup to handle work_mem
specifically.  This could be done by having a Perl hash with all the GUC
names, that we could read lazily from "postmaster --describe-config" the
first time we see an unrecognized keyword as an option to init() /
init_from_backup().

I edited the doc changes a bit.

I don't know what to think of 0003 yet.  Has this been agreed to be a
good idea?

I also made a few small edits to the code; all cosmetic so far:

* added long_desc to the new GUC; it now reads:

        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
            gettext_noop("Sets the maximum size of WAL space reserved by replication slots."),
            gettext_noop("Replication slots will be marked as failed, and segments released "
                         "for deletion or recycling, if this much space is occupied by WAL "
                         "on disk."),

* updated the comment to ConvertToXSegs() which is now being used for
  this purpose

* remove outdated comment to GetWalAvailability; it was talking about
  restBytes parameter that no longer exists

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

31 March 2020, 05:10:14

Thank you for looking this and trouble rebasing!

At Mon, 30 Mar 2020 20:03:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> I rebased this patch; it's failing to apply due to minor concurrent
> changes in PostgresNode.pm.  I squashed the patches in a series that
> made the most sense to me.
> 
> I have a question about static variable lastFoundOldestSeg in
> FindOldestXLogFileSegNo.  It may be set the first time the function
> runs; if it is, the function never again does anything, it just returns
> that value.  In other words, the static value is never reset; it never
> advances either.  Isn't that strange?  I think the coding is to assume
> that XLogCtl->lastRemovedSegNo will always be set, so its code will
> almost never run ... except when the very first wal file has not been
> removed yet.  This seems weird and pointless.  Maybe we should think
> about this differently -- example: if XLogGetLastRemovedSegno returns
> zero, then the oldest file is the zeroth one.  In what cases this is
> wrong?  Maybe we should fix those.

That's right, but without the static variable, every call to the
pg_replication_slots view before the fist checkpoint causes scanning
pg_xlog. XLogCtl->lastRemovedSegNo advances only at a checkpoint, so
it is actually right that the return value from
FindOldestXLogFileSegNo doesn't change until the first checkpoint.

Also we could set XLogCtl->lastRemovedSegNo at startup, but the
scanning on pg_xlog is useless in most cases.

I avoided to update the XLogCtl->lastRemovedSegNo directlry, but the
third way would be if XLogGetLastRemovedSegno() returned 0, then set
XLogCtl->lastRemovedSegNo by scanning the WAL directory. The attached
takes this way.

> Regarding the PostgresNode change in 0001, I think adding a special
> parameter for primary_slot_name is limited.  I'd like to change the
> definition so that anything that you give as a parameter that's not one
> of the recognized keywords (has_streaming, etc) is tested to see if it's
> a GUC; and if it is, then put it in postgresql.conf.  This would have to
> apply both to PostgresNode::init() as well as
> PostgresNode::init_from_backup(), obviously, since it would make no
> sense for the APIs to diverge on this point.  So you'd be able to do
>   $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
> without having to add code to init_from_backup to handle work_mem
> specifically.  This could be done by having a Perl hash with all the GUC
> names, that we could read lazily from "postmaster --describe-config" the
> first time we see an unrecognized keyword as an option to init() /
> init_from_backup().

Done that way. We could exclude "known" parameters by explicitly
delete the key at reading it, but I choosed to enumerate the known
keywords.  Although it can be used widely but actually I changed only
018_repslot_limit.pl to use the feature.

> I edited the doc changes a bit.
> 
> I don't know what to think of 0003 yet.  Has this been agreed to be a
> good idea?

So it was a separate patch. I think it has not been approved nor
rejected.  The main objective of the patch is preventing
pg_replication_slots.wal_status from strange coming back from the
"lost" state to other states. However, in the first place I doubt that
it's right that logical replication sends the content of a WAL segment
already recycled.

> I also made a few small edits to the code; all cosmetic so far:
> 
> * added long_desc to the new GUC; it now reads:
> 
>         {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
>             gettext_noop("Sets the maximum size of WAL space reserved by replication slots."),
>             gettext_noop("Replication slots will be marked as failed, and segments released "
>                          "for deletion or recycling, if this much space is occupied by WAL "
>                          "on disk."),
> 
> * updated the comment to ConvertToXSegs() which is now being used for
>   this purpose
> 
> * remove outdated comment to GetWalAvailability; it was talking about
>   restBytes parameter that no longer exists

Thank you for the fixes. All of the looks fine.

I fixed several typos. (s/requred/required/, s/devinitly/definitely/,
s/errror/error/)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b6793238653e188dc4e57aee268b4ac42cdc18b6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v20 1/3] Allow arbitrary GUC parameter setting init and
 init_from_backup in TAP test.

It is convenient that arbitrary GUC parameters can be specified on
initializing a node or taking a backup.
---
 src/test/perl/PostgresNode.pm | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..c4fdf6d21a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -416,6 +416,13 @@ parameter allows_streaming => 'logical' or 'physical' (passing 1 will also
 suffice for physical replication) depending on type of replication that
 should be enabled. This is disabled by default.
 
+The keyword parameter extra is appended to the parameters to
+initdb. Similarly auth_extra is appended to the parameter list to
+pg_regress.
+
+The keyword parameters other than the aboves are appended to the
+configuration file as configuraion parameters.
+
 The new node is set up in a fast but unsafe configuration where fsync is
 disabled.
 
@@ -494,6 +501,17 @@ sub init
         print $conf "unix_socket_directories = '$host'\n";
         print $conf "listen_addresses = ''\n";
     }
+
+    # Finally, append all unknown parameters as configuration parameters.
+    foreach my $k (keys %params)
+    {
+        # Ignore known parameters, which are shown above.
+        next if (grep { $k eq $_ }
+                 ('has_archiving', 'allows_streaming', 'extra', 'auth_extra'));
+
+        print $conf "$k = \'$params{$k}\'\n";
+    }
+
     close $conf;
 
     chmod($self->group_access ? 0640 : 0600, "$pgdata/postgresql.conf")
@@ -656,6 +674,9 @@ default.
 If has_restoring is used, standby mode is used by default.  To use
 recovery mode instead, pass the keyword parameter standby => 0.
 
+The keyword parameters other than the aboves are appended to the
+configuration file as configuraion parameters.
+
 The backup is copied, leaving the original unmodified. pg_hba.conf is
 unconditionally set to enable replication connections.
 
@@ -702,6 +723,17 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+
+    # Translate unknown parameters into a configuration parameter.
+    foreach my $k (keys %params)
+    {
+        # Ignore known parameters, which are shown above.
+        next if (grep { $k eq $_ }
+                 ('has_streaming', 'has_restoring', 'standby'));
+
+        $self->append_conf('postgresql.conf', "$k = \'$params{$k}\'");
+    }
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node, $params{standby}) if $params{has_restoring};
     return;
-- 
2.18.2

From bfd3094622d2fac2bb35bc4e4ebb0ed0d40426a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH v20 2/3] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 contrib/test_decoding/expected/ddl.out        |   4 +-
 contrib/test_decoding/sql/ddl.sql             |   2 +
 doc/src/sgml/catalogs.sgml                    |  48 +++
 doc/src/sgml/config.sgml                      |  23 ++
 doc/src/sgml/high-availability.sgml           |   8 +-
 src/backend/access/transam/xlog.c             | 361 ++++++++++++++++--
 src/backend/catalog/system_views.sql          |   4 +-
 src/backend/replication/slot.c                |   1 +
 src/backend/replication/slotfuncs.c           |  39 +-
 src/backend/utils/misc/guc.c                  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |  19 +
 src/include/catalog/pg_proc.dat               |   6 +-
 src/test/recovery/t/018_replslot_limit.pl     | 202 ++++++++++
 src/test/regress/expected/rules.out           |   6 +-
 15 files changed, 699 insertions(+), 38 deletions(-)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..01a7802ed4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,54 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this slot.
+      Valid values are:
+       <simplelist>
+        <member>
+         <literal>normal</literal> means that the claimed files
+         are within <varname>max_wal_size</varname>
+        </member>
+        <member>
+         <literal>keeping</literal> means that <varname>max_wal_size</varname>
+         is exceeded but still held by replication slots or
+         <varname>wal_keep_segments</varname>
+        </member>
+        <member>
+         <literal>losing</literal> means that some of the files are on the verge
+         of deletion, but can still be accessed by a session that's currently
+         reading it
+        </member>
+        <member>
+         <literal>lost</literal> means that some of them are definitely lost
+         and the session using this slot cannot continue replication.
+         This state also implies that the session using this slot has been
+         stopped.
+        </member>
+       </simplelist>
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes of WAL that can be written before this slot
+        loses required WAL files.
+        If <structfield>restart_lsn</structfield> is null or
+        <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..dc99c6868a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3758,6 +3758,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..624e5f94ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1951103b26..e68a89fded 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -758,7 +759,7 @@ static ControlFileData *ControlFile = NULL;
  */
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 
-/* Convert min_wal_size_mb and max_wal_size_mb to equivalent segment count */
+/* Convert values of GUCs measured in megabytes to equiv. segment count */
 #define ConvertToXSegs(x, segsize)    \
     (x / ((segsize) / (1024 * 1024)))
 
@@ -895,6 +896,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -3929,8 +3931,56 @@ XLogGetLastRemovedSegno(void)
 }
 
 /*
- * Update the last removed segno pointer in shared memory, to reflect
- * that the given XLOG file has been removed.
+ * Scan the WAL directory then set the lastRemovedSegNo.
+ *
+ * In the case we need to know the last removed segment before the first
+ * checkpoint runs, call this function to initialize the variable by scanning
+ * the WAL directory.
+ */
+XLogSegNo
+ScanAndSetLastRemovedSegno(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * Get minimum segment ignoring timeline ID, the same way with
+         * RemoveOldXlogFiles().
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    /* Update the last removed segno, not making retrogression. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (segno > XLogCtl->lastRemovedSegNo)
+        XLogCtl->lastRemovedSegNo = segno;
+    else
+        segno = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return segno;
+}
+
+/*
+ * Update the last removed segno pointer in shared memory, to reflect that the
+ * given XLOG file has been removed.
  */
 static void
 UpdateLastRemovedPtr(char *filename)
@@ -9441,6 +9491,201 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_BEING_REMOVED means it is being removed or already removed but the
+ * replication stream on the given slot is live yet. The state may transit to
+ * WALAVAIL_PRESERVED or WALAVAIL_NORMAL state if the walsender advances
+ * restart_lsn.
+ *
+ * WALAVAIL_REMOVED means it is definitely lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * returns WALAVAIL_NULL if restart_lsn is invalid.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+    /* find the oldest extant segment file */
+    oldestSeg = XLogGetLastRemovedSegno() + 1;
+
+    /* initialize last removed segno if not yet */
+    if (oldestSeg == 1)
+        oldestSeg = ScanAndSetLastRemovedSegno() + 1;
+
+    /* calculate oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+    
+    /*
+     * The segment is already lost or being lost. If the oldest segment is just
+     * after the restartSeg, running walsender may be reading the just removed
+     * segment. The walsender may safely move to the oldest existing segment in
+     * that case.
+     */
+    if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
+        return WALAVAIL_BEING_REMOVED;
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
+/*
+ * Calculate remaining bytes until WAL segment for targetLSN will be removed.
+ */
+int64
+DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN)
+{
+    XLogSegNo    currSeg;
+    uint64        limitSegs = 0;
+    int64         restbytes;
+    uint64        fragbytes;
+    XLogSegNo    targetSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+
+    /* Calculate how far back WAL segments are preserved */
+    if (max_slot_wal_keep_size_mb >= 0)
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+    if (wal_keep_segments > 0 && limitSegs < wal_keep_segments)
+        limitSegs = wal_keep_segments;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    /* avoid underflow */
+    if (targetSeg + limitSegs < currSeg)
+        return 0;
+
+    /*
+     * This slot still has all required segments. Calculate how
+     * many LSN bytes the slot has until it loses targetLSN.
+     */
+    fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+    XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                            fragbytes, wal_segment_size,
+                            restbytes);
+
+    /*
+     * Not realistic, but make sure that it is not out of the
+     * range of int64. No problem to do so since such large values
+     * have no significant difference.
+     */
+    if (restbytes > PG_INT64_MAX)
+        restbytes = PG_INT64_MAX;
+
+    return restbytes;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9452,38 +9697,102 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    static XLogSegNo last_lost_segs = 0;
+    static int last_nslots = 0;
+    static char *last_slot_name = NULL;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    minSlotSegNo;
+    int            nslots_affected = 0;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
+        Assert (max_replication_slots > 0);
+
+        XLByteToSeg(slotminptr, minSlotSegNo, wal_segment_size);
+
+        if (minSlotSegNo < minSegNo)
+        {
+            /* Some slots has lost required segments */
+            XLogSegNo    lost_segs = minSegNo - minSlotSegNo;
+            ReplicationSlot *earliest = NULL;
+            char       *earliest_name = NULL;
+            int            i;
+
+            /* Find the most affected slot */
+            LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+            for (i = 0 ; i < max_replication_slots ; i++)
+            {
+                ReplicationSlot *s =
+                    &ReplicationSlotCtl->replication_slots[i];
+                XLogSegNo slotSegNo;
+
+                XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);
+
+                if (s->in_use && s->active_pid == 0 && slotSegNo < minSegNo)
+                {
+                    nslots_affected++;
+
+                    if (earliest == NULL ||
+                        s->data.restart_lsn < earliest->data.restart_lsn)
+                        earliest = s;
+                }
+            }
+
+            if (earliest)
+            {
+                MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+                earliest_name = pstrdup(NameStr(earliest->data.name));
+                MemoryContextSwitchTo(oldcxt);
+            }
+
+            LWLockRelease(ReplicationSlotControlLock);
+
+            /* Emit WARNING if something has changed */
+            if (earliest_name &&
+                (last_lost_segs != lost_segs || last_nslots != nslots_affected))
+            {
+                ereport(WARNING,
+                        (errmsg_plural ("%d replication slot has lost required WAL segments by %lu segments",
+                                        "%d replication slots have lost required WAL segments by %lu segments",
+                                        nslots_affected, nslots_affected,
+                                        lost_segs),
+                         errdetail("Most affected slot is %s.",
+                                   earliest_name)));
+
+                if (last_slot_name)
+                    pfree(last_slot_name);
+                last_slot_name = earliest_name;
+                last_lost_segs = lost_segs;
+                last_nslots = nslots_affected;
+            }
+        }
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
+    /* Reset the state if no affected slots remain. */
+    if (nslots_affected == 0 && last_slot_name)
     {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        pfree(last_slot_name);
+        last_slot_name = NULL;
+        last_lost_segs = 0;
+        last_nslots = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 83d00c6cde..775b8b7f20 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -863,7 +863,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..a26f7999aa 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -49,6 +49,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 
 /*
  * Replication slot on-disk data structure.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ce0c9127bc..47cd4375a1 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -234,7 +234,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -288,6 +288,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
         int            i;
 
         if (!slot->in_use)
@@ -355,6 +356,42 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_BEING_REMOVED:
+                values[i++] = CStringGetTextDatum("losing");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >= 0 &&
+            (walstate == WALAVAIL_NORMAL ||
+             walstate == WALAVAIL_PRESERVED))
+        {
+            values[i++] =
+                Int64GetDatum(DistanceToWalRemoval(GetXLogWriteRecPtr(),
+                                                   restart_lsn));
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..54cd5f6420 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2771,6 +2771,19 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of WAL space reserved by replication slots."),
+            gettext_noop("Replication slots will be marked as failed, and segments released "
+                         "for deletion or recycling, if this much space is occupied by WAL "
+                         "on disk."),
+            GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..0b696e7044 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..c1994eec6f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
@@ -255,6 +256,20 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter error */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_BEING_REMOVED,            /* WAL segment is no longer preserved */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -268,6 +283,7 @@ extern int    XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo ScanAndSetLastRemovedSegno(void);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
@@ -305,6 +321,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern int64 DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a6a708cca9..2025f34bfd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9971,9 +9971,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..84a1f3a9dd
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,202 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1,
+                   extra => ['--wal-segsize=1'],
+                   min_wal_size => '2MB',
+                   max_wal_size => '4MB',
+                   log_checkpoints => 'yes');
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1,
+                                primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'"); 
+is($result, "$start_lsn|normal|7168 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 4 MB.
+advance_wal($node_master, 4);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "1 replication slot has lost required WAL segments by 1 segments\n".
+               ".*Most affected slot is rep1.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7245b0e13b..8688f7138f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1462,8 +1462,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

From 417eb0ba58451a02e71377277fda6e4615c8db5b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 31 Mar 2020 12:59:10 +0900
Subject: [PATCH v20 3/3] Allow init and init_from_backup to set arbitrary GUC
 parameters in TAP test.

It is convenient that arbitrary GUC parameters can be specified on
initializing a node or taking a base backup.  Any non-predefined
keyword parameter given to the methods is translated into a parameter
setting in the config file.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132f..90a9649f61 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -270,7 +270,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -314,6 +316,22 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

31 March 2020, 05:20:16

Thank you for looking this and trouble rebasing!

At Mon, 30 Mar 2020 20:03:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> I rebased this patch; it's failing to apply due to minor concurrent
> changes in PostgresNode.pm.  I squashed the patches in a series that
> made the most sense to me.
> 
> I have a question about static variable lastFoundOldestSeg in
> FindOldestXLogFileSegNo.  It may be set the first time the function
> runs; if it is, the function never again does anything, it just returns
> that value.  In other words, the static value is never reset; it never
> advances either.  Isn't that strange?  I think the coding is to assume
> that XLogCtl->lastRemovedSegNo will always be set, so its code will
> almost never run ... except when the very first wal file has not been
> removed yet.  This seems weird and pointless.  Maybe we should think
> about this differently -- example: if XLogGetLastRemovedSegno returns
> zero, then the oldest file is the zeroth one.  In what cases this is
> wrong?  Maybe we should fix those.

That's right, but without the static variable, every call to the
pg_replication_slots view before the fist checkpoint causes scanning
pg_xlog. XLogCtl->lastRemovedSegNo advances only at a checkpoint, so
it is actually right that the return value from
FindOldestXLogFileSegNo doesn't change until the first checkpoint.

Also we could set XLogCtl->lastRemovedSegNo at startup, but the
scanning on pg_xlog is useless in most cases.

I avoided to update the XLogCtl->lastRemovedSegNo directlry, but the
third way would be if XLogGetLastRemovedSegno() returned 0, then set
XLogCtl->lastRemovedSegNo by scanning the WAL directory. The attached
takes this way.

> Regarding the PostgresNode change in 0001, I think adding a special
> parameter for primary_slot_name is limited.  I'd like to change the
> definition so that anything that you give as a parameter that's not one
> of the recognized keywords (has_streaming, etc) is tested to see if it's
> a GUC; and if it is, then put it in postgresql.conf.  This would have to
> apply both to PostgresNode::init() as well as
> PostgresNode::init_from_backup(), obviously, since it would make no
> sense for the APIs to diverge on this point.  So you'd be able to do
>   $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
> without having to add code to init_from_backup to handle work_mem
> specifically.  This could be done by having a Perl hash with all the GUC
> names, that we could read lazily from "postmaster --describe-config" the
> first time we see an unrecognized keyword as an option to init() /
> init_from_backup().

Done that way. We could exclude "known" parameters by explicitly
delete the key at reading it, but I choosed to enumerate the known
keywords.  Although it can be used widely but actually I changed only
018_repslot_limit.pl to use the feature.

> I edited the doc changes a bit.
> 
> I don't know what to think of 0003 yet.  Has this been agreed to be a
> good idea?

So it was a separate patch. I think it has not been approved nor
rejected.  The main objective of the patch is preventing
pg_replication_slots.wal_status from strange coming back from the
"lost" state to other states. However, in the first place I doubt that
it's right that logical replication sends the content of a WAL segment
already recycled.

> I also made a few small edits to the code; all cosmetic so far:
> 
> * added long_desc to the new GUC; it now reads:
> 
>         {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
>             gettext_noop("Sets the maximum size of WAL space reserved by replication slots."),
>             gettext_noop("Replication slots will be marked as failed, and segments released "
>                          "for deletion or recycling, if this much space is occupied by WAL "
>                          "on disk."),
> 
> * updated the comment to ConvertToXSegs() which is now being used for
>   this purpose
> 
> * remove outdated comment to GetWalAvailability; it was talking about
>   restBytes parameter that no longer exists

Thank you for the fixes. All of the looks fine.

I fixed several typos. (s/requred/required/, s/devinitly/definitely/,
s/errror/error/)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 95165d218d633e354c8136d2e200d83685ef3799 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v20 1/3] Allow arbitrary GUC parameter setting init and
 init_from_backup in TAP test.

It is convenient that arbitrary GUC parameters can be specified on
initializing a node or taking a backup.
---
 src/test/perl/PostgresNode.pm | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..4671dc5eb1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -416,6 +416,13 @@ parameter allows_streaming => 'logical' or 'physical' (passing 1 will also
 suffice for physical replication) depending on type of replication that
 should be enabled. This is disabled by default.
 
+The keyword parameter extra is appended to the parameters to
+initdb. Similarly auth_extra is appended to the parameter list to
+pg_regress.
+
+The keyword parameters other than the aboves are appended to the
+configuration file as configuraion parameters.
+
 The new node is set up in a fast but unsafe configuration where fsync is
 disabled.
 
@@ -494,6 +501,17 @@ sub init
         print $conf "unix_socket_directories = '$host'\n";
         print $conf "listen_addresses = ''\n";
     }
+
+    # Finally, append all unknown parameters as configuration parameters.
+    foreach my $k (keys %params)
+    {
+        # Ignore known parameters, which are shown above.
+        next if (grep { $k eq $_ }
+                 ('has_archiving', 'allows_streaming', 'extra', 'auth_extra'));
+
+        print $conf "$k = \'$params{$k}\'\n";
+    }
+
     close $conf;
 
     chmod($self->group_access ? 0640 : 0600, "$pgdata/postgresql.conf")
@@ -656,6 +674,9 @@ default.
 If has_restoring is used, standby mode is used by default.  To use
 recovery mode instead, pass the keyword parameter standby => 0.
 
+The keyword parameters other than the aboves are appended to the
+configuration file as configuraion parameters.
+
 The backup is copied, leaving the original unmodified. pg_hba.conf is
 unconditionally set to enable replication connections.
 
@@ -702,6 +723,17 @@ port = $port
         $self->append_conf('postgresql.conf',
             "unix_socket_directories = '$host'");
     }
+
+    # Translate unknown parameters into configuration parameters.
+    foreach my $k (keys %params)
+    {
+        # Ignore known parameters, which are shown above.
+        next if (grep { $k eq $_ }
+                 ('has_streaming', 'has_restoring', 'standby'));
+
+        $self->append_conf('postgresql.conf', "$k = \'$params{$k}\'");
+    }
+
     $self->enable_streaming($root_node) if $params{has_streaming};
     $self->enable_restoring($root_node, $params{standby}) if $params{has_restoring};
     return;
-- 
2.18.2

From d21d9cc87022c384e18068328307bc294afd4c54 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH v20 2/3] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 contrib/test_decoding/expected/ddl.out        |   4 +-
 contrib/test_decoding/sql/ddl.sql             |   2 +
 doc/src/sgml/catalogs.sgml                    |  48 +++
 doc/src/sgml/config.sgml                      |  23 ++
 doc/src/sgml/high-availability.sgml           |   8 +-
 src/backend/access/transam/xlog.c             | 361 ++++++++++++++++--
 src/backend/catalog/system_views.sql          |   4 +-
 src/backend/replication/slot.c                |   1 +
 src/backend/replication/slotfuncs.c           |  39 +-
 src/backend/utils/misc/guc.c                  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |  19 +
 src/include/catalog/pg_proc.dat               |   6 +-
 src/test/recovery/t/018_replslot_limit.pl     | 202 ++++++++++
 src/test/regress/expected/rules.out           |   6 +-
 15 files changed, 699 insertions(+), 38 deletions(-)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index 2c999fd3eb..cf0318f697 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -723,8 +723,8 @@ SELECT pg_drop_replication_slot('regression_slot');
 (1 row)
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 (0 rows)
 
+\x
diff --git a/contrib/test_decoding/sql/ddl.sql b/contrib/test_decoding/sql/ddl.sql
index 856495c952..0f2b9992f7 100644
--- a/contrib/test_decoding/sql/ddl.sql
+++ b/contrib/test_decoding/sql/ddl.sql
@@ -387,4 +387,6 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 SELECT pg_drop_replication_slot('regression_slot');
 
 /* check that the slot is gone */
+\x
 SELECT * FROM pg_replication_slots;
+\x
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..01a7802ed4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,54 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this slot.
+      Valid values are:
+       <simplelist>
+        <member>
+         <literal>normal</literal> means that the claimed files
+         are within <varname>max_wal_size</varname>
+        </member>
+        <member>
+         <literal>keeping</literal> means that <varname>max_wal_size</varname>
+         is exceeded but still held by replication slots or
+         <varname>wal_keep_segments</varname>
+        </member>
+        <member>
+         <literal>losing</literal> means that some of the files are on the verge
+         of deletion, but can still be accessed by a session that's currently
+         reading it
+        </member>
+        <member>
+         <literal>lost</literal> means that some of them are definitely lost
+         and the session using this slot cannot continue replication.
+         This state also implies that the session using this slot has been
+         stopped.
+        </member>
+       </simplelist>
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes of WAL that can be written before this slot
+        loses required WAL files.
+        If <structfield>restart_lsn</structfield> is null or
+        <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..dc99c6868a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3758,6 +3758,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..624e5f94ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1951103b26..e68a89fded 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -758,7 +759,7 @@ static ControlFileData *ControlFile = NULL;
  */
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 
-/* Convert min_wal_size_mb and max_wal_size_mb to equivalent segment count */
+/* Convert values of GUCs measured in megabytes to equiv. segment count */
 #define ConvertToXSegs(x, segsize)    \
     (x / ((segsize) / (1024 * 1024)))
 
@@ -895,6 +896,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -3929,8 +3931,56 @@ XLogGetLastRemovedSegno(void)
 }
 
 /*
- * Update the last removed segno pointer in shared memory, to reflect
- * that the given XLOG file has been removed.
+ * Scan the WAL directory then set the lastRemovedSegNo.
+ *
+ * In the case we need to know the last removed segment before the first
+ * checkpoint runs, call this function to initialize the variable by scanning
+ * the WAL directory.
+ */
+XLogSegNo
+ScanAndSetLastRemovedSegno(void)
+{
+    DIR        *xldir;
+    struct dirent *xlde;
+    XLogSegNo segno;
+
+    xldir = AllocateDir(XLOGDIR);
+    while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+    {
+        TimeLineID tli;
+        XLogSegNo fsegno;
+
+        /* Ignore files that are not XLOG segments */
+        if (!IsXLogFileName(xlde->d_name) &&
+            !IsPartialXLogFileName(xlde->d_name))
+            continue;
+
+        XLogFromFileName(xlde->d_name, &tli, &fsegno, wal_segment_size);
+
+        /*
+         * Get minimum segment ignoring timeline ID, the same way with
+         * RemoveOldXlogFiles().
+         */
+        if (segno == 0 || fsegno < segno)
+            segno = fsegno;
+    }
+
+    FreeDir(xldir);
+
+    /* Update the last removed segno, not making retrogression. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (segno > XLogCtl->lastRemovedSegNo)
+        XLogCtl->lastRemovedSegNo = segno;
+    else
+        segno = XLogCtl->lastRemovedSegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return segno;
+}
+
+/*
+ * Update the last removed segno pointer in shared memory, to reflect that the
+ * given XLOG file has been removed.
  */
 static void
 UpdateLastRemovedPtr(char *filename)
@@ -9441,6 +9491,201 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_BEING_REMOVED means it is being removed or already removed but the
+ * replication stream on the given slot is live yet. The state may transit to
+ * WALAVAIL_PRESERVED or WALAVAIL_NORMAL state if the walsender advances
+ * restart_lsn.
+ *
+ * WALAVAIL_REMOVED means it is definitely lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * returns WALAVAIL_NULL if restart_lsn is invalid.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+    /* find the oldest extant segment file */
+    oldestSeg = XLogGetLastRemovedSegno() + 1;
+
+    /* initialize last removed segno if not yet */
+    if (oldestSeg == 1)
+        oldestSeg = ScanAndSetLastRemovedSegno() + 1;
+
+    /* calculate oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+    
+    /*
+     * The segment is already lost or being lost. If the oldest segment is just
+     * after the restartSeg, running walsender may be reading the just removed
+     * segment. The walsender may safely move to the oldest existing segment in
+     * that case.
+     */
+    if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
+        return WALAVAIL_BEING_REMOVED;
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
+/*
+ * Calculate remaining bytes until WAL segment for targetLSN will be removed.
+ */
+int64
+DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN)
+{
+    XLogSegNo    currSeg;
+    uint64        limitSegs = 0;
+    int64         restbytes;
+    uint64        fragbytes;
+    XLogSegNo    targetSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+
+    /* Calculate how far back WAL segments are preserved */
+    if (max_slot_wal_keep_size_mb >= 0)
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+    if (wal_keep_segments > 0 && limitSegs < wal_keep_segments)
+        limitSegs = wal_keep_segments;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    /* avoid underflow */
+    if (targetSeg + limitSegs < currSeg)
+        return 0;
+
+    /*
+     * This slot still has all required segments. Calculate how
+     * many LSN bytes the slot has until it loses targetLSN.
+     */
+    fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+    XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                            fragbytes, wal_segment_size,
+                            restbytes);
+
+    /*
+     * Not realistic, but make sure that it is not out of the
+     * range of int64. No problem to do so since such large values
+     * have no significant difference.
+     */
+    if (restbytes > PG_INT64_MAX)
+        restbytes = PG_INT64_MAX;
+
+    return restbytes;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9452,38 +9697,102 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    static XLogSegNo last_lost_segs = 0;
+    static int last_nslots = 0;
+    static char *last_slot_name = NULL;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    minSlotSegNo;
+    int            nslots_affected = 0;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
+        Assert (max_replication_slots > 0);
+
+        XLByteToSeg(slotminptr, minSlotSegNo, wal_segment_size);
+
+        if (minSlotSegNo < minSegNo)
+        {
+            /* Some slots has lost required segments */
+            XLogSegNo    lost_segs = minSegNo - minSlotSegNo;
+            ReplicationSlot *earliest = NULL;
+            char       *earliest_name = NULL;
+            int            i;
+
+            /* Find the most affected slot */
+            LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+            for (i = 0 ; i < max_replication_slots ; i++)
+            {
+                ReplicationSlot *s =
+                    &ReplicationSlotCtl->replication_slots[i];
+                XLogSegNo slotSegNo;
+
+                XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);
+
+                if (s->in_use && s->active_pid == 0 && slotSegNo < minSegNo)
+                {
+                    nslots_affected++;
+
+                    if (earliest == NULL ||
+                        s->data.restart_lsn < earliest->data.restart_lsn)
+                        earliest = s;
+                }
+            }
+
+            if (earliest)
+            {
+                MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+                earliest_name = pstrdup(NameStr(earliest->data.name));
+                MemoryContextSwitchTo(oldcxt);
+            }
+
+            LWLockRelease(ReplicationSlotControlLock);
+
+            /* Emit WARNING if something has changed */
+            if (earliest_name &&
+                (last_lost_segs != lost_segs || last_nslots != nslots_affected))
+            {
+                ereport(WARNING,
+                        (errmsg_plural ("%d replication slot has lost required WAL segments by %lu segments",
+                                        "%d replication slots have lost required WAL segments by %lu segments",
+                                        nslots_affected, nslots_affected,
+                                        lost_segs),
+                         errdetail("Most affected slot is %s.",
+                                   earliest_name)));
+
+                if (last_slot_name)
+                    pfree(last_slot_name);
+                last_slot_name = earliest_name;
+                last_lost_segs = lost_segs;
+                last_nslots = nslots_affected;
+            }
+        }
     }
 
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
+    /* Reset the state if no affected slots remain. */
+    if (nslots_affected == 0 && last_slot_name)
     {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+        pfree(last_slot_name);
+        last_slot_name = NULL;
+        last_lost_segs = 0;
+        last_nslots = 0;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 83d00c6cde..775b8b7f20 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -863,7 +863,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..a26f7999aa 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -49,6 +49,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 
 /*
  * Replication slot on-disk data structure.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ce0c9127bc..47cd4375a1 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -234,7 +234,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -288,6 +288,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
         int            i;
 
         if (!slot->in_use)
@@ -355,6 +356,42 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_BEING_REMOVED:
+                values[i++] = CStringGetTextDatum("losing");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >= 0 &&
+            (walstate == WALAVAIL_NORMAL ||
+             walstate == WALAVAIL_PRESERVED))
+        {
+            values[i++] =
+                Int64GetDatum(DistanceToWalRemoval(GetXLogWriteRecPtr(),
+                                                   restart_lsn));
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..54cd5f6420 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2771,6 +2771,19 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum size of WAL space reserved by replication slots."),
+            gettext_noop("Replication slots will be marked as failed, and segments released "
+                         "for deletion or recycling, if this much space is occupied by WAL "
+                         "on disk."),
+            GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..0b696e7044 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..c1994eec6f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
@@ -255,6 +256,20 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter error */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_BEING_REMOVED,            /* WAL segment is no longer preserved */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -268,6 +283,7 @@ extern int    XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo ScanAndSetLastRemovedSegno(void);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
@@ -305,6 +321,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern int64 DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a6a708cca9..2025f34bfd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9971,9 +9971,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..84a1f3a9dd
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,202 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1,
+                   extra => ['--wal-segsize=1'],
+                   min_wal_size => '2MB',
+                   max_wal_size => '4MB',
+                   log_checkpoints => 'yes');
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1,
+                                primary_slot_name => 'rep1');
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'"); 
+is($result, "$start_lsn|normal|7168 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 4 MB.
+advance_wal($node_master, 4);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "1 replication slot has lost required WAL segments by 1 segments\n".
+               ".*Most affected slot is rep1.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7245b0e13b..8688f7138f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1462,8 +1462,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

From b01043efdbec8f52306c4d0dea4e71d7a93cae63 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 31 Mar 2020 12:59:10 +0900
Subject: [PATCH v20 3/3] Allow init and init_from_backup to set arbitrary GUC
 parameters in TAP test.

It is convenient that arbitrary GUC parameters can be specified on
initializing a node or taking a base backup.  Any non-predefined
keyword parameter given to the methods is translated into a parameter
setting in the config file.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132f..90a9649f61 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -270,7 +270,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     uint32        pageHeaderSize;
     bool        gotheader;
     int            readOff;
-
+#ifndef FRONTEND
+    XLogSegNo    targetSegNo;
+#endif
     /*
      * randAccess indicates whether to verify the previous-record pointer of
      * the record we're reading.  We only do this if we're reading
@@ -314,6 +316,22 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
     targetRecOff = RecPtr % XLOG_BLCKSZ;
 
+#ifndef FRONTEND
+    /*
+     * Although It's safe that the current segment is recycled as a new
+     * segment since we check the page/record header at reading, it leads to
+     * an apparently strange error message when logical replication, which can
+     * be prevented by explicitly checking if the current segment is removed.
+     */
+    XLByteToSeg(targetPagePtr, targetSegNo, state->segcxt.ws_segsize);
+    if (targetSegNo <= XLogGetLastRemovedSegno())
+    {
+        report_invalid_record(state,
+                              "WAL segment for LSN %X/%X has been removed",
+                              (uint32)(RecPtr >> 32), (uint32) RecPtr);
+        goto err;
+    }
+#endif
     /*
      * Read the page containing the record into state->readBuf. Request enough
      * byte to cover the whole record header, or at least the part of it that
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 15:07:55

On 2020-Mar-31, Kyotaro Horiguchi wrote:

> Thank you for looking this and trouble rebasing!
> 
> At Mon, 30 Mar 2020 20:03:27 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > I rebased this patch; it's failing to apply due to minor concurrent
> > changes in PostgresNode.pm.  I squashed the patches in a series that
> > made the most sense to me.
> > 
> > I have a question about static variable lastFoundOldestSeg in
> > FindOldestXLogFileSegNo.  It may be set the first time the function
> > runs; if it is, the function never again does anything, it just returns
> > that value.  In other words, the static value is never reset; it never
> > advances either.  Isn't that strange?  I think the coding is to assume
> > that XLogCtl->lastRemovedSegNo will always be set, so its code will
> > almost never run ... except when the very first wal file has not been
> > removed yet.  This seems weird and pointless.  Maybe we should think
> > about this differently -- example: if XLogGetLastRemovedSegno returns
> > zero, then the oldest file is the zeroth one.  In what cases this is
> > wrong?  Maybe we should fix those.
> 
> That's right, but without the static variable, every call to the
> pg_replication_slots view before the fist checkpoint causes scanning
> pg_xlog. XLogCtl->lastRemovedSegNo advances only at a checkpoint, so
> it is actually right that the return value from
> FindOldestXLogFileSegNo doesn't change until the first checkpoint.
> 
> Also we could set XLogCtl->lastRemovedSegNo at startup, but the
> scanning on pg_xlog is useless in most cases.
> 
> I avoided to update the XLogCtl->lastRemovedSegNo directlry, but the
> third way would be if XLogGetLastRemovedSegno() returned 0, then set
> XLogCtl->lastRemovedSegNo by scanning the WAL directory. The attached
> takes this way.

I'm not sure if I explained my proposal clearly.  What if
XLogGetLastRemovedSegno returning zero means that every segment is
valid?  We don't need to scan pg_xlog at all.

> > Regarding the PostgresNode change in 0001, I think adding a special
> > parameter for primary_slot_name is limited.  I'd like to change the
> > definition so that anything that you give as a parameter that's not one
> > of the recognized keywords (has_streaming, etc) is tested to see if it's
> > a GUC; and if it is, then put it in postgresql.conf.  This would have to
> > apply both to PostgresNode::init() as well as
> > PostgresNode::init_from_backup(), obviously, since it would make no
> > sense for the APIs to diverge on this point.  So you'd be able to do
> >   $node->init_from_backup(allow_streaming => 1, work_mem => "4MB");
> > without having to add code to init_from_backup to handle work_mem
> > specifically.  This could be done by having a Perl hash with all the GUC
> > names, that we could read lazily from "postmaster --describe-config" the
> > first time we see an unrecognized keyword as an option to init() /
> > init_from_backup().
> 
> Done that way. We could exclude "known" parameters by explicitly
> delete the key at reading it, but I choosed to enumerate the known
> keywords.  Although it can be used widely but actually I changed only
> 018_repslot_limit.pl to use the feature.

Hmm.  I like this idea in general, but I'm not sure I want to introduce
it in this form right away.  For the time being I realized while waking
up this morning we can just use $node->append_conf() in the
018_replslot_limit.pl file, like every other place that needs a special
config.  There's no need to change the test infrastructure for this.

I'll go through this again.  Many thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 17:18:36

On 2020-Mar-31, Alvaro Herrera wrote:

> I'm not sure if I explained my proposal clearly.  What if
> XLogGetLastRemovedSegno returning zero means that every segment is
> valid?  We don't need to scan pg_xlog at all.

I mean this:

XLogSegNo
FindOldestXLogFileSegNo(void)
{
    XLogSegNo segno = XLogGetLastRemovedSegno();

    /* this is the only special case we need to care about */
    if (segno == 0)
        return some-value;

    return segno + 1;
}

... and that point one can further note that a freshly initdb'd system
(no file has been removed) has "1" as the first file.  So when segno is
0, you can return 1 and all should be well.  That means you can reduce
the function to this:

XLogSegNo
FindOldestXLogFileSegNo(void)
{
    return XLogGetLastRemovedSegno() + 1;
}

The tests still pass with this coding.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 19:59:05

On 2020-Mar-31, Alvaro Herrera wrote:

> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > I'm not sure if I explained my proposal clearly.  What if
> > XLogGetLastRemovedSegno returning zero means that every segment is
> > valid?  We don't need to scan pg_xlog at all.
> 
> I mean this:

[v21 does it that way.  Your typo fixes are included, but not the
LastRemoved stuff being discussed here.  I also edited the shortdesc in
guc.c to better match {min,max}_wal_size.]

Hmm ... but if the user runs pg_resetwal to remove WAL segments, then
this will work badly for a time (until a segment is removed next).  I'm
not very worried for that scenario, since surely the user will have to
reclone any standbys anyway.  I think your v20 behaves better in that
case.  But I'm not sure we should have that code to cater only to that
case ... seems to me that it will go untested 99.999% of the time.

Maybe you're aware of some other cases where lastRemovedSegNo is not
correct for the purposes of this feature?

I pushed the silly test_decoding test adjustment to get it out of the
way.

/me tries to figure out KeepLogSeg next

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 21:01:36

I noticed some other things:

1. KeepLogSeg sends a warning message when slots fall behind.  To do
this, it searches for "the most affected slot", that is, the slot that
lost the most data.  But it seems to me that that's a bit pointless; if
a slot data, it's now useless and anything that was using that slot must
be recreated.  If you only know what's the most affected slot, it's not
possible to see which *other* slots are affected.  It doesn't matter if
the slot missed one segment or twenty segments or 9999 segments -- the
slot is now useless, or it is not useless.  I think we should list the
slot that was *least* affected, i.e., the slot that lost the minimum
amount of segments; then the user knows that all slots that are older
than that one are *also* affected.

2. KeepLogSeg ignores slots that are active.  I guess the logic here is
that if a slot is active, then it'll keep going until it catches up and
we don't need to do anything about the used disk space.  But that seems
a false premise, because if a standby is so slow that it cannot keep up,
it will eventually run the master out of diskspace even if it's active
all the time.  So I'm not seeing the reasoning that makes it useful to
skip checking active slots.

(BTW I don't think you need to keep that many static variables in that
function.  Just the slot name should be sufficient, I think ... or maybe
even the *pointer* to the slot that was last reported.

I think if a slot is behind and it lost segments, we should kill the
walsender that's using it, and unreserve the segments.  So maybe
something like

            LWLockAcquire( ... );
            for (i = 0 ; i < max_replication_slots; i++)
            {
                ReplicationSlot *s =
                    &ReplicationSlotCtl->replication_slots[i];
                XLogSegNo slotSegNo;

                XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);

                if (s->in_use)
                {
                    if (s->active_pid)
                        pids_to_kill = lappend(pids_to_kill, s->active_pid);

                    nslots_affected++;
                    ... ;    /* other stuff */
                }
            }
            LWLockRelease( ... )
            /* release lock before syscalls */
            foreach(l, pids_to_kill)
            {
                kill(SIGTERM, lfirst_int(l));
            }

I sense some attempt to salvage slots that are reading a segment that is
"outdated" and removed, but for which the walsender has an open file
descriptor.  (This appears to be the "losing" state.) This seems
dangerous, for example the segment might be recycled and is being
overwritten with different data.  Trying to keep track of that seems
doomed.  And even if the walsender can still read that data, it's only a
matter of time before the next segment is also removed.  So keeping the
walsender alive is futile; it only delays the inevitable.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 22:02:57

On 2020-Mar-31, Alvaro Herrera wrote:

>             /* release lock before syscalls */
>             foreach(l, pids_to_kill)
>             {
>                 kill(SIGTERM, lfirst_int(l));
>             }
> 
> I sense some attempt to salvage slots that are reading a segment that is
> "outdated" and removed, but for which the walsender has an open file
> descriptor.  (This appears to be the "losing" state.) This seems
> dangerous, for example the segment might be recycled and is being
> overwritten with different data.  Trying to keep track of that seems
> doomed.  And even if the walsender can still read that data, it's only a
> matter of time before the next segment is also removed.  So keeping the
> walsender alive is futile; it only delays the inevitable.

I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid),
then acquire the slot and set it to some state indicating that it is now
useless, no longer reserving WAL; so when the walsender is restarted, it
will find the slot cannot be used any longer.  Two ideas come to mind
about doing this:

1. set the LSNs and Xmins to Invalid; keep only the slot name, database,
plug_in, etc.  This makes monitoring harder, I think, because as soon as
the slot is gone you know nothing at all about it.

2. add a new flag to ReplicationSlotPersistentData to indicate that the
slot is dead.  This preserves the LSN info for forensics, and might even
be easier to code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

31 March 2020, 22:07:49

On 2020-Mar-31, Alvaro Herrera wrote:

> I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid),
> then acquire the slot and set it to some state indicating that it is now
> useless, no longer reserving WAL; so when the walsender is restarted, it
> will find the slot cannot be used any longer.

Ah, I see ioguix already pointed this out and the response was that the
walsender stops by itself.  Hmm.  I suppose this works too ... it seems
a bit fragile, but maybe I'm too sensitive.  Do we have other opinions
on this point?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

01 April 2020, 01:18:04

At Tue, 31 Mar 2020 14:18:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > I'm not sure if I explained my proposal clearly.  What if
> > XLogGetLastRemovedSegno returning zero means that every segment is
> > valid?  We don't need to scan pg_xlog at all.
> 
> I mean this:
> 
> XLogSegNo
> FindOldestXLogFileSegNo(void)
> {
>     XLogSegNo segno = XLogGetLastRemovedSegno();
> 
>     /* this is the only special case we need to care about */
>     if (segno == 0)
>         return some-value;
> 
>     return segno + 1;
> }
> 
> ... and that point one can further note that a freshly initdb'd system
> (no file has been removed) has "1" as the first file.  So when segno is
> 0, you can return 1 and all should be well.  That means you can reduce
> the function to this:

If we don't scan the wal files, for example (somewhat artificail), if
segments canoot be removed by a wrong setting of archive_command,
GetWalAvailability can return false "removed(lost)" state.  If
max_slot_wal_keep_size is shrinked is changed then restarted, the
function can return false "normal" or "keeping" states.

By the way the oldest segment of initdb'ed cluster was (14x)th for
me. So I think we can treat segno == 1 as "uncertain" or "unknown"
state, but that state lasts until a checkpoint actually removes a
segment.

> XLogSegNo
> FindOldestXLogFileSegNo(void)
> {
>     return XLogGetLastRemovedSegno() + 1;
> }
> 
> 
> The tests still pass with this coding.

Mmm. Yeah, that affects when under an abnormal condition.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

01 April 2020, 01:25:43

At Tue, 31 Mar 2020 16:59:05 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Mar-31, Alvaro Herrera wrote:
> 
> > On 2020-Mar-31, Alvaro Herrera wrote:
> > 
> > > I'm not sure if I explained my proposal clearly.  What if
> > > XLogGetLastRemovedSegno returning zero means that every segment is
> > > valid?  We don't need to scan pg_xlog at all.
> > 
> > I mean this:
> 
> [v21 does it that way.  Your typo fixes are included, but not the
> LastRemoved stuff being discussed here.  I also edited the shortdesc in
> guc.c to better match {min,max}_wal_size.]
> 
> Hmm ... but if the user runs pg_resetwal to remove WAL segments, then
> this will work badly for a time (until a segment is removed next).  I'm
> not very worried for that scenario, since surely the user will have to
> reclone any standbys anyway.  I think your v20 behaves better in that
> case.  But I'm not sure we should have that code to cater only to that
> case ... seems to me that it will go untested 99.999% of the time.

I feel the same. If we allow bogus status or "unkown" status before
the first checkpoint, we don't need to scan the directory.

> Maybe you're aware of some other cases where lastRemovedSegNo is not
> correct for the purposes of this feature?

The cases of archive-failure (false "removed") and change of
max_slot_wal_keep_size(false "normal/kept") mentioned in another mail.

> I pushed the silly test_decoding test adjustment to get it out of the
> way.
> 
> /me tries to figure out KeepLogSeg next

Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

01 April 2020, 05:39:22

At Tue, 31 Mar 2020 18:01:36 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> I noticed some other things:
> 
> 1. KeepLogSeg sends a warning message when slots fall behind.  To do
> this, it searches for "the most affected slot", that is, the slot that
> lost the most data.  But it seems to me that that's a bit pointless; if
> a slot data, it's now useless and anything that was using that slot must
> be recreated.  If you only know what's the most affected slot, it's not
> possible to see which *other* slots are affected.  It doesn't matter if
> the slot missed one segment or twenty segments or 9999 segments -- the
> slot is now useless, or it is not useless.  I think we should list the
> slot that was *least* affected, i.e., the slot that lost the minimum
> amount of segments; then the user knows that all slots that are older
> than that one are *also* affected.

Mmm. v17-0001 patch  [1] shows it as the following:

> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slot s1 lost 8 segment(s).
> WARNING:  some replication slots have lost required WAL segments
> DETAIL:  Slots s1, s2, s3 lost at most 9 segment(s).

And it is removed following a comment as [2] :p 

I restored the feature in simpler shape in v22.

[1]
https://www.postgresql.org/message-id/flat/20191224.212614.633369820509385571.horikyota.ntt%40gmail.com#cbc193425b95edd166a5c6d42fd579c6
[2] https://www.postgresql.org/message-id/20200123.212854.658794168913258596.horikyota.ntt%40gmail.com

> 2. KeepLogSeg ignores slots that are active.  I guess the logic here is
> that if a slot is active, then it'll keep going until it catches up and
> we don't need to do anything about the used disk space.  But that seems
> a false premise, because if a standby is so slow that it cannot keep up,
> it will eventually run the master out of diskspace even if it's active
> all the time.  So I'm not seeing the reasoning that makes it useful to
> skip checking active slots.

Right. I unconsciously assumed synchronous replication. It should be
removed. Fixed.

> (BTW I don't think you need to keep that many static variables in that
> function.  Just the slot name should be sufficient, I think ... or maybe
> even the *pointer* to the slot that was last reported.

Agreed. Fixed.

> I think if a slot is behind and it lost segments, we should kill the
> walsender that's using it, and unreserve the segments.  So maybe
> something like

At Tue, 31 Mar 2020 19:07:49 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > I think we should kill(SIGTERM) the walsender using the slot (slot->active_pid),
> > then acquire the slot and set it to some state indicating that it is now
> > useless, no longer reserving WAL; so when the walsender is restarted, it
> > will find the slot cannot be used any longer.
> 
> Ah, I see ioguix already pointed this out and the response was that the
> walsender stops by itself.  Hmm.  I suppose this works too ... it seems
> a bit fragile, but maybe I'm too sensitive.  Do we have other opinions
> on this point?

Yes it the check is performed after every block-read so walsender
doesn't seem to send a wrong record. The 0002 added that for
per-record basis so it can be said useless. But things get simpler by
killing such walsenders under a subtle condition, I think.

In the attached, 0002 removed and added walsender-kill code.

> I sense some attempt to salvage slots that are reading a segment
that is
> "outdated" and removed, but for which the walsender has an open file
> descriptor.  (This appears to be the "losing" state.) This seems
> dangerous, for example the segment might be recycled and is being
> overwritten with different data.  Trying to keep track of that seems
> doomed.  And even if the walsender can still read that data, it's only a
> matter of time before the next segment is also removed.  So keeping the
> walsender alive is futile; it only delays the inevitable.

Agreed.

The attached is v22, only one patch file.

- 0002 is removed

- I didn't add "unknown" status in wal_status, because it is quite
  hard to explain reasonably. Instead, I added the following comment.

+     * Find the oldest extant segment file. We get 1 until checkpoint removes
+     * the first WAL segment file since startup, which causes the status being
+     * wrong under certain abnormal conditions but that doesn't actually harm.

- Changed the message in KeepLogSeg as described above.

- Don't ignore inactive slots in KeepLogSeg.

- Out-of-sync walsenders are killed immediately.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4afea34a5ad748fddb2061ec6ef0f2fc6e66ba6c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 19 Dec 2018 12:43:57 +0900
Subject: [PATCH v22] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 doc/src/sgml/catalogs.sgml                    |  48 +++
 doc/src/sgml/config.sgml                      |  23 ++
 doc/src/sgml/high-availability.sgml           |   8 +-
 src/backend/access/transam/xlog.c             | 326 ++++++++++++++++--
 src/backend/catalog/system_views.sql          |   4 +-
 src/backend/replication/slot.c                |   1 +
 src/backend/replication/slotfuncs.c           |  39 ++-
 src/backend/utils/misc/guc.c                  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |  19 +
 src/include/catalog/pg_proc.dat               |   6 +-
 src/test/recovery/t/018_replslot_limit.pl     | 203 +++++++++++
 src/test/regress/expected/rules.out           |   6 +-
 13 files changed, 660 insertions(+), 37 deletions(-)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..01a7802ed4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,54 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this slot.
+      Valid values are:
+       <simplelist>
+        <member>
+         <literal>normal</literal> means that the claimed files
+         are within <varname>max_wal_size</varname>
+        </member>
+        <member>
+         <literal>keeping</literal> means that <varname>max_wal_size</varname>
+         is exceeded but still held by replication slots or
+         <varname>wal_keep_segments</varname>
+        </member>
+        <member>
+         <literal>losing</literal> means that some of the files are on the verge
+         of deletion, but can still be accessed by a session that's currently
+         reading it
+        </member>
+        <member>
+         <literal>lost</literal> means that some of them are definitely lost
+         and the session using this slot cannot continue replication.
+         This state also implies that the session using this slot has been
+         stopped.
+        </member>
+       </simplelist>
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>remain</structfield></entry>
+      <entry><type>bigint</type></entry>
+      <entry></entry>
+      <entry>The amount in bytes of WAL that can be written before this slot
+        loses required WAL files.
+        If <structfield>restart_lsn</structfield> is null or
+        <structfield>wal_status</structfield> is <literal>losing</literal>
+        or <literal>lost</literal>, this field is null.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..dc99c6868a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3758,6 +3758,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..624e5f94ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..e43603a2f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -106,6 +106,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -759,7 +760,7 @@ static ControlFileData *ControlFile = NULL;
  */
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 
-/* Convert min_wal_size_mb and max_wal_size_mb to equivalent segment count */
+/* Convert values of GUCs measured in megabytes to equiv. segment count */
 #define ConvertToXSegs(x, segsize)    \
     (x / ((segsize) / (1024 * 1024)))
 
@@ -896,6 +897,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -3929,9 +3931,10 @@ XLogGetLastRemovedSegno(void)
     return lastRemovedSegNo;
 }
 
+
 /*
- * Update the last removed segno pointer in shared memory, to reflect
- * that the given XLOG file has been removed.
+ * Update the last removed segno pointer in shared memory, to reflect that the
+ * given XLOG file has been removed.
  */
 static void
 UpdateLastRemovedPtr(char *filename)
@@ -9451,6 +9454,201 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Detect availability of the record at given targetLSN.
+ *
+ * targetLSN is restart_lsn of a slot.
+ * walsender_pid is the slot's walsender PID.
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_BEING_REMOVED means it is being removed or already removed but the
+ * replication stream on the given slot is live yet. The state may transit to
+ * WALAVAIL_PRESERVED or WALAVAIL_NORMAL state if the walsender advances
+ * restart_lsn.
+ *
+ * WALAVAIL_REMOVED means it is definitely lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * returns WALAVAIL_NULL if restart_lsn is invalid.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogRecPtr slotPtr;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* the case where the slot has never been activated */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    slotPtr = XLogGetReplicationSlotMinimumLSN();
+    oldestSlotSeg = GetOldestKeepSegment(currpos, slotPtr);
+
+    /*
+     * Find the oldest extant segment file. We get 1 until checkpoint removes
+     * the first WAL segment file since startup, which causes the status being
+     * wrong under certain abnormal conditions but that doesn't actually harm.
+     */
+    oldestSeg = XLogGetLastRemovedSegno() + 1;
+
+    /* calculate oldest segment by max_wal_size */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+
+    /*
+     * The segment is already lost or being lost. If the oldest segment is just
+     * after the restartSeg, running walsender may be reading the just removed
+     * segment. The walsender may safely move to the oldest existing segment in
+     * that case.
+     */
+    if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
+        return WALAVAIL_BEING_REMOVED;
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+    XLogSegNo    currSeg;
+    XLogSegNo    minSlotSeg;
+    uint64        keepSegs = 0;    /* # of segments actually kept */
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+    XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+    /*
+     * Calculate how many segments are kept by slots first. The second
+     * term of the condition is just a sanity check.
+     */
+    if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+        keepSegs = currSeg - minSlotSeg;
+
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (max_slot_wal_keep_size_mb >= 0)
+    {
+        uint64 limitSegs;
+
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (limitSegs < keepSegs)
+            keepSegs = limitSegs;
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+        keepSegs = wal_keep_segments;
+
+    /* avoid underflow, don't go below 1 */
+    if (currSeg <= keepSegs)
+        return 1;
+
+    return currSeg - keepSegs;
+}
+
+/*
+ * Calculate remaining bytes until WAL segment for targetLSN will be removed.
+ */
+int64
+DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN)
+{
+    XLogSegNo    currSeg;
+    uint64        limitSegs = 0;
+    int64         restbytes;
+    uint64        fragbytes;
+    XLogSegNo    targetSeg;
+
+    XLByteToSeg(currLSN, currSeg, wal_segment_size);
+
+    /* Calculate how far back WAL segments are preserved */
+    if (max_slot_wal_keep_size_mb >= 0)
+        limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+    if (wal_keep_segments > 0 && limitSegs < wal_keep_segments)
+        limitSegs = wal_keep_segments;
+
+    XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
+
+    /* avoid underflow */
+    if (targetSeg + limitSegs < currSeg)
+        return 0;
+
+    /*
+     * This slot still has all required segments. Calculate how
+     * many LSN bytes the slot has until it loses targetLSN.
+     */
+    fragbytes = wal_segment_size - (currLSN % wal_segment_size);
+    XLogSegNoOffsetToRecPtr(targetSeg + limitSegs - currSeg,
+                            fragbytes, wal_segment_size,
+                            restbytes);
+
+    /*
+     * Not realistic, but make sure that it is not out of the
+     * range of int64. No problem to do so since such large values
+     * have no significant difference.
+     */
+    if (restbytes > PG_INT64_MAX)
+        restbytes = PG_INT64_MAX;
+
+    return restbytes;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9462,38 +9660,112 @@ CreateRestartPoint(int flags)
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-    XLogSegNo    segno;
-    XLogRecPtr    keep;
+    XLogRecPtr    slotminptr = InvalidXLogRecPtr;
+    XLogSegNo    minSegNo;
+    XLogSegNo    minSlotSegNo;
+    int            nslots_affected = 0;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
-    keep = XLogGetReplicationSlotMinimumLSN();
+    if (max_replication_slots > 0)
+        slotminptr = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
-    {
-        /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
-            segno = 1;
-        else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
+    /*
+     * We should keep certain number of WAL segments after this checkpoint.
+     */
+    minSegNo = GetOldestKeepSegment(recptr, slotminptr);
+
+    /*
+     * Warn if the checkpoint is going to flush the segments required by
+     * replication slots.
+     */
+    if (!XLogRecPtrIsInvalid(slotminptr))
     {
-        XLogSegNo    slotSegNo;
+        StringInfoData    slot_names;
+        List           *pids_to_kill = NIL;
+        ListCell       *lc;
+
+        initStringInfo(&slot_names);
+        XLByteToSeg(slotminptr, minSlotSegNo, wal_segment_size);
+
+        if (minSlotSegNo < minSegNo)
+        {
+            /* Some slots have lost required segments */
+            XLogSegNo        lost_segs = minSegNo - minSlotSegNo;
+            static char       *prev_slot_names = NULL;
+            int                i;
+
+            /* Collect affected slot names */
+            LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+            for (i = 0 ; i < max_replication_slots ; i++)
+            {
+                ReplicationSlot *s =
+                    &ReplicationSlotCtl->replication_slots[i];
+                XLogSegNo slotSegNo;
+
+                XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);
+
+                if (s->in_use && slotSegNo < minSegNo)
+                {
+                    if (slot_names.data[0])
+                        appendStringInfoString(&slot_names, ", ");
+                    appendStringInfoString(&slot_names, s->data.name.data);
+
+                    /* remember the pid of walsender to kill */
+                    if (s->active_pid != 0)
+                        pids_to_kill = lappend_int(pids_to_kill, s->active_pid);
+
+                    nslots_affected++;
+                }
+            }
+
+            LWLockRelease(ReplicationSlotControlLock);
+
+            /* Kill the walsenders that have lost segments to read */
+            foreach(lc, pids_to_kill)
+            {
+                int pid = lfirst_int(lc);
+                ereport(LOG,
+                    (errmsg("terminating walsender process (PID %d) due to WAL file removal", pid)));
+                kill(SIGTERM, pid);
+            }
+
+            if (nslots_affected == 0)
+            {
+                /* No slots are affected, forget about previous state. */
+                if (prev_slot_names)
+                {
+                    pfree(prev_slot_names);
+                    prev_slot_names = NULL;
+                }
+            }
+            /* Emit WARNING if affected slots are changed */
+            else if (prev_slot_names == NULL ||
+                     strcmp(prev_slot_names, slot_names.data) != 0)
+            {
+                MemoryContext cxt;
+
+                if (prev_slot_names)
+                    pfree(prev_slot_names);
 
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
+                cxt = MemoryContextSwitchTo(TopMemoryContext);
+                prev_slot_names = pstrdup(slot_names.data);
+                MemoryContextSwitchTo(cxt);
 
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+                ereport(WARNING,
+                        (errmsg_plural (
+                            "%d replication slot has lost required WAL segments by %lu segments",
+                            "%d replication slots have lost required WAL segments by %lu segments",
+                            nslots_affected, nslots_affected, lost_segs),
+                         errdetail_plural(
+                             "Slot %s lost required segments.",
+                             "Slots %s lost required segments.",
+                             nslots_affected, prev_slot_names)));
+            }
+        }
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
-        *logSegNo = segno;
+    if (minSegNo < *logSegNo)
+        *logSegNo = minSegNo;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 83d00c6cde..775b8b7f20 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -863,7 +863,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.remain
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..a26f7999aa 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -49,6 +49,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/memutils.h"
 
 /*
  * Replication slot on-disk data structure.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ce0c9127bc..47cd4375a1 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -234,7 +234,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -288,6 +288,7 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
         int            i;
 
         if (!slot->in_use)
@@ -355,6 +356,42 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_BEING_REMOVED:
+                values[i++] = CStringGetTextDatum("losing");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >= 0 &&
+            (walstate == WALAVAIL_NORMAL ||
+             walstate == WALAVAIL_PRESERVED))
+        {
+            values[i++] =
+                Int64GetDatum(DistanceToWalRemoval(GetXLogWriteRecPtr(),
+                                                   restart_lsn));
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..a4f0a4e0e3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2771,6 +2771,19 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum WAL size that can be reserved by replication slots."),
+            gettext_noop("Replication slots will be marked as failed, and segments released "
+                         "for deletion or recycling, if this much space is occupied by WAL "
+                         "on disk."),
+            GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..0b696e7044 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..9d29d2263f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
@@ -255,6 +256,20 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter error */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_BEING_REMOVED,            /* WAL segment is no longer preserved */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -268,6 +283,7 @@ extern int    XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
+extern XLogSegNo FindOldestXLogFileSegNo(void);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
@@ -305,6 +321,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern int64 DistanceToWalRemoval(XLogRecPtr currLSN, XLogRecPtr targetLSN);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a6a708cca9..2025f34bfd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9971,9 +9971,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,int8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,remain}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..4ff2ef3b48
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,203 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 4MB
+log_checkpoints = yes
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The remaining bytes should be as almost
+# (max_slot_wal_keep_size + 1) times large as the segment size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that remaining byte is calculated correctly');
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|7168 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# Advance WAL again without checkpoint, reducing remain by 4 MB.
+advance_wal($node_master, 4);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(remain) as remain FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|1024 kB", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "1 replication slot has lost required WAL segments by 1 segments\n".
+               ".*Slot rep1 lost required segments.",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, remain is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7245b0e13b..8688f7138f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1462,8 +1462,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.remain
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, remain)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

03 April 2020, 23:14:03

So, the more I look at this patch, the less I like the way the slots are
handled.

* I think it's a mistake to try to do anything in KeepLogSeg itself;
  that function is merely in charge of some arithmetic.  I propose to
  make that function aware of the new size limitation (so that it
  doesn't trust the slot's LSNs completely), but to make the function
  have no side effects.  The attached patch does that, I hope.
  To replace that responsibility, let's add another function.  I named it
  InvalidateObsoleteReplicationSlots().  In CreateCheckPoint and
  CreateRestartPoint, we call the new function just before removing
  segments.  Note: the one in this patch doesn't actually work or even
  compile.
  The new function must:

  1. mark the slot as "invalid" somehow.  Maybe it would make sense to
  add a new flag in the on-disk struct for this; but for now I'm just
  thinking that changing the slot's restart_lsn is sufficient.
  (Of course, I haven't tested this, so there might be side-effects that
  mean that this idea doesn't work).

  2. send SIGTERM to a walsender that's using such a slot.

  3. Send the warning message.  Instead of trying to construct a message
  with a list of slots, send one message per slot.  (I think this is
  better from a translatability point of view, and also from a
  monitoring PoV).

* GetWalAvailability seems too much in competition with
  DistanceToWalRemoval.  Which is weird, because both functions do
  pretty much the same thing.  I think a better design is to make the
  former function return the distance as an out parameter.

* Andres complained that the "distance" column was not a great value to
  expose (20171106132050.6apzynxrqrzghb4r@alap3.anarazel.de).  That's
  right: it changes both by the insertion LSN as well as the slot's
  consumption.  Maybe we can expose the earliest live LSN (start of the
  earliest segment?) as a new column.  It'll be the same for all slots,
  I suppose, but we don't care, do we?

I attach a rough sketch, which as I said before doesn't work and doesn't
compile.  Sadly I have reached the end of my day here so I won't be able
to work on this for today anymore.  I'll be glad to try again tomorrow,
but in the meantime I thought it was better to send it over and see
whether you had any thoughts about this proposed design (maybe you know
it doesn't work for some reason), or better yet, you have the chance to
actually complete the code or at least move it a little further.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v23-0001-Add-WAL-relief-vent-for-replication-slots.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

06 April 2020, 09:50:27

At Fri, 3 Apr 2020 20:14:03 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> So, the more I look at this patch, the less I like the way the slots are
> handled.
> 
> * I think it's a mistake to try to do anything in KeepLogSeg itself;
>   that function is merely in charge of some arithmetic.  I propose to
>   make that function aware of the new size limitation (so that it
>   doesn't trust the slot's LSNs completely), but to make the function
>   have no side effects.  The attached patch does that, I hope.
>   To replace that responsibility, let's add another function.  I named it
>   InvalidateObsoleteReplicationSlots().  In CreateCheckPoint and
>   CreateRestartPoint, we call the new function just before removing
>   segments.  Note: the one in this patch doesn't actually work or even
>   compile.

Agreed and thanks for the code. The patch is enough to express the
intention. I fixed some compilation errors and made a clean up of
KeepLogSeg.  InvalidateObsoleteReplicationSlots requires the "oldest
preserved segment" so it should be called before _logSegNo--, not
after.

>   The new function must:
> 
>   1. mark the slot as "invalid" somehow.  Maybe it would make sense to
>   add a new flag in the on-disk struct for this; but for now I'm just
>   thinking that changing the slot's restart_lsn is sufficient.
>   (Of course, I haven't tested this, so there might be side-effects that
>   mean that this idea doesn't work).
> 
>   2. send SIGTERM to a walsender that's using such a slot.
> 
>   3. Send the warning message.  Instead of trying to construct a message
>   with a list of slots, send one message per slot.  (I think this is
>   better from a translatability point of view, and also from a
>   monitoring PoV).
> 
> * GetWalAvailability seems too much in competition with
>   DistanceToWalRemoval.  Which is weird, because both functions do
>   pretty much the same thing.  I think a better design is to make the
>   former function return the distance as an out parameter.

I agree to the aboves. When a slot is invlidated, the following
message is logged.

LOG: slot rep1 is invalidated at 0/1C00000 due to exceeding max_slot_wal_keep_size

> * Andres complained that the "distance" column was not a great value to
>   expose (20171106132050.6apzynxrqrzghb4r@alap3.anarazel.de).  That's
>   right: it changes both by the insertion LSN as well as the slot's
>   consumption.  Maybe we can expose the earliest live LSN (start of the
>   earliest segment?) as a new column.  It'll be the same for all slots,
>   I suppose, but we don't care, do we?

I don't care as far as users can calculate the "remain" of individual
slots (that is, how far the current LSN can advance before the slot
loses data). But the "earliest live LSN (EL-LSN) is really not
relevant to the safeness of each slot. The distance from EL-LSN to
restart_lsn or the current LSN doesn't generally suggest the safeness
of individual slots.  The only relevance would be if the distance from
EL-LSN to the current LSN is close to max_slot_wal_keep_size, the most
lagged slot could die in a short term.

FWIW, the relationship between the values are shown below.

                                    (now)>>>
<--- past ----------------------------+--------------------future --->
 lastRemovedSegment + 1
 "earliest_live_lsn"                                    | segment X |
 |   min(restart_lsn) restart_lsn[i]  current_lsn       |   "The LSN X"
.+...+................+...............+>>>..............|...+       |
                      <--------max_slot_wal_keep_size------>        |
                                       <---"remain" --------------->|

So the "remain" is calculated using "restart_lsn(pg_lsn)",
max_slot_wal_keep_size(int in MB), wal_keep_segments(in segments) and
wal_segment_size (int in MB) and pg_current_wal_lsn()(pg_lsn).  The
formula could be simplified by ignoring the segment size, but anyway
we don't have an arithmetic between pg_lsn and int in SQL interface.

Anyway in this version I added the "min_safe_lsn". And adjust the TAP
tests for that. It can use (pg_current_wal_lsn() - min_safe_lsn) as
the alternative index since there is only one slot while the test.

> I attach a rough sketch, which as I said before doesn't work and doesn't
> compile.  Sadly I have reached the end of my day here so I won't be able
> to work on this for today anymore.  I'll be glad to try again tomorrow,
> but in the meantime I thought it was better to send it over and see
> whether you had any thoughts about this proposed design (maybe you know
> it doesn't work for some reason), or better yet, you have the chance to
> actually complete the code or at least move it a little further.

WALAVAIL_BEING_REMOVED is removed since walsender is now actively
killed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b2815ce65fd72b5fcb85d785588b4a5adc5f99ae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 6 Apr 2020 17:56:46 +0900
Subject: [PATCH v24] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 doc/src/sgml/catalogs.sgml                    |  39 +++++
 doc/src/sgml/config.sgml                      |  23 +++
 doc/src/sgml/high-availability.sgml           |   8 +-
 src/backend/access/transam/xlog.c             | 145 +++++++++++++++---
 src/backend/catalog/system_views.sql          |   4 +-
 src/backend/replication/slot.c                |  52 +++++++
 src/backend/replication/slotfuncs.c           |  38 ++++-
 src/backend/utils/misc/guc.c                  |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |  17 ++
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/replication/slot.h                |   1 +
 src/test/regress/expected/rules.out           |   6 +-
 13 files changed, 320 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..de8ca5ccca 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,45 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this slot.
+      Valid values are:
+       <simplelist>
+        <member>
+         <literal>normal</literal> means that the claimed files
+         are within <varname>max_wal_size</varname>
+        </member>
+        <member>
+         <literal>keeping</literal> means that <varname>max_wal_size</varname>
+         is exceeded but still held by replication slots or
+         <varname>wal_keep_segments</varname>
+        </member>
+        <member>
+         <literal>lost</literal> means that some of them are definitely lost
+         and the session using this slot cannot continue replication. This
+         state will be hardly seen because walsender that enters this state is
+         terminated immediately.
+        </member>
+       </simplelist>
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_safe_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The minimum LSN currently available for walsenders.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..17c18386e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3753,6 +3753,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..624e5f94ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..8f28ffaab9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -106,6 +106,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -759,7 +760,7 @@ static ControlFileData *ControlFile = NULL;
  */
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 
-/* Convert min_wal_size_mb and max_wal_size_mb to equivalent segment count */
+/* Convert values of GUCs measured in megabytes to equiv. segment count */
 #define ConvertToXSegs(x, segsize)    \
     (x / ((segsize) / (1024 * 1024)))
 
@@ -3929,9 +3930,10 @@ XLogGetLastRemovedSegno(void)
     return lastRemovedSegNo;
 }
 
+
 /*
- * Update the last removed segno pointer in shared memory, to reflect
- * that the given XLOG file has been removed.
+ * Update the last removed segno pointer in shared memory, to reflect that the
+ * given XLOG file has been removed.
  */
 static void
 UpdateLastRemovedPtr(char *filename)
@@ -9049,6 +9051,7 @@ CreateCheckPoint(int flags)
      */
     XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
     KeepLogSeg(recptr, &_logSegNo);
+    InvalidateObsoleteReplicationSlots(_logSegNo);
     _logSegNo--;
     RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
@@ -9383,6 +9386,7 @@ CreateRestartPoint(int flags)
     replayPtr = GetXLogReplayRecPtr(&replayTLI);
     endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
     KeepLogSeg(endptr, &_logSegNo);
+    InvalidateObsoleteReplicationSlots(_logSegNo);
     _logSegNo--;
 
     /*
@@ -9451,48 +9455,143 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Report availability of WAL for a replication slot
+ *        restart_lsn and active_pid are straight from the slot info
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_REMOVED means it is definitely lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg = InvalidXLogRecPtr;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* slot does not reserve WAL. Either deactivated, or has never been active */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    KeepLogSeg(currpos, &oldestSlotSeg);
+
+    /*
+     * Find the oldest extant segment file. We get 1 until checkpoint removes
+     * the first WAL segment file since startup, which causes the status being
+     * wrong under certain abnormal conditions but that doesn't actually harm.
+     */
+    oldestSeg = XLogGetLastRemovedSegno() + 1;
+
+    /* calculate oldest segment by max_wal_size and wal_keep_segments */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
+                              wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
  * This is calculated by subtracting wal_keep_segments from the given xlog
  * location, recptr and by making sure that that result is below the
- * requirement of replication slots.
+ * requirement of replication slots.  For the latter criterion we do consider
+ * the effects of max_slot_wal_keep_size: reserve at most that much space back
+ * from recptr.
  */
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
+    XLogSegNo    currSegNo;
     XLogSegNo    segno;
     XLogRecPtr    keep;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
+    XLByteToSeg(recptr, currSegNo, wal_segment_size);
+    segno = currSegNo;
+
     keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * Calculate how many segments are kept by slots first.
+     */
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (keep != InvalidXLogRecPtr)
+    {
+        XLByteToSeg(keep, segno, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (max_slot_wal_keep_size_mb >= 0)
+        {
+            XLogRecPtr slot_keep_segs =
+                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+            if (currSegNo - segno > slot_keep_segs)
+                segno = currSegNo - slot_keep_segs;
+        }
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && currSegNo - segno < wal_keep_segments)
     {
         /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
+        if (currSegNo <= wal_keep_segments)
             segno = 1;
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            segno = currSegNo - wal_keep_segments;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
+    if (XLogRecPtrIsInvalid(*logSegNo) || segno < *logSegNo)
         *logSegNo = segno;
 }
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..d406ea8118 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -876,7 +876,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_safe_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..86ddff8b9d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,58 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Mark any slot that points to an LSN older than the given segment
+ * as invalid; it requires WAL that's about to be removed.
+ *
+ * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ */
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+{
+    XLogRecPtr    oldestLSN;
+    List       *pids = NIL;
+    ListCell   *cell;
+
+    XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (int i = 0; i < max_replication_slots; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        if (!s->in_use || s->data.restart_lsn == InvalidXLogRecPtr)
+            continue;
+
+        if (s->data.restart_lsn < oldestLSN)
+        {
+            elog(LOG, "slot %s is invalidated at %X/%X due to exceeding max_slot_wal_keep_size",
+                 s->data.name.data,
+                 (uint32) (s->data.restart_lsn >> 32),
+                 (uint32) s->data.restart_lsn);
+            /* mark this slot as invalid */
+            SpinLockAcquire(&s->mutex);
+            s->data.restart_lsn = InvalidXLogRecPtr;
+
+            /* remember PID for killing, if active*/
+            if (s->active_pid != 0)
+                pids = lappend_int(pids, s->active_pid);
+            SpinLockRelease(&s->mutex);
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /*
+     * Signal any active walsenders to terminate.  We do not wait to observe
+     * them gone.
+     */
+    foreach(cell, pids)
+    {
+        /* signal the walsender to terminate */
+        (void) kill(lfirst_int(cell), SIGTERM);
+    }
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ce0c9127bc..dc38b475c5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -234,7 +234,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -288,6 +288,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
+        XLogSegNo    last_removed_seg;
         int            i;
 
         if (!slot->in_use)
@@ -355,6 +357,40 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >= 0 &&
+            (walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_PRESERVED) &&
+            ((last_removed_seg = XLogGetLastRemovedSegno()) != 0))
+        {
+            XLogRecPtr min_safe_lsn;
+
+            XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
+                                    wal_segment_size, min_safe_lsn);
+            values[i++] = Int64GetDatum(min_safe_lsn);
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..1cfb999748 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2763,6 +2763,19 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum WAL size that can be reserved by replication slots."),
+            gettext_noop("Replication slots will be marked as failed, and segments released "
+                         "for deletion or recycling, if this much space is occupied by WAL "
+                         "on disk."),
+            GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..507a72b712 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -288,6 +288,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..33812bb3f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
@@ -255,6 +256,19 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter error */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -305,6 +319,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern XLogRecPtr CalculateMaxmumSafeLSN(void);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..ef808c5c43 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9986,9 +9986,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_safe_lsn}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3e95b019b3..6e469ea749 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -198,6 +198,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6eec8ec568..ac31840739 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1462,8 +1462,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_safe_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_safe_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

06 April 2020, 16:54:56

On 2020-Apr-06, Kyotaro Horiguchi wrote:

> > * Andres complained that the "distance" column was not a great value to
> >   expose (20171106132050.6apzynxrqrzghb4r@alap3.anarazel.de).  That's
> >   right: it changes both by the insertion LSN as well as the slot's
> >   consumption.  Maybe we can expose the earliest live LSN (start of the
> >   earliest segment?) as a new column.  It'll be the same for all slots,
> >   I suppose, but we don't care, do we?
> 
> I don't care as far as users can calculate the "remain" of individual
> slots (that is, how far the current LSN can advance before the slot
> loses data). But the "earliest live LSN (EL-LSN) is really not
> relevant to the safeness of each slot. The distance from EL-LSN to
> restart_lsn or the current LSN doesn't generally suggest the safeness
> of individual slots.  The only relevance would be if the distance from
> EL-LSN to the current LSN is close to max_slot_wal_keep_size, the most
> lagged slot could die in a short term.

Thanks for the revised version.  Please note that you forgot to "git
add" the test file, to it's not in the patch.

I'm reviewing the patch now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

06 April 2020, 18:58:39

On 2020-Apr-06, Kyotaro Horiguchi wrote:

> At Fri, 3 Apr 2020 20:14:03 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 

> Agreed and thanks for the code. The patch is enough to express the
> intention. I fixed some compilation errors and made a clean up of
> KeepLogSeg.  InvalidateObsoleteReplicationSlots requires the "oldest
> preserved segment" so it should be called before _logSegNo--, not
> after.

Ah, of course, thanks.

> I agree to the aboves. When a slot is invlidated, the following
> message is logged.
> 
> LOG: slot rep1 is invalidated at 0/1C00000 due to exceeding max_slot_wal_keep_size

Sounds good.  Here's a couple of further adjustments to your v24.  This
passes the existing tests (pg_basebackup exception noted below), but I
don't have the updated 019_replslot_limit.pl, so that still needs to be
verified.

First, cosmetic changes in xlog.c.

Second, an unrelated bugfix: ReplicationSlotsComputeLogicalRestartLSN()
is able to return InvalidXLogRecPtr if there's a slot with invalid
restart_lsn.  I'm fairly certain that that's bogus.  I think this needs
to be backpatched.

Third: The loop in InvalidateObsoleteReplicationSlots was reading
restart_lsn without aquiring mutex. Split the "continue" line in two, so
in_use is checked without spinlock and restart_lsn is checked with it.
This means we also need to store restart_lsn in a local variable before
logging the message (because we don't want to log with spinlock held).
Also, use ereport() not elog() for that, and add quotes to the slot
name.

Lastly, I noticed that we're now changing the slot's restart_lsn to
Invalid without being the slot's owner, which goes counter to what is
said in slot.h:

 * - Individual fields are protected by mutex where only the backend owning
 * the slot is authorized to update the fields from its own slot.  The
 * backend owning the slot does not need to take this lock when reading its
 * own fields, while concurrent backends not owning this slot should take the
 * lock when reading this slot's data.

What this means is that if the slot owner walsender updates the
restart_lsn to a newer value just as we (checkpointer) are trying to set
it to Invalid, the owner's value might persist and our value would be
lost.

AFAICT if we were really stressed about getting this exactly correct,
then we would have to kill the walsender, wait for it to die, then
ReplicationSlotAcquire and *then* update
MyReplicationSlot->data.restart_lsn.  But I don't think we want to do
that during checkpoint, and I'm not sure we need to be as strict anyway:
it seems to me that it suffices to check restart_lsn for being invalid
in the couple of places where the slot's owner advances (which is the
two auxiliary functions for ProcessStandbyReplyMessage).  I have done so
in the attached.  There are other places where the restart_lsn is set,
but those seem to be used only when the slot is created.  I don't think
we need to cover for those, but I'm not 100% sure about that.

However, the change in PhysicalConfirmReceivedLocation() breaks
the way slots work for pg_basebackup: apparently the slot is created
with a restart_lsn of Invalid and we only advance it the first time we
process a feedback message from pg_basebackup.  I have a vague feeling
that that's bogus, but I'll have to look at the involved code a little
bit more closely to be sure about this.

One last thing: I think we need to ReplicationSlotMarkDirty() and
ReplicationSlotSave() after changing the LSN.  My patch doesn't do that. 
I noticed that the checkpoint already saved the slot once; maybe it
would make more sense to avoid doubly-writing the files by removing
CheckPointReplicationSlots() from CheckPointGuts, and instead call it
just after doing InvalidateObsoleteReplicationSlots().  But this is not
very important, since we don't expect to be modifying slots because of
disk-space reasons very frequently anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Further-changes.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

06 April 2020, 22:15:55

On 2020-Apr-06, Alvaro Herrera wrote:

> Lastly, I noticed that we're now changing the slot's restart_lsn to
> Invalid without being the slot's owner, which goes counter to what is
> said in slot.h:
> 
>  * - Individual fields are protected by mutex where only the backend owning
>  * the slot is authorized to update the fields from its own slot.  The
>  * backend owning the slot does not need to take this lock when reading its
>  * own fields, while concurrent backends not owning this slot should take the
>  * lock when reading this slot's data.
> 
> What this means is that if the slot owner walsender updates the
> restart_lsn to a newer value just as we (checkpointer) are trying to set
> it to Invalid, the owner's value might persist and our value would be
> lost.
> 
> AFAICT if we were really stressed about getting this exactly correct,
> then we would have to kill the walsender, wait for it to die, then
> ReplicationSlotAcquire and *then* update
> MyReplicationSlot->data.restart_lsn.

So I had cold feet about the whole business of trying to write a
non-owned replication slot, so I tried to implemented the "exactly
correct" idea above.  That's v25 here.

I think there's a race condition in this: if we kill a walsender and it
restarts immediately before we (checkpoint) can acquire the slot, we
will wait for it to terminate on its own.  Fixing this requires changing
the ReplicationSlotAcquire API so that it knows not to wait but not
raise error either (so we can use an infinite loop: "acquire, if busy
send signal")

I also include a separate diff for a change that might or might not be
necessary, where xmins reserved by slots with restart_lsn=invalid are
ignored.  I'm not yet sure that we should include this, but we should
keep an eye on it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

06 April 2020, 23:51:48

On 2020-Apr-06, Alvaro Herrera wrote:

> I think there's a race condition in this: if we kill a walsender and it
> restarts immediately before we (checkpoint) can acquire the slot, we
> will wait for it to terminate on its own.  Fixing this requires changing
> the ReplicationSlotAcquire API so that it knows not to wait but not
> raise error either (so we can use an infinite loop: "acquire, if busy
> send signal")

I think this should do it, but I didn't test it super-carefully and the
usage of the condition variable is not entirely kosher.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v25delta-walsender-kill-loop.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

07 April 2020, 00:12:43

At Mon, 6 Apr 2020 12:54:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> Thanks for the revised version.  Please note that you forgot to "git
> add" the test file, to it's not in the patch.

Oops! I forgot that I was working after just doing patch -p1 on my
working directory. This is the version that contains the test script.

> I'm reviewing the patch now.

Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From df92156779aaaf882659521863b833d2dbfa08b4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 6 Apr 2020 17:56:46 +0900
Subject: [PATCH v25] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 doc/src/sgml/catalogs.sgml                    |  39 ++++
 doc/src/sgml/config.sgml                      |  23 ++
 doc/src/sgml/high-availability.sgml           |   8 +-
 src/backend/access/transam/xlog.c             | 145 ++++++++++--
 src/backend/catalog/system_views.sql          |   4 +-
 src/backend/replication/slot.c                |  52 +++++
 src/backend/replication/slotfuncs.c           |  38 ++-
 src/backend/utils/misc/guc.c                  |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |  17 ++
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/replication/slot.h                |   1 +
 src/test/recovery/t/018_replslot_limit.pl     | 217 ++++++++++++++++++
 src/test/regress/expected/rules.out           |   6 +-
 14 files changed, 537 insertions(+), 33 deletions(-)
 create mode 100644 src/test/recovery/t/018_replslot_limit.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..de8ca5ccca 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9907,6 +9907,45 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>wal_status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+
+      <entry>Availability of WAL files claimed by this slot.
+      Valid values are:
+       <simplelist>
+        <member>
+         <literal>normal</literal> means that the claimed files
+         are within <varname>max_wal_size</varname>
+        </member>
+        <member>
+         <literal>keeping</literal> means that <varname>max_wal_size</varname>
+         is exceeded but still held by replication slots or
+         <varname>wal_keep_segments</varname>
+        </member>
+        <member>
+         <literal>lost</literal> means that some of them are definitely lost
+         and the session using this slot cannot continue replication. This
+         state will be hardly seen because walsender that enters this state is
+         terminated immediately.
+        </member>
+       </simplelist>
+      The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is
+      non-negative. If <structfield>restart_lsn</structfield> is NULL, this
+      field is null.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>min_safe_lsn</structfield></entry>
+      <entry><type>pg_lsn</type></entry>
+      <entry></entry>
+      <entry>The minimum LSN currently available for walsenders.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..17c18386e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3753,6 +3753,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
       </listitem>
      </varlistentry>
 
+      <varlistentry id="guc-max-slot-wal-keep-size" xreflabel="max_slot_wal_keep_size">
+       <term><varname>max_slot_wal_keep_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_slot_wal_keep_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+       <para>
+        Specify the maximum size of WAL files
+        that <link linkend="streaming-replication-slots">replication
+        slots</link> are allowed to retain in the <filename>pg_wal</filename>
+        directory at checkpoint time.
+        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
+        replication slots retain unlimited amount of WAL files.  If
+        restart_lsn of a replication slot gets behind more than that megabytes
+        from the current LSN, the standby using the slot may no longer be able
+        to continue replication due to removal of required WAL files. You
+        can see the WAL availability of replication slots
+        in <link linkend="view-pg-replication-slots">pg_replication_slots</link>.
+       </para>
+       </listitem>
+      </varlistentry>
+
      <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout">
       <term><varname>wal_sender_timeout</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..624e5f94ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -925,9 +925,11 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
-    known to be needed.  An advantage of these methods is that they bound
-    the space requirement for <literal>pg_wal</literal>; there is currently no way
-    to do this using replication slots.
+    known to be needed.  On the other hand, replication slots can retain so
+    many WAL segments that they fill up the space allocated
+    for <literal>pg_wal</literal>;
+    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+    retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..8f28ffaab9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -106,6 +106,7 @@ int            wal_level = WAL_LEVEL_MINIMAL;
 int            CommitDelay = 0;    /* precommit delay in microseconds */
 int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
+int            max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -759,7 +760,7 @@ static ControlFileData *ControlFile = NULL;
  */
 #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
 
-/* Convert min_wal_size_mb and max_wal_size_mb to equivalent segment count */
+/* Convert values of GUCs measured in megabytes to equiv. segment count */
 #define ConvertToXSegs(x, segsize)    \
     (x / ((segsize) / (1024 * 1024)))
 
@@ -3929,9 +3930,10 @@ XLogGetLastRemovedSegno(void)
     return lastRemovedSegNo;
 }
 
+
 /*
- * Update the last removed segno pointer in shared memory, to reflect
- * that the given XLOG file has been removed.
+ * Update the last removed segno pointer in shared memory, to reflect that the
+ * given XLOG file has been removed.
  */
 static void
 UpdateLastRemovedPtr(char *filename)
@@ -9049,6 +9051,7 @@ CreateCheckPoint(int flags)
      */
     XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
     KeepLogSeg(recptr, &_logSegNo);
+    InvalidateObsoleteReplicationSlots(_logSegNo);
     _logSegNo--;
     RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
@@ -9383,6 +9386,7 @@ CreateRestartPoint(int flags)
     replayPtr = GetXLogReplayRecPtr(&replayTLI);
     endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
     KeepLogSeg(endptr, &_logSegNo);
+    InvalidateObsoleteReplicationSlots(_logSegNo);
     _logSegNo--;
 
     /*
@@ -9451,48 +9455,143 @@ CreateRestartPoint(int flags)
     return true;
 }
 
+/*
+ * Report availability of WAL for a replication slot
+ *        restart_lsn and active_pid are straight from the slot info
+ *
+ * Returns one of the following enum values.
+ *
+ * WALAVAIL_NORMAL means targetLSN is available because it is in the range of
+ * max_wal_size.  If max_slot_wal_keep_size is smaller than max_wal_size, this
+ * state is not returned.
+ *
+ * WALAVAIL_PRESERVED means it is still available by preserving extra segments
+ * beyond max_wal_size.
+ *
+ * WALAVAIL_REMOVED means it is definitely lost. The replication stream on the
+ * slot cannot continue.
+ *
+ * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
+ */
+WalAvailability
+GetWalAvailability(XLogRecPtr restart_lsn, pid_t walsender_pid)
+{
+    XLogRecPtr currpos;
+    XLogSegNo currSeg;        /* segid of currpos */
+    XLogSegNo restartSeg;    /* segid of restart_lsn */
+    XLogSegNo oldestSeg;    /* actual oldest segid */
+    XLogSegNo oldestSegMaxWalSize;    /* oldest segid kept by max_wal_size */
+    XLogSegNo oldestSlotSeg = InvalidXLogRecPtr;/* oldest segid kept by slot */
+    uint64      keepSegs;
+
+    /* slot does not reserve WAL. Either deactivated, or has never been active */
+    if (XLogRecPtrIsInvalid(restart_lsn))
+        return WALAVAIL_INVALID_LSN;
+
+    currpos = GetXLogWriteRecPtr();
+
+    /* calculate oldest segment currently needed by slots */
+    XLByteToSeg(restart_lsn, restartSeg, wal_segment_size);
+    KeepLogSeg(currpos, &oldestSlotSeg);
+
+    /*
+     * Find the oldest extant segment file. We get 1 until checkpoint removes
+     * the first WAL segment file since startup, which causes the status being
+     * wrong under certain abnormal conditions but that doesn't actually harm.
+     */
+    oldestSeg = XLogGetLastRemovedSegno() + 1;
+
+    /* calculate oldest segment by max_wal_size and wal_keep_segments */
+    XLByteToSeg(currpos, currSeg, wal_segment_size);
+    keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
+                              wal_segment_size) + 1;
+
+    if (currSeg > keepSegs)
+        oldestSegMaxWalSize = currSeg - keepSegs;
+    else
+        oldestSegMaxWalSize = 1;
+
+    /*
+     * If max_slot_wal_keep_size has changed after the last call, the segment
+     * that would been kept by the current setting might have been lost by the
+     * previous setting. No point in showing normal or keeping status values if
+     * the restartSeg is known to be lost.
+     */
+    if (restartSeg >= oldestSeg)
+    {
+        /*
+         * show "normal" when restartSeg is within max_wal_size. If
+         * max_slot_wal_keep_size is smaller than max_wal_size, there's no
+         * point in showing the status.
+         */
+        if ((max_slot_wal_keep_size_mb <= 0 ||
+             max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
+            oldestSegMaxWalSize <= restartSeg)
+            return WALAVAIL_NORMAL;
+
+        /* being retained by slots */
+        if (oldestSlotSeg <= restartSeg)
+            return WALAVAIL_PRESERVED;
+    }
+
+    /* definitely lost. the walsender can no longer restart */
+    return WALAVAIL_REMOVED;
+}
+
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
  *
  * This is calculated by subtracting wal_keep_segments from the given xlog
  * location, recptr and by making sure that that result is below the
- * requirement of replication slots.
+ * requirement of replication slots.  For the latter criterion we do consider
+ * the effects of max_slot_wal_keep_size: reserve at most that much space back
+ * from recptr.
  */
 static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
+    XLogSegNo    currSegNo;
     XLogSegNo    segno;
     XLogRecPtr    keep;
 
-    XLByteToSeg(recptr, segno, wal_segment_size);
+    XLByteToSeg(recptr, currSegNo, wal_segment_size);
+    segno = currSegNo;
+
     keep = XLogGetReplicationSlotMinimumLSN();
 
-    /* compute limit for wal_keep_segments first */
-    if (wal_keep_segments > 0)
+    /*
+     * Calculate how many segments are kept by slots first.
+     */
+    /* Cap keepSegs by max_slot_wal_keep_size */
+    if (keep != InvalidXLogRecPtr)
+    {
+        XLByteToSeg(keep, segno, wal_segment_size);
+
+        /* Reduce it if slots already reserves too many. */
+        if (max_slot_wal_keep_size_mb >= 0)
+        {
+            XLogRecPtr slot_keep_segs =
+                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+            if (currSegNo - segno > slot_keep_segs)
+                segno = currSegNo - slot_keep_segs;
+        }
+    }
+
+    /* but, keep at least wal_keep_segments segments if any */
+    if (wal_keep_segments > 0 && currSegNo - segno < wal_keep_segments)
     {
         /* avoid underflow, don't go below 1 */
-        if (segno <= wal_keep_segments)
+        if (currSegNo <= wal_keep_segments)
             segno = 1;
         else
-            segno = segno - wal_keep_segments;
-    }
-
-    /* then check whether slots limit removal further */
-    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
-    {
-        XLogSegNo    slotSegNo;
-
-        XLByteToSeg(keep, slotSegNo, wal_segment_size);
-
-        if (slotSegNo <= 0)
-            segno = 1;
-        else if (slotSegNo < segno)
-            segno = slotSegNo;
+            segno = currSegNo - wal_keep_segments;
     }
 
     /* don't delete WAL segments newer than the calculated segment */
-    if (segno < *logSegNo)
+    if (XLogRecPtrIsInvalid(*logSegNo) || segno < *logSegNo)
         *logSegNo = segno;
 }
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..d406ea8118 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -876,7 +876,9 @@ CREATE VIEW pg_replication_slots AS
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn,
-            L.confirmed_flush_lsn
+            L.confirmed_flush_lsn,
+            L.wal_status,
+            L.min_safe_lsn
     FROM pg_get_replication_slots() AS L
             LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..86ddff8b9d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1064,6 +1064,58 @@ ReplicationSlotReserveWal(void)
     }
 }
 
+/*
+ * Mark any slot that points to an LSN older than the given segment
+ * as invalid; it requires WAL that's about to be removed.
+ *
+ * NB - this runs as part of checkpoint, so avoid raising errors if possible.
+ */
+void
+InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
+{
+    XLogRecPtr    oldestLSN;
+    List       *pids = NIL;
+    ListCell   *cell;
+
+    XLogSegNoOffsetToRecPtr(oldestSegno, 0, wal_segment_size, oldestLSN);
+
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    for (int i = 0; i < max_replication_slots; i++)
+    {
+        ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+        if (!s->in_use || s->data.restart_lsn == InvalidXLogRecPtr)
+            continue;
+
+        if (s->data.restart_lsn < oldestLSN)
+        {
+            elog(LOG, "slot %s is invalidated at %X/%X due to exceeding max_slot_wal_keep_size",
+                 s->data.name.data,
+                 (uint32) (s->data.restart_lsn >> 32),
+                 (uint32) s->data.restart_lsn);
+            /* mark this slot as invalid */
+            SpinLockAcquire(&s->mutex);
+            s->data.restart_lsn = InvalidXLogRecPtr;
+
+            /* remember PID for killing, if active*/
+            if (s->active_pid != 0)
+                pids = lappend_int(pids, s->active_pid);
+            SpinLockRelease(&s->mutex);
+        }
+    }
+    LWLockRelease(ReplicationSlotControlLock);
+
+    /*
+     * Signal any active walsenders to terminate.  We do not wait to observe
+     * them gone.
+     */
+    foreach(cell, pids)
+    {
+        /* signal the walsender to terminate */
+        (void) kill(lfirst_int(cell), SIGTERM);
+    }
+}
+
 /*
  * Flush all replication slots to disk.
  *
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ce0c9127bc..dc38b475c5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -234,7 +234,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 11
+#define PG_GET_REPLICATION_SLOTS_COLS 13
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     TupleDesc    tupdesc;
     Tuplestorestate *tupstore;
@@ -288,6 +288,8 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         Oid            database;
         NameData    slot_name;
         NameData    plugin;
+        WalAvailability walstate;
+        XLogSegNo    last_removed_seg;
         int            i;
 
         if (!slot->in_use)
@@ -355,6 +357,40 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         else
             nulls[i++] = true;
 
+        walstate = GetWalAvailability(restart_lsn, active_pid);
+
+        switch (walstate)
+        {
+            case WALAVAIL_INVALID_LSN:
+                nulls[i++] = true;
+                break;
+
+            case WALAVAIL_NORMAL:
+                values[i++] = CStringGetTextDatum("normal");
+                break;
+
+            case WALAVAIL_PRESERVED:
+                values[i++] = CStringGetTextDatum("keeping");
+                break;
+
+            case WALAVAIL_REMOVED:
+                values[i++] = CStringGetTextDatum("lost");
+                break;
+        }
+
+        if (max_slot_wal_keep_size_mb >= 0 &&
+            (walstate == WALAVAIL_NORMAL || walstate == WALAVAIL_PRESERVED) &&
+            ((last_removed_seg = XLogGetLastRemovedSegno()) != 0))
+        {
+            XLogRecPtr min_safe_lsn;
+
+            XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
+                                    wal_segment_size, min_safe_lsn);
+            values[i++] = Int64GetDatum(min_safe_lsn);
+        }
+        else
+            nulls[i++] = true;
+
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
     LWLockRelease(ReplicationSlotControlLock);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..1cfb999748 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2763,6 +2763,19 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+            gettext_noop("Sets the maximum WAL size that can be reserved by replication slots."),
+            gettext_noop("Replication slots will be marked as failed, and segments released "
+                         "for deletion or recycling, if this much space is occupied by WAL "
+                         "on disk."),
+            GUC_UNIT_MB
+        },
+        &max_slot_wal_keep_size_mb,
+        -1, -1, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"wal_sender_timeout", PGC_USERSET, REPLICATION_SENDING,
             gettext_noop("Sets the maximum time to wait for WAL replication."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..507a72b712 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -288,6 +288,7 @@
 #max_wal_senders = 10        # max number of walsender processes
                 # (change requires restart)
 #wal_keep_segments = 0        # in logfile segments; 0 disables
+#max_slot_wal_keep_size = -1    # measured in bytes; -1 disables
 #wal_sender_timeout = 60s    # in milliseconds; 0 disables
 
 #max_replication_slots = 10    # max number of replication slots
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..33812bb3f9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -108,6 +108,7 @@ extern int    wal_segment_size;
 extern int    min_wal_size_mb;
 extern int    max_wal_size_mb;
 extern int    wal_keep_segments;
+extern int    max_slot_wal_keep_size_mb;
 extern int    XLOGbuffers;
 extern int    XLogArchiveTimeout;
 extern int    wal_retrieve_retry_interval;
@@ -255,6 +256,19 @@ typedef struct CheckpointStatsData
 
 extern CheckpointStatsData CheckpointStats;
 
+/*
+ * WAL segment availability status
+ *
+ * This is used as the return value of GetWalAvailability.
+ */
+typedef enum WalAvailability
+{
+    WALAVAIL_INVALID_LSN,            /* parameter error */
+    WALAVAIL_NORMAL,                /* WAL segment is within max_wal_size */
+    WALAVAIL_PRESERVED,                /* WAL segment is preserved by repslots */
+    WALAVAIL_REMOVED                /* WAL segment has been removed */
+} WalAvailability;
+
 struct XLogRecData;
 
 extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
@@ -305,6 +319,9 @@ extern void ShutdownXLOG(int code, Datum arg);
 extern void InitXLOGAccess(void);
 extern void CreateCheckPoint(int flags);
 extern bool CreateRestartPoint(int flags);
+extern WalAvailability GetWalAvailability(XLogRecPtr restart_lsn,
+                                          pid_t walsender_pid);
+extern XLogRecPtr CalculateMaxmumSafeLSN(void);
 extern void XLogPutNextOid(Oid nextOid);
 extern XLogRecPtr XLogRestorePoint(const char *rpName);
 extern void UpdateFullPageWrites(void);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..ef808c5c43 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9986,9 +9986,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_safe_lsn}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 3e95b019b3..6e469ea749 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -198,6 +198,7 @@ extern void ReplicationSlotsComputeRequiredLSN(void);
 extern XLogRecPtr ReplicationSlotsComputeLogicalRestartLSN(void);
 extern bool ReplicationSlotsCountDBSlots(Oid dboid, int *nslots, int *nactive);
 extern void ReplicationSlotsDropDBSlots(Oid dboid);
+extern void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno);
 
 extern void StartupReplicationSlots(void);
 extern void CheckPointReplicationSlots(void);
diff --git a/src/test/recovery/t/018_replslot_limit.pl b/src/test/recovery/t/018_replslot_limit.pl
new file mode 100644
index 0000000000..d4bf970bd3
--- /dev/null
+++ b/src/test/recovery/t/018_replslot_limit.pl
@@ -0,0 +1,217 @@
+# Test for replication slot limit
+# Ensure that max_slot_wal_keep_size limits the number of WAL files to
+# be kept by replication slots.
+
+
+use strict;
+use warnings;
+use File::Path qw(rmtree);
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+use Time::HiRes qw(usleep);
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize master node, setting wal-segsize to 1MB
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
+$node_master->append_conf('postgresql.conf', qq(
+min_wal_size = 2MB
+max_wal_size = 4MB
+log_checkpoints = yes
+));
+$node_master->start;
+$node_master->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+
+# The slot state and remain should be null before the first connection
+my $result = $node_master->safe_psql('postgres', "SELECT restart_lsn is NULL, wal_status is NULL, min_safe_lsn is NULL
FROMpg_replication_slots WHERE slot_name = 'rep1'");
 
+is($result, "t|t|t", 'check the state of non-reserved slot is "unknown"');
+
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_master->backup($backup_name);
+
+# Create a standby linking to it using the replication slot
+my $node_standby = get_new_node('standby_1');
+$node_standby->init_from_backup($node_master, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+
+$node_standby->start;
+
+# Wait until standby has replayed enough data
+my $start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+# Stop standby
+$node_standby->stop;
+
+# Preparation done, the slot is the state "normal" now
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_safe_lsn is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'"); 
+is($result, "$start_lsn|normal|t", 'check the catching-up state');
+
+# Advance WAL by five segments (= 5MB) on master
+advance_wal($node_master, 1);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when fitting max_wal_size
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_safe_lsn is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that restart_lsn is in max_wal_size');
+
+advance_wal($node_master, 4);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is always "safe" when max_slot_wal_keep_size is not set
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_safe_lsn is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|t", 'check that slot is working');
+
+# The standby can reconnect to master
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+# Set max_slot_wal_keep_size on master
+my $max_slot_wal_keep_size_mb = 6;
+$node_master->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = ${max_slot_wal_keep_size_mb}MB
+));
+$node_master->reload;
+
+# The slot is in safe state. The distance from the min_safe_lsn should
+# be as almost (max_slot_wal_keep_size - 1) times large as the segment
+# size
+
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(pg_current_wal_lsn() -
min_safe_lsn) FROM pg_replication_slots WHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|5120 kB", 'check that max_slot_wal_keep_size is working');
+
+# Advance WAL again then checkpoint, reducing remain by 2 MB.
+advance_wal($node_master, 2);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# The slot is still working
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(pg_current_wal_lsn() -
min_safe_lsn)FROM pg_replication_slots WHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|2048 kB", 'check that min_safe_lsn gets close to the current LSN');
+
+# The standby can reconnect to master
+$node_standby->start;
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+$node_standby->stop;
+
+# wal_keep_segments overrides max_slot_wal_keep_size
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 8; SELECT pg_reload_conf();");
+# Advance WAL again then checkpoint, reducing remain by 6 MB.
+advance_wal($node_master, 6);
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(pg_current_wal_lsn() -
min_safe_lsn)as remain FROM pg_replication_slots WHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|normal|8192 kB", 'check that wal_keep_segments overrides max_slot_wal_keep_size');
+# restore wal_keep_segments
+$result = $node_master->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_segments to 0; SELECT pg_reload_conf();");
+
+# The standby can reconnect to master
+$node_standby->start;
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+$node_standby->stop;
+
+# Advance WAL again without checkpoint, reducing remain by 6 MB.
+advance_wal($node_master, 6);
+
+# Slot gets into 'keeping' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(restart_lsn -
min_safe_lsn)as remain FROM pg_replication_slots WHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|keeping|216 bytes", 'check that the slot state changes to "keeping"');
+
+# do checkpoint so that the next checkpoint runs too early
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# Advance WAL again without checkpoint; remain goes to 0.
+advance_wal($node_master, 1);
+
+# Slot gets into 'lost' state
+$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, min_safe_lsn is NULL FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "$start_lsn|lost|t", 'check that the slot state changes to "lost"');
+
+# The standby still can connect to master before a checkpoint
+$node_standby->start;
+
+$start_lsn = $node_master->lsn('write');
+$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
+
+$node_standby->stop;
+
+ok(!find_in_log($node_standby,
+                "requested WAL segment [0-9A-F]+ has already been removed"),
+   'check that required WAL segments are still available');
+
+# Advance WAL again, the slot loses the oldest segment.
+my $logstart = get_log_size($node_master);
+advance_wal($node_master, 7);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
+
+# WARNING should be issued
+ok(find_in_log($node_master,
+               "slot rep1 is invalidated at [0-9A-F/]+ due to exceeding max_slot_wal_keep_size\n",
+               $logstart),
+   'check that the warning is logged');
+
+# This slot should be broken
+$result = $node_master->safe_psql('postgres', "SELECT slot_name, active, restart_lsn, wal_status, min_safe_lsn FROM
pg_replication_slotsWHERE slot_name = 'rep1'");
 
+is($result, "rep1|f|||", 'check that the slot became inactive');
+
+# The standby no longer can connect to the master
+$logstart = get_log_size($node_standby);
+$node_standby->start;
+
+my $failed = 0;
+for (my $i = 0 ; $i < 10000 ; $i++)
+{
+    if (find_in_log($node_standby,
+                    "requested WAL segment [0-9A-F]+ has already been removed",
+                    $logstart))
+    {
+        $failed = 1;
+        last;
+    }
+    usleep(100_000);
+}
+ok($failed, 'check that replication has been broken');
+
+$node_standby->stop;
+
+#####################################
+# Advance WAL of $node by $n segments
+sub advance_wal
+{
+    my ($node, $n) = @_;
+
+    # Advance by $n segments (= (16 * $n) MB) on master
+    for (my $i = 0 ; $i < $n ; $i++)
+    {
+        $node->safe_psql('postgres', "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();");
+    }
+}
+
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = TestLib::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6eec8ec568..ac31840739 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1462,8 +1462,10 @@ pg_replication_slots| SELECT l.slot_name,
     l.xmin,
     l.catalog_xmin,
     l.restart_lsn,
-    l.confirmed_flush_lsn
-   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn)
 
+    l.confirmed_flush_lsn,
+    l.wal_status,
+    l.min_safe_lsn
+   FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, temporary, active, active_pid, xmin,
catalog_xmin,restart_lsn, confirmed_flush_lsn, wal_status, min_safe_lsn)
 
      LEFT JOIN pg_database d ON ((l.datoid = d.oid)));
 pg_roles| SELECT pg_authid.rolname,
     pg_authid.rolsuper,
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

07 April 2020, 00:25:02

On 2020-Apr-07, Kyotaro Horiguchi wrote:

> At Mon, 6 Apr 2020 12:54:56 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > Thanks for the revised version.  Please note that you forgot to "git
> > add" the test file, to it's not in the patch.
> 
> Oops! I forgot that I was working after just doing patch -p1 on my
> working directory. This is the version that contains the test script.

Thanks!  This v26 is what I submitted last (sans the "xmin" business I
mentioned), with this test file included, adjusted for the message
wording I used.  These tests all pass for me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v26-0001-Add-WAL-relief-vent-for-replication-slots.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

07 April 2020, 03:09:05

At Mon, 6 Apr 2020 14:58:39 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > LOG: slot rep1 is invalidated at 0/1C00000 due to exceeding max_slot_wal_keep_size
> 
> Sounds good.  Here's a couple of further adjustments to your v24.  This
> passes the existing tests (pg_basebackup exception noted below), but I
> don't have the updated 019_replslot_limit.pl, so that still needs to be
> verified.
> 
> First, cosmetic changes in xlog.c.
> 
> Second, an unrelated bugfix: ReplicationSlotsComputeLogicalRestartLSN()
> is able to return InvalidXLogRecPtr if there's a slot with invalid
> restart_lsn.  I'm fairly certain that that's bogus.  I think this needs
> to be backpatched.

Logical slots are not assumed to be in that state, tait is, in_use but
having invalid restart_lsn. Maybe we need to define the behavor if
restart_lsn is invalid (but confirmed_flush_lsn is valid)?

> Third: The loop in InvalidateObsoleteReplicationSlots was reading
> restart_lsn without aquiring mutex. Split the "continue" line in two, so
> in_use is checked without spinlock and restart_lsn is checked with it.

Right. Thanks.

> This means we also need to store restart_lsn in a local variable before
> logging the message (because we don't want to log with spinlock held).
> Also, use ereport() not elog() for that, and add quotes to the slot
> name.

I omitted the quotes since slot names don't contain white spaces, but,
yes, it is quoted in other places.  elog is just my bad.

> Lastly, I noticed that we're now changing the slot's restart_lsn to
> Invalid without being the slot's owner, which goes counter to what is
> said in slot.h:
> 
>  * - Individual fields are protected by mutex where only the backend owning
>  * the slot is authorized to update the fields from its own slot.  The
>  * backend owning the slot does not need to take this lock when reading its
>  * own fields, while concurrent backends not owning this slot should take the
>  * lock when reading this slot's data.
> 
> What this means is that if the slot owner walsender updates the
> restart_lsn to a newer value just as we (checkpointer) are trying to set
> it to Invalid, the owner's value might persist and our value would be
> lost.

Right.

> AFAICT if we were really stressed about getting this exactly correct,
> then we would have to kill the walsender, wait for it to die, then
> ReplicationSlotAcquire and *then* update
> MyReplicationSlot->data.restart_lsn.  But I don't think we want to do
> that during checkpoint, and I'm not sure we need to be as strict anyway:

Agreed.

> it seems to me that it suffices to check restart_lsn for being invalid
> in the couple of places where the slot's owner advances (which is the
> two auxiliary functions for ProcessStandbyReplyMessage).  I have done so
> in the attached.  There are other places where the restart_lsn is set,
> but those seem to be used only when the slot is created.  I don't think
> we need to cover for those, but I'm not 100% sure about that.

StartLogicalReplcation does
"XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the
restart_lsn is invalid, following call to XLogReadRecord runs into
assertion failure.  Walsender (or StartLogicalReplication) should
correctly reject reconnection from the subscriber if restart_lsn is
invalid.

> However, the change in PhysicalConfirmReceivedLocation() breaks
> the way slots work for pg_basebackup: apparently the slot is created
> with a restart_lsn of Invalid and we only advance it the first time we
> process a feedback message from pg_basebackup.  I have a vague feeling
> that that's bogus, but I'll have to look at the involved code a little
> bit more closely to be sure about this.

Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?

> One last thing: I think we need to ReplicationSlotMarkDirty() and
> ReplicationSlotSave() after changing the LSN.  My patch doesn't do that.

Oops.

> I noticed that the checkpoint already saved the slot once; maybe it
> would make more sense to avoid doubly-writing the files by removing
> CheckPointReplicationSlots() from CheckPointGuts, and instead call it
> just after doing InvalidateObsoleteReplicationSlots().  But this is not
> very important, since we don't expect to be modifying slots because of
> disk-space reasons very frequently anyway.

Agreed.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

07 April 2020, 07:30:43

At Tue, 07 Apr 2020 12:09:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > it seems to me that it suffices to check restart_lsn for being invalid
> > in the couple of places where the slot's owner advances (which is the
> > two auxiliary functions for ProcessStandbyReplyMessage).  I have done so
> > in the attached.  There are other places where the restart_lsn is set,
> > but those seem to be used only when the slot is created.  I don't think
> > we need to cover for those, but I'm not 100% sure about that.
> 
> StartLogicalReplcation does
> "XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the
> restart_lsn is invalid, following call to XLogReadRecord runs into
> assertion failure.  Walsender (or StartLogicalReplication) should
> correctly reject reconnection from the subscriber if restart_lsn is
> invalid.
> 
> > However, the change in PhysicalConfirmReceivedLocation() breaks
> > the way slots work for pg_basebackup: apparently the slot is created
> > with a restart_lsn of Invalid and we only advance it the first time we
> > process a feedback message from pg_basebackup.  I have a vague feeling
> > that that's bogus, but I'll have to look at the involved code a little
> > bit more closely to be sure about this.
> 
> Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?

I did that in the attached. The invalidated is shared-but-not-saved
member of a slot and initialized to false then irreversibly changed to
true when the slot loses required segment.

It is checked by the new function CheckReplicationSlotInvalidated() at
acquireing a slot and at updating slot by standby reply message. This
change stops walsender without explicitly killing but I didn't remove
that code.

When logical slot loses segment, the publisher complains as:


[backend  ] LOG:  slot "s1" is invalidated at 0/370001C0 due to exceeding max_slot_wal_keep_size
[walsender] FATAL:  terminating connection due to administrator command

The subscriber tries to reconnect and that fails as follows:

[19350] ERROR:  replication slot "s1" is invalidated
[19352] ERROR:  replication slot "s1" is invalidated
...

If the publisher restarts, the message is not seen and see the
following instead.

[19372] ERROR:  requested WAL segment 000000010000000000000037 has already been removed

The check is done at ReplicationSlotAcquire, some slot-related SQL
functions are affected.

=# select pg_replication_slot_advance('s1', '0/37000000');
ERROR:  replication slot "s1" is invalidated

After restarting the publisher, the message changes as the same with
walsender.

=# select pg_replication_slot_advance('s1', '0/380001C0');
ERROR:  requested WAL segment pg_wal/000000010000000000000037 has already been removed

Since I didn't touch restart_lsn at all so no fear for changing other
behavior inadvertently.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3f81c5740ea3554835bbe794820624b56c9c3ea8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 7 Apr 2020 11:09:54 +0900
Subject: [PATCH] further change type 2

---
 src/backend/access/transam/xlog.c   | 17 +++++----
 src/backend/replication/slot.c      | 59 ++++++++++++++++++++++++-----
 src/backend/replication/walsender.c |  2 +
 src/include/replication/slot.h      |  7 ++++
 4 files changed, 68 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f28ffaab9..c5b96126ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9559,20 +9559,21 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
     XLByteToSeg(recptr, currSegNo, wal_segment_size);
     segno = currSegNo;
 
+    /*
+     * Calculate how many segments are kept by slots first, adjusting
+     * for max_slot_wal_keep_size.
+     */
     keep = XLogGetReplicationSlotMinimumLSN();
-
-    /*
-     * Calculate how many segments are kept by slots first.
-     */
-    /* Cap keepSegs by max_slot_wal_keep_size */
     if (keep != InvalidXLogRecPtr)
     {
         XLByteToSeg(keep, segno, wal_segment_size);
 
-        /* Reduce it if slots already reserves too many. */
+        /* Cap by max_slot_wal_keep_size ... */
         if (max_slot_wal_keep_size_mb >= 0)
         {
-            XLogRecPtr slot_keep_segs =
+            XLogRecPtr    slot_keep_segs;
+
+            slot_keep_segs =
                 ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
 
             if (currSegNo - segno > slot_keep_segs)
@@ -9580,7 +9581,7 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
         }
     }
 
-    /* but, keep at least wal_keep_segments segments if any */
+    /* but, keep at least wal_keep_segments if that's set */
     if (wal_keep_segments > 0 && currSegNo - segno < wal_keep_segments)
     {
         /* avoid underflow, don't go below 1 */
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 86ddff8b9d..0a28b27607 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -277,6 +277,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
     StrNCpy(NameStr(slot->data.name), name, NAMEDATALEN);
     slot->data.database = db_specific ? MyDatabaseId : InvalidOid;
     slot->data.persistency = persistency;
+    slot->invalidated = false;
 
     /* and then data only present in shared memory */
     slot->just_dirtied = false;
@@ -323,6 +324,29 @@ ReplicationSlotCreate(const char *name, bool db_specific,
     ConditionVariableBroadcast(&slot->active_cv);
 }
 
+
+/*
+ * Check if the slot is invalidated.
+ */
+void
+CheckReplicationSlotInvalidated(ReplicationSlot *slot)
+{
+    bool invalidated;
+
+    /* Take lock to read slot name for error message. */
+    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+    invalidated = slot->invalidated;
+
+    /* If the slot is invalidated, error out. */
+    if (invalidated)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("replication slot \"%s\" is invalidated",
+                        NameStr(slot->data.name))));
+
+    LWLockRelease(ReplicationSlotControlLock);
+}
+
 /*
  * Find a previously created slot and mark it as used by this backend.
  */
@@ -412,6 +436,9 @@ retry:
 
     /* We made this slot active, so it's ours now. */
     MyReplicationSlot = slot;
+
+    /* Finally, check if the slot is invalidated */
+    CheckReplicationSlotInvalidated(slot);
 }
 
 /*
@@ -1083,25 +1110,39 @@ InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
     for (int i = 0; i < max_replication_slots; i++)
     {
         ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+        XLogRecPtr restart_lsn = InvalidXLogRecPtr;
 
-        if (!s->in_use || s->data.restart_lsn == InvalidXLogRecPtr)
+        if (!s->in_use)
+            continue;
+
+        if (s->invalidated)
             continue;
 
         if (s->data.restart_lsn < oldestLSN)
         {
-            elog(LOG, "slot %s is invalidated at %X/%X due to exceeding max_slot_wal_keep_size",
-                 s->data.name.data,
-                 (uint32) (s->data.restart_lsn >> 32),
-                 (uint32) s->data.restart_lsn);
-            /* mark this slot as invalid */
             SpinLockAcquire(&s->mutex);
-            s->data.restart_lsn = InvalidXLogRecPtr;
 
-            /* remember PID for killing, if active*/
+            /* mark this slot as invalid */
+            s->invalidated = true;
+
+            /* remember restart_lsn for logging */
+            restart_lsn = s->data.restart_lsn;
+
+            SpinLockRelease(&s->mutex);
+
+            /* remember PID for killing, if active */
             if (s->active_pid != 0)
                 pids = lappend_int(pids, s->active_pid);
-            SpinLockRelease(&s->mutex);
         }
+        SpinLockRelease(&s->mutex);
+
+        if (restart_lsn != InvalidXLogRecPtr)
+            ereport(LOG,
+                    errmsg("slot \"%s\" is invalidated at %X/%X due to exceeding max_slot_wal_keep_size",
+                           NameStr(s->data.name),
+                           (uint32) (restart_lsn >> 32),
+                           (uint32) restart_lsn));
+
     }
     LWLockRelease(ReplicationSlotControlLock);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 76ec3c7dd0..c582f34fcc 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1897,6 +1897,8 @@ ProcessStandbyReplyMessage(void)
      */
     if (MyReplicationSlot && flushPtr != InvalidXLogRecPtr)
     {
+        CheckReplicationSlotInvalidated(MyReplicationSlot);
+
         if (SlotIsLogical(MyReplicationSlot))
             LogicalConfirmReceivedLocation(flushPtr);
         else
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 6e469ea749..f7531ca495 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -98,6 +98,9 @@ typedef struct ReplicationSlotPersistentData
  * backend owning the slot does not need to take this lock when reading its
  * own fields, while concurrent backends not owning this slot should take the
  * lock when reading this slot's data.
+ * - The invalidated field is initially false then changed to true
+ * irreversibly by other than the owner and read by the possible next owner
+ * process after the termination of the current owner.
  */
 typedef struct ReplicationSlot
 {
@@ -131,6 +134,9 @@ typedef struct ReplicationSlot
     /* data surviving shutdowns and crashes */
     ReplicationSlotPersistentData data;
 
+    /* is invalidated ? */
+    bool        invalidated;
+
     /* is somebody performing io on this slot? */
     LWLock        io_in_progress_lock;
 
@@ -187,6 +193,7 @@ extern void ReplicationSlotDrop(const char *name, bool nowait);
 extern void ReplicationSlotAcquire(const char *name, bool nowait);
 extern void ReplicationSlotRelease(void);
 extern void ReplicationSlotCleanup(void);
+extern void CheckReplicationSlotInvalidated(ReplicationSlot *slot);
 extern void ReplicationSlotSave(void);
 extern void ReplicationSlotMarkDirty(void);
 
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

07 April 2020, 22:45:22

On 2020-Apr-07, Kyotaro Horiguchi wrote:

> > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
> 
> I did that in the attached. The invalidated is shared-but-not-saved
> member of a slot and initialized to false then irreversibly changed to
> true when the slot loses required segment.
> 
> It is checked by the new function CheckReplicationSlotInvalidated() at
> acquireing a slot and at updating slot by standby reply message. This
> change stops walsender without explicitly killing but I didn't remove
> that code.

This change didn't work well with my proposed change to make
checkpointer acquire slots before marking them invalid.  When I
incorporated your patch in the last version I posted yesterday, there
was a problem that when checkpointer attempted to acquire the slot, it
would fail with "the slot is invalidated"; also if you try to drop the
slot, it would obviously fail.  I think it would work to remove the
SlotIsInvalidated check from the Acquire routine, and instead move it to
the routines that need it (ie. not the InvalidateObsolete one, and also
not the routine to drop slots).

I pushed version 26, with a few further adjustments.

I think what we have now is sufficient, but if you want to attempt this
"invalidated" flag on top of what I pushed, be my guest.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

08 April 2020, 00:37:10

Thank you for committing this.

At Tue, 7 Apr 2020 18:45:22 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Apr-07, Kyotaro Horiguchi wrote:
> 
> > > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?
> > 
> > I did that in the attached. The invalidated is shared-but-not-saved
> > member of a slot and initialized to false then irreversibly changed to
> > true when the slot loses required segment.
> > 
> > It is checked by the new function CheckReplicationSlotInvalidated() at
> > acquireing a slot and at updating slot by standby reply message. This
> > change stops walsender without explicitly killing but I didn't remove
> > that code.
> 
> This change didn't work well with my proposed change to make
> checkpointer acquire slots before marking them invalid.  When I
> incorporated your patch in the last version I posted yesterday, there
> was a problem that when checkpointer attempted to acquire the slot, it
> would fail with "the slot is invalidated"; also if you try to drop the
> slot, it would obviously fail.  I think it would work to remove the
> SlotIsInvalidated check from the Acquire routine, and instead move it to
> the routines that need it (ie. not the InvalidateObsolete one, and also
> not the routine to drop slots).
> 
> I pushed version 26, with a few further adjustments.
> 
> I think what we have now is sufficient, but if you want to attempt this
> "invalidated" flag on top of what I pushed, be my guest.

I don't think the invalidation flag is essential but it can prevent
unanticipated behavior, in other words, it makes us feel at ease:p

After the current master/HEAD, the following steps causes assertion
failure in xlogreader.c.

P(ublisher) $ vi $PGDATA/postgresql.conf
wal_level=logical
max_slot_wal_keep_size=0
^Z
(start publisher and subscriber)

P=> create table t(a int);
P=> create publication p1 for table t;
S=> create table t(a int);
P=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;
(publisher crashes)

2020-04-08 09:20:16.893 JST [9582] LOG:  invalidating slot "s1" because its restart_lsn 0/1571770 exceeds
max_slot_wal_keep_size
2020-04-08 09:20:16.897 JST [9496] LOG:  database system is ready to accept connections
2020-04-08 09:20:21.472 JST [9597] LOG:  starting logical decoding for slot "s1"
2020-04-08 09:20:21.472 JST [9597] DETAIL:  Streaming transactions committing after 0/1571770, reading WAL from 0/0.
TRAP: FailedAssertion("!XLogRecPtrIsInvalid(RecPtr)", File: "xlogreader.c", Line: 235)
postgres: walsender horiguti [local] idle(ExceptionalCondition+0xa8)[0xaac4c1]
postgres: walsender horiguti [local] idle(XLogBeginRead+0x30)[0x588dbf]
postgres: walsender horiguti [local] idle[0x8c938b]
postgres: walsender horiguti [local] idle(exec_replication_command+0x311)[0x8c9c75]
postgres: walsender horiguti [local] idle(PostgresMain+0x79a)[0x92f091]
postgres: walsender horiguti [local] idle[0x87eec3]
postgres: walsender horiguti [local] idle[0x87e69a]
postgres: walsender horiguti [local] idle[0x87abc2]
postgres: walsender horiguti [local] idle(PostmasterMain+0x11cd)[0x87a48f]
postgres: walsender horiguti [local] idle[0x7852cb]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7fc190958873]
postgres: walsender horiguti [local] idle(_start+0x2e)[0x48169e]
2020-04-08 09:20:22.255 JST [9496] LOG:  server process (PID 9597) was terminated by signal 6: Aborted
2020-04-08 09:20:22.255 JST [9496] LOG:  terminating any other active server processes
2020-04-08 09:20:22.256 JST [9593] WARNING:  terminating connection because of crash of another server process
2020-04-08 09:20:22.256 JST [9593] DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.

I will look at it.

On the other hand, physical replication doesn't break by invlidation.

Primary: postgres.conf
max_slot_wal_keep_size=0
Standby: postgres.conf
primary_conninfo='connect to master'
primary_slot_name='x1'

(start the primary)
P=> select pg_create_physical_replication_slot('x1');
(start the standby)
S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

(primary log)
2020-04-08 09:35:09.719 JST [10064] LOG:  terminating walsender 10076 because replication slot "x1" is too far behind
2020-04-08 09:35:09.719 JST [10076] FATAL:  terminating connection due to administrator command
2020-04-08 09:35:09.720 JST [10064] LOG:  invalidating slot "x1" because its restart_lsn 0/B9F2000 exceeds
max_slot_wal_keep_size
(standby)
[10075] 2020-04-08 09:35:09.723 JST FATAL:  could not receive data from WAL stream: server closed the connection
unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
[10101] 2020-04-08 09:35:09.734 JST LOG:  started streaming WAL from primary at 0/C000000 on timeline 1

Doesn't harm but something's strange. I'll look it, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

08 April 2020, 05:19:56

At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > I pushed version 26, with a few further adjustments.
> > 
> > I think what we have now is sufficient, but if you want to attempt this
> > "invalidated" flag on top of what I pushed, be my guest.
> 
> I don't think the invalidation flag is essential but it can prevent
> unanticipated behavior, in other words, it makes us feel at ease:p
> 
> After the current master/HEAD, the following steps causes assertion
> failure in xlogreader.c.
..
> I will look at it.

Just avoiding starting replication when restart_lsn is invalid is
sufficient (the attached, which is equivalent to a part of what the
invalidated flag did). I thing that the error message needs a Hint but
it looks on the subscriber side as:

[22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL stream: ERROR:  replication slot "s1" is
invalidated
        HINT:  The slot exceeds the limit by max_slot_wal_keep_size.

I don't think it is not clean.. Perhaps the subscriber should remove
the trailing line of the message from the publisher?

> On the other hand, physical replication doesn't break by invlidation.
> 
> Primary: postgres.conf
> max_slot_wal_keep_size=0
> Standby: postgres.conf
> primary_conninfo='connect to master'
> primary_slot_name='x1'
> 
> (start the primary)
> P=> select pg_create_physical_replication_slot('x1');
> (start the standby)
> S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

If we don't mind that standby can reconnect after a walsender
termination due to the invalidation, we don't need to do something for
this.  Restricting max_slot_wal_keep_size to be larger than a certain
threshold would reduce the chance we see that behavior.

I saw another issue, the following sequence on the primary freezes
when invalidation happens.

=# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select
pg_switch_wal();createtable tt(); drop table tt; select pg_switch_wal(); checkpoint;
 

The last checkpoint command is waiting for CV on
CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
is waiting for the next latch at the end of
CheckpointerMain. new_started doesn't move but it is the same value
with old_started.

That freeze didn't happen when I removed
ConditionVariableSleep(&s->active_cv) in
InvalidateObsoleteReplicationSlots.

I continue investigating it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b3f7e2d94b8ea9b5f3819fcf47c0e1ba57355b87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 8 Apr 2020 14:03:01 +0900
Subject: [PATCH] walsender crash fix

---
 src/backend/replication/walsender.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 06e8b79036..707de65f4b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1170,6 +1170,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
     pq_flush();
 
     /* Start reading WAL from the oldest required WAL. */
+    if (MyReplicationSlot->data.restart_lsn == InvalidXLogRecPtr)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("replication slot \"%s\" is invalidated",
+                        cmd->slotname),
+                 errhint("The slot exceeds the limit by max_slot_wal_keep_size.")));
+
     XLogBeginRead(logical_decoding_ctx->reader,
                   MyReplicationSlot->data.restart_lsn);
 
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

08 April 2020, 07:46:05

At Wed, 08 Apr 2020 14:19:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I saw another issue, the following sequence on the primary freezes
> when invalidation happens.
> 
> =# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select
pg_switch_wal();createtable tt(); drop table tt; select pg_switch_wal(); checkpoint;
 
> 
> The last checkpoint command is waiting for CV on
> CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
> is waiting for the next latch at the end of
> CheckpointerMain. new_started doesn't move but it is the same value
> with old_started.
> 
> That freeze didn't happen when I removed
> ConditionVariableSleep(&s->active_cv) in
> InvalidateObsoleteReplicationSlots.
> 
> I continue investigating it.

I understand how it happens.

The latch triggered by checkpoint request by CHECKPOINT command has
been absorbed by ConditionVariableSleep() in
InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
use MyLatch for other than checkpoint request while a checkpoint is
running.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 6b877f11f557fc76f206e7a71ff7890952bf63d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 8 Apr 2020 16:35:25 +0900
Subject: [PATCH] Allow MyLatch of checkpointer for other use.

MyLatch of checkpointer process was used only to request for a
checkpoint.  Checkpoint can miss a request if the latch is used for
other purposes during a checkpoint.  Allow MyLatch be used for other
purposes such as condition variables by recording pending checkpoint
requests.
---
 src/backend/postmaster/checkpointer.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..86c355f035 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -160,6 +160,12 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
 
+/*
+ * Record checkpoint requests.  Since MyLatch is used other than
+ * CheckpointerMain, we need to record pending checkpoint request here.
+ */
+static bool CheckpointRequestPending = false;
+
 /* Prototypes for private functions */
 
 static void HandleCheckpointerInterrupts(void);
@@ -335,6 +341,7 @@ CheckpointerMain(void)
 
         /* Clear any already-pending wakeups */
         ResetLatch(MyLatch);
+        CheckpointRequestPending = false;
 
         /*
          * Process any requests or signals received recently.
@@ -494,6 +501,10 @@ CheckpointerMain(void)
          */
         pgstat_send_bgwriter();
 
+        /* Don't sleep if pending request exists */
+        if (CheckpointRequestPending)
+            continue;
+
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
          * xlog file switch.
@@ -817,6 +828,7 @@ ReqCheckpointHandler(SIGNAL_ARGS)
      */
     SetLatch(MyLatch);
 
+    CheckpointRequestPending = true;
     errno = save_errno;
 }
 
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

08 April 2020, 08:02:22

At Wed, 08 Apr 2020 16:46:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 08 Apr 2020 14:19:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The latch triggered by checkpoint request by CHECKPOINT command has
> been absorbed by ConditionVariableSleep() in
> InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
> use MyLatch for other than checkpoint request while a checkpoint is
> running.

Checkpoint requests happens during waiting for the CV causes spurious
wake up but that doesn't harm.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

10 April 2020, 02:59:20

At Wed, 08 Apr 2020 14:19:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> Just avoiding starting replication when restart_lsn is invalid is
me> sufficient (the attached, which is equivalent to a part of what the
me> invalidated flag did). I thing that the error message needs a Hint but
me> it looks on the subscriber side as:

At Wed, 08 Apr 2020 17:02:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> > At Wed, 08 Apr 2020 14:19:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> > The latch triggered by checkpoint request by CHECKPOINT command has
me> > been absorbed by ConditionVariableSleep() in
me> > InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
me> > use MyLatch for other than checkpoint request while a checkpoint is
me> > running.
me> 
me> Checkpoint requests happens during waiting for the CV causes spurious
me> wake up but that doesn't harm.

I added the two above to open items[1] so as not to be forgotten.

[1] https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

27 April 2020, 22:33:42

On 2020-Apr-08, Kyotaro Horiguchi wrote:

> At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> 
> Just avoiding starting replication when restart_lsn is invalid is
> sufficient (the attached, which is equivalent to a part of what the
> invalidated flag did). I thing that the error message needs a Hint but
> it looks on the subscriber side as:
> 
> [22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL stream: ERROR:  replication slot "s1" is
invalidated
>         HINT:  The slot exceeds the limit by max_slot_wal_keep_size.
> 
> I don't think it is not clean.. Perhaps the subscriber should remove
> the trailing line of the message from the publisher?

Thanks for the fix!  I propose two changes:

1. reword the error like this:

ERROR:  replication slot "regression_slot3" cannot be advanced
DETAIL:  This slot has never previously reserved WAL, or has been invalidated

2. use the same error in one other place, to wit
   pg_logical_slot_get_changes() and pg_replication_slot_advance().  I
   made the DETAIL part the same in all places, but the ERROR line is
   adjusted to what each callsite is doing.
   I do think that this change in test_decoding is a bit unpleasant:

-ERROR:  cannot use physical replication slot for logical decoding
+ERROR:  cannot get changes from replication slot "repl"

   The test is
      -- check that we're detecting a streaming rep slot used for logical decoding
      SELECT 'init' FROM pg_create_physical_replication_slot('repl');
      SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

> > On the other hand, physical replication doesn't break by invlidation.
> > [...]
> 
> If we don't mind that standby can reconnect after a walsender
> termination due to the invalidation, we don't need to do something for
> this.  Restricting max_slot_wal_keep_size to be larger than a certain
> threshold would reduce the chance we see that behavior.

Yeah, I think you're referring to the fact that StartReplication()
doesn't verify the restart_lsn of the slot; and if we do add a check, a
few tests that rely on physical replication start to fail.  This patch
only adds a comment in that spot.  But I don't (yet) know what the
consequences of this are, or whether it can be fixed by setting a valid
restart_lsn ahead of time.  This test in pg_basebackup fails, for
example:

# Running: pg_basebackup -D
/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl-X stream -S slot1

pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  cannot read from replication slot
"slot1"
DETAIL:  This slot has never previously reserved WAL, or has been invalidated
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory
"/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
not ok 95 - pg_basebackup -X stream with replication slot runs

#   Failed test 'pg_basebackup -X stream with replication slot runs'
#   at t/010_pg_basebackup.pl line 461.

Anyway I think the current patch can be applied as is -- and if we want
physical replication to have some other behavior, we can patch for that
afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-check-slot-restart_lsn-in-a-couple-of-places.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

27 April 2020, 23:40:07

On 2020-Apr-08, Kyotaro Horiguchi wrote:

> I understand how it happens.
> 
> The latch triggered by checkpoint request by CHECKPOINT command has
> been absorbed by ConditionVariableSleep() in
> InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
> use MyLatch for other than checkpoint request while a checkpoint is
> running.

Hmm, that explanation makes sense, but I couldn't reproduce it with the
steps you provided.  Perhaps I'm missing something.

Anyway I think this patch should fix it also -- instead of adding a new
flag, we just rely on the existing flags (since do_checkpoint must have
been set correctly from the flags earlier in that block.)

I think it'd be worth to verify this bugfix in a new test.  Would you
have time to produce that?  I could try in a couple of days ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Don-t-freeze-on-checkpoints.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

28 April 2020, 04:58:57

At Mon, 27 Apr 2020 18:33:42 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Apr-08, Kyotaro Horiguchi wrote:
> 
> > At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > 
> > Just avoiding starting replication when restart_lsn is invalid is
> > sufficient (the attached, which is equivalent to a part of what the
> > invalidated flag did). I thing that the error message needs a Hint but
> > it looks on the subscriber side as:
> > 
> > [22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL stream: ERROR:  replication slot "s1"
isinvalidated
 
> >         HINT:  The slot exceeds the limit by max_slot_wal_keep_size.
> > 
> > I don't think it is not clean.. Perhaps the subscriber should remove
> > the trailing line of the message from the publisher?
> 
> Thanks for the fix!  I propose two changes:
> 
> 1. reword the error like this:
> 
> ERROR:  replication slot "regression_slot3" cannot be advanced
> DETAIL:  This slot has never previously reserved WAL, or has been invalidated

Agreed to describe what is failed rather than the cause.  However,
logical replications slots are always "previously reserved" at
creation.


> 2. use the same error in one other place, to wit
>    pg_logical_slot_get_changes() and pg_replication_slot_advance().  I
>    made the DETAIL part the same in all places, but the ERROR line is
>    adjusted to what each callsite is doing.
>    I do think that this change in test_decoding is a bit unpleasant:
> 
> -ERROR:  cannot use physical replication slot for logical decoding
> +ERROR:  cannot get changes from replication slot "repl"
> 
>    The test is
>       -- check that we're detecting a streaming rep slot used for logical decoding
>       SELECT 'init' FROM pg_create_physical_replication_slot('repl');
>       SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');

The message may be understood as "No change has been made since
restart_lsn". Does something like the following work?

ERROR:  replication slot "repl" is not usable to get changes


By the way there are some other messages that doesn't render the
symptom but the cause.

"cannot use physical replication slot for logical decoding"
"replication slot \"%s\" was not created in this database"

Don't they need the same amendment?


> > > On the other hand, physical replication doesn't break by invlidation.
> > > [...]
> > 
> > If we don't mind that standby can reconnect after a walsender
> > termination due to the invalidation, we don't need to do something for
> > this.  Restricting max_slot_wal_keep_size to be larger than a certain
> > threshold would reduce the chance we see that behavior.
> 
> Yeah, I think you're referring to the fact that StartReplication()
> doesn't verify the restart_lsn of the slot; and if we do add a check, a
> few tests that rely on physical replication start to fail.  This patch
> only adds a comment in that spot.  But I don't (yet) know what the
> consequences of this are, or whether it can be fixed by setting a valid
> restart_lsn ahead of time.  This test in pg_basebackup fails, for
> example:
> 
> # Running: pg_basebackup -D
/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl-X stream -S slot1
 
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR:  cannot read from replication
slot"slot1"
 
> DETAIL:  This slot has never previously reserved WAL, or has been invalidated
> pg_basebackup: error: child process exited with exit code 1
> pg_basebackup: removing data directory
"/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
> not ok 95 - pg_basebackup -X stream with replication slot runs
> 
> #   Failed test 'pg_basebackup -X stream with replication slot runs'
> #   at t/010_pg_basebackup.pl line 461.
> 
> 
> Anyway I think the current patch can be applied as is -- and if we want
> physical replication to have some other behavior, we can patch for that
> afterwards.

Agreed here. The false-invalidation doesn't lead to any serious
consequences.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

28 April 2020, 08:18:15

At Mon, 27 Apr 2020 19:40:07 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Apr-08, Kyotaro Horiguchi wrote:
> 
> > I understand how it happens.
> > 
> > The latch triggered by checkpoint request by CHECKPOINT command has
> > been absorbed by ConditionVariableSleep() in
> > InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
> > use MyLatch for other than checkpoint request while a checkpoint is
> > running.
> 
> Hmm, that explanation makes sense, but I couldn't reproduce it with the
> steps you provided.  Perhaps I'm missing something.

Sorry for the incomplete reproducer. A checkpoint needs to be running
simultaneously for the manual checkpoint to hang up.  The following is
the complete sequence.

1. Build a primary database cluster with the following setup, then start it.
   max_slot_wal_keep_size=0
   max_wal_size=32MB
   min_wal_size=32MB

2. Build a replica from the primary creating a slot, then start it.

   $ pg_basebackup -R -C -S s1 -D...
   
3. Try the following commands. Try several times if it succeeds.
  =# create table tt(); drop table tt; select pg_switch_wal();checkpoint;

It is evidently stochastic, but it works quite reliably for me.

> Anyway I think this patch should fix it also -- instead of adding a new
> flag, we just rely on the existing flags (since do_checkpoint must have
> been set correctly from the flags earlier in that block.)

Since the added (!do_checkpoint) check is reached with
do_checkpoint=false at server start and at archive_timeout intervals,
the patch makes checkpointer run a busy-loop at that timings, and that
loop lasts until a checkpoint is actually executed.

What we need to do here is not forgetting the fact that the latch has
been set even if the latch itself gets reset before reaching to
WaitLatch.

> I think it'd be worth to verify this bugfix in a new test.  Would you
> have time to produce that?  I could try in a couple of days ...

The attached patch on 019_replslot_limit.pl does the commands above
automatically. It sometimes succeed but fails in most cases, at least
for me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 32dce54522..c8ec4bb363 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -8,7 +8,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 13;
+use Test::More tests => 14;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -181,6 +181,36 @@ ok($failed, 'check that replication has been broken');
 
 $node_standby->stop;
 
+my $node_master2 = get_new_node('master2');
+$node_master2->init(allows_streaming => 1);
+$node_master2->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 32MB
+log_checkpoints = yes
+));
+$node_master2->start;
+$node_master2->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+$backup_name = 'my_backup2';
+$node_master2->backup($backup_name);
+
+$node_master2->stop;
+$node_master2->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = 0
+));
+$node_master2->start;
+
+$node_standby = get_new_node('standby_2');
+$node_standby->init_from_backup($node_master2, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+$node_standby->start;
+my @result =
+  split('\n', $node_master2->safe_psql('postgres', "
+                                       CREATE TABLE tt(); DROP TABLE tt;
+                                       SELECT pg_switch_wal();
+                                       CHECKPOINT;
+                                       SELECT 'finished';", timeout=>'5'));
+is($result[1], 'finished', 'check if checkpoint command is not blocked');
+
 #####################################
 # Advance WAL of $node by $n segments
 sub advance_wal

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

28 April 2020, 16:29:41

On 2020-Apr-28, Kyotaro Horiguchi wrote:

> At Mon, 27 Apr 2020 18:33:42 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> > On 2020-Apr-08, Kyotaro Horiguchi wrote:
> > 
> > > At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 

> > Thanks for the fix!  I propose two changes:
> > 
> > 1. reword the error like this:
> > 
> > ERROR:  replication slot "regression_slot3" cannot be advanced
> > DETAIL:  This slot has never previously reserved WAL, or has been invalidated
> 
> Agreed to describe what is failed rather than the cause.  However,
> logical replications slots are always "previously reserved" at
> creation.

Bah, of course.  I was thinking in making the equivalent messages all
identical in all callsites, but maybe they should be different when
slots are logical.  I'll go over them again.

> > 2. use the same error in one other place, to wit
> >    pg_logical_slot_get_changes() and pg_replication_slot_advance().  I
> >    made the DETAIL part the same in all places, but the ERROR line is
> >    adjusted to what each callsite is doing.
> >    I do think that this change in test_decoding is a bit unpleasant:
> > 
> > -ERROR:  cannot use physical replication slot for logical decoding
> > +ERROR:  cannot get changes from replication slot "repl"
> > 
> >    The test is
> >       -- check that we're detecting a streaming rep slot used for logical decoding
> >       SELECT 'init' FROM pg_create_physical_replication_slot('repl');
> >       SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');
> 
> The message may be understood as "No change has been made since
> restart_lsn". Does something like the following work?
> 
> ERROR:  replication slot "repl" is not usable to get changes

That wording seems okay, but my specific point for this error message is
that we were trying to use a physical slot to get logical changes; so
the fact that the slot has been invalidated is secondary and we should
complain about the *type* of slot rather than the restart_lsn.

> By the way there are some other messages that doesn't render the
> symptom but the cause.
> 
> "cannot use physical replication slot for logical decoding"
> "replication slot \"%s\" was not created in this database"
> 
> Don't they need the same amendment?

Maybe, but I don't want to start rewording every single message in uses
of replication slots ... I prefer to only modify the ones related to the
problem at hand.

> > > > On the other hand, physical replication doesn't break by invlidation.
> > > > [...]

> > Anyway I think the current patch can be applied as is -- and if we want
> > physical replication to have some other behavior, we can patch for that
> > afterwards.
> 
> Agreed here. The false-invalidation doesn't lead to any serious
> consequences.

But does it?  What happens, for example, if we have a slot used to get a
pg_basebackup, then time passes before starting to stream from it and is
invalidated?  I think this "works fine" (meaning that once we try to
stream from the slot to replay at the restored base backup, we will
raise an error immediately), but I haven't tried.

The worst situation would be producing a corrupt replica.  I don't think
this is possible.

The ideal behavior I think would be that pg_basebackup aborts
immediately when the slot is invalidated, to avoid wasting more time
producing a doomed backup.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

28 April 2020, 23:14:10

On 2020-Apr-28, Kyotaro Horiguchi wrote:

> > Anyway I think this patch should fix it also -- instead of adding a new
> > flag, we just rely on the existing flags (since do_checkpoint must have
> > been set correctly from the flags earlier in that block.)
> 
> Since the added (!do_checkpoint) check is reached with
> do_checkpoint=false at server start and at archive_timeout intervals,
> the patch makes checkpointer run a busy-loop at that timings, and that
> loop lasts until a checkpoint is actually executed.
> 
> What we need to do here is not forgetting the fact that the latch has
> been set even if the latch itself gets reset before reaching to
> WaitLatch.

After a few more false starts :-) I think one easy thing we can do
without the additional boolean flag is to call SetLatch there in the
main loop if we see that ckpt_flags is nonzero.

(I had two issues with the boolean flag.  One is that the comment in
ReqCheckpointHandler needed an update to, essentially, say exactly the
opposite of what it was saying; such a change was making me very
uncomfortable.  The other is that the place where the flag was reset in
CheckpointerMain() was ... not really appropriate; or it could have been
appropriate if the flag was called, say, "CheckpointerMainNoSleepOnce".
Because "RequestPending" was the wrong name to use, because if the flag
was for really request pending, then we should reset it inside the "if
do_checkpoint" block .. but as I understand this would cause the
busy-loop behavior you described.)

> The attached patch on 019_replslot_limit.pl does the commands above
> automatically. It sometimes succeed but fails in most cases, at least
> for me.

With the additional SetLatch, the test passes reproducibly for me.
Before the patch, it failed ten out of ten times I ran it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Fix-checkpoint-signalling.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

29 April 2020, 00:47:10

I pushed this one.  Some closing remarks:

On 2020-Apr-28, Alvaro Herrera wrote:

> On 2020-Apr-28, Kyotaro Horiguchi wrote:

> > Agreed to describe what is failed rather than the cause.  However,
> > logical replications slots are always "previously reserved" at
> > creation.
> 
> Bah, of course.  I was thinking in making the equivalent messages all
> identical in all callsites, but maybe they should be different when
> slots are logical.  I'll go over them again.

I changed the ones that can only be logical slots so that they no longer
say "previously reserved WAL".  The one in
pg_replication_slot_advance still uses that wording, because I didn't
think it was worth creating two separate error paths.

> > ERROR:  replication slot "repl" is not usable to get changes
> 
> That wording seems okay, but my specific point for this error message is
> that we were trying to use a physical slot to get logical changes; so
> the fact that the slot has been invalidated is secondary and we should
> complain about the *type* of slot rather than the restart_lsn.

I moved the check for validity to after CreateDecodingContext, so the
other errors are reported preferently. I also chose a different wording:

        /*
         * After the sanity checks in CreateDecodingContext, make sure the
         * restart_lsn is valid.  Avoid "cannot get changes" wording in this
         * errmsg because that'd be confusingly ambiguous about no changes
         * being available.
         */
        if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
            ereport(ERROR,
                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                     errmsg("can no longer get changes from replication slot \"%s\"",
                            NameStr(*name)),
                     errdetail("This slot has never previously reserved WAL, or has been invalidated.")));

I hope this is sufficiently clear, but if not, feel free to nudge me and
we can discuss it further.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

29 April 2020, 22:58:16

On 2020-Apr-28, Alvaro Herrera wrote:

> On 2020-Apr-28, Kyotaro Horiguchi wrote:
> 
> > > Anyway I think this patch should fix it also -- instead of adding a new
> > > flag, we just rely on the existing flags (since do_checkpoint must have
> > > been set correctly from the flags earlier in that block.)
> > 
> > Since the added (!do_checkpoint) check is reached with
> > do_checkpoint=false at server start and at archive_timeout intervals,
> > the patch makes checkpointer run a busy-loop at that timings, and that
> > loop lasts until a checkpoint is actually executed.
> > 
> > What we need to do here is not forgetting the fact that the latch has
> > been set even if the latch itself gets reset before reaching to
> > WaitLatch.
> 
> After a few more false starts :-) I think one easy thing we can do
> without the additional boolean flag is to call SetLatch there in the
> main loop if we see that ckpt_flags is nonzero.

I went back to "continue" instead of SetLatch, because it seems less
wasteful, but I changed the previously "do_checkpoint" condition to
rechecking ckpt_flags.  We would not get in the busy loop in that case,
because the condition is true when the next loop would take action and
false otherwise.  So I think this should fix the problem without causing
any other issues.  But if you do see problems with this, please let us
know.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

30 April 2020, 01:25:20

Thank you for polishing and committing this.

At Tue, 28 Apr 2020 20:47:10 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> I pushed this one.  Some closing remarks:
> 
> On 2020-Apr-28, Alvaro Herrera wrote:
> 
> > On 2020-Apr-28, Kyotaro Horiguchi wrote:
> 
> > > Agreed to describe what is failed rather than the cause.  However,
> > > logical replications slots are always "previously reserved" at
> > > creation.
> > 
> > Bah, of course.  I was thinking in making the equivalent messages all
> > identical in all callsites, but maybe they should be different when
> > slots are logical.  I'll go over them again.
> 
> I changed the ones that can only be logical slots so that they no longer
> say "previously reserved WAL".  The one in
> pg_replication_slot_advance still uses that wording, because I didn't
> think it was worth creating two separate error paths.

Agreed. 

> > > ERROR:  replication slot "repl" is not usable to get changes
> > 
> > That wording seems okay, but my specific point for this error message is
> > that we were trying to use a physical slot to get logical changes; so
> > the fact that the slot has been invalidated is secondary and we should
> > complain about the *type* of slot rather than the restart_lsn.
> 
> I moved the check for validity to after CreateDecodingContext, so the
> other errors are reported preferently. I also chose a different wording:

Yes. It is what I had in my mind. The function checks invariable
properties of the slot, then the following code checks a variable
state of the same.

>         /*
>          * After the sanity checks in CreateDecodingContext, make sure the
>          * restart_lsn is valid.  Avoid "cannot get changes" wording in this
>          * errmsg because that'd be confusingly ambiguous about no changes
>          * being available.
>          */
>         if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
>             ereport(ERROR,
>                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                      errmsg("can no longer get changes from replication slot \"%s\"",
>                             NameStr(*name)),
>                      errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
> 
> I hope this is sufficiently clear, but if not, feel free to nudge me and
> we can discuss it further.

That somewhat sounds odd that 'we "no longer" get changes from "never
previously reserved" slots'.  More than that, I think we don't reach
there for physical slots, since CreateDecodingContext doesn't accept a
physical slot and ERRORs out.  (That is the reason for the location of
the checking.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

30 April 2020, 01:32:02

On 2020-Apr-30, Kyotaro Horiguchi wrote:

> At Tue, 28 Apr 2020 20:47:10 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 

> >         /*
> >          * After the sanity checks in CreateDecodingContext, make sure the
> >          * restart_lsn is valid.  Avoid "cannot get changes" wording in this
> >          * errmsg because that'd be confusingly ambiguous about no changes
> >          * being available.
> >          */
> >         if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> >             ereport(ERROR,
> >                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                      errmsg("can no longer get changes from replication slot \"%s\"",
> >                             NameStr(*name)),
> >                      errdetail("This slot has never previously reserved WAL, or has been invalidated.")));
> > 
> > I hope this is sufficiently clear, but if not, feel free to nudge me and
> > we can discuss it further.
> 
> That somewhat sounds odd that 'we "no longer" get changes from "never
> previously reserved" slots'.  More than that, I think we don't reach
> there for physical slots, since CreateDecodingContext doesn't accept a
> physical slot and ERRORs out.  (That is the reason for the location of
> the checking.)

Oh, right, so we could reword the errdetail to just "This slot has been
invalidated."

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Kyotaro Horiguchi

Date:

30 April 2020, 02:30:15

At Wed, 29 Apr 2020 18:58:16 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Apr-28, Alvaro Herrera wrote:
> 
> > On 2020-Apr-28, Kyotaro Horiguchi wrote:
> > 
> > > > Anyway I think this patch should fix it also -- instead of adding a new
> > > > flag, we just rely on the existing flags (since do_checkpoint must have
> > > > been set correctly from the flags earlier in that block.)
> > > 
> > > Since the added (!do_checkpoint) check is reached with
> > > do_checkpoint=false at server start and at archive_timeout intervals,
> > > the patch makes checkpointer run a busy-loop at that timings, and that
> > > loop lasts until a checkpoint is actually executed.
> > > 
> > > What we need to do here is not forgetting the fact that the latch has
> > > been set even if the latch itself gets reset before reaching to
> > > WaitLatch.
> > 
> > After a few more false starts :-) I think one easy thing we can do
> > without the additional boolean flag is to call SetLatch there in the
> > main loop if we see that ckpt_flags is nonzero.
> 
> I went back to "continue" instead of SetLatch, because it seems less
> wasteful, but I changed the previously "do_checkpoint" condition to
> rechecking ckpt_flags.  We would not get in the busy loop in that case,
> because the condition is true when the next loop would take action and
> false otherwise.  So I think this should fix the problem without causing
> any other issues.  But if you do see problems with this, please let us
> know.

Checking ckpt_flags then continue makes sense to me.

Thanks for committing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

17 May 2020, 01:00:05

Hi,

On 2020-04-29 18:58:16 -0400, Alvaro Herrera wrote:
> On 2020-Apr-28, Alvaro Herrera wrote:
> 
> > On 2020-Apr-28, Kyotaro Horiguchi wrote:
> > 
> > > > Anyway I think this patch should fix it also -- instead of adding a new
> > > > flag, we just rely on the existing flags (since do_checkpoint must have
> > > > been set correctly from the flags earlier in that block.)
> > > 
> > > Since the added (!do_checkpoint) check is reached with
> > > do_checkpoint=false at server start and at archive_timeout intervals,
> > > the patch makes checkpointer run a busy-loop at that timings, and that
> > > loop lasts until a checkpoint is actually executed.
> > > 
> > > What we need to do here is not forgetting the fact that the latch has
> > > been set even if the latch itself gets reset before reaching to
> > > WaitLatch.
> > 
> > After a few more false starts :-) I think one easy thing we can do
> > without the additional boolean flag is to call SetLatch there in the
> > main loop if we see that ckpt_flags is nonzero.
> 
> I went back to "continue" instead of SetLatch, because it seems less
> wasteful, but I changed the previously "do_checkpoint" condition to
> rechecking ckpt_flags.  We would not get in the busy loop in that case,
> because the condition is true when the next loop would take action and
> false otherwise.  So I think this should fix the problem without causing
> any other issues.  But if you do see problems with this, please let us
> know.

I don't think this is quite sufficient:
I, independent of this patch, added a few additional paths in which
checkpointer's latch is reset, and I found a few shutdowns in regression
tests to be extremely slow / timing out.  The reason for that is that
the only check for interrupts is at the top of the loop. So if
checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
decide to do a checkpoint for other reasons.

I also suspect that it could have harmful consequences to not do a
AbsorbSyncRequests() if something "ate" the set latch.

I don't think it's reasonable to expect this much code between a
ResetLatch and WaitLatch to never reset a latch. So I think we need to
make the coding more robust in face of that. Without having to duplicate
the top and the bottom of the loop.

One way to do that would be to WaitLatch() call to much earlier, and
only do a WaitLatch() if do_checkpoint is false.  Roughly like in the
attached.

Greetings,

Andres Freund

Attachment

checkpointer-latch-bug.diff

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

17 May 2020, 02:51:50

On 2020-May-16, Andres Freund wrote:

> I, independent of this patch, added a few additional paths in which
> checkpointer's latch is reset, and I found a few shutdowns in regression
> tests to be extremely slow / timing out.  The reason for that is that
> the only check for interrupts is at the top of the loop. So if
> checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
> decide to do a checkpoint for other reasons.

Ah, yeah, this seems a genuine bug.

> I also suspect that it could have harmful consequences to not do a
> AbsorbSyncRequests() if something "ate" the set latch.

I traced through this when looking over the previous fix, and given that
checkpoint execution itself calls AbsorbSyncRequests frequently, I
don't think this one qualifies as a bug.

> I don't think it's reasonable to expect this much code between a
> ResetLatch and WaitLatch to never reset a latch. So I think we need to
> make the coding more robust in face of that. Without having to duplicate
> the top and the bottom of the loop.

That makes sense to me.

> One way to do that would be to WaitLatch() call to much earlier, and
> only do a WaitLatch() if do_checkpoint is false.  Roughly like in the
> attached.

Hm.  I'd do "WaitLatch() / continue" in the "!do_checkpoint" block, and
put the checpkoint code not in the else block; seems easier to read to
me.

While we're here, can we change CreateCheckPoint to return true so
that we can do 

    ckpt_performed = do_restartpoint ? CreateRestartPoint(flags) : CreateCheckPoint(flags);
instead of the mess we have there now?  (Also add a comment that
CreateCheckPoint must not return false, to avoid messing with the
schedule)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Andres Freund

Date:

17 May 2020, 03:23:01

Hi,

On 2020-05-16 22:51:50 -0400, Alvaro Herrera wrote:
> On 2020-May-16, Andres Freund wrote:
> 
> > I, independent of this patch, added a few additional paths in which
> > checkpointer's latch is reset, and I found a few shutdowns in regression
> > tests to be extremely slow / timing out.  The reason for that is that
> > the only check for interrupts is at the top of the loop. So if
> > checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
> > decide to do a checkpoint for other reasons.
> 
> Ah, yeah, this seems a genuine bug.
> 
> > I also suspect that it could have harmful consequences to not do a
> > AbsorbSyncRequests() if something "ate" the set latch.
> 
> I traced through this when looking over the previous fix, and given that
> checkpoint execution itself calls AbsorbSyncRequests frequently, I
> don't think this one qualifies as a bug.

There's no AbsorbSyncRequests() after CheckPointBuffers(), I think. And
e.g. CheckPointTwoPhase() could take a while. Which then would mean that
we'd potentially not AbsorbSyncRequests() until checkpoint_timeout
causes us to wake up. Am I missing something?


> > One way to do that would be to WaitLatch() call to much earlier, and
> > only do a WaitLatch() if do_checkpoint is false.  Roughly like in the
> > attached.
> 
> Hm.  I'd do "WaitLatch() / continue" in the "!do_checkpoint" block, and
> put the checpkoint code not in the else block; seems easier to read to
> me.

Yea, that'd probably be better. I was also pondering if we shouldn't
just move the checkpoint code into, gasp, it's own function ;)

Greetings,

Andres Freund

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

17 May 2020, 07:02:49

On 2020-May-16, Andres Freund wrote:

> Hi,
> 
> On 2020-05-16 22:51:50 -0400, Alvaro Herrera wrote:
> > On 2020-May-16, Andres Freund wrote:
> > 
> > > I, independent of this patch, added a few additional paths in which
> > > checkpointer's latch is reset, and I found a few shutdowns in regression
> > > tests to be extremely slow / timing out.  The reason for that is that
> > > the only check for interrupts is at the top of the loop. So if
> > > checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
> > > decide to do a checkpoint for other reasons.
> > 
> > Ah, yeah, this seems a genuine bug.
> > 
> > > I also suspect that it could have harmful consequences to not do a
> > > AbsorbSyncRequests() if something "ate" the set latch.
> > 
> > I traced through this when looking over the previous fix, and given that
> > checkpoint execution itself calls AbsorbSyncRequests frequently, I
> > don't think this one qualifies as a bug.
> 
> There's no AbsorbSyncRequests() after CheckPointBuffers(), I think. And
> e.g. CheckPointTwoPhase() could take a while. Which then would mean that
> we'd potentially not AbsorbSyncRequests() until checkpoint_timeout
> causes us to wake up. Am I missing something?

True.  There's no delay like CheckpointWriteDelay in that code though,
so the "a while" is much smaller.  My understanding of these sync
requests is that they're not for immediate processing anyway -- I mean
it's okay for checkpointer to take a bit of time before syncing ... or
am I mistaken?  (If another sync request is queued and the queue hasn't
been emptied, that would set the latch again, so it's not like this
could fill the queue arbitrarily.)

> > > One way to do that would be to WaitLatch() call to much earlier, and
> > > only do a WaitLatch() if do_checkpoint is false.  Roughly like in the
> > > attached.
> > 
> > Hm.  I'd do "WaitLatch() / continue" in the "!do_checkpoint" block, and
> > put the checpkoint code not in the else block; seems easier to read to
> > me.
> 
> Yea, that'd probably be better. I was also pondering if we shouldn't
> just move the checkpoint code into, gasp, it's own function ;)

That might work :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

18 May 2020, 23:44:59

BTW while you're messing with checkpointer, I propose this patch to
simplify things.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-CreateCheckPoint-return-bool.patch

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Michael Paquier

Date:

19 May 2020, 02:43:57

On Mon, May 18, 2020 at 07:44:59PM -0400, Alvaro Herrera wrote:
> BTW while you're messing with checkpointer, I propose this patch to
> simplify things.

It seems to me that this would have a benefit if we begin to have a
code path in CreateCheckpoint() where where it makes sense to let the
checkpointer know that no checkpoint has happened, and now we assume
that a skipped checkpoint is a performed one.  As that's not the case
now, I would vote for keeping the code as-is.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Alvaro Herrera

Date:

19 May 2020, 03:46:49

On 2020-May-19, Michael Paquier wrote:

> On Mon, May 18, 2020 at 07:44:59PM -0400, Alvaro Herrera wrote:
> > BTW while you're messing with checkpointer, I propose this patch to
> > simplify things.
> 
> It seems to me that this would have a benefit if we begin to have a
> code path in CreateCheckpoint() where where it makes sense to let the
> checkpointer know that no checkpoint has happened, and now we assume
> that a skipped checkpoint is a performed one.

Well, my first attempt at this was returning false in that case, until I
realized that it would break the scheduling algorithm.

> As that's not the case now, I would vote for keeping the code as-is.

The presented patch doesn't have any functional impact; it just writes
the same code in a more concise way.  Like you, I wouldn't change this
if we didn't have a reason to rewrite this section of code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

From

Justin Pryzby

Date:

20 June 2020, 23:06:27

Minor language tweak:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7050ce6e2e..08142d64cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3800,8 +3800,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
        slots</link> are allowed to retain in the <filename>pg_wal</filename>
        directory at checkpoint time.
        If <varname>max_slot_wal_keep_size</varname> is -1 (the default),
        replication slots {+may+} retain {+an+} unlimited amount of WAL files.  [-If-]{+Otherwise, if+}
        restart_lsn of a replication slot [-gets-]{+falls+} behind {+by+} more than [-that megabytes-]{+the given
size+}
        from the current LSN, the standby using the slot may no longer be able
        to continue replication due to removal of required WAL files. You
        can see the WAL availability of replication slots

Attachment

v1-0001-Tweak-user-facing-doc-in-c6550776394e25c1620bc825.patch