Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Re: [HACKERS] Restricting maximum keep segments by repslots
Date
Msg-id 20190627162256.4f4872b8@firost
Whole thread Raw
In response to Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] Restricting maximum keep segments by repslots
List pgsql-hackers
Hi all,

Being interested by this feature, I did a patch review.

This features adds the GUC "max_slot_wal_keep_size". This is the maximum amount
of WAL that can be kept in "pg_wal" by active slots.

If the amount of WAL is superior to this limit, the slot is deactivated and
its status (new filed in pg_replication_slot) is set as "lost".


Patching
========

The patch v13-0003 does not apply on HEAD anymore.

The patch v13-0005 applies using "git am --ignore-space-change"

Other patches applies correctly.

Please, find attached the v14 set of patches rebased on master.


Documentation
=============

The documentation explains the GUC and related columns in "pg_replication_slot".

It reflects correctly the current behavior of the patch.


Usability
=========

The patch implement what it described. It is easy to enable and disable. The
GUC name is describing correctly its purpose.

This feature is useful in some HA scenario where slot are required (eg. no
possible archiving), but where primary availability is more important than
standbys.

In "pg_replication_slots" view, the new "wal_status" field is misleading.
Consider this sentence and the related behavior from documentation
(catalogs.sgml):

  <literal>keeping</literal> means that some of them are to be removed by the
  next checkpoint.

"keeping" appears when the current checkpoint will delete some WAL further than
"current_lsn - max_slot_wal_keep_size", but still required by at least one slot.
As some WAL required by some slots will be deleted quite soon, probably before
anyone can react, "keeping" status is misleading here. We are already in the
red zone.

I would expect this "wal_status" to be:

- streaming: slot lag between 0 and "max_wal_size"
- keeping: slot lag between "max_wal_size" and "max_slot_wal_keep_size". the
  slot actually protect some WALs from being deleted
- lost: slot lag superior of max_slot_wal_keep_size. The slot couldn't protect
  some WAL from deletion

Documentation follows with:

  The last two states are seen only when max_slot_wal_keep_size is
  non-negative

This is true with the current behavior. However, if "keeping" is set as soon as
te slot lag is superior than "max_wal_size", this status could be useful even
with "max_slot_wal_keep_size = -1". As soon as a slot is stacking WALs that
should have been removed by previous checkpoint, it "keeps" them.


Feature tests
=============

I have played with various traffic shaping setup between nodes, to observe how
columns "active", "wal_status" and "remain" behaves in regard to each others
using:

  while true; do
   sleep 0.3; 
   psql -p 5432 -AXtc "
    select now(), active, restart_lsn, wal_status, pg_size_pretty(remain)
    from pg_replication_slots
    where slot_name='slot_limit_st'" 
  done

The primary is created using:

  initdb -U postgres -D slot_limit_pr --wal-segsize=1

  cat<<EOF >>slot_limit_pr/postgresql.conf
  port=5432
  max_wal_size = 3MB
  min_wal_size = 2MB
  max_slot_wal_keep_size = 4MB
  logging_collector = on
  synchronous_commit = off
  EOF

WAL activity is generated using a simple pgbench workload. Then, during
this activity, packets on loopback are delayed using:

  tc qdisc add dev lo root handle 1:0 netem delay 140msec

Here is how the wal_status behave. I removed the timestamps, but the
record order is the original one:

     t|1/7B116898|streaming|1872 kB
     t|1/7B1A0000|lost|0 bytes
     t|1/7B320000|keeping|0 bytes
     t|1/7B780000|lost|0 bytes
     t|1/7BB00000|keeping|0 bytes
     t|1/7BE00000|keeping|0 bytes
     t|1/7C100000|lost|0 bytes
     t|1/7C400000|keeping|0 bytes
     t|1/7C700000|lost|0 bytes
     t|1/7CA40000|keeping|0 bytes
     t|1/7CDE0000|lost|0 bytes
     t|1/7D100000|keeping|0 bytes
     t|1/7D400000|keeping|0 bytes
     t|1/7D7C0000|keeping|0 bytes
     t|1/7DB40000|keeping|0 bytes
     t|1/7DE60000|lost|0 bytes
     t|1/7E180000|keeping|0 bytes
     t|1/7E500000|keeping|0 bytes
     t|1/7E860000|lost|0 bytes
     t|1/7EB80000|keeping|0 bytes
     [...x15]
     t|1/80800000|keeping|0 bytes
     t|1/80900000|streaming|940 kB
     t|1/80A00000|streaming|1964 kB

When increasing the network delay to 145ms, the slot has been lost for real.
Note that it has been shown as lost but active twice (during approx 0.6s) before
being deactivated.

     t|1/85700000|streaming|2048 kB
     t|1/85800000|keeping|0 bytes
     t|1/85940000|lost|0 bytes
     t|1/85AC0000|lost|0 bytes
     f|1/85C40000|lost|0 bytes

Finally, at least once the following messages appeared in primary logs
**before** the "wal_status" changed from "keeping" to "streaming":

     WARNING:  some replication slots have lost required WAL segments

So the slot lost one WAL, but the standby has been able to catch-up anyway.
 
My humble opinion about these results:

* after many different tests, the status "keeping" appears only when "remain"
  equals 0. In current implementation, "keeping" really adds no value...
* "remain" should be NULL if "max_slot_wal_keep_size=-1 or if the slot isn't
  active
* the "lost" status should be a definitive status
* it seems related, but maybe the "wal_status" should be set as "lost"
  only when the slot has been deactivate ?
* logs should warn about a failing slot as soon as it is effectively
  deactivated, not before.


Attachment

pgsql-hackers by date:

Previous
From: Jesper Pedersen
Date:
Subject: pg_receivewal documentation
Next
From: Bruce Momjian
Date:
Subject: Missing PG 12 stable branch