Re: Review for GetWALAvailability() - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Review for GetWALAvailability()
Date
Msg-id 20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Review for GetWALAvailability()
Re: Review for GetWALAvailability()
List pgsql-hackers
At Mon, 15 Jun 2020 18:59:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > It was a kind of hard to decide. Even when max_slot_wal_keep_size is
> > smaller than max_wal_size, the segments more than
> > max_slot_wal_keep_size are not guaranteed to be kept.  In that case
> > the state transits as NORMAL->LOST skipping the "RESERVED" state.
> > Putting aside whether the setting is useful or not, I thought that the
> > state transition is somewhat abrupt.
> 
> IMO the direct transition of the state from normal to lost is ok to me
> if each state is clearly defined.
> 
> >> Or, if that condition is really necessary, the document should be
> >> updated so that the note about the condition is added.
> > Does the following make sense?
> > https://www.postgresql.org/docs/13/view-pg-replication-slots.html
> > normal means that the claimed files are within max_wal_size.
> > + If max_slot_wal_keep_size is smaller than max_wal_size, this state
> > + will not appear.
> 
> I don't think this change is enough. For example, when
> max_slot_wal_keep_size
> is smaller than max_wal_size and the amount of WAL files claimed by
> the slot
> is smaller thhan max_slot_wal_keep_size, "reserved" is reported. But
> which is
> inconsistent with the meaning of "reserved" in the docs.

You're right.

> To consider what should be reported in wal_status, could you tell me
> what
> purpose and how the users is expected to use this information?

I saw that the "reserved" is the state where slots are working to
retain segments, and "normal" is the state to indicate that "WAL
segments are within max_wal_size", which is orthogonal to the notion
of "reserved".  So it seems to me useless when the retained WAL
segments cannot exceeds max_wal_size.

With longer description they would be:

"reserved under max_wal_size"
"reserved over max_wal_size"
"lost some segements"

Come to think of that, I realized that my trouble was just the
wording.  Are the following wordings make sense to you?

"reserved"  - retained within max_wal_size
"extended"  - retained over max_wal_size
"lost"      - lost some segments

With these wordings I can live with "not extended"=>"lost". Of course
more appropriate wording are welcome.

> Even if walsender is terminated during the state "lost", unless
> checkpointer
> removes the required WAL files, the state can go back to "reserved"
> after
> new replication connection is established. This is the same as what
> you're
> explaining at the above?

GetWALAvailability checks restart_lsn against lastRemovedSegNo, thus
the "lost" cannot be seen unless checkpointer actually have removed
the segment at restart_lsn (and restart_lsn has not been invalidated).
However, walsenders are killed before that segments are actually
removed so there're cases where physical walreceiver reconnects before
RemoveOldXloFiles removes all segments, then removed after
reconnection. "lost" can go back to "resrved" in that case. (Physical
walreceiver can connect to invalid-restart_lsn slot)

I noticed the another issue. If some required WALs are removed, the
slot will be "invalidated", that is, restart_lsn is set to invalid
value. As the result we hardly see the "lost" state.

It can be "fixed" by remembering the validity of a slot separately
from restart_lsn. Is that worth doing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Postgresql13_beta1 (could not rename temporary statistics file)Windows 64bits
Next
From: Peter Eisentraut
Date:
Subject: Re: factorial of negative numbers