On Wed, Jul 17, 2019 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
> I got surprised by the following behavior from pg_stat_get_wal_senders
> when connecting for example pg_receivewal to a primary:
> =# select application_name, flush_lsn, replay_lsn, flush_lag,
> replay_lag from pg_stat_replication;
> application_name | flush_lsn | replay_lsn | flush_lag | replay_lag
> ------------------+-----------+------------+-----------------+-----------------
> receivewal | null | null | 00:09:13.578185 | 00:09:13.578185
> (1 row)
>
> It makes little sense to me, as we are reporting a replay lag on a
> position which has never been reported yet, so it cannot actually be
> used as a comparison base for the lag. Am I missing something or
> should we return NULL for those fields if we have no write, flush or
> apply LSNs like in the attached?
Hmm. It's working as designed, but indeed it's not very newsworthy
information in this case. If you run pg_receivewal --synchronous then
you get sensible looking flush_lag times. Without that, flush_lag
only goes up, and of course replay_lag only goes up, so although it's
telling the truth, I think your proposal makes sense.
One question I had is what would happen with your patch without
--synchronous, once it flushes a whole file and opens a new one; I
wondered if your new boring-information-hiding behaviour would stop
working after one segment file because of that. I tested that and the
column remains NULL when we move to a new file, so that's good.
One thing I noticed in passing is that you always get the same times
in the write_lag and flush_lag columns, in --synchronous mode, and the
times updates infrequently. That's not the case with regular
replicas; I suspect there is a difference in the time and frequency of
replies sent to the server, which I guess might make synchronous
commit a bit "lumpier", but I didn't dig further today.
--
Thomas Munro
https://enterprisedb.com