Hi,
On 2021-07-30 10:16:44 -0400, Robert Haas wrote:
> 2021-07-30 09:39:43.579 EDT [63702] LOG: redo starts at 0/14A2F48
> 2021-07-30 09:39:44.129 EDT [63702] LOG: redo done at 0/15F48230
> system usage: CPU: user: 0.25 s, system: 0.25 s, elapsed: 0.55 s
> 2021-07-30 09:39:44.129 EDT [63702] LOG: crash recovery complete:
> wrote 36517 buffers (222.9%); dirtied 52985 buffers; read 7 buffers
>
> Now I really think that information on the number of buffers touched
> and how long it took is way more useful than user and system time.
> Knowing how much user and system time were spent doesn't really tell
> you anything, but a count of buffers touched gives you some meaningful
> idea of how much work recovery did, and whether I/O was slow.
I don't agree with that? If (user+system) << wall then it is very likely
that recovery is IO bound. If system is a large percentage of wall, then
shared buffers is likely too small (or we're replacing the wrong
buffers) because you spend a lot of time copying data in/out of the
kernel page cache. If user is the majority, you're CPU bound.
Without user & system time it's much harder to figure that out - at
least for me.
> In your patch, there's no end-of-recovery checkpoint -- you just
> trigger a checkpoint instead of waiting for it. I think it's probably
> better to make those two cases work the same. The end-of-recovery
> record isn't needed to change the TLI as it is in the promotion case,
> but (1) it seems better to have fewer code paths and (2) it might be
> good for debuggability.
+1
In addition, the end-of-recovery record also good for e.g. hot standby,
logical decoding, etc, because it's a point where no write transactions
can be in progress...
Greetings,
Andres Freund