On Wed, Oct 7, 2020 at 8:58 PM Michael Banck <michael.banck@credativ.de> wrote:
> we had a customer incident recently where they needed to do a PITR.
> Their data directory is on a NetApp NFS and they have several hundred
> databases in their instance. The startup sync (i.e. before the message
> "starting archive recovery" appears) took 20 minutes and during the
Nice data point.
> first try[1] they were wondering what's going on because there is just
> one log message ("database system was interrupted; last known up at
> ...") and the postmaster process is in state 'D'. Attaching strace
> revealed that it was syncing files and due to the NFS performance that
> took a long time.
No objection to adding a message, but see also this other thread,
about potential ways to get rid of that sync completely, or at least
the phase where you have to open all the files one by one:
https://www.postgresql.org/message-id/flat/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
Also, maybe of interest for PITR use cases, see this other thread
about relaxing the end-of-recovery checkpoint (well the patch doesn't
do that yet but it'd be a small step to not wait for it, based on a
GUC, once the checkpointer is running):
https://commitfest.postgresql.org/30/2706/