Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

From Alexander Kukushkin
Subject Re: Standby trying "restore_command" before local WAL
Date
Msg-id CAFh8B==eaXBUe6F6FC1kjZ6cgQLCPzhS8okLQg3mFsDBkkikLA@mail.gmail.com
Whole thread Raw
In response to Re: Standby trying "restore_command" before local WAL  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Standby trying "restore_command" before local WAL
List pgsql-hackers
Hi,

2018-07-31 20:25 GMT+02:00 Stephen Frost <sfrost@snowman.net>:
>
>
> There's still a question here, at least from my perspective, as to which
> is actually going to be faster to perform recovery based off of.  A good
> restore command, which pre-fetches the WAL in parallel and gets it local
> and on the same filesystem, meaning that the restore_command only has to
> execute essentially a 'mv' and return back to PG for the next WAL file,
> is really rather fast, compared to streaming that same data over the
> network with a single TCP connection to the primary.  Of course, there's
> a lot of variables there and it depends on the network speed between the
> various pieces, but I've certainly had cases where a replica catches up
> much faster using restore command than streaming from the primary.


Sure, mv is incredibly fast, but not calling external script/binary at
all is still faster than calling it.

What about the following cases?
1. replica host crashed, and in pg_wal we have a few thousands WAL files.
2. we are creating a new replica with pg_basebackup -X stream, it
takes a long time and again leaves a few thousands WAL files.

In both cases, if there is no restore_command in the recovery.conf,
postgres will happily read WAL files from pg_wal and only when there
is nothing left it will try to start streaming.

But, if restore_command is defined, it will always call the
restore_command, for every single WAL file it wants to restore.
If the restore_command exits with non zero exit code, postgres is
happily restoring the file from pg_wal!
And, only if the file is not there or not valid, postgres is trying to
start streaming.

From my point of view, there is no difference between having no
restore_command and relying only on streaming replication and having
the restore_comman which always fails.
Therefore I don't really understand why we stick to the
"restore_command => pg_wal => streaming" and why it is not possible to
change it to "pg_wal => restore_command => streaming" or maybe even
(pg_wal => streaming => restore_command).
I am not sure about the last option, but in any case. before going to
some remote place, postgres should try to find (and try to replay) the
WAL file in the pg_wal.

Regards,
--
Alexander Kukushkin


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Improve geometric types
Next
From: Robert Haas
Date:
Subject: Re: partition tree inspection functions