Thread: In case of network issues, how long before archive_command does retries

In case of network issues, how long before archive_command does retries

From
Koen De Groote
Date:
I've got a setup where archive_command will gzip the wal archive to a directory that is itself an NFS mount.

When connection is gone or blocked, archive_command fails after the timeout specified by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)

However, on restoring connection, it's not clear to me how long it takes before the command is retried.

Experience says "a few minutes", but I can't find documentation on an exact algorithm.

To be clear, the question is: if archive_command fails, what are the specifics of retrying? Is there a timeout? How is that timeout defined?

Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.

For detail, I'm using postgres 11, running on Ubuntu 20.

Regards,
Koen
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
> I've got a setup where archive_command will gzip the wal archive to a directory that is itself an NFS mount.
> 
> When connection is gone or blocked, archive_command fails after the timeout specified by the NFS mount, as expected.
(fora soft mount. hard mount hangs, as expected)
 
> 
> However, on restoring connection, it's not clear to me how long it takes before the command is retried.
> 
> Experience says "a few minutes", but I can't find documentation on an exact algorithm.
> 
> To be clear, the question is: if archive_command fails, what are the specifics of retrying? Is there a timeout? How
isthat timeout defined?
 
> 
> Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
> 
> For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: In case of network issues, how long before archive_command does retries

From
Koen De Groote
Date:
Hello Laurenz,

Thanks for the reply. That would mean the source code is here: https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?

If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen that a walarchive is skipped if the archive_command fails too many times or takes too long? It will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point where it stops trying, waiting for the next time archive_command is invoked?

I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no ".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until someone forcefully switches the timeline with for instance a basebackup)?

Apologies, I already sent this message once, but only to Laurenz. Sending again to have it in the archives.

Regards,
Koen

On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
> I've got a setup where archive_command will gzip the wal archive to a directory that is itself an NFS mount.
>
> When connection is gone or blocked, archive_command fails after the timeout specified by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)
>
> However, on restoring connection, it's not clear to me how long it takes before the command is retried.
>
> Experience says "a few minutes", but I can't find documentation on an exact algorithm.
>
> To be clear, the question is: if archive_command fails, what are the specifics of retrying? Is there a timeout? How is that timeout defined?
>
> Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
>
> For detail, I'm using postgres 11, running on Ubuntu 20.

You can find the details in "src/backend/postmaster/pgarch.c".

The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
of one second, then back off until it receives a signal, PostgreSQL shutd down
or a minute has passed.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:
> On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
> > > When connection is gone or blocked, archive_command fails after the timeout specified
> > > by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)
> > > 
> > > However, on restoring connection, it's not clear to me how long it takes before the command is retried.
> > > 
> > > Experience says "a few minutes", but I can't find documentation on an exact algorithm.
> > > 
> > > To be clear, the question is: if archive_command fails, what are the specifics of retrying?
> > > Is there a timeout? How is that timeout defined?
> > > 
> > > Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
> > > 
> > > For detail, I'm using postgres 11, running on Ubuntu 20.
> > 
> > You can find the details in "src/backend/postmaster/pgarch.c".
> > 
> > The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
> > of one second, then back off until it receives a signal, PostgreSQL shutd down
> > or a minute has passed.
>
> Thanks for the reply. That would mean the source code is here:
> https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

For release 11.0, yes.

> Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?

No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if anybody
wakes up the archiver, it will try again.
 
> If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen
> that a walarchive is skipped if the archive_command fails too many times or takes too long? It
> will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point
> where it stops trying, waiting for the next time archive_command is invoked?

Even if a signal arrives, PostgreSQL will keep trying to archive that same WAL segment
that failed until it is done.

This is a potential sequence of events:

  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds -> get woken up by a signal after 30 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  get shutdown request -> exit

When PostgreSQL restarts, it will continue trying to archive the same segment.
 
> I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no
> ".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until
> someone forcefully switches the timeline with for instance a basebackup)?

Yes, it will keep trying, and a timeline switch won't change that.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: In case of network issues, how long before archive_command does retries

From
Koen De Groote
Date:
Thank you for your thorough explanation.

On Thu, May 19, 2022 at 5:47 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:
> On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
> > > When connection is gone or blocked, archive_command fails after the timeout specified
> > > by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)
> > >
> > > However, on restoring connection, it's not clear to me how long it takes before the command is retried.
> > >
> > > Experience says "a few minutes", but I can't find documentation on an exact algorithm.
> > >
> > > To be clear, the question is: if archive_command fails, what are the specifics of retrying?
> > > Is there a timeout? How is that timeout defined?
> > >
> > > Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
> > >
> > > For detail, I'm using postgres 11, running on Ubuntu 20.
> >
> > You can find the details in "src/backend/postmaster/pgarch.c".
> >
> > The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
> > of one second, then back off until it receives a signal, PostgreSQL shutd down
> > or a minute has passed.
>
> Thanks for the reply. That would mean the source code is here:
> https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

For release 11.0, yes.

> Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?

No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if anybody
wakes up the archiver, it will try again.

> If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen
> that a walarchive is skipped if the archive_command fails too many times or takes too long? It
> will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point
> where it stops trying, waiting for the next time archive_command is invoked?

Even if a signal arrives, PostgreSQL will keep trying to archive that same WAL segment
that failed until it is done.

This is a potential sequence of events:

  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds -> get woken up by a signal after 30 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  get shutdown request -> exit

When PostgreSQL restarts, it will continue trying to archive the same segment.

> I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no
> ".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until
> someone forcefully switches the timeline with for instance a basebackup)?

Yes, it will keep trying, and a timeline switch won't change that.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com