Re: In case of network issues, how long before archive_command does retries - Mailing list pgsql-general

From Laurenz Albe
Subject Re: In case of network issues, how long before archive_command does retries
Date
Msg-id f992ec94fe4fbc0b7e39a8f26e1e397c98c929a0.camel@cybertec.at
Whole thread Raw
In response to Re: In case of network issues, how long before archive_command does retries  (Koen De Groote <kdg.dev@gmail.com>)
Responses Re: In case of network issues, how long before archive_command does retries
List pgsql-general
On Thu, 2022-05-19 at 15:43 +0200, Koen De Groote wrote:
> On Thu, May 19, 2022 at 9:10 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Wed, 2022-05-18 at 22:51 +0200, Koen De Groote wrote:
> > > When connection is gone or blocked, archive_command fails after the timeout specified
> > > by the NFS mount, as expected. (for a soft mount. hard mount hangs, as expected)
> > > 
> > > However, on restoring connection, it's not clear to me how long it takes before the command is retried.
> > > 
> > > Experience says "a few minutes", but I can't find documentation on an exact algorithm.
> > > 
> > > To be clear, the question is: if archive_command fails, what are the specifics of retrying?
> > > Is there a timeout? How is that timeout defined?
> > > 
> > > Is this detailed somewhere? Perhaps in the source code? I couldn't find it in the documentation.
> > > 
> > > For detail, I'm using postgres 11, running on Ubuntu 20.
> > 
> > You can find the details in "src/backend/postmaster/pgarch.c".
> > 
> > The archiver will try to archive three times (NUM_ARCHIVE_RETRIES) in an interval
> > of one second, then back off until it receives a signal, PostgreSQL shutd down
> > or a minute has passed.
>
> Thanks for the reply. That would mean the source code is here:
> https://github.com/postgres/postgres/blob/REL_11_0/src/backend/postmaster/pgarch.c

For release 11.0, yes.

> Just to be sure, the "signal" you speak of, this is the result of the command executed by archive_command?

No, that is an operating system signal.
PostgreSQL processes communicate by sending signals to each other, and if anybody
wakes up the archiver, it will try again.
 
> If my understanding of the code is right, if no SIGTERM or other signal arrives, it won't ever happen
> that a walarchive is skipped if the archive_command fails too many times or takes too long? It
> will simply check again every 60 seconds(PGARCH_AUTOWAKE_INTERVAL) ? Or is the 60 seconds the point
> where it stops trying, waiting for the next time archive_command is invoked?

Even if a signal arrives, PostgreSQL will keep trying to archive that same WAL segment
that failed until it is done.

This is a potential sequence of events:

  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  sleep 60 seconds -> get woken up by a signal after 30 seconds
  try to archive -> fail
  sleep 1 second
  try to archive -> fail
  get shutdown request -> exit

When PostgreSQL restarts, it will continue trying to archive the same segment.
 
> I'm assuming that as long as the file is still in the pg_wal directory and as long as there is no
> ".done" file for that walarchive under pg_wal/archive_status, it will keep trying forever(or until
> someone forcefully switches the timeline with for instance a basebackup)?

Yes, it will keep trying, and a timeline switch won't change that.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: No default for (user-specific) service file location on Windows?
Next
From: Julien Rouhaud
Date:
Subject: Re: No default for (user-specific) service file location on Windows?