Thread: Basebackup fails without useful error message
Having run a basebackup, I'm getting this output at the very end:
pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"
Is there a way to get more information as to what exactly happened?
I'd like to look into fixing this or doing whatever is required so that it doesn't happen again, but this just isn't enough info. Where do I start looking?
Regards,
Koen De Groote
On 9/29/24 07:48, Koen De Groote wrote: > Having run a basebackup, I'm getting this output at the very end: > > pg_basebackup: child process exited with error 1 > pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/" What is the complete command you are using? > > Is there a way to get more information as to what exactly happened? Have you looked at the Postgres log? Is --verbose being used? > > I'd like to look into fixing this or doing whatever is required so that > it doesn't happen again, but this just isn't enough info. Where do I > start looking? > > Regards, > Koen De Groote -- Adrian Klaver adrian.klaver@aklaver.com
> What is the complete command you are using?
The full command is:
pg_basebackup -h localhost -p 5432 -U basebackup_user -D /mnt/base_backup/dir -Ft -z -P
So output Format as tar, gzipped, and with progress being printed.
> Have you looked at the Postgres log?
> Is --verbose being used?
> Is --verbose being used?
This is straight from the logs, it's the only output besides the % progress counter.
Will have a look at --verbose.
Regards,
Koen De Groote
On Sun, Sep 29, 2024 at 5:01 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 9/29/24 07:48, Koen De Groote wrote:
> Having run a basebackup, I'm getting this output at the very end:
>
> pg_basebackup: child process exited with error 1
> pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"
What is the complete command you are using?
>
> Is there a way to get more information as to what exactly happened?
Have you looked at the Postgres log?
Is --verbose being used?
>
> I'd like to look into fixing this or doing whatever is required so that
> it doesn't happen again, but this just isn't enough info. Where do I
> start looking?
>
> Regards,
> Koen De Groote
--
Adrian Klaver
adrian.klaver@aklaver.com
On 9/29/24 08:57, Koen De Groote wrote: > > What is the complete command you are using? > > The full command is: > > pg_basebackup -h localhost -p 5432 -U basebackup_user -D > /mnt/base_backup/dir -Ft -z -P > > So output Format as tar, gzipped, and with progress being printed. > > > Have you looked at the Postgres log? > > > Is --verbose being used? > > This is straight from the logs, it's the only output besides the % > progress counter. > > Will have a look at --verbose. When you report on that and if it does not report the error then what is?: Postgres version. OS and version. Anything special about the cluster like tablespaces, extensions, replication, etc. > > Regards, > Koen De Groote > -- Adrian Klaver adrian.klaver@aklaver.com
Hi,
Also examine recent postgres logs
Verify backup directory permissions
You can enable verbose to get more detail about what is happening during backup
pg_basebackup -h your_host -U your_user -D /mnt/base_backup/dir/ -v
Verify backup directory permissions
On Sun, 29 Sept 2024 at 19:49, Koen De Groote <kdg.dev@gmail.com> wrote:
Having run a basebackup, I'm getting this output at the very end:pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"Is there a way to get more information as to what exactly happened?I'd like to look into fixing this or doing whatever is required so that it doesn't happen again, but this just isn't enough info. Where do I start looking?Regards,Koen De Groote
Hello Adrian, and everyone else.
> pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
It has finally happened, the backup ran into an error again, and the verbose output set me on the right path.
I'm getting this error message:
> pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
Combined with the main server logging:
> terminating walsender process due to replication timeout
Now, the server is set up with an archive_command which gzips the WAL files and writes them to a network filesystem.
From looking at machine metrics at the time, my conclusion is the following:
At the time of the error, the remote filesystem experienced a very high queue size for new writes.
So I'm assuming the process of writing WAL files, if there is an archive_command set, is only considered to be finished after the archive is written, not just when the WAL file is written in pg_wal.
I'm also seeing in the documentation that the default WAL method for pg_basebackup is "stream", which waits for these WAL files as they are produced.
I suspect that I have 2 possible paths at this point:
1: increase wal_sender_timeout
2: run the basebackup with --wal-method=none since my restore_command is set up to explicitly go to the very same network storage to get the archived WAL files.
I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great.
Regards,
Koen De Groote
On Sun, Sep 29, 2024 at 6:08 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 9/29/24 08:57, Koen De Groote wrote:
> > What is the complete command you are using?
>
> The full command is:
>
> pg_basebackup -h localhost -p 5432 -U basebackup_user -D
> /mnt/base_backup/dir -Ft -z -P
>
> So output Format as tar, gzipped, and with progress being printed.
>
> > Have you looked at the Postgres log?
>
> > Is --verbose being used?
>
> This is straight from the logs, it's the only output besides the %
> progress counter.
>
> Will have a look at --verbose.
When you report on that and if it does not report the error then what is?:
Postgres version.
OS and version.
Anything special about the cluster like tablespaces, extensions,
replication, etc.
>
> Regards,
> Koen De Groote
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On 10/20/24 14:03, Koen De Groote wrote: > So I'm assuming the process of writing WAL files, if there is an > archive_command set, is only considered to be finished after the archive > is written, not just when the WAL file is written in pg_wal. https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL "It is important that the archive command return zero exit status if and only if it succeeds. Upon getting a zero result, PostgreSQL will assume that the file has been successfully archived, and will remove or recycle it. However, a nonzero status tells PostgreSQL that the file was not archived; it will try again periodically until it succeeds." > Regards, > Koen De Groote > > -- Adrian Klaver adrian.klaver@aklaver.com
See, I'm reading that, and my conclusion is "Ah great, it will try again, everything will be fine."
There's a link between the archive_command, pg_basebackup and wal_sender_timeout, but that link isn't clear from just reading all these documentation pages when learning about their concepts separately.
It's just too much information to read and then "just know" in your head that these must be the relations. Especially if you have dozens of other things to do.
That's my 2 cents.
Regards,
Koen De Groote
On Sun, Oct 20, 2024 at 11:12 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 10/20/24 14:03, Koen De Groote wrote:
> So I'm assuming the process of writing WAL files, if there is an
> archive_command set, is only considered to be finished after the archive
> is written, not just when the WAL file is written in pg_wal.
https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL
"It is important that the archive command return zero exit status if and
only if it succeeds. Upon getting a zero result, PostgreSQL will assume
that the file has been successfully archived, and will remove or recycle
it. However, a nonzero status tells PostgreSQL that the file was not
archived; it will try again periodically until it succeeds."
> Regards,
> Koen De Groote
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On 10/21/24 13:49, Koen De Groote wrote: > See, I'm reading that, and my conclusion is "Ah great, it will try > again, everything will be fine." Unless it never completes and you hit the timeout. > > There's a link between the archive_command, pg_basebackup and > wal_sender_timeout, but that link isn't clear from just reading all > these documentation pages when learning about their concepts separately. > > It's just too much information to read and then "just know" in your head > that these must be the relations. Especially if you have dozens of other > things to do. > > That's my 2 cents. > > Regards, > Koen De Groote > -- Adrian Klaver adrian.klaver@aklaver.com
On Sunday, October 20, 2024, Koen De Groote <kdg.dev@gmail.com> wrote:
I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great.
The archiving of WAL files by the primary does not involve a replication connection of any sort and thus the “WAL sender” settings are not relevant to it; or, here, whether or not you are archiving your WAL is immaterial since you are streaming it as it gets produced.
If you are streaming WAL it seems highly unusual that you’d end up in a situation where the connection goes idle long enough that it gets killed, especially if the backup is still happening. I’d probably go with performing the backup under a disabled (or extremely large?) timeout though and move on to other things.
That isn’t to say I fully understand what actually is happening here…
David J.
Hello David,
I saw the backup fail. The backup logged that it terminated the walsender, and correlating the moment it failed to the metrics of my storage, shows the storage at that time was facing a huge IOWAIT. And this was a network mounted storage.
The backup process continued, but because of a failure to stream WAL without error(due to a local issue) the entire backup was marked as failed. At the end, pg_basebackup will delete the backup, in this case. There's no flag to control this final behavior.
I'll be testing restore soon without streaming WAL, since the actual restore I perform doesn't use the pg_wal.tar.gz file. It gets the archived WAL At least I think it doesn't need it, hence the need for testing.
Regards,
Koen De Groote
On Tue, Oct 22, 2024 at 12:34 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Sunday, October 20, 2024, Koen De Groote <kdg.dev@gmail.com> wrote:I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great.The archiving of WAL files by the primary does not involve a replication connection of any sort and thus the “WAL sender” settings are not relevant to it; or, here, whether or not you are archiving your WAL is immaterial since you are streaming it as it gets produced.If you are streaming WAL it seems highly unusual that you’d end up in a situation where the connection goes idle long enough that it gets killed, especially if the backup is still happening. I’d probably go with performing the backup under a disabled (or extremely large?) timeout though and move on to other things.That isn’t to say I fully understand what actually is happening here…David J.