Re: Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Hard limit on WAL space used (because PANIC sucks)
Date
Msg-id CAMkU=1wCsTvMt=XrqHroaRCynYJZBQKW6hqL948UAVk=mLkr5g@mail.gmail.com
Whole thread Raw
In response to Re: Hard limit on WAL space used (because PANIC sucks)  ("Joshua D. Drake" <jd@commandprompt.com>)
List pgsql-hackers
On Sat, Jun 8, 2013 at 11:07 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 06/08/2013 07:36 AM, MauMau wrote:

1. If the machine or postgres crashes while archive_command is copying a
WAL file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and
postgres refuses to start when it finds such a small archive WAL file.

Should that be changed?  If the file is 16MB but it turns to gibberish after 3MB, recovery proceeds up to the gibberish.  Given that, why should it refuse to start if the file is only 3MB to start with?
 
The solution, which IIRC Tomas san told me here, is to do like "cp %p
/archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".


This will overwrite /archive/dir/%f if it already exists, which is usually recommended against.  Although I don't know that it necessarily should be.  One common problem with archiving is for a network glitch to occur during the archive command, so the archive command fails and tries again later.  But the later tries will always fail, because the target was created before/during the glitch.  Perhaps a more full featured archive command would detect and rename an existing file, rather than either overwriting it or failing.

If we have no compunction about overwriting the file, then I don't see a reason to use the cp + mv combination.  If the simple cp fails to copy the entire file, it will be tried again until it succeeds.


Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that.

Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that.  The example with cp is explicitly given as an example, not a recommendation.
 



3. You cannot know the reason of archive_command failure (e.g. archive
area full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.

Wait, what? Is this true (someone else?)


It is kind of true.  PostgreSQL does not automatically arrange for the stderr of the archive_command to be sent to syslog.  But archive_command can do whatever it wants, including arranging for its own failure messages to go to syslog. 

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: JSON and unicode surrogate pairs
Next
From: Daniel Farina
Date:
Subject: Re: Hard limit on WAL space used (because PANIC sucks)