Thread: [RFC] What should we do for reliable WAL archiving?

[RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

16 March 2014, 10:25:04

Hello,

The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on 
Windows) as an example for archive_command.  However, cp/copy does not sync 
the copied data to disk.  As a result, the completed WAL segments would be 
lost in the following sequence:

1. A WAL segment fills up.

2. The archiver process archives the just filled WAL segment using 
archive_command.  That is, cp/copy reads the WAL segment file from pg_xlog/ 
and writes to the archive area.  At this point, the WAL file is not 
persisted to the archive area yet, because cp/copy doesn't sync the writes.

3. The checkpoint processing removes the WAL segment file from pg_xlog/.

4. The OS crashes.  The filled WAL segment doesn't exist anywhere any more.

Considering the "reliable" image of PostgreSQL and widespread use in 
enterprise systems, I think something should be done.  Could you give me 
your opinions on the right direction?  Although the doc certainly escapes by 
saying "(This is an example, not a recommendation, and might not work on all 
platforms.)", it seems from pgsql-xxx MLs that many people are following 
this example.

* Improve the example in the documentation.
But what command can we use to reliably sync just one file?

* Provide some command, say pg_copy, which copies a file synchronously by 
using fsync(), and describes in the doc something like "for simple use 
cases, you can use pg_copy as the standard reliable copy command."

Related to this topic, pg_basebackup doesn't fsync the backed up files.  I'm 
afraid this too is different from what the users expect --- I guess they 
would expect the backup is certainly available after pg_basebackup completes 
even if the machine crashes.

Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

Amit Kapila

Date:

16 March 2014, 11:21:27

On Sun, Mar 16, 2014 at 3:53 PM, MauMau <maumau307@gmail.com> wrote:
> Hello,
>
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command.  However, cp/copy does not sync
> the copied data to disk.  As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command.  That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area.  At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>
> 4. The OS crashes.  The filled WAL segment doesn't exist anywhere any more.
>
> Considering the "reliable" image of PostgreSQL and widespread use in
> enterprise systems, I think something should be done.  Could you give me
> your opinions on the right direction?

How about using pg_receivexlog for archiving purpose?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

16 March 2014, 14:24:14

From: "Amit Kapila" <amit.kapila16@gmail.com>
> How about using pg_receivexlog for archiving purpose?

pg_receivexlog is good in that it does fsync().  But it seems difficult to 
use correctly, and I'm not sure if I can catch all WAL segments without any 
loss.  pg_receivexlog must be started with postmaster and monitored with 
some measures.  This won't be very easy at least on Windows.

The pg_receivexlog reference page suggests another difficulty:

Notes
When using pg_receivexlog instead of archive_command, the server will 
continue to recycle transaction log files even if the backups are not 
properly archived, since there is no command that fails. This can be worked 
around by having an archive_command that fails when the file has not been 
properly archived yet, for example:
archive_command = 'sleep 5 && test -f /mnt/server/archivedir/%f'

This suggestion is not correct, because it only checks the existence of the 
file.  What if the file size is less than 16MB?  How can we check if the 
file is completely archived?

Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

Greg Stark

Date:

16 March 2014, 14:48:15

On Sun, Mar 16, 2014 at 10:23 AM, MauMau <maumau307@gmail.com> wrote:
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command.  However, cp/copy does not sync
> the copied data to disk

I'm actually a lot less concerned about fsyncing the backup than I am
about fsyncing the restore. The backup is just a bunch of files for
the user to make use of. They might copy them around, compress them,
move them onto tape or other storage. They need to be aware of the
persistence of whatever storage system they're putting them in.

But when they do a restore they just untar or whatever other
extraction tool and then hand those files to Postgres to maintain. I
bet the number of people who fsync or call sync the data files after
untarring their backups is vanishingly small and problems could
manifest later after Postgres has been running.

WAL-e recently changed to fsync each data file and the directories
containing them after restore. But perhaps Postgres should open and
fsync each file in the database when it starts up?

In most file systems files written to are guaranteed to be synced
within a configurable amount of time (in some systems unless the
filesystem can't keep up). So the practical risk may be small. But in
theory a database that wasn't synced when it was restored could
suddenly lose files days or months later when a crash occurs and some
data files weren't touched by the database in the intervening time.

-- 
greg

Re: [RFC] What should we do for reliable WAL archiving?

From

Andreas Karlsson

Date:

16 March 2014, 18:59:45

On 03/16/2014 03:23 PM, MauMau wrote:
> From: "Amit Kapila" <amit.kapila16@gmail.com>
>> How about using pg_receivexlog for archiving purpose?
>
> pg_receivexlog is good in that it does fsync().  But it seems difficult
> to use correctly, and I'm not sure if I can catch all WAL segments
> without any loss.  pg_receivexlog must be started with postmaster and
> monitored with some measures.  This won't be very easy at least on Windows.

Replication slots should solve the issue of making sure to catch all of 
the WAL.

-- 
Andreas Karlsson

Re: [RFC] What should we do for reliable WAL archiving?

From

Robert Haas

Date:

17 March 2014, 01:20:22

On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307@gmail.com> wrote:
> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
> Windows) as an example for archive_command.  However, cp/copy does not sync
> the copied data to disk.  As a result, the completed WAL segments would be
> lost in the following sequence:
>
> 1. A WAL segment fills up.
>
> 2. The archiver process archives the just filled WAL segment using
> archive_command.  That is, cp/copy reads the WAL segment file from pg_xlog/
> and writes to the archive area.  At this point, the WAL file is not
> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>
> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>
> 4. The OS crashes.  The filled WAL segment doesn't exist anywhere any more.
>
> Considering the "reliable" image of PostgreSQL and widespread use in
> enterprise systems, I think something should be done.  Could you give me
> your opinions on the right direction?  Although the doc certainly escapes by
> saying "(This is an example, not a recommendation, and might not work on all
> platforms.)", it seems from pgsql-xxx MLs that many people are following
> this example.
>
> * Improve the example in the documentation.
> But what command can we use to reliably sync just one file?
>
> * Provide some command, say pg_copy, which copies a file synchronously by
> using fsync(), and describes in the doc something like "for simple use
> cases, you can use pg_copy as the standard reliable copy command."

+1.  This won't obviate the need for tools to manage replication, but
it would make it possible to get the simplest case right without
guessing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [RFC] What should we do for reliable WAL archiving?

From

Amit Kapila

Date:

17 March 2014, 04:10:20

On Sun, Mar 16, 2014 at 7:53 PM, MauMau <maumau307@gmail.com> wrote:
> From: "Amit Kapila" <amit.kapila16@gmail.com>
>
>> How about using pg_receivexlog for archiving purpose?
>
>
> pg_receivexlog is good in that it does fsync().  But it seems difficult to
> use correctly, and I'm not sure if I can catch all WAL segments without any
> loss.  pg_receivexlog must be started with postmaster and monitored with
> some measures.  This won't be very easy at least on Windows.
>
> The pg_receivexlog reference page suggests another difficulty:
>
> Notes
> When using pg_receivexlog instead of archive_command, the server will
> continue to recycle transaction log files even if the backups are not
> properly archived, since there is no command that fails. This can be worked
> around by having an archive_command that fails when the file has not been
> properly archived yet, for example:
> archive_command = 'sleep 5 && test -f /mnt/server/archivedir/%f'
>
> This suggestion is not correct, because it only checks the existence of the
> file.  What if the file size is less than 16MB?  How can we check if the
> file is completely archived?

The most probable reasons for un-successful archiving could be:
1. Disk space got full - pg_receivexlog makes sure while open/create new
segment file that the size of new file should be 16MB (open_walfile()). So due
to this reason there should not be a problem to above command.

2. Permission got denied - I think this will lead to failure of above archive
command mentioned by you.

3. n/w connection broken - This will also lead to failure of above command, but
here I think there is a possibility that it might have checked the existence of
in-complete wal file on archive location and consider it archived, but I think
wal_keep_segments can avoid this problem.

Also if you are on 9.4, then may be --slot parameter can help you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [RFC] What should we do for reliable WAL archiving?

From

Fujii Masao

Date:

17 March 2014, 12:13:00

On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307@gmail.com> wrote:
>> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
>> Windows) as an example for archive_command.  However, cp/copy does not sync
>> the copied data to disk.  As a result, the completed WAL segments would be
>> lost in the following sequence:
>>
>> 1. A WAL segment fills up.
>>
>> 2. The archiver process archives the just filled WAL segment using
>> archive_command.  That is, cp/copy reads the WAL segment file from pg_xlog/
>> and writes to the archive area.  At this point, the WAL file is not
>> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>>
>> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>>
>> 4. The OS crashes.  The filled WAL segment doesn't exist anywhere any more.
>>
>> Considering the "reliable" image of PostgreSQL and widespread use in
>> enterprise systems, I think something should be done.  Could you give me
>> your opinions on the right direction?  Although the doc certainly escapes by
>> saying "(This is an example, not a recommendation, and might not work on all
>> platforms.)", it seems from pgsql-xxx MLs that many people are following
>> this example.
>>
>> * Improve the example in the documentation.
>> But what command can we use to reliably sync just one file?
>>
>> * Provide some command, say pg_copy, which copies a file synchronously by
>> using fsync(), and describes in the doc something like "for simple use
>> cases, you can use pg_copy as the standard reliable copy command."
>
> +1.  This won't obviate the need for tools to manage replication, but
> it would make it possible to get the simplest case right without
> guessing.

+1, too.

And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
archived file after the copy? Also It might be good idea to support the direct
copy of the file to avoid wasting the file cache.

Regards,

-- 
Fujii Masao

Re: [RFC] What should we do for reliable WAL archiving?

From

Mitsumasa KONDO

Date:

17 March 2014, 13:44:44

2014-03-17 21:12 GMT+09:00 Fujii Masao <masao.fujii@gmail.com>:

On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307@gmail.com> wrote:
>> The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
>> Windows) as an example for archive_command. However, cp/copy does not sync
>> the copied data to disk. As a result, the completed WAL segments would be
>> lost in the following sequence:
>>
>> 1. A WAL segment fills up.
>>
>> 2. The archiver process archives the just filled WAL segment using
>> archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/
>> and writes to the archive area. At this point, the WAL file is not
>> persisted to the archive area yet, because cp/copy doesn't sync the writes.
>>
>> 3. The checkpoint processing removes the WAL segment file from pg_xlog/.
>>
>> 4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.
>>
>> Considering the "reliable" image of PostgreSQL and widespread use in
>> enterprise systems, I think something should be done. Could you give me
>> your opinions on the right direction? Although the doc certainly escapes by
>> saying "(This is an example, not a recommendation, and might not work on all
>> platforms.)", it seems from pgsql-xxx MLs that many people are following
>> this example.
>>
>> * Improve the example in the documentation.
>> But what command can we use to reliably sync just one file?
>>
>> * Provide some command, say pg_copy, which copies a file synchronously by
>> using fsync(), and describes in the doc something like "for simple use
>> cases, you can use pg_copy as the standard reliable copy command."
>
> +1. This won't obviate the need for tools to manage replication, but
> it would make it possible to get the simplest case right without
> guessing.

+1, too.

And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
archived file after the copy? Also It might be good idea to support the direct
copy of the file to avoid wasting the file cache.

Use direct_cp.

http://directcp.sourceforge.net/direct_cp.html

Regards,

Mitsumasa KONDO

NTT Open Source Software Center

Re: [RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

21 March 2014, 11:56:00

From: "Mitsumasa KONDO" <kondo.mitsumasa@gmail.com>
> 2014-03-17 21:12 GMT+09:00 Fujii Masao <masao.fujii@gmail.com>:
>
>> On Mon, Mar 17, 2014 at 10:20 AM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > On Sun, Mar 16, 2014 at 6:23 AM, MauMau <maumau307@gmail.com> wrote:
>> >> * Improve the example in the documentation.
>> >> But what command can we use to reliably sync just one file?
>> >>
>> >> * Provide some command, say pg_copy, which copies a file synchronously
>> by
>> >> using fsync(), and describes in the doc something like "for simple use
>> >> cases, you can use pg_copy as the standard reliable copy command."
>> >
>> > +1.  This won't obviate the need for tools to manage replication, but
>> > it would make it possible to get the simplest case right without
>> > guessing.
>>
>> +1, too.
>>
>> And, what about making pg_copy call posix_fadvise(DONT_NEED) against the
>> archived file after the copy? Also It might be good idea to support the
>> direct
>> copy of the file to avoid wasting the file cache.
>
> Use direct_cp.
> http://directcp.sourceforge.net/direct_cp.html

Thank you all for giving favorable responses and interesting ideas.
Then, I think I'll do:

* Create pg_copy in C so that it can be used on Windows as well as on 
UNIX/Linux.  It just copies one file.  Its source code is located in 
src/bin/pg_copy/.  Please recommend a better name if you have one in mind.

* Add a reference page for pg_copy in the chapter "Server applications". 
Modify the section for continuous archiving to recommend pg_copy for simple 
use cases as the standard command.

* pg_copy calls posix_fadvise(DONT_NEED) on the destination file.

* pg_copy passes O_DIRECT flag when opening the destination file 
when --directio or -d option is specified.  O_DIRECT is not used by default 
because it may not be available on some file systems, as well as it might 
cause trouble on older platforms such as RHEL4/5.  pg_copy does not use 
O_DIRECT for the source file so that it can copy the data from the 
filesystem cache, which is just written by postgres.

Could you give me your opinions before starting the work, including the 
following?

* Should I refactor the functions (copy_file, copydir, etc.) in 
src/backend/storage/file/copydir.c so that they can also be used for 
frontends?  If so, which of src/port or src/common/ is the right place to 
put copydir.c in?

* Should I complete the work before 9.4 beta so that it will be available 
starting with 9.4?  I think so because it is a basic capability to archive 
transaction logs safely (although the time may not allow me to do this).

Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

Michael Paquier

Date:

21 March 2014, 13:24:12

On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307@gmail.com> wrote:
> * Create pg_copy in C so that it can be used on Windows as well as on
> UNIX/Linux.  It just copies one file.  Its source code is located in
> src/bin/pg_copy/.  Please recommend a better name if you have one in mind.
I'd rather see that as a part of contrib/ if possible. Is there any
portion of the code you have in mind that makes mandatory putting it
in src/bin?

> * Should I complete the work before 9.4 beta so that it will be available
> starting with 9.4?  I think so because it is a basic capability to archive
> transaction logs safely (although the time may not allow me to do this).
Pursing efforts on a utility like that is worth the shot IMO (I would
use it for sure if it has reliable cross-platform support to unify
sets of replication scripts), but including it in 9.4 is out of scope.
A saner target would be the 1st commit fest of 9.5.

Regards,
-- 
Michael

Re: [RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

21 March 2014, 15:03:24

From: "Michael Paquier" <michael.paquier@gmail.com>
> On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307@gmail.com> wrote:
>> * Create pg_copy in C so that it can be used on Windows as well as on
>> UNIX/Linux.  It just copies one file.  Its source code is located in
>> src/bin/pg_copy/.  Please recommend a better name if you have one in 
>> mind.
> I'd rather see that as a part of contrib/ if possible. Is there any
> portion of the code you have in mind that makes mandatory putting it
> in src/bin?

Archiving transaction logs reliably is a basic responsibility of DBMS, so I 
think it should be treated as part of the core.  It is not a desirable 
feature but actually a mandatory one to persist transaction logs.  Even if 
it were a "better to have" feature, it can be put in the core like 
pg_basebackup and pg_isready, which are not mandatory tools.

>> * Should I complete the work before 9.4 beta so that it will be available
>> starting with 9.4?  I think so because it is a basic capability to 
>> archive
>> transaction logs safely (although the time may not allow me to do this).
> Pursing efforts on a utility like that is worth the shot IMO (I would
> use it for sure if it has reliable cross-platform support to unify
> sets of replication scripts), but including it in 9.4 is out of scope.
> A saner target would be the 1st commit fest of 9.5.

OK, I don't mind if it should be targeted at 9.4 or 9.5.  If someone wants 
it for 9.4, I try to hurry.

Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

21 March 2014, 15:03:57

From: "Michael Paquier" <michael.paquier@gmail.com>
> On Fri, Mar 21, 2014 at 8:54 PM, MauMau <maumau307@gmail.com> wrote:
>> * Create pg_copy in C so that it can be used on Windows as well as on
>> UNIX/Linux.  It just copies one file.  Its source code is located in
>> src/bin/pg_copy/.  Please recommend a better name if you have one in 
>> mind.
> I'd rather see that as a part of contrib/ if possible. Is there any
> portion of the code you have in mind that makes mandatory putting it
> in src/bin?

Archiving transaction logs reliably is a basic responsibility of DBMS, so I 
think it should be treated as part of the core.  It is not a desirable 
feature but actually a mandatory one to persist transaction logs.  Even if 
it were a "better to have" feature, it can be put in the core like 
pg_basebackup and pg_isready, which are not mandatory tools.

>> * Should I complete the work before 9.4 beta so that it will be available
>> starting with 9.4?  I think so because it is a basic capability to 
>> archive
>> transaction logs safely (although the time may not allow me to do this).
> Pursing efforts on a utility like that is worth the shot IMO (I would
> use it for sure if it has reliable cross-platform support to unify
> sets of replication scripts), but including it in 9.4 is out of scope.
> A saner target would be the 1st commit fest of 9.5.

OK, I don't mind if it should be targeted at 9.4 or 9.5.  If someone wants 
it for 9.4, I try to hurry.

Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

Jeff Janes

Date:

21 March 2014, 20:16:15

On Sun, Mar 16, 2014 at 3:23 AM, MauMau <maumau307@gmail.com> wrote:

Hello,

The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on Windows) as an example for archive_command. However, cp/copy does not sync the copied data to disk. As a result, the completed WAL segments would be lost in the following sequence:

1. A WAL segment fills up.

2. The archiver process archives the just filled WAL segment using archive_command. That is, cp/copy reads the WAL segment file from pg_xlog/ and writes to the archive area. At this point, the WAL file is not persisted to the archive area yet, because cp/copy doesn't sync the writes.

3. The checkpoint processing removes the WAL segment file from pg_xlog/.

Note that it takes two checkpoints for this to happen, at least as currently coded.

Also, if the system crashed badly enough to need media recovery, rather than just automatic crash recovery, some lost transactions are expected. Although this could silently break your PITR chain, of a crash happened and automatic recover used the copy in pg_xlog (which of course was synced) , while copy in the archive was not synced.

4. The OS crashes. The filled WAL segment doesn't exist anywhere any more.

Considering the "reliable" image of PostgreSQL and widespread use in enterprise systems, I think something should be done. Could you give me your opinions on the right direction? Although the doc certainly escapes by saying "(This is an example, not a recommendation, and might not work on all platforms.)", it seems from pgsql-xxx MLs that many people are following this example.

I use this as an example, kind of, but what I am copying to is a network mount, so any attempts to fsync it there would probably need unavailable hooks into the remote file system.

Do people really just copy the files from one directory of local storage to another directory of local storage? I don't see the point of that. But it seems like this is an area where there are hundreds of use cases, and often one doesn't see the point of other people's, making it hard to come up with good examples.

* Improve the example in the documentation.
But what command can we use to reliably sync just one file?

* Provide some command, say pg_copy, which copies a file synchronously by using fsync(), and describes in the doc something like "for simple use cases, you can use pg_copy as the standard reliable copy command."

The recommendation is to refuse to overwrite an existing file of the same name, and exit with failure. Which essentially brings archiving to a halt, because it keeps trying but it will keep failing. If we make a custom version, one thing it should do is determine if the existing archived file is just a truncated version of the attempting-to-be archived file, and if so overwrite it. Because if the first archival command fails with a network glitch, it can leave behind a partial file.

Cheers,

Jeff

Re: [RFC] What should we do for reliable WAL archiving?

From

"MauMau"

Date:

21 March 2014, 21:23:57

From: "Jeff Janes" <jeff.janes@gmail.com>
> Do people really just copy the files from one directory of local storage 
> to
> another directory of local storage?  I don't see the point of that.

It makes sense to archive WAL to a directory of local storage for media 
recovery.  Here, the local storage is a different disk drive which is 
directly attached to the database server or directly connected through SAN.


> The recommendation is to refuse to overwrite an existing file of the same
> name, and exit with failure.  Which essentially brings archiving to a 
> halt,
> because it keeps trying but it will keep failing.  If we make a custom
> version, one thing it should do is determine if the existing archived file
> is just a truncated version of the attempting-to-be archived file, and if
> so overwrite it.  Because if the first archival command fails with a
> network glitch, it can leave behind a partial file.

What I'm trying to address is just an alternative to cp/copy which fsyncs a 
file.  It just overwrites an existing file.

Yes, you're right, the failed archive attempt leaves behind a partial file 
which causes subsequent attempts to fail, if you follow the PG manual. 
That's another undesirable point in the current doc.  To overcome this, 
someone on this ML recommended me to do "cp %p /archive/dir/%f.tmp && mv 
/archive/dir/%f.tmp /archive/dir/%f".  Does this solve your problem?


Regards
MauMau

Re: [RFC] What should we do for reliable WAL archiving?

From

Bruce Momjian

Date:

21 March 2014, 23:02:30

On Fri, Mar 21, 2014 at 01:16:08PM -0700, Jeff Janes wrote:
> On Sun, Mar 16, 2014 at 3:23 AM, MauMau <maumau307@gmail.com> wrote:
> 
>     Hello,
> 
>     The PostgreSQL documentation describes cp (on UNIX/Linux) or copy (on
>     Windows) as an example for archive_command.  However, cp/copy does not sync
>     the copied data to disk.  As a result, the completed WAL segments would be
>     lost in the following sequence:
> 
>     1. A WAL segment fills up.
> 
>     2. The archiver process archives the just filled WAL segment using
>     archive_command.  That is, cp/copy reads the WAL segment file from pg_xlog/
>     and writes to the archive area.  At this point, the WAL file is not
>     persisted to the archive area yet, because cp/copy doesn't sync the writes.
> 
>     3. The checkpoint processing removes the WAL segment file from pg_xlog/.
> 
> 
> Note that it takes two checkpoints for this to happen, at least as currently
> coded.
> 
> Also, if the system crashed badly enough to need media recovery, rather than
> just automatic crash recovery, some lost transactions are expected.  Although
> this could silently break your PITR chain, of a crash happened and automatic
> recover used the copy in pg_xlog (which of course was synced) , while copy in
> the archive was not synced.

That is one good reason to keep checkpoint_warning=30, so the typical
file system sync that happens every 30 seconds warns that those files
might not on permanent storage.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: [RFC] What should we do for reliable WAL archiving?

From

Martijn van Oosterhout

Date:

22 March 2014, 09:21:18

On Sat, Mar 22, 2014 at 06:22:37AM +0900, MauMau wrote:
> From: "Jeff Janes" <jeff.janes@gmail.com>
> >Do people really just copy the files from one directory of local
> >storage to
> >another directory of local storage?  I don't see the point of that.
>
> It makes sense to archive WAL to a directory of local storage for
> media recovery.  Here, the local storage is a different disk drive
> which is directly attached to the database server or directly
> connected through SAN.

I'm one of those peope. They are archived into a local directory in
preparation for an rsync over ssh.

> >The recommendation is to refuse to overwrite an existing file of the same
> >name, and exit with failure.  Which essentially brings archiving
> >to a halt,
> >because it keeps trying but it will keep failing.  If we make a custom
> >version, one thing it should do is determine if the existing archived file
> >is just a truncated version of the attempting-to-be archived file, and if
> >so overwrite it.  Because if the first archival command fails with a
> >network glitch, it can leave behind a partial file.
>
> What I'm trying to address is just an alternative to cp/copy which
> fsyncs a file.  It just overwrites an existing file.

I ran into a related problem with cp, where halfway the copy the disk
was full and I was left with half a WAL file. This caused the rsync to
copy only half a file and the replication broke. This is clearly a
recoverable situation, but it didn't recover in this case.

> Yes, you're right, the failed archive attempt leaves behind a
> partial file which causes subsequent attempts to fail, if you follow
> the PG manual. That's another undesirable point in the current doc.
> To overcome this, someone on this ML recommended me to do "cp %p
> /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f".
> Does this solve your problem?

This would probably have handled it, but I find it odd that there's
program to handle restoring of archives properly, but on the archiving
side you have to cobble together your own shell scripts which fail in
various corner cases.

I'd love a program that just Did The Right Thing.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: [RFC] What should we do for reliable WAL archiving?

From

Jeff Janes

Date:

29 March 2014, 21:38:13

On Fri, Mar 21, 2014 at 2:22 PM, MauMau <maumau307@gmail.com> wrote:

From: "Jeff Janes" <jeff.janes@gmail.com>

Do people really just copy the files from one directory of local storage to
another directory of local storage? I don't see the point of that.

It makes sense to archive WAL to a directory of local storage for media recovery. Here, the local storage is a different disk drive which is directly attached to the database server or directly connected through SAN.

For a SAN I guess we have different meanings of "local" :)

(I have no doubt yours is correct--the fine art of IT terminology is not my thing.)

The recommendation is to refuse to overwrite an existing file of the same
name, and exit with failure. Which essentially brings archiving to a halt,
because it keeps trying but it will keep failing. If we make a custom
version, one thing it should do is determine if the existing archived file
is just a truncated version of the attempting-to-be archived file, and if
so overwrite it. Because if the first archival command fails with a
network glitch, it can leave behind a partial file.

What I'm trying to address is just an alternative to cp/copy which fsyncs a file. It just overwrites an existing file.

Yes, you're right, the failed archive attempt leaves behind a partial file which causes subsequent attempts to fail, if you follow the PG manual. That's another undesirable point in the current doc. To overcome this, someone on this ML recommended me to do "cp %p /archive/dir/%f.tmp && mv /archive/dir/%f.tmp /archive/dir/%f". Does this solve your problem?

As written is doesn't solve it, as it just unconditionally overwrites the file. If you wanted that you could just do the single-statement unconditional overwrite.

You could make it so that the .tmp gets overwritten unconditionally, but the move of it will not overwrite an existing permanent file. That would solve the problem where a glitch in the network leaves in incomplete file behind that blocks the next attempt, *except* that mv on (at least some) network file systems is really a copy, and not an atomic rename, so is still subject to leaving behind incomplete crud.

But, it is hard to tell what the real solution is, because the doc doesn't explain why it should refuse (and fail) to overwrite an existing file. The only reason I can think of to make that recommendation is because it is easy to accidentally configure two clusters to attempt to archive to the same location, and having them overwrite each others files should be guarded against. If I am right, it seems like this reason should be added to the docs, so people know what they are defending against. And if I am wrong, it seems even more important that the (correct) reason is added to the docs.

Cheers,

Jeff

Re: [RFC] What should we do for reliable WAL archiving?

From

Tom Lane

Date:

29 March 2014, 21:44:19

Jeff Janes <jeff.janes@gmail.com> writes:
> But, it is hard to tell what the real solution is, because the doc doesn't
> explain why it should refuse (and fail) to overwrite an existing file.  The
> only reason I can think of to make that recommendation is because it is
> easy to accidentally configure two clusters to attempt to archive to the
> same location, and having them overwrite each others files should be
> guarded against.  If I am right, it seems like this reason should be added
> to the docs, so people know what they are defending against.  And if I am
> wrong, it seems even more important that the (correct) reason is added to
> the docs.

If memory serves, that is the reason ... and I thought it *was* explained
somewhere in the docs.
        regards, tom lane

Re: [RFC] What should we do for reliable WAL archiving?

From

Jeff Janes

Date:

29 March 2014, 23:10:16

On Saturday, March 29, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> But, it is hard to tell what the real solution is, because the doc doesn't
> explain why it should refuse (and fail) to overwrite an existing file. The
> only reason I can think of to make that recommendation is because it is
> easy to accidentally configure two clusters to attempt to archive to the
> same location, and having them overwrite each others files should be
> guarded against. If I am right, it seems like this reason should be added
> to the docs, so people know what they are defending against. And if I am
> wrong, it seems even more important that the (correct) reason is added to
> the docs.

If memory serves, that is the reason ... and I thought it *was* explained
somewhere in the docs.

You are right, and it has been there for a decade. I don't know how I missed that the last several times I read it. I remember clearly the paragraph below it, just not that one.

Sorry,

Jeff