Thread: Basic question on recovery and disk snapshotting

Basic question on recovery and disk snapshotting

From

Yang Zhang

Date:

27 April 2013, 01:45:00

We're running on EBS volumes on EC2.  We're interested in leveraging
EBS snapshotting for backups.  However, does this mean we'd need to
ensure our pg_xlog is on the same EBS volume as our data?

(I believe) the usual reasoning for separating pg_xlog onto a separate
volume is for performance.  However, if they are on different volumes,
the snapshots may be out of sync.

Thanks.

Re: Basic question on recovery and disk snapshotting

From

Jov

Date:

27 April 2013, 11:25:57

Are you sure the EBS snapshot is consistent? if the snapshot is not consistent,enven on the same volume,you will have prolbems with your backup.

One methed can be try is run pg_start_backup() before take snapshot.

2013/4/27 Yang Zhang <yanghatespam@gmail.com>

We're running on EBS volumes on EC2. We're interested in leveraging
EBS snapshotting for backups. However, does this mean we'd need to
ensure our pg_xlog is on the same EBS volume as our data?

(I believe) the usual reasoning for separating pg_xlog onto a separate
volume is for performance. However, if they are on different volumes,
the snapshots may be out of sync.

Thanks.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Jov

blog: http:amutu.com/blog

Re: Basic question on recovery and disk snapshotting

From

Yang Zhang

Date:

27 April 2013, 17:40:52

On Sat, Apr 27, 2013 at 4:25 AM, Jov <amutu@amutu.com> wrote:
> Are you sure the EBS snapshot is consistent? if the snapshot is not
> consistent,enven on the same volume,you will have prolbems with your backup.

I think so.  EBS gives you "point-in-time consistent snapshots"
(https://aws.amazon.com/ebs/), but maybe you're using the term
differently.

Even so, it's impossible to take snapshots of two different volumes at
exactly the same time so they won't be consistent with each other,
hence my question.

My question really boils down to: if we're interested in using COW
snapshotting (a common feature of modern filesystems and hosting
environments), would we necessarily need to ensure the data and
pg_xlog are on the same snapshotted volume?  If not, how should we be
taking the snapshots - should we be using pg_start_backup() and then
taking the snapshot of one before the other?  (What order?)  What if
we have tablespaces, do we take snapshots of those, followed by the
cluster directory, followed by pg_xlog?

I read through http://www.postgresql.org/docs/9.1/static/continuous-archiving.html
and it doesn't touch on these questions.

>
> One methed can be try is run pg_start_backup() before take snapshot.
>
>
>
>
> 2013/4/27 Yang Zhang <yanghatespam@gmail.com>
>>
>> We're running on EBS volumes on EC2.  We're interested in leveraging
>> EBS snapshotting for backups.  However, does this mean we'd need to
>> ensure our pg_xlog is on the same EBS volume as our data?
>>
>> (I believe) the usual reasoning for separating pg_xlog onto a separate
>> volume is for performance.  However, if they are on different volumes,
>> the snapshots may be out of sync.
>>
>> Thanks.
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> Jov
> blog: http:amutu.com/blog

--
Yang Zhang
http://yz.mit.edu/

Re: Basic question on recovery and disk snapshotting

From

Tom Lane

Date:

27 April 2013, 18:13:27

Yang Zhang <yanghatespam@gmail.com> writes:
> My question really boils down to: if we're interested in using COW
> snapshotting (a common feature of modern filesystems and hosting
> environments), would we necessarily need to ensure the data and
> pg_xlog are on the same snapshotted volume?

Yeah, I think so.  It's possible to imagine schemes that would let
a WAL-snapshot-shortly-after-the-data-snapshot work, but you would
(at least) need to disable WAL file recycling, which there's no
provision for at the moment.

The usual approach is to use a COW snapshot only for making a base
backup of the data area, and rely on WAL streaming/archiving to copy
the WAL.

            regards, tom lane

Re: Basic question on recovery and disk snapshotting

From

Jeff Janes

Date:

27 April 2013, 18:55:19

On Sat, Apr 27, 2013 at 10:40 AM, Yang Zhang <yanghatespam@gmail.com> wrote:

On Sat, Apr 27, 2013 at 4:25 AM, Jov <amutu@amutu.com> wrote:
> Are you sure the EBS snapshot is consistent? if the snapshot is not
> consistent,enven on the same volume,you will have prolbems with your backup.

I think so. EBS gives you "point-in-time consistent snapshots"
(https://aws.amazon.com/ebs/), but maybe you're using the term
differently.

I would not trust any data that I care about based on the description on that page. They mention "consistent" once and "atomic" not at all. Which is not to say it won't work.

This thread seems to indicate the file system on top of the EBS volume would need to be quiescent in order for the snapshot to be reliable:

https://forums.aws.amazon.com/message.jspa?messageID=108940

Although I don't think that strength of consistency would actually be needed. As long as the snapshot reflects all writes that were, at the time the snapshot was initiated, reported back to PG as being successfully synced, and contained no writes which were done after the snapshot was reported as complete, that should be consistent enough for PG. Unless the file system itself got scrambled.

Even so, it's impossible to take snapshots of two different volumes at
exactly the same time so they won't be consistent with each other,
hence my question.

My question really boils down to: if we're interested in using COW
snapshotting (a common feature of modern filesystems and hosting
environments), would we necessarily need to ensure the data and
pg_xlog are on the same snapshotted volume?

That would certainly make it easier. But it shouldn't be necessary, as long as the xlog snapshot is taken after the cluster snapshot, and also as long as no xlog files which were written to after the last completed checkpoint prior to the cluster snapshot got recycled before the xlog snapshot. As long as the snapshots run quickly and promptly one after the other, this should not be a problem, but you should certainly validate that a snapshot collection has all the xlogs it needs before accepting it as being good. If you find some necessary xlog files are missing, you can turn up wal_keep_segments and try again.

If not, how should we be
taking the snapshots - should we be using pg_start_backup() and then
taking the snapshot of one before the other? (What order?) What if
we have tablespaces, do we take snapshots of those, followed by the
cluster directory, followed by pg_xlog?

First the cluster directory (where "pg_control" is), then tablespaces, then pg_xlog. pg_start_backup() shouldn't be necessary, unless you are running with full_page_writes off. But it won't hurt, and if you don't use pg_start_backup you should probably run a checkpoint of your own immediately before starting.

I read through http://www.postgresql.org/docs/9.1/static/continuous-archiving.html
and it doesn't touch on these questions.

Your goal seems to be to *avoid* continuous archiving, so I wouldn't expect that part of the docs to touch on your issues. But see the section "Standalone Hot Backups" which would allow you to use snapshots for the cluster "copy" part, and normal archiving for just the xlogs. The volume of pg_xlog should be fairly small, so this seems to me like an attractive option.

If you really don't want to use archiving, even just during the duration of the cluster snapshotting, then this is the part that addresses your questions:

http://www.postgresql.org/docs/9.1/static/backup-file.html

Cheers,

Jeff

Re: Basic question on recovery and disk snapshotting

From

Yang Zhang

Date:

27 April 2013, 20:10:49

On Sat, Apr 27, 2013 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Apr 27, 2013 at 10:40 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> My question really boils down to: if we're interested in using COW
>> snapshotting (a common feature of modern filesystems and hosting
>> environments), would we necessarily need to ensure the data and
>> pg_xlog are on the same snapshotted volume?
>
>
> That would certainly make it easier.  But it shouldn't be necessary, as long
> as the xlog snapshot is taken after the cluster snapshot, and also as long
> as no xlog files which were written to after the last completed checkpoint
> prior to the cluster snapshot got recycled before the xlog snapshot.   As
> long as the snapshots run quickly and promptly one after the other, this
> should not be a problem, but you should certainly validate that a snapshot
> collection has all the xlogs it needs before accepting it as being good.  If
> you find some necessary xlog files are missing, you can turn up
> wal_keep_segments and try again.

This information is gold, thank you.

How do I validate that a snapshot collection has all the xlogs it needs?

>
>
>>
>>  If not, how should we be
>> taking the snapshots - should we be using pg_start_backup() and then
>> taking the snapshot of one before the other?  (What order?)  What if
>> we have tablespaces, do we take snapshots of those, followed by the
>> cluster directory, followed by pg_xlog?
>
>
> First the cluster directory (where "pg_control" is), then tablespaces, then
> pg_xlog.  pg_start_backup() shouldn't be necessary, unless you are running
> with full_page_writes off.  But it won't hurt, and if you don't use
> pg_start_backup you should probably run a checkpoint of your own immediately
> before starting.
>
>>
>> I read through
>> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html
>> and it doesn't touch on these questions.
>
>
> Your goal seems to be to *avoid* continuous archiving, so I wouldn't expect
> that part of the docs to touch on your issues.   But see the section
> "Standalone Hot Backups" which would allow you to use snapshots for the
> cluster "copy" part, and normal archiving for just the xlogs.  The volume of
> pg_xlog should be fairly small, so this seems to me like an attractive
> option.

Just to validate my understanding, are the two options as follows?

a. Checkpoint (optional but helps with time window?), snapshot
tablespaces/cluster/xlog, validate all necessary xlogs present.

b. Set wal_level/archive_mode/archive_command, pg_start_backup,
snapshot tablespaces/cluster, pg_stop_backup to archive xlog.

(a) sounds more appealing since it's treating recovery as crash
recovery rather than backup restore, and as such seems simpler and
lower-overhead (e.g. WAL verbosity, though I don't know how much that
overhead is).  However, I'm not sure how complex that validation step
is.

>
> If you really don't want to use archiving, even just during the duration of
> the cluster snapshotting, then this is the part that addresses your
> questions:
>
> http://www.postgresql.org/docs/9.1/static/backup-file.html

I'm still interested in online backups, though - stopping the DB is a
no-go unfortunately.

--
Yang Zhang
http://yz.mit.edu/

Re: Basic question on recovery and disk snapshotting

From

Ben Chobot

Date:

30 April 2013, 17:38:06

On Apr 27, 2013, at 10:40 AM, Yang Zhang wrote:

My question really boils down to: if we're interested in using COW
snapshotting (a common feature of modern filesystems and hosting
environments), would we necessarily need to ensure the data and
pg_xlog are on the same snapshotted volume? If not, how should we be
taking the snapshots - should we be using pg_start_backup() and then
taking the snapshot of one before the other? (What order?) What if
we have tablespaces, do we take snapshots of those, followed by the
cluster directory, followed by pg_xlog?

We do this, using xfs to take advantage of being able to freeze the filesystem. (Because we're also using software raid.) The process looks like:

1. pg_start_backup()

2. xfs_freeze both the data and xlog filesystems.

3. snapshot all volumes.

4. unfreeze

5. stop backup

Re: Basic question on recovery and disk snapshotting

From

Jeff Janes

Date:

01 May 2013, 23:56:13

On Saturday, April 27, 2013, Yang Zhang wrote:

On Sat, Apr 27, 2013 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Apr 27, 2013 at 10:40 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> My question really boils down to: if we're interested in using COW
>> snapshotting (a common feature of modern filesystems and hosting
>> environments), would we necessarily need to ensure the data and
>> pg_xlog are on the same snapshotted volume?
>
>
> That would certainly make it easier. But it shouldn't be necessary, as long
> as the xlog snapshot is taken after the cluster snapshot, and also as long
> as no xlog files which were written to after the last completed checkpoint
> prior to the cluster snapshot got recycled before the xlog snapshot. As
> long as the snapshots run quickly and promptly one after the other, this
> should not be a problem, but you should certainly validate that a snapshot
> collection has all the xlogs it needs before accepting it as being good. If
> you find some necessary xlog files are missing, you can turn up
> wal_keep_segments and try again.

This information is gold, thank you.

How do I validate that a snapshot collection has all the xlogs it needs?

I've always validated my backups by practicing restoring them. It seems like the most rigorous way, and I figure I need the practice. If that isn't feasible, the backup_label file created by pg_start_backup() will tell you by name which xlog is the first one you need. If not, you can use pg_controldata to figure that out based on "Latest checkpoint's REDO location", keeping in mind that the part after the / is not zero padded on the left, so if it "short" you have to take that into account. I have no actual experience in doing this in practice.

>
>
> Your goal seems to be to *avoid* continuous archiving, so I wouldn't expect
> that part of the docs to touch on your issues. But see the section
> "Standalone Hot Backups" which would allow you to use snapshots for the
> cluster "copy" part, and normal archiving for just the xlogs. The volume of
> pg_xlog should be fairly small, so this seems to me like an attractive
> option.

Just to validate my understanding, are the two options as follows?

a. Checkpoint (optional but helps with time window?), snapshot
tablespaces/cluster/xlog, validate all necessary xlogs present.

Yes. And doing the checkpoint immediately before has two good effects, it makes the recovery faster, and it maximizes the time until wal logs you need will start to be recycled.

b. Set wal_level/archive_mode/archive_command, pg_start_backup,
snapshot tablespaces/cluster, pg_stop_backup to archive xlog.

(a) sounds more appealing since it's treating recovery as crash
recovery rather than backup restore, and as such seems simpler and
lower-overhead (e.g. WAL verbosity, though I don't know how much that
overhead is).

That brings up another point to consider. If wal level is minimal, then tables which you bulk load in the same transaction as you created them or truncated them will not get any WAL records written. (That is the main reason the WAL verbosity is reduced). But that also means that if any of those operations is happening while you are taking your snapshot, those operations will be corrupted. If the data and xlogs were part of the same atomic snapshot, this would not be a problem, as either the operation completed, or it never happened. But with different snapshots, the data can get partially but not completely into the data-snapshot, but then the xlog record which says the data was completely written does gets into the xlog snapshot

If the system is very busy with different people doing different operations at the same time, it could be hard to coordinate a time to take the backup.

However, I'm not sure how complex that validation step
is.

The more I think about it, the more I doubt it is worth it. It would be very easy to come up with a method that sort of works, some of the time, and then blows up spectacularly for no apparent reason. Using the archiving, at least over the period during which the backup is being taken, is the truly supported way of doing it. You can still use a snapshot of the data directory, but you are really just treating it as a fast copy operation. That it is atomic is not important to the process.

>
> If you really don't want to use archiving, even just during the duration of
> the cluster snapshotting, then this is the part that addresses your
> questions:
>
> http://www.postgresql.org/docs/9.1/static/backup-file.html

I'm still interested in online backups, though - stopping the DB is a
no-go unfortunately.

I wonder if initiating a EBS snapshot of large volume isn't going to freeze the database up for a while anyway, just by starving it of IO. It would be interesting to hear back on your experiences.

Cheers,

Jeff

Re: Basic question on recovery and disk snapshotting

From

Yang Zhang

Date:

02 May 2013, 02:41:28

On Wed, May 1, 2013 at 4:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> That brings up another point to consider.  If wal level is minimal, then
> tables which you bulk load in the same transaction as you created them or
> truncated them will not get any WAL records written.  (That is the main
> reason the WAL verbosity is reduced).  But that also means that if any of
> those operations is happening while you are taking your snapshot, those
> operations will be corrupted.  If the data and xlogs were part of the same
> atomic snapshot, this would not be a problem, as either the operation
> completed, or it never happened.  But with different snapshots, the data can
> get partially but not completely into the data-snapshot, but then the xlog
> record which says the data was completely written does gets into the xlog
> snapshot

Come to think of it, I'm no longer sure that EBS snapshots, which are
on the block device level, are OK, even if all your data is on a
single volume, since base backups (as documented) are supposed to be
taken via the FS (e.g. normal read operations, or even FS snapshots).
Block device level copies are not mentioned.

Can anyone confirm or refute?