Thread: LVM snapshots

LVM snapshots

From
"Matt Clark"
Date:
Has anyone tried taking an LVM snapshot of a running DB?  And then reverting to the snapshot?  I presume one would need to take a low level backup (i.e. using dd, not tar or cp) of the snapshot to preserve the precise FS layout, but are there any other issues?
 
I'm thinking of trying this slightly loony experiment on Postgres v 7.2.3 running on RedHat 7.3, with ext3 as the filesystem.  The only 'gotchas' I can think of at the moment are:
 
1.    the need to preserve the FS layout precisely
2.    the minor pauses that occur while LVM freezes the FS to bring the snapshot online and offline.
3.    the need to have the data directory and the pg_xlog directory on the same logical volume so they get snapshotted together
4.    the fact that the on-disk data is much larger than the output of pg_dump, so backing up the snapshot will require quite a bit of IO
 
Any others I've missed?  If it works then it'd be a fairly neat way to keep a recent consistent copy of the DB around without having to run pg_dump every 10 minutes, which chews CPU like nobody's business.
 
Thanks in advance for any comments.
 
Matt
 

Matt Clark
Ymogen Ltd
matt@ymogen.net
corp.ymogen.net

 

Re: LVM snapshots

From
"David F. Skoll"
Date:
On Fri, 14 Mar 2003, Matt Clark wrote:

> Has anyone tried taking an LVM snapshot of a running DB?

I don't think there's a guarantee that a snapshot of the file system
corresponds to a consistent database, even if the snapshot corresponds
to a single point-in-time.

--
David.

Re: LVM snapshots

From
"Trewern, Ben"
Date:

I don't see why not.  What happens if your server crashes?  Hopefully what's on the disk is consistent (one of the good reasons to use Postgresql).

BTW I've heard of someone else doing something like this and using rsync.  He was running it just before the snapshot and then again against the snapshot.  This should then reduce the amount of time the actual snapshot needs to be kept.

Regards,

Ben

-----Original Message-----
From: David F. Skoll [mailto:dfs@roaringpenguin.com]
Sent: 14 March 2003 15:52
To: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots

On Fri, 14 Mar 2003, Matt Clark wrote:

> Has anyone tried taking an LVM snapshot of a running DB?

I don't think there's a guarantee that a snapshot of the file system
corresponds to a consistent database, even if the snapshot corresponds
to a single point-in-time.

--
David.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.

It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************

Re: LVM snapshots

From
"Matt Clark"
Date:
That's what I thought.  The docs are a bit coy on this point.  Section 9.2 says "The database server must be shut down in order to get a usable backup. Half-way measures such as disallowing all connections will not work as there is always some buffering going on. For this reason it is also not advisable to trust file systems that claim to support "consistent snapshots".
 
But LVM calls fsync_dev_lockfs() before the snapshot is taken, and unlockfs() afterwards, so there are guaranteed to be no dirty buffers lying around - the tradeoff is that calls to write() will block until all the buffers are written to disk and the snapshot set up, but that shouldn't take long.
 
BTW I see I was wrong in my original question about tar &c being unsuitable, I don't know where I got that idea from, so taking a snapshot every 10 mins and doing an rsync should be pretty cheap in terms of write IO, though it will increase the CPU demands above those for a simple copy, and doesn't save anything on read IO.
 
Is there some other interpretation of Section 9.2 of the docs that means the reasoning on fsync_dev_lockfs above is wrong or inadequate?
 
 
-----Original Message-----
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben
Sent: 14 March 2003 16:03
To: 'David F. Skoll'; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots

I don't see why not.  What happens if your server crashes?  Hopefully what's on the disk is consistent (one of the good reasons to use Postgresql).

BTW I've heard of someone else doing something like this and using rsync.  He was running it just before the snapshot and then again against the snapshot.  This should then reduce the amount of time the actual snapshot needs to be kept.

Regards,

Ben

-----Original Message-----
From: David F. Skoll [mailto:dfs@roaringpenguin.com]
Sent: 14 March 2003 15:52
To: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots

On Fri, 14 Mar 2003, Matt Clark wrote:

> Has anyone tried taking an LVM snapshot of a running DB?

I don't think there's a guarantee that a snapshot of the file system
corresponds to a consistent database, even if the snapshot corresponds
to a single point-in-time.

--
David.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.

It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************

Re: LVM snapshots

From
"Matt Clark"
Date:
I *think* it is guaranteed with WAL, otherwise the system could fail to recover after a system crash.  At least that's
howI read 
section 11.1.1 of the docs:

http://www.postgresql.org/docs/view.php?version=7.2&idoc=1&file=wal.html

Could freezing of the filesystem result in the WAL being fsynced at an 'inappropriate' time?  Is there ever an
inappropriatetime to 
fsync the WAL?



> -----Original Message-----
> From: pgsql-admin-owner@postgresql.org
> [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of David F. Skoll
> Sent: 14 March 2003 15:52
> To: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] LVM snapshots
>
>
> On Fri, 14 Mar 2003, Matt Clark wrote:
>
> > Has anyone tried taking an LVM snapshot of a running DB?
>
> I don't think there's a guarantee that a snapshot of the file system
> corresponds to a consistent database, even if the snapshot corresponds
> to a single point-in-time.
>
> --
> David.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>



Re: LVM snapshots

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> Has anyone tried taking an LVM snapshot of a running DB?  And then reverting
> to the snapshot?  I presume one would need to take a low level backup (i.e.
> using dd, not tar or cp) of the snapshot to preserve the precise FS layout,
> but are there any other issues?

This was discussed only a couple weeks ago (in the context of a similar
facility in NetApp servers, but I believe LVM was mentioned).  See the
archives.

            regards, tom lane

Re: LVM snapshots

From
"Matt Clark"
Date:
A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up on any of the lists, and I can't *see* any
likelythreads 
in this group for Jan, Feb or March.  I'm probably just being blind - can anyone recall the Subject: of the thread?

Cheers

Matt


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 14 March 2003 17:17
> To: Matt Clark
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] LVM snapshots
>
>
> "Matt Clark" <matt@ymogen.net> writes:
> > Has anyone tried taking an LVM snapshot of a running DB?  And then reverting
> > to the snapshot?  I presume one would need to take a low level backup (i.e.
> > using dd, not tar or cp) of the snapshot to preserve the precise FS layout,
> > but are there any other issues?
>
> This was discussed only a couple weeks ago (in the context of a similar
> facility in NetApp servers, but I believe LVM was mentioned).  See the
> archives.
>
>             regards, tom lane
>



Re: LVM snapshots

From
"Trewern, Ben"
Date:

Try - [GENERAL] A few questions to real pgsql gurus

Ben

-----Original Message-----
From: Matt Clark [mailto:matt@ymogen.net]
Sent: 14 March 2003 17:46
To: Tom Lane
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots

A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up on any of the lists, and I can't *see* any likely threads

in this group for Jan, Feb or March.  I'm probably just being blind - can anyone recall the Subject: of the thread?

Cheers

Matt

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 14 March 2003 17:17
> To: Matt Clark
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] LVM snapshots
>
>
> "Matt Clark" <matt@ymogen.net> writes:
> > Has anyone tried taking an LVM snapshot of a running DB?  And then reverting
> > to the snapshot?  I presume one would need to take a low level backup (i.e.
> > using dd, not tar or cp) of the snapshot to preserve the precise FS layout,
> > but are there any other issues?
>
> This was discussed only a couple weeks ago (in the context of a similar
> facility in NetApp servers, but I believe LVM was mentioned).  See the
> archives.
>
>                       regards, tom lane
>

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.

It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************

Re: LVM snapshots

From
Larry Rosenman
Date:
Here is the subject from my post on 2/24/03:

Re: [GENERAL] A few questions to real pgsql gurus

--On Friday, March 14, 2003 17:46:24 +0000 Matt Clark <matt@ymogen.net>
wrote:

> A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up
> on any of the lists, and I can't *see* any likely threads in this group
> for Jan, Feb or March.  I'm probably just being blind - can anyone recall
> the Subject: of the thread?
>
> Cheers
>

> Matt
>
>
>> -----Original Message-----
>> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Sent: 14 March 2003 17:17
>> To: Matt Clark
>> Cc: pgsql-admin@postgresql.org
>> Subject: Re: [ADMIN] LVM snapshots
>>
>>
>> "Matt Clark" <matt@ymogen.net> writes:
>> > Has anyone tried taking an LVM snapshot of a running DB?  And then
>> > reverting to the snapshot?  I presume one would need to take a low
>> > level backup (i.e. using dd, not tar or cp) of the snapshot to
>> > preserve the precise FS layout, but are there any other issues?
>>
>> This was discussed only a couple weeks ago (in the context of a similar
>> facility in NetApp servers, but I believe LVM was mentioned).  See the
>> archives.
>>
>>             regards, tom lane
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>



--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749




Re: LVM snapshots

From
"Matt Clark"
Date:
Thanks all.
 
The conclusion there seemed to be that it ought to work just fine, but should be tested.  Well, I'll test it and see if anything interesting comes up.  If anything LVM snapshots will be less tricky than NetApp snapshots as LVM has access to the OS's local cache buffers, whereas the NetApp doesn't (though presumably the OS shouldn't/doesn't do any write caching when writing to a network FS).
 
 
 
 
-----Original Message-----
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben
Sent: 14 March 2003 17:55
To: 'Matt Clark'
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots

Try - [GENERAL] A few questions to real pgsql gurus

Ben

Re: LVM snapshots

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> Could freezing of the filesystem result in the WAL being fsynced at an
> 'inappropriate' time?  Is there ever an inappropriate time to fsync
> the WAL?

If there were it'd be a bug ...

            regards, tom lane

Re: LVM snapshots

From
Murthy Kambhampaty
Date:
Here's the script I use to mount a snapshot and rsync to a remote server;
postgresql is started and stopped on the remote server as part of the backup
procedure to make sure that the backup log contains any errors that might
show up. The restore strategy is to start the postmaster on the remote
server and dump restore like so:

pgdump -Fc -h <backup-server> -U <user> -d <database> -t <table> | pgrestore
-h <production-server> -U <user> -d <database> -t <table> [-c]

Note that this is not particularly safe with 7.3, since pgrestore will try
to restore every table in the database with the given name. If you use the
schema feature that allows indentically named tables in different schema,
you're in for trouble when you recover it.

Backup restore has been tested so far by restoring a table and comparing the
contents; they were identical. Thanks to rsync performance is fantastic
compared to pgdump to a file (my past e-mail to the list will give some idea
of the problems we've had with that). I'd be interested in comments on the
strategy and the script itself.

Cheers,
    Murthy



-----Original Message-----
From: Matt Clark [mailto:matt@ymogen.net]
Sent: Friday, March 14, 2003 13:22
To: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots


Thanks all.

The conclusion there seemed to be that it ought to work just fine, but
should be tested.  Well, I'll test it and see if anything interesting comes
up.  If anything LVM snapshots will be less tricky than NetApp snapshots as
LVM has access to the OS's local cache buffers, whereas the NetApp doesn't
(though presumably the OS shouldn't/doesn't do any write caching when
writing to a network FS).




-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben
Sent: 14 March 2003 17:55
To: 'Matt Clark'
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots


Try - [GENERAL] A few questions to real pgsql gurus
Ben


Attachment

Re: LVM snapshots

From
Matt
Date:
Murthy Kambhampaty wrote:

>Backup restore has been tested so far by restoring a table and comparing the
>contents; they were identical. Thanks to rsync performance is fantastic
>compared to pgdump to a file (my past e-mail to the list will give some idea
>of the problems we've had with that). I'd be interested in comments on the
>strategy and the script itself.
>
>
>
Many thanks for the detailed info!

As I see it the discussion so far says:
1.   The strategy should theoretically work (coy comments in the docs
notwithstanding)
2.   It works for you in practice
3.   It works very efficiently when using rsync

The only potential missing piece that I can see is some thorough
stress-testing, such as running say 50 updates/second against a
multi-table DB while the snapshotting and backup takes place, then
testing the snapshotted DB.  Is the DB you have been backing up under
significant load when you take the snapshot?

If the theory can be robustly demonstrated to work in practice (which
you may have already done) then I say this strategy needs to be
recognised in the docs (or at least the techdocs) as a valid and useful one.




Re: LVM snapshots

From
Peter Eisentraut
Date:
Matt writes:

> If the theory can be robustly demonstrated to work in practice (which
> you may have already done) then I say this strategy needs to be
> recognised in the docs (or at least the techdocs) as a valid and useful one.

The note in the documentation is rather old.  If you can conclusively
disprove it, then we'll be happy to alter it.

--
Peter Eisentraut   peter_e@gmx.net


Re: LVM snapshots

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Matt writes:
>> If the theory can be robustly demonstrated to work in practice (which
>> you may have already done) then I say this strategy needs to be
>> recognised in the docs (or at least the techdocs) as a valid and useful one.

> The note in the documentation is rather old.  If you can conclusively
> disprove it, then we'll be happy to alter it.

More to the point, the note is intended to discourage people from using
simple methods like tar or rsync duplication, which will most definitely
not work.

A snapshot implemented with help from something that sits between
Postgres and the disk theoretically could work.  Being conservative
database weenies, we'd like to see some proof that it actually works
in practical cases before we endorse it ;-).

BTW: aside from the question of the integrity of the snapshot backup,
I'd be inclined to ask hard questions about the effects on database
performance from using the disk-freezing feature in the first place.
What is the LVM layer doing with the writes that occur while the
frozen snapshot is maintained??  If they're not getting to disk at
all, then "commits" that occur during the freeze aren't really commits
(and one can be pardoned for wondering whether write ordering
constraints will be honored when the floodgate is reopened).  If they
are getting to disk, this seems to imply data journaling is being used,
which is probably a net loss for overall database performance.  You
might be willing to pay that cost to be able to use this backup
technique --- but you should find out what that cost is ...

            regards, tom lane

Re: LVM snapshots

From
Ragnar KjÞrstad
Date:
On Sun, Mar 16, 2003 at 12:50:58PM -0500, Tom Lane wrote:
> BTW: aside from the question of the integrity of the snapshot backup,
> I'd be inclined to ask hard questions about the effects on database
> performance from using the disk-freezing feature in the first place.
> What is the LVM layer doing with the writes that occur while the
> frozen snapshot is maintained??  If they're not getting to disk at
> all, then "commits" that occur during the freeze aren't really commits
> (and one can be pardoned for wondering whether write ordering
> constraints will be honored when the floodgate is reopened).  If they
> are getting to disk, this seems to imply data journaling is being used,
> which is probably a net loss for overall database performance.  You
> might be willing to pay that cost to be able to use this backup
> technique --- but you should find out what that cost is ...

All sync writes will hang while the snapshot is beeing taken. Not sure
if async writes are buffered in memory or will hang too.

So, basicly your database will be inresponsive for the time it takes to
take a snapshot. The time will depend on the size of the device, the
size of the pysical extents (the LVM "blocks") and the raw disk
performance, but typically we're talking about periods shorter than a
second.


--
Ragnar KjÞrstad
Zet.no

OT: mail encoding - Re: LVM snapshots

From
Martin Kutschker
Date:
Hi!

Ragnar Kjørstad <postgres@ragnark.vestdata.no> writes on
Mon, 17 Mar 2003 13:31:32 +0100 (MET):

> [snip]


The mail has been marked using 8859-15 but apparently is using utf-8.

Looks odd in my not so very good webmailer.

Masi

Re: LVM snapshots

From
Tom Lane
Date:
Ragnar =?iso-8859-15?B?S2rDuHJzdGFk?= <postgres@ragnark.vestdata.no> writes:
> So, basicly your database will be inresponsive for the time it takes to
> take a snapshot. The time will depend on the size of the device, the
> size of the pysical extents (the LVM "blocks") and the raw disk
> performance, but typically we're talking about periods shorter than a
> second.

We are?  On what size database?  Surely it takes longer than a second to
copy a reasonable-size database.

            regards, tom lane

Re: LVM snapshots

From
"Matt Clark"
Date:
> Ragnar =?iso-8859-15?B?S2rDuHJzdGFk?= <postgres@ragnark.vestdata.no> writes:
> > So, basicly your database will be inresponsive for the time it takes to
> > take a snapshot. The time will depend on the size of the device, the
> > size of the pysical extents (the LVM "blocks") and the raw disk
> > performance, but typically we're talking about periods shorter than a
> > second.
>
> We are?  On what size database?  Surely it takes longer than a second to
> copy a reasonable-size database.

Ah, a misconception.  The terminology is confusing.  What is stored on the 'snapshot volume' is not a copy of the
'frozenvolume', 
but only the writes to the frozen volume.  When the 'snapshot volume' is unmounted those updates then get copied back
tothe 'frozen 
volume' which is then unfrozen.  The 'snapshot volume' therefore only needs to be big enough to store as much data as
willbe 
written to (or rather as many blocks as will be altered on) the frozen volume while the snapshot is in place.

So when you take a snapshot there is no data to be copied, and when you release a snapshot some data needs to be
writtenback. 
Since the data to be copied is nice whole disk blocks and no metadata has to be altered that's a pretty fast operation.




Re: LVM snapshots

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> So when you take a snapshot there is no data to be copied, and when
> you release a snapshot some data needs to be written back.

Does the write-back preserve the original write ordering?  In general,
what happens if there's a system crash in this state?  What of errors
encountered during write-back, eg out of disk space (or is disk space
accounting still done in terms of the original volume)?

            regards, tom lane

Re: LVM snapshots

From
"Matt Clark"
Date:
Many apologies, I got things completely the wrong way round I think!  My fault entirely.

The snapshot volume holds the *original* data from the frozen volume, using copy-on-write, so neither taking nor
releasingthe 
snapshot involve any copying of data.

If you run out of room on the snapshot volume because you've written too much data to the frozen volume then you are
buggered,pure 
and simple.



> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 17 March 2003 14:32
> To: Matt Clark
> Cc: Ragnar KjÃzrstad; Peter Eisentraut; Murthy Kambhampaty;
> pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] LVM snapshots
>
>
> "Matt Clark" <matt@ymogen.net> writes:
> > So when you take a snapshot there is no data to be copied, and when
> > you release a snapshot some data needs to be written back.
>
> Does the write-back preserve the original write ordering?  In general,
> what happens if there's a system crash in this state?  What of errors
> encountered during write-back, eg out of disk space (or is disk space
> accounting still done in terms of the original volume)?
>
>             regards, tom lane
>



Re: LVM snapshots

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> The snapshot volume holds the *original* data from the frozen volume,
> using copy-on-write, so neither taking nor releasing the snapshot
> involve any copying of data.

Ah, I see: so while the freeze is on, writes to the frozen volume cost
about 3x normal (1 write -> 1 read + 2 writes), but establishing and
discontinuing the freeze cost nothing.  And if you have a crash the
volume is still in the proper state.  Good.

> If you run out of room on the snapshot volume because you've written
> too much data to the frozen volume then you are buggered, pure and
> simple.

No, you only have to drop the freeze; the original volume is not broken.

I would wonder though about the datastructure that LVM uses to remember
which blocks of the frozen volume have been copied (and where) on the
snapshot area.  I assume this table lives in kernel memory --- what
happens if it gets large?

            regards, tom lane

Re: LVM snapshots

From
"Matt Clark"
Date:
> Ah, I see: so while the freeze is on, writes to the frozen volume cost
> about 3x normal (1 write -> 1 read + 2 writes), but establishing and
> discontinuing the freeze cost nothing.  And if you have a crash the
> volume is still in the proper state.  Good.

I think that should be 1 read + 2 writes for the first write to a given block, but just 1 write from then on.

> No, you only have to drop the freeze; the original volume is not broken.

Yeah, only the snapshot is damaged

>
> I would wonder though about the datastructure that LVM uses to remember
> which blocks of the frozen volume have been copied (and where) on the
> snapshot area.  I assume this table lives in kernel memory --- what
> happens if it gets large?
>

/* copy on write tables in disk format */
typedef struct lv_COW_table_disk_v1 {
        uint64_t pv_org_number;
        uint64_t pv_org_rsector;
        uint64_t pv_snap_number;
        uint64_t pv_snap_rsector;
} lv_COW_table_disk_t;

Looks like 256 bytes per disk block.  At a default LVM block size of 64KB that's a 4MB table for 1GB of dirty blocks,
pluswhatever 
hash table overhead there might be.  Not much at any rate.



Re: LVM snapshots

From
Ragnar Kjørstad
Date:
On Mon, Mar 17, 2003 at 09:59:27AM -0500, Tom Lane wrote:
> Ah, I see: so while the freeze is on, writes to the frozen volume cost
> about 3x normal (1 write -> 1 read + 2 writes), but establishing and
> discontinuing the freeze cost nothing.  And if you have a crash the
> volume is still in the proper state.  Good.

I would say it cost practically nothing. The buffers are flushed to
disk, and some metadata updated, so there is a small cost.

I don't think the 3x write-cost is accurate?
What is the read for? There shouldn't be any reads as far as I can tell.
(assuming the lvm-metadata is in memory - I think it always is)

When you write to a block for the first time (after taking the snapshot)
there should be 2 writes - one to write the data, and one to update
lvm-metadata. The next time you write to the same "block" there should
only be one write. If you use large LVM-blocks (extents) there will be
only one write most of the time. (if you do sequential writes). On the
other hand, if you do random writes, large extents will waste a lot of
disk-space for your snapshot.

So, I think the write-cost should be somewhere in the range 1x-2x of
non-snapshot cost, depending on your usage.

> I would wonder though about the datastructure that LVM uses to remember
> which blocks of the frozen volume have been copied (and where) on the
> snapshot area.  I assume this table lives in kernel memory --- what
> happens if it gets large?

I believe it has a fixed size.
It should take 4 bytes * (volume_size / block_size).

LVM-blocksizes are relatively large, 4 MB pr default, and for large
volumes it's common to use even bigger blocks (say 128 MB). For a 2TB
volume and 128 MB blocks the datastructure should take 32KB memory.


--
Ragnar Kjørstad
Zet.no

Re: LVM snapshots

From
Ragnar Kjørstad
Date:
On Mon, Mar 17, 2003 at 03:16:19PM -0000, Matt Clark wrote:
> > Ah, I see: so while the freeze is on, writes to the frozen volume cost
> > about 3x normal (1 write -> 1 read + 2 writes), but establishing and
> > discontinuing the freeze cost nothing.  And if you have a crash the
> > volume is still in the proper state.  Good.
>
> I think that should be 1 read + 2 writes for the first write to a given block, but just 1 write from then on.

Ah, the read is needed if you're not writing the full block, right?

Discard my email on this topic, then.

> > I would wonder though about the datastructure that LVM uses to remember
> > which blocks of the frozen volume have been copied (and where) on the
> > snapshot area.  I assume this table lives in kernel memory --- what
> > happens if it gets large?
>
> /* copy on write tables in disk format */
> typedef struct lv_COW_table_disk_v1 {
>         uint64_t pv_org_number;
>         uint64_t pv_org_rsector;
>         uint64_t pv_snap_number;
>         uint64_t pv_snap_rsector;
> } lv_COW_table_disk_t;
>
> Looks like 256 bytes per disk block.  At a default LVM block size of 64KB that's a 4MB table for 1GB of dirty blocks,
pluswhatever 
> hash table overhead there might be.  Not much at any rate.

I'm confused.

First I thought LVM used 32bit block-numbers.
Second, I thought 4 MB was the default extent-size.
Third, I thought this was a fixed mapping table, not a dynamic table.

I must admint it has been a long time since I read the lvm-source, so I
may have forgotten or they may have changed it.

Is this from LVM1 or LVM2?
Is the COW-table different from the regular LVM mapping table?


--
Ragnar Kjørstad
Zet.no

Re: LVM snapshots

From
"Matt Clark"
Date:
> LVM-blocksizes are relatively large, 4 MB pr default, and for large
> volumes it's common to use even bigger blocks (say 128 MB). For a 2TB
> volume and 128 MB blocks the datastructure should take 32KB memory.

4MB is the size of a physical extent, not a block.  Snapshots use 'chunks' not whole extents, and I don't know what (if
any)
relationship there is between chunks and blocks.

#define LVM_SNAPSHOT_MAX_CHUNK  1024    /* 1024 KB */
#define LVM_SNAPSHOT_DEF_CHUNK  64      /* 64  KB */
#define LVM_SNAPSHOT_MIN_CHUNK  (PAGE_SIZE/1024)        /* 4 or 8 KB */

Anyway, you definitely *don't* force LVM to copy 4MB of data the first time you touch just 1 bit on  a snapshotted
volume!



Re: LVM snapshots

From
Ragnar Kjørstad
Date:
On Mon, Mar 17, 2003 at 03:35:53PM -0000, Matt Clark wrote:
> > LVM-blocksizes are relatively large, 4 MB pr default, and for large
> > volumes it's common to use even bigger blocks (say 128 MB). For a 2TB
> > volume and 128 MB blocks the datastructure should take 32KB memory.
>
> 4MB is the size of a physical extent, not a block.  Snapshots use 'chunks' not whole extents, and I don't know what
(ifany) 
> relationship there is between chunks and blocks.
>
> #define LVM_SNAPSHOT_MAX_CHUNK  1024    /* 1024 KB */
> #define LVM_SNAPSHOT_DEF_CHUNK  64      /* 64  KB */
> #define LVM_SNAPSHOT_MIN_CHUNK  (PAGE_SIZE/1024)        /* 4 or 8 KB */

Ah, that explains it.
Sorry for adding to the confusion.

> Anyway, you definitely *don't* force LVM to copy 4MB of data the first time you touch just 1 bit on  a snapshotted
volume!

Probably not.


--
Ragnar Kjørstad
Zet.no

Re: OT: mail encoding - Re: LVM snapshots

From
Ragnar Kjørstad
Date:
On Mon, Mar 17, 2003 at 02:09:15PM +0100, Martin Kutschker wrote:
> Ragnar Kjørstad <postgres@ragnark.vestdata.no> writes on
> Mon, 17 Mar 2003 13:31:32 +0100 (MET):
>
> > [snip]
>
> The mail has been marked using 8859-15 but apparently is using utf-8.

Ups - sorry about that.

> Looks odd in my not so very good webmailer.

Probably looks odd in most mailers :-/

I _think_ I have fixed it now.


--
Ragnar Kjørstad
Zet.no

Re: LVM snapshots

From
Murthy Kambhampaty
Date:
Here's what I have so far:

The attachment includes a revised version of a script for backing up the
database cluster, and the logs from a couple of instances of it; in these
instances the server was under the following workload:
A script that
1. Copies a roughly 380,000 row and 85 column file into a newly create table
called tbl1
2. Creates 3 single column indexes on tlb1
3. In 3 parallel subshells, 3 separate insert ... from select ... queries
put data from tbl1 into tbl2, tbl3, and tbl4
4. Upon completion, of step3, tbl1 is dropped and the process is repeated
for 12 different text files with distinct data, and similar size
5. 3 single column indexes are added to each of tbl2, tbl3 and tbl4 in 3
parallel subshells


Server hardware:
Quad PIII Xeon 500 MHz/2MB L2, 6 GB RAM, 168 GB HW RAID5 (data), free space
on second 168 GB HW RAID5 for snapshots on a dual channel Mylex Acceleraid
352, with XFS filesystem logs on a LINUX MD1 connected with an Adaptec
39320-R.

Column sums on the larger tables on the production server and on the backup
server gave identical values. Using a SAS System facility called PROC
COMPARE on the smaller tables, they were found to be identical. Note,
however, (i) my rsync options get the WHOLE FILE each time a table changes,
and (ii) because the logs attached are from tables that are being updated on
the production server, the copies of these tables have not been verified for
"accuracy", only recoverability (vacuum analyze completes successfully).
Therefore, users in transaction processing environments will have to do
their own testing. (I have verified that doing incremental rsyncs until the
data stabilize is also reliable; I'm just being paranoid on this one, and
the cost is not too high.)

I have also successfully run the backup simultaneously with the above
workload on a Dual PIII 1GHz/512MB L2, 528 MB RAM, 60 GB IDE data here, 6 GB
IDE (snapshots here) on the SAME ide channel.

This version uses a two-phase rsync: the first rsync copies the live $PGDATA
directory, excluding the pg_xlog/ folder, from the production server to the
backup server; the second is run on an LVM snapshot of the $PGDATA folder.
The postmaster is started and stopped on the backup server, so that any
problems can be identified right away. (Notice the "ReadRecord: unexpected
pageaddr 13C/98EDA000 in log file 317, segment 11, offset 15572992" in the
later log. This seems to be a non-critical error; VACUUM ANALYZE gave a
short series of:
NOTICE:  Rel <table name>: Uninitialized page 54300 - fixing
...
VACUUM
).

Others have suggested the two-phase rsync, but there the second phase
required shutting down the postmaster. Besides the obvious advantage, it
turns out that because snapshots are fast -- especially if fsync=true -- the
second rsync has little to do (if you turn of --exclude='pg_xlog/' in the
first phase, you many find that the backup is fully complete and the
snapshot is removed almost immediately after creation). Thus, the time that
the snapshot is up and dragging down filesystem operations is minimized.

ADDITIONAL NOTES:

Data recovery from backup is done with:
pg_dump -Fc -h bkHost -U pguser -d db-to-recover [ -t <tbl-to-recover> ] |
pg_restore -h pgHost -U pguser -c

As, the 7.4dev docs say:
"WAL offers the opportunity for a new method for database on-line backup and
restore (BAR). To use this method, one would have to make periodic saves of
data files to another disk, a tape or another host and also archive the WAL
log files. The database file copy and the archived log files could be used
to restore just as if one were restoring after a crash. Each time a new
database file copy was made the old log files could be removed. Implementing
this facility will require the logging of data file and index creation and
deletion; it will also require development of a method for copying the data
files (operating system copy commands are not suitable)."

If you can stand doing it with filesystem level tools (LVM or EVMS on linux;
hw snapshots on other OS?), you can do this already, as shown here. Besides
BAR, here are a couple of applications to consider:

1. A limited form of PITR: You can recover any object to its state at the
last snapshot backup. (Of course, log roll-forward is missing.)
From Oracle 9i documentation:
"Introduction to RMAN TSPITR
Recovery Manager (RMAN) automated tablespace point-in-time recovery (TSPITR)
enables you to quickly recover one or more tablespaces to a time that is
different from that of the rest of the database.

RMAN TSPITR is most useful for recovering the following:

    * An erroneous DROP TABLE or TRUNCATE TABLE statement
    * A table that has become logically corrupted
    * An incorrect batch job or other DML statement that has affected only a
subset of the database
    ...
"
Can do here.

2. It is much easier to upgrade versions. Initdb the new version on pgHost,
then (pg_dumpall -Fc -h bkHost | pg_restore ) from the upgraded server.

3. If $PGDATA/pg_xlog is on a different set of disks than PGDATA, you should
be able to adapt with:
xfs_freeze -f $PGXLOG_FS
xfs_freeze -f $PGDATA_FS
    lvcreate -s -L <size> -n snap_pgdata <the $PGDATA_FS device>
    lvcreate -s -n <size> -n snap_pgxlog <the $PGXLOG_FS device>
xfs_freeze -u $PGDATA_FS
xfs_freeze -u $PGXLOG_FS
(on the assumption that the postmaster isn't doing anything while waiting
for pg_xlog/ writes to complete). You can then mount the snapshots in the
proper places, rsync and go. This has not been tested, though it's on my
agenda. Could the postgresql developers comment on this?

As Matt points out, this strategy is a useful addition to the postgresql
administrator's tools. It seems useful to update the documentation, and to
point out that the WAL design already gives the benefit of online backup and
restore when combined with externally available snaphsot facilities.

QED. Cheers,
    Murthy









-----Original Message-----
From: Peter Eisentraut [mailto:peter_e@gmx.net]
Sent: Sunday, March 16, 2003 09:05
To: Matt
Cc: Murthy Kambhampaty; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots


Matt writes:

> If the theory can be robustly demonstrated to work in practice (which
> you may have already done) then I say this strategy needs to be
> recognised in the docs (or at least the techdocs) as a valid and useful
one.

The note in the documentation is rather old.  If you can conclusively
disprove it, then we'll be happy to alter it.

--
Peter Eisentraut   peter_e@gmx.net


Attachment

Re: LVM snapshots

From
"Matt Clark"
Date:
Thanks Murthy, that's exceptionally helpful!

Does anyone know what (in general) would cause the notices that Murthy spotted in the logs as per the snippet below?


> The postmaster is started and stopped on the backup server, so that any
> problems can be identified right away. (Notice the "ReadRecord: unexpected
> pageaddr 13C/98EDA000 in log file 317, segment 11, offset 15572992" in the
> later log. This seems to be a non-critical error; VACUUM ANALYZE gave a
> short series of:
> NOTICE:  Rel <table name>: Uninitialized page 54300 - fixing
> ...
> VACUUM
> ).


Re: LVM snapshots

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> Does anyone know what (in general) would cause the notices that Murthy spotted in the logs as per the snippet below?

>> NOTICE:  Rel <table name>: Uninitialized page 54300 - fixing

There is a known mechanism for this to occur on certain key system catalogs,
see http://fts.postgresql.org/db/mw/msg.html?mid=1357214

We've seen some reports suggesting it can happen on user tables too, but
I don't know what the mechanism is or whether there's any risk of data
loss.  If you can make this happen reproducibly (or at least with decent
probability) on tables other than pg_class, pg_type, pg_attribute, pg_proc,
then I'd be real interested to see a test case.

            regards, tom lane


Re: LVM snapshots

From
Murthy Kambhampaty
Date:
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes:

>>> NOTICE:  Rel <table name>: Uninitialized page 54300 - fixing
>
>There is a known mechanism for this to occur on certain key system
catalogs,
>see http://fts.postgresql.org/db/mw/msg.html?mid=1357214
>
>We've seen some reports suggesting it can happen on user tables too, but
>I don't know what the mechanism is or whether there's any risk of data
>loss.  If you can make this happen reproducibly (or at least with decent
>probability) on tables other than pg_class, pg_type, pg_attribute, pg_proc,
>then I'd be real interested to see a test case.
>
>

The tables on which this happened were the ones called tbl2 and tbl3 in my
example: i.e., tables into which data were being added with "insert into
tbl[1,2] from select * from tbl1 where ..." running in parallel subshells.
If somebody wants to try to replicate this:
1. Create a table with several hundred thousand records
2. Start an "insert into tbl2 from select ..." (multiple ones in parallel
maximize your chance of hitting this)
3. Freeze the filesystem, create a snapshot, unfreeze the filesystem, mount
the snapshot and copy $PGDATA somewhere else
4. Start a postmaster on the copy of the database cluster; run VACUUM
ANALYZE on tlb2
    you should see a series of
    NOTICE:  Rel tbl2: Uninitialized page 'nnnnn' - fixing
    messages

I might have a chance to play with this late next but right now I'm swamped.

Cheers,
    Murthy


Re: LVM snapshots

From
Tom Lane
Date:
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
> If somebody wants to try to replicate this:
> 1. Create a table with several hundred thousand records
> 2. Start an "insert into tbl2 from select ..." (multiple ones in parallel
> maximize your chance of hitting this)
> 3. Freeze the filesystem, create a snapshot, unfreeze the filesystem, mount
> the snapshot and copy $PGDATA somewhere else
> 4. Start a postmaster on the copy of the database cluster; run VACUUM
> ANALYZE on tlb2
>     you should see a series of
>     NOTICE:  Rel tbl2: Uninitialized page 'nnnnn' - fixing
>     messages

Hmm.  This amounts to testing what happens after a Postgres crash (the
copied PGDATA will look exactly like the system had gone down at the
instant of the freeze).  In that scenario zeroed pages aren't too
improbable, although I wouldn't expect to see a large number of them.
Are you running with a nondefault WAL_BUFFERS setting?

The way it could happen is like this: you have backend(s) busily
extending the table and filling each page with new tuples.  Each time we
extend a table, we actually write a page of zeroes to the kernel ---
this is so that we'll find out right away if we're out of disk space.
Then, *inside the Postgres shared-buffer arena*, the page is initialized
with correct header data and filled with tuples.  Unless you are short
on shared buffers, it will probably not be written out to the kernel
until the next checkpoint.  So there is an interval where the kernel
thinks that this page of the file exists but contains zeroes.

Meanwhile, WAL entries for the new tuples are being written into the
WAL buffers arena.  But they aren't going to be forced to disk until
there's a transaction commit (or the WAL buffers become full).

If the first WAL entry describing a tuple insertion into the new page
gets out to disk (or out to the kernel at least) before the filesystem
freeze or Postgres crash, then WAL replay will initialize the zeroed
page and you won't see any complaints.  But there is clearly an interval
where a zeroed page can be added to a table and not initialized by WAL
replay.

This is all okay and no data loss is possible through this mechanism.
We can only lose uncommitted tuples, since a commit would have forced
out the WAL entries.  But it's hard to see how you could have more than
approximately WAL_BUFFERS uninitialized pages present due to this
scenario --- a WAL entry can remain unwritten only as long as less than
WAL_BUFFERS worth of subsequent WAL traffic has occurred, so the
physical extensions of the file can't get all that far ahead of what WAL
replay will be able to see.

If you're just seeing a few of these notices per try, then I don't think
there's anything to worry about (especially if they reference pages very
near the end of their tables).  Otherwise, there may be some other
behavior involved, and we'd better look more closely.

            regards, tom lane


Re: LVM snapshots

From
Matt
Date:
Tom Lane wrote:

 > A lot of detailed and informative stuff essentially equating a
snapshot backup with a system crash

And I have to agree with what Tom said.  An FS snapshot should indeed
look just like an instantaneous halt in processing.  Murthy's testing
does involve a pretty heavy load, and it is in a sense reassuring that
it manages to generate the same kind of issues, and that PostgreSQL's
ability to recover from such crashes is so good.

Maybe one could use FS snapshots as a way to test crash recovery without
needing to crash the system!

Anyway, I was about to respond to Tom's comment that it looked like a
snapshotting bug, but can we take it from yr subsequent post Tom that
you no longer think so?


Re: LVM snapshots

From
Tom Lane
Date:
Matt <matt@ymogen.net> writes:
> Anyway, I was about to respond to Tom's comment that it looked like a
> snapshotting bug, but can we take it from yr subsequent post Tom that
> you no longer think so?

Right, assuming that what Murthy is seeing matches the behavior I
described (ie, no more than a few zeroed pages, all near the ends of
tables that were undergoing extension).

            regards, tom lane


Re: LVM snapshots

From
Murthy Kambhampaty
Date:
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes:
>>     NOTICE:  Rel tbl2: Uninitialized page 'nnnnn' - fixing
>Hmm.  This amounts to testing what happens after a Postgres crash (the
>copied PGDATA will look exactly like the system had gone down at the
>instant of the freeze).  In that scenario zeroed pages aren't too
>improbable, although I wouldn't expect to see a large number of them.
It turns out there aren't many of them, after all, and they seem to be
truncated by vacuum.
Here's the vacuum report from the log, for the two files "afftected":
2003-04-02 18:33:46 [4657]   DEBUG:  --Relation tbl2--
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54300 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54301 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54302 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54303 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54304 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54305 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54306 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54307 -
fixing
2003-04-02 18:34:33 [4657]   NOTICE:  Rel tbl2: Uninitialized page 54308 -
fixing
2003-04-02 18:34:34 [4657]   DEBUG:  Removed 49173 tuples in 3946 pages.
        CPU 0.27s/1.07u sec elapsed 1.34 sec.
2003-04-02 18:34:34 [4657]   DEBUG:  Pages 54309: Changed 54300, Empty 0;
Tup 620366: Vac 49173, Keep 0, UnUsed 0.
        Total CPU 9.78s/2.35u sec elapsed 47.74 sec.
2003-04-02 18:34:35 [4657]   DEBUG:  Truncated 54309 --> 50355 pages.
        CPU 0.38s/0.07u sec elapsed 0.44 sec.


2003-04-02 18:34:55 [4657]   DEBUG:  --Relation tbl3--
2003-04-02 18:35:31 [4657]   NOTICE:  Rel tbl3: Uninitialized page 33172 -
fixing
2003-04-02 18:35:31 [4657]   NOTICE:  Rel tbl3: Uninitialized page 33173 -
fixing
2003-04-02 18:35:31 [4657]   NOTICE:  Rel tbl3: Uninitialized page 33174 -
fixing
2003-04-02 18:35:31 [4657]   NOTICE:  Rel tbl3: Uninitialized page 33175 -
fixing
2003-04-02 18:35:32 [4657]   DEBUG:  Removed 38874 tuples in 3230 pages.
        CPU 0.20s/0.88u sec elapsed 1.08 sec.
2003-04-02 18:35:32 [4657]   DEBUG:  Pages 33176: Changed 33172, Empty 0;
Tup 354316: Vac 38874, Keep 0, UnUsed 0.
        Total CPU 5.66s/1.71u sec elapsed 36.95 sec.
2003-04-02 18:35:32 [4657]   DEBUG:  Truncated 33176 --> 29943 pages.
        CPU 0.41s/0.02u sec elapsed 0.42 sec.


>Are you running with a nondefault WAL_BUFFERS setting?
WAL_BUFFERS=64


> [ Snip ]
Thanks for the explanation, and for taking a second look ;-)


Re: LVM snapshots

From
Tom Lane
Date:
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes:
>> Are you running with a nondefault WAL_BUFFERS setting?
> WAL_BUFFERS=64

Okay, a dozen or so uninitialized pages seem consistent with that
setting.  I think we understand this one then ...

            regards, tom lane