Thread: LVM snapshots
Matt Clark
Ymogen Ltd
matt@ymogen.net
corp.ymogen.net
On Fri, 14 Mar 2003, Matt Clark wrote: > Has anyone tried taking an LVM snapshot of a running DB? I don't think there's a guarantee that a snapshot of the file system corresponds to a consistent database, even if the snapshot corresponds to a single point-in-time. -- David.
I don't see why not. What happens if your server crashes? Hopefully what's on the disk is consistent (one of the good reasons to use Postgresql).
BTW I've heard of someone else doing something like this and using rsync. He was running it just before the snapshot and then again against the snapshot. This should then reduce the amount of time the actual snapshot needs to be kept.
Regards,
Ben
-----Original Message-----
From: David F. Skoll [mailto:dfs@roaringpenguin.com]
Sent: 14 March 2003 15:52
To: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots
On Fri, 14 Mar 2003, Matt Clark wrote:
> Has anyone tried taking an LVM snapshot of a running DB?
I don't think there's a guarantee that a snapshot of the file system
corresponds to a consistent database, even if the snapshot corresponds
to a single point-in-time.
--
David.
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.
It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************
-----Original Message-----
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben
Sent: 14 March 2003 16:03
To: 'David F. Skoll'; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshotsI don't see why not. What happens if your server crashes? Hopefully what's on the disk is consistent (one of the good reasons to use Postgresql).
BTW I've heard of someone else doing something like this and using rsync. He was running it just before the snapshot and then again against the snapshot. This should then reduce the amount of time the actual snapshot needs to be kept.
Regards,
Ben
-----Original Message-----
From: David F. Skoll [mailto:dfs@roaringpenguin.com]
Sent: 14 March 2003 15:52
To: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshotsOn Fri, 14 Mar 2003, Matt Clark wrote:
> Has anyone tried taking an LVM snapshot of a running DB?
I don't think there's a guarantee that a snapshot of the file system
corresponds to a consistent database, even if the snapshot corresponds
to a single point-in-time.--
David.---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.
It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************
I *think* it is guaranteed with WAL, otherwise the system could fail to recover after a system crash. At least that's howI read section 11.1.1 of the docs: http://www.postgresql.org/docs/view.php?version=7.2&idoc=1&file=wal.html Could freezing of the filesystem result in the WAL being fsynced at an 'inappropriate' time? Is there ever an inappropriatetime to fsync the WAL? > -----Original Message----- > From: pgsql-admin-owner@postgresql.org > [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of David F. Skoll > Sent: 14 March 2003 15:52 > To: pgsql-admin@postgresql.org > Subject: Re: [ADMIN] LVM snapshots > > > On Fri, 14 Mar 2003, Matt Clark wrote: > > > Has anyone tried taking an LVM snapshot of a running DB? > > I don't think there's a guarantee that a snapshot of the file system > corresponds to a consistent database, even if the snapshot corresponds > to a single point-in-time. > > -- > David. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >
"Matt Clark" <matt@ymogen.net> writes: > Has anyone tried taking an LVM snapshot of a running DB? And then reverting > to the snapshot? I presume one would need to take a low level backup (i.e. > using dd, not tar or cp) of the snapshot to preserve the precise FS layout, > but are there any other issues? This was discussed only a couple weeks ago (in the context of a similar facility in NetApp servers, but I believe LVM was mentioned). See the archives. regards, tom lane
A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up on any of the lists, and I can't *see* any likelythreads in this group for Jan, Feb or March. I'm probably just being blind - can anyone recall the Subject: of the thread? Cheers Matt > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 14 March 2003 17:17 > To: Matt Clark > Cc: pgsql-admin@postgresql.org > Subject: Re: [ADMIN] LVM snapshots > > > "Matt Clark" <matt@ymogen.net> writes: > > Has anyone tried taking an LVM snapshot of a running DB? And then reverting > > to the snapshot? I presume one would need to take a low level backup (i.e. > > using dd, not tar or cp) of the snapshot to preserve the precise FS layout, > > but are there any other issues? > > This was discussed only a couple weeks ago (in the context of a similar > facility in NetApp servers, but I believe LVM was mentioned). See the > archives. > > regards, tom lane >
Try - [GENERAL] A few questions to real pgsql gurus
Ben
-----Original Message-----
From: Matt Clark [mailto:matt@ymogen.net]
Sent: 14 March 2003 17:46
To: Tom Lane
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshots
A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up on any of the lists, and I can't *see* any likely threads
in this group for Jan, Feb or March. I'm probably just being blind - can anyone recall the Subject: of the thread?
Cheers
Matt
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 14 March 2003 17:17
> To: Matt Clark
> Cc: pgsql-admin@postgresql.org
> Subject: Re: [ADMIN] LVM snapshots
>
>
> "Matt Clark" <matt@ymogen.net> writes:
> > Has anyone tried taking an LVM snapshot of a running DB? And then reverting
> > to the snapshot? I presume one would need to take a low level backup (i.e.
> > using dd, not tar or cp) of the snapshot to preserve the precise FS layout,
> > but are there any other issues?
>
> This was discussed only a couple weeks ago (in the context of a similar
> facility in NetApp servers, but I believe LVM was mentioned). See the
> archives.
>
> regards, tom lane
>
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html
*****************************************************************************
This email and any attachments transmitted with it are confidential
and intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the sender and do not store, copy or disclose the content
to any other person.
It is the responsibility of the recipient to ensure that opening this
message and/or any of its attachments will not adversely affect
its systems. No responsibility is accepted by the Company.
*****************************************************************************
Here is the subject from my post on 2/24/03: Re: [GENERAL] A few questions to real pgsql gurus --On Friday, March 14, 2003 17:46:24 +0000 Matt Clark <matt@ymogen.net> wrote: > A search for 'netapp' or 'lvm' doesn't seem to throw anything relevant up > on any of the lists, and I can't *see* any likely threads in this group > for Jan, Feb or March. I'm probably just being blind - can anyone recall > the Subject: of the thread? > > Cheers > > Matt > > >> -----Original Message----- >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> Sent: 14 March 2003 17:17 >> To: Matt Clark >> Cc: pgsql-admin@postgresql.org >> Subject: Re: [ADMIN] LVM snapshots >> >> >> "Matt Clark" <matt@ymogen.net> writes: >> > Has anyone tried taking an LVM snapshot of a running DB? And then >> > reverting to the snapshot? I presume one would need to take a low >> > level backup (i.e. using dd, not tar or cp) of the snapshot to >> > preserve the precise FS layout, but are there any other issues? >> >> This was discussed only a couple weeks ago (in the context of a similar >> facility in NetApp servers, but I believe LVM was mentioned). See the >> archives. >> >> regards, tom lane >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
-----Original Message-----
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben
Sent: 14 March 2003 17:55
To: 'Matt Clark'
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] LVM snapshotsTry - [GENERAL] A few questions to real pgsql gurus
Ben
"Matt Clark" <matt@ymogen.net> writes: > Could freezing of the filesystem result in the WAL being fsynced at an > 'inappropriate' time? Is there ever an inappropriate time to fsync > the WAL? If there were it'd be a bug ... regards, tom lane
Here's the script I use to mount a snapshot and rsync to a remote server; postgresql is started and stopped on the remote server as part of the backup procedure to make sure that the backup log contains any errors that might show up. The restore strategy is to start the postmaster on the remote server and dump restore like so: pgdump -Fc -h <backup-server> -U <user> -d <database> -t <table> | pgrestore -h <production-server> -U <user> -d <database> -t <table> [-c] Note that this is not particularly safe with 7.3, since pgrestore will try to restore every table in the database with the given name. If you use the schema feature that allows indentically named tables in different schema, you're in for trouble when you recover it. Backup restore has been tested so far by restoring a table and comparing the contents; they were identical. Thanks to rsync performance is fantastic compared to pgdump to a file (my past e-mail to the list will give some idea of the problems we've had with that). I'd be interested in comments on the strategy and the script itself. Cheers, Murthy -----Original Message----- From: Matt Clark [mailto:matt@ymogen.net] Sent: Friday, March 14, 2003 13:22 To: pgsql-admin@postgresql.org Subject: Re: [ADMIN] LVM snapshots Thanks all. The conclusion there seemed to be that it ought to work just fine, but should be tested. Well, I'll test it and see if anything interesting comes up. If anything LVM snapshots will be less tricky than NetApp snapshots as LVM has access to the OS's local cache buffers, whereas the NetApp doesn't (though presumably the OS shouldn't/doesn't do any write caching when writing to a network FS). -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Trewern, Ben Sent: 14 March 2003 17:55 To: 'Matt Clark' Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] LVM snapshots Try - [GENERAL] A few questions to real pgsql gurus Ben
Attachment
Murthy Kambhampaty wrote: >Backup restore has been tested so far by restoring a table and comparing the >contents; they were identical. Thanks to rsync performance is fantastic >compared to pgdump to a file (my past e-mail to the list will give some idea >of the problems we've had with that). I'd be interested in comments on the >strategy and the script itself. > > > Many thanks for the detailed info! As I see it the discussion so far says: 1. The strategy should theoretically work (coy comments in the docs notwithstanding) 2. It works for you in practice 3. It works very efficiently when using rsync The only potential missing piece that I can see is some thorough stress-testing, such as running say 50 updates/second against a multi-table DB while the snapshotting and backup takes place, then testing the snapshotted DB. Is the DB you have been backing up under significant load when you take the snapshot? If the theory can be robustly demonstrated to work in practice (which you may have already done) then I say this strategy needs to be recognised in the docs (or at least the techdocs) as a valid and useful one.
Matt writes: > If the theory can be robustly demonstrated to work in practice (which > you may have already done) then I say this strategy needs to be > recognised in the docs (or at least the techdocs) as a valid and useful one. The note in the documentation is rather old. If you can conclusively disprove it, then we'll be happy to alter it. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > Matt writes: >> If the theory can be robustly demonstrated to work in practice (which >> you may have already done) then I say this strategy needs to be >> recognised in the docs (or at least the techdocs) as a valid and useful one. > The note in the documentation is rather old. If you can conclusively > disprove it, then we'll be happy to alter it. More to the point, the note is intended to discourage people from using simple methods like tar or rsync duplication, which will most definitely not work. A snapshot implemented with help from something that sits between Postgres and the disk theoretically could work. Being conservative database weenies, we'd like to see some proof that it actually works in practical cases before we endorse it ;-). BTW: aside from the question of the integrity of the snapshot backup, I'd be inclined to ask hard questions about the effects on database performance from using the disk-freezing feature in the first place. What is the LVM layer doing with the writes that occur while the frozen snapshot is maintained?? If they're not getting to disk at all, then "commits" that occur during the freeze aren't really commits (and one can be pardoned for wondering whether write ordering constraints will be honored when the floodgate is reopened). If they are getting to disk, this seems to imply data journaling is being used, which is probably a net loss for overall database performance. You might be willing to pay that cost to be able to use this backup technique --- but you should find out what that cost is ... regards, tom lane
On Sun, Mar 16, 2003 at 12:50:58PM -0500, Tom Lane wrote: > BTW: aside from the question of the integrity of the snapshot backup, > I'd be inclined to ask hard questions about the effects on database > performance from using the disk-freezing feature in the first place. > What is the LVM layer doing with the writes that occur while the > frozen snapshot is maintained?? If they're not getting to disk at > all, then "commits" that occur during the freeze aren't really commits > (and one can be pardoned for wondering whether write ordering > constraints will be honored when the floodgate is reopened). If they > are getting to disk, this seems to imply data journaling is being used, > which is probably a net loss for overall database performance. You > might be willing to pay that cost to be able to use this backup > technique --- but you should find out what that cost is ... All sync writes will hang while the snapshot is beeing taken. Not sure if async writes are buffered in memory or will hang too. So, basicly your database will be inresponsive for the time it takes to take a snapshot. The time will depend on the size of the device, the size of the pysical extents (the LVM "blocks") and the raw disk performance, but typically we're talking about periods shorter than a second. -- Ragnar KjÞrstad Zet.no
Hi! Ragnar Kjørstad <postgres@ragnark.vestdata.no> writes on Mon, 17 Mar 2003 13:31:32 +0100 (MET): > [snip] The mail has been marked using 8859-15 but apparently is using utf-8. Looks odd in my not so very good webmailer. Masi
Ragnar =?iso-8859-15?B?S2rDuHJzdGFk?= <postgres@ragnark.vestdata.no> writes: > So, basicly your database will be inresponsive for the time it takes to > take a snapshot. The time will depend on the size of the device, the > size of the pysical extents (the LVM "blocks") and the raw disk > performance, but typically we're talking about periods shorter than a > second. We are? On what size database? Surely it takes longer than a second to copy a reasonable-size database. regards, tom lane
> Ragnar =?iso-8859-15?B?S2rDuHJzdGFk?= <postgres@ragnark.vestdata.no> writes: > > So, basicly your database will be inresponsive for the time it takes to > > take a snapshot. The time will depend on the size of the device, the > > size of the pysical extents (the LVM "blocks") and the raw disk > > performance, but typically we're talking about periods shorter than a > > second. > > We are? On what size database? Surely it takes longer than a second to > copy a reasonable-size database. Ah, a misconception. The terminology is confusing. What is stored on the 'snapshot volume' is not a copy of the 'frozenvolume', but only the writes to the frozen volume. When the 'snapshot volume' is unmounted those updates then get copied back tothe 'frozen volume' which is then unfrozen. The 'snapshot volume' therefore only needs to be big enough to store as much data as willbe written to (or rather as many blocks as will be altered on) the frozen volume while the snapshot is in place. So when you take a snapshot there is no data to be copied, and when you release a snapshot some data needs to be writtenback. Since the data to be copied is nice whole disk blocks and no metadata has to be altered that's a pretty fast operation.
"Matt Clark" <matt@ymogen.net> writes: > So when you take a snapshot there is no data to be copied, and when > you release a snapshot some data needs to be written back. Does the write-back preserve the original write ordering? In general, what happens if there's a system crash in this state? What of errors encountered during write-back, eg out of disk space (or is disk space accounting still done in terms of the original volume)? regards, tom lane
Many apologies, I got things completely the wrong way round I think! My fault entirely. The snapshot volume holds the *original* data from the frozen volume, using copy-on-write, so neither taking nor releasingthe snapshot involve any copying of data. If you run out of room on the snapshot volume because you've written too much data to the frozen volume then you are buggered,pure and simple. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 17 March 2003 14:32 > To: Matt Clark > Cc: Ragnar KjÃzrstad; Peter Eisentraut; Murthy Kambhampaty; > pgsql-admin@postgresql.org > Subject: Re: [ADMIN] LVM snapshots > > > "Matt Clark" <matt@ymogen.net> writes: > > So when you take a snapshot there is no data to be copied, and when > > you release a snapshot some data needs to be written back. > > Does the write-back preserve the original write ordering? In general, > what happens if there's a system crash in this state? What of errors > encountered during write-back, eg out of disk space (or is disk space > accounting still done in terms of the original volume)? > > regards, tom lane >
"Matt Clark" <matt@ymogen.net> writes: > The snapshot volume holds the *original* data from the frozen volume, > using copy-on-write, so neither taking nor releasing the snapshot > involve any copying of data. Ah, I see: so while the freeze is on, writes to the frozen volume cost about 3x normal (1 write -> 1 read + 2 writes), but establishing and discontinuing the freeze cost nothing. And if you have a crash the volume is still in the proper state. Good. > If you run out of room on the snapshot volume because you've written > too much data to the frozen volume then you are buggered, pure and > simple. No, you only have to drop the freeze; the original volume is not broken. I would wonder though about the datastructure that LVM uses to remember which blocks of the frozen volume have been copied (and where) on the snapshot area. I assume this table lives in kernel memory --- what happens if it gets large? regards, tom lane
> Ah, I see: so while the freeze is on, writes to the frozen volume cost > about 3x normal (1 write -> 1 read + 2 writes), but establishing and > discontinuing the freeze cost nothing. And if you have a crash the > volume is still in the proper state. Good. I think that should be 1 read + 2 writes for the first write to a given block, but just 1 write from then on. > No, you only have to drop the freeze; the original volume is not broken. Yeah, only the snapshot is damaged > > I would wonder though about the datastructure that LVM uses to remember > which blocks of the frozen volume have been copied (and where) on the > snapshot area. I assume this table lives in kernel memory --- what > happens if it gets large? > /* copy on write tables in disk format */ typedef struct lv_COW_table_disk_v1 { uint64_t pv_org_number; uint64_t pv_org_rsector; uint64_t pv_snap_number; uint64_t pv_snap_rsector; } lv_COW_table_disk_t; Looks like 256 bytes per disk block. At a default LVM block size of 64KB that's a 4MB table for 1GB of dirty blocks, pluswhatever hash table overhead there might be. Not much at any rate.
On Mon, Mar 17, 2003 at 09:59:27AM -0500, Tom Lane wrote: > Ah, I see: so while the freeze is on, writes to the frozen volume cost > about 3x normal (1 write -> 1 read + 2 writes), but establishing and > discontinuing the freeze cost nothing. And if you have a crash the > volume is still in the proper state. Good. I would say it cost practically nothing. The buffers are flushed to disk, and some metadata updated, so there is a small cost. I don't think the 3x write-cost is accurate? What is the read for? There shouldn't be any reads as far as I can tell. (assuming the lvm-metadata is in memory - I think it always is) When you write to a block for the first time (after taking the snapshot) there should be 2 writes - one to write the data, and one to update lvm-metadata. The next time you write to the same "block" there should only be one write. If you use large LVM-blocks (extents) there will be only one write most of the time. (if you do sequential writes). On the other hand, if you do random writes, large extents will waste a lot of disk-space for your snapshot. So, I think the write-cost should be somewhere in the range 1x-2x of non-snapshot cost, depending on your usage. > I would wonder though about the datastructure that LVM uses to remember > which blocks of the frozen volume have been copied (and where) on the > snapshot area. I assume this table lives in kernel memory --- what > happens if it gets large? I believe it has a fixed size. It should take 4 bytes * (volume_size / block_size). LVM-blocksizes are relatively large, 4 MB pr default, and for large volumes it's common to use even bigger blocks (say 128 MB). For a 2TB volume and 128 MB blocks the datastructure should take 32KB memory. -- Ragnar Kjørstad Zet.no
On Mon, Mar 17, 2003 at 03:16:19PM -0000, Matt Clark wrote: > > Ah, I see: so while the freeze is on, writes to the frozen volume cost > > about 3x normal (1 write -> 1 read + 2 writes), but establishing and > > discontinuing the freeze cost nothing. And if you have a crash the > > volume is still in the proper state. Good. > > I think that should be 1 read + 2 writes for the first write to a given block, but just 1 write from then on. Ah, the read is needed if you're not writing the full block, right? Discard my email on this topic, then. > > I would wonder though about the datastructure that LVM uses to remember > > which blocks of the frozen volume have been copied (and where) on the > > snapshot area. I assume this table lives in kernel memory --- what > > happens if it gets large? > > /* copy on write tables in disk format */ > typedef struct lv_COW_table_disk_v1 { > uint64_t pv_org_number; > uint64_t pv_org_rsector; > uint64_t pv_snap_number; > uint64_t pv_snap_rsector; > } lv_COW_table_disk_t; > > Looks like 256 bytes per disk block. At a default LVM block size of 64KB that's a 4MB table for 1GB of dirty blocks, pluswhatever > hash table overhead there might be. Not much at any rate. I'm confused. First I thought LVM used 32bit block-numbers. Second, I thought 4 MB was the default extent-size. Third, I thought this was a fixed mapping table, not a dynamic table. I must admint it has been a long time since I read the lvm-source, so I may have forgotten or they may have changed it. Is this from LVM1 or LVM2? Is the COW-table different from the regular LVM mapping table? -- Ragnar Kjørstad Zet.no
> LVM-blocksizes are relatively large, 4 MB pr default, and for large > volumes it's common to use even bigger blocks (say 128 MB). For a 2TB > volume and 128 MB blocks the datastructure should take 32KB memory. 4MB is the size of a physical extent, not a block. Snapshots use 'chunks' not whole extents, and I don't know what (if any) relationship there is between chunks and blocks. #define LVM_SNAPSHOT_MAX_CHUNK 1024 /* 1024 KB */ #define LVM_SNAPSHOT_DEF_CHUNK 64 /* 64 KB */ #define LVM_SNAPSHOT_MIN_CHUNK (PAGE_SIZE/1024) /* 4 or 8 KB */ Anyway, you definitely *don't* force LVM to copy 4MB of data the first time you touch just 1 bit on a snapshotted volume!
On Mon, Mar 17, 2003 at 03:35:53PM -0000, Matt Clark wrote: > > LVM-blocksizes are relatively large, 4 MB pr default, and for large > > volumes it's common to use even bigger blocks (say 128 MB). For a 2TB > > volume and 128 MB blocks the datastructure should take 32KB memory. > > 4MB is the size of a physical extent, not a block. Snapshots use 'chunks' not whole extents, and I don't know what (ifany) > relationship there is between chunks and blocks. > > #define LVM_SNAPSHOT_MAX_CHUNK 1024 /* 1024 KB */ > #define LVM_SNAPSHOT_DEF_CHUNK 64 /* 64 KB */ > #define LVM_SNAPSHOT_MIN_CHUNK (PAGE_SIZE/1024) /* 4 or 8 KB */ Ah, that explains it. Sorry for adding to the confusion. > Anyway, you definitely *don't* force LVM to copy 4MB of data the first time you touch just 1 bit on a snapshotted volume! Probably not. -- Ragnar Kjørstad Zet.no
On Mon, Mar 17, 2003 at 02:09:15PM +0100, Martin Kutschker wrote: > Ragnar Kjørstad <postgres@ragnark.vestdata.no> writes on > Mon, 17 Mar 2003 13:31:32 +0100 (MET): > > > [snip] > > The mail has been marked using 8859-15 but apparently is using utf-8. Ups - sorry about that. > Looks odd in my not so very good webmailer. Probably looks odd in most mailers :-/ I _think_ I have fixed it now. -- Ragnar Kjørstad Zet.no
Here's what I have so far: The attachment includes a revised version of a script for backing up the database cluster, and the logs from a couple of instances of it; in these instances the server was under the following workload: A script that 1. Copies a roughly 380,000 row and 85 column file into a newly create table called tbl1 2. Creates 3 single column indexes on tlb1 3. In 3 parallel subshells, 3 separate insert ... from select ... queries put data from tbl1 into tbl2, tbl3, and tbl4 4. Upon completion, of step3, tbl1 is dropped and the process is repeated for 12 different text files with distinct data, and similar size 5. 3 single column indexes are added to each of tbl2, tbl3 and tbl4 in 3 parallel subshells Server hardware: Quad PIII Xeon 500 MHz/2MB L2, 6 GB RAM, 168 GB HW RAID5 (data), free space on second 168 GB HW RAID5 for snapshots on a dual channel Mylex Acceleraid 352, with XFS filesystem logs on a LINUX MD1 connected with an Adaptec 39320-R. Column sums on the larger tables on the production server and on the backup server gave identical values. Using a SAS System facility called PROC COMPARE on the smaller tables, they were found to be identical. Note, however, (i) my rsync options get the WHOLE FILE each time a table changes, and (ii) because the logs attached are from tables that are being updated on the production server, the copies of these tables have not been verified for "accuracy", only recoverability (vacuum analyze completes successfully). Therefore, users in transaction processing environments will have to do their own testing. (I have verified that doing incremental rsyncs until the data stabilize is also reliable; I'm just being paranoid on this one, and the cost is not too high.) I have also successfully run the backup simultaneously with the above workload on a Dual PIII 1GHz/512MB L2, 528 MB RAM, 60 GB IDE data here, 6 GB IDE (snapshots here) on the SAME ide channel. This version uses a two-phase rsync: the first rsync copies the live $PGDATA directory, excluding the pg_xlog/ folder, from the production server to the backup server; the second is run on an LVM snapshot of the $PGDATA folder. The postmaster is started and stopped on the backup server, so that any problems can be identified right away. (Notice the "ReadRecord: unexpected pageaddr 13C/98EDA000 in log file 317, segment 11, offset 15572992" in the later log. This seems to be a non-critical error; VACUUM ANALYZE gave a short series of: NOTICE: Rel <table name>: Uninitialized page 54300 - fixing ... VACUUM ). Others have suggested the two-phase rsync, but there the second phase required shutting down the postmaster. Besides the obvious advantage, it turns out that because snapshots are fast -- especially if fsync=true -- the second rsync has little to do (if you turn of --exclude='pg_xlog/' in the first phase, you many find that the backup is fully complete and the snapshot is removed almost immediately after creation). Thus, the time that the snapshot is up and dragging down filesystem operations is minimized. ADDITIONAL NOTES: Data recovery from backup is done with: pg_dump -Fc -h bkHost -U pguser -d db-to-recover [ -t <tbl-to-recover> ] | pg_restore -h pgHost -U pguser -c As, the 7.4dev docs say: "WAL offers the opportunity for a new method for database on-line backup and restore (BAR). To use this method, one would have to make periodic saves of data files to another disk, a tape or another host and also archive the WAL log files. The database file copy and the archived log files could be used to restore just as if one were restoring after a crash. Each time a new database file copy was made the old log files could be removed. Implementing this facility will require the logging of data file and index creation and deletion; it will also require development of a method for copying the data files (operating system copy commands are not suitable)." If you can stand doing it with filesystem level tools (LVM or EVMS on linux; hw snapshots on other OS?), you can do this already, as shown here. Besides BAR, here are a couple of applications to consider: 1. A limited form of PITR: You can recover any object to its state at the last snapshot backup. (Of course, log roll-forward is missing.) From Oracle 9i documentation: "Introduction to RMAN TSPITR Recovery Manager (RMAN) automated tablespace point-in-time recovery (TSPITR) enables you to quickly recover one or more tablespaces to a time that is different from that of the rest of the database. RMAN TSPITR is most useful for recovering the following: * An erroneous DROP TABLE or TRUNCATE TABLE statement * A table that has become logically corrupted * An incorrect batch job or other DML statement that has affected only a subset of the database ... " Can do here. 2. It is much easier to upgrade versions. Initdb the new version on pgHost, then (pg_dumpall -Fc -h bkHost | pg_restore ) from the upgraded server. 3. If $PGDATA/pg_xlog is on a different set of disks than PGDATA, you should be able to adapt with: xfs_freeze -f $PGXLOG_FS xfs_freeze -f $PGDATA_FS lvcreate -s -L <size> -n snap_pgdata <the $PGDATA_FS device> lvcreate -s -n <size> -n snap_pgxlog <the $PGXLOG_FS device> xfs_freeze -u $PGDATA_FS xfs_freeze -u $PGXLOG_FS (on the assumption that the postmaster isn't doing anything while waiting for pg_xlog/ writes to complete). You can then mount the snapshots in the proper places, rsync and go. This has not been tested, though it's on my agenda. Could the postgresql developers comment on this? As Matt points out, this strategy is a useful addition to the postgresql administrator's tools. It seems useful to update the documentation, and to point out that the WAL design already gives the benefit of online backup and restore when combined with externally available snaphsot facilities. QED. Cheers, Murthy -----Original Message----- From: Peter Eisentraut [mailto:peter_e@gmx.net] Sent: Sunday, March 16, 2003 09:05 To: Matt Cc: Murthy Kambhampaty; pgsql-admin@postgresql.org Subject: Re: [ADMIN] LVM snapshots Matt writes: > If the theory can be robustly demonstrated to work in practice (which > you may have already done) then I say this strategy needs to be > recognised in the docs (or at least the techdocs) as a valid and useful one. The note in the documentation is rather old. If you can conclusively disprove it, then we'll be happy to alter it. -- Peter Eisentraut peter_e@gmx.net
Attachment
Thanks Murthy, that's exceptionally helpful! Does anyone know what (in general) would cause the notices that Murthy spotted in the logs as per the snippet below? > The postmaster is started and stopped on the backup server, so that any > problems can be identified right away. (Notice the "ReadRecord: unexpected > pageaddr 13C/98EDA000 in log file 317, segment 11, offset 15572992" in the > later log. This seems to be a non-critical error; VACUUM ANALYZE gave a > short series of: > NOTICE: Rel <table name>: Uninitialized page 54300 - fixing > ... > VACUUM > ).
"Matt Clark" <matt@ymogen.net> writes: > Does anyone know what (in general) would cause the notices that Murthy spotted in the logs as per the snippet below? >> NOTICE: Rel <table name>: Uninitialized page 54300 - fixing There is a known mechanism for this to occur on certain key system catalogs, see http://fts.postgresql.org/db/mw/msg.html?mid=1357214 We've seen some reports suggesting it can happen on user tables too, but I don't know what the mechanism is or whether there's any risk of data loss. If you can make this happen reproducibly (or at least with decent probability) on tables other than pg_class, pg_type, pg_attribute, pg_proc, then I'd be real interested to see a test case. regards, tom lane
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes: >>> NOTICE: Rel <table name>: Uninitialized page 54300 - fixing > >There is a known mechanism for this to occur on certain key system catalogs, >see http://fts.postgresql.org/db/mw/msg.html?mid=1357214 > >We've seen some reports suggesting it can happen on user tables too, but >I don't know what the mechanism is or whether there's any risk of data >loss. If you can make this happen reproducibly (or at least with decent >probability) on tables other than pg_class, pg_type, pg_attribute, pg_proc, >then I'd be real interested to see a test case. > > The tables on which this happened were the ones called tbl2 and tbl3 in my example: i.e., tables into which data were being added with "insert into tbl[1,2] from select * from tbl1 where ..." running in parallel subshells. If somebody wants to try to replicate this: 1. Create a table with several hundred thousand records 2. Start an "insert into tbl2 from select ..." (multiple ones in parallel maximize your chance of hitting this) 3. Freeze the filesystem, create a snapshot, unfreeze the filesystem, mount the snapshot and copy $PGDATA somewhere else 4. Start a postmaster on the copy of the database cluster; run VACUUM ANALYZE on tlb2 you should see a series of NOTICE: Rel tbl2: Uninitialized page 'nnnnn' - fixing messages I might have a chance to play with this late next but right now I'm swamped. Cheers, Murthy
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: > If somebody wants to try to replicate this: > 1. Create a table with several hundred thousand records > 2. Start an "insert into tbl2 from select ..." (multiple ones in parallel > maximize your chance of hitting this) > 3. Freeze the filesystem, create a snapshot, unfreeze the filesystem, mount > the snapshot and copy $PGDATA somewhere else > 4. Start a postmaster on the copy of the database cluster; run VACUUM > ANALYZE on tlb2 > you should see a series of > NOTICE: Rel tbl2: Uninitialized page 'nnnnn' - fixing > messages Hmm. This amounts to testing what happens after a Postgres crash (the copied PGDATA will look exactly like the system had gone down at the instant of the freeze). In that scenario zeroed pages aren't too improbable, although I wouldn't expect to see a large number of them. Are you running with a nondefault WAL_BUFFERS setting? The way it could happen is like this: you have backend(s) busily extending the table and filling each page with new tuples. Each time we extend a table, we actually write a page of zeroes to the kernel --- this is so that we'll find out right away if we're out of disk space. Then, *inside the Postgres shared-buffer arena*, the page is initialized with correct header data and filled with tuples. Unless you are short on shared buffers, it will probably not be written out to the kernel until the next checkpoint. So there is an interval where the kernel thinks that this page of the file exists but contains zeroes. Meanwhile, WAL entries for the new tuples are being written into the WAL buffers arena. But they aren't going to be forced to disk until there's a transaction commit (or the WAL buffers become full). If the first WAL entry describing a tuple insertion into the new page gets out to disk (or out to the kernel at least) before the filesystem freeze or Postgres crash, then WAL replay will initialize the zeroed page and you won't see any complaints. But there is clearly an interval where a zeroed page can be added to a table and not initialized by WAL replay. This is all okay and no data loss is possible through this mechanism. We can only lose uncommitted tuples, since a commit would have forced out the WAL entries. But it's hard to see how you could have more than approximately WAL_BUFFERS uninitialized pages present due to this scenario --- a WAL entry can remain unwritten only as long as less than WAL_BUFFERS worth of subsequent WAL traffic has occurred, so the physical extensions of the file can't get all that far ahead of what WAL replay will be able to see. If you're just seeing a few of these notices per try, then I don't think there's anything to worry about (especially if they reference pages very near the end of their tables). Otherwise, there may be some other behavior involved, and we'd better look more closely. regards, tom lane
Tom Lane wrote: > A lot of detailed and informative stuff essentially equating a snapshot backup with a system crash And I have to agree with what Tom said. An FS snapshot should indeed look just like an instantaneous halt in processing. Murthy's testing does involve a pretty heavy load, and it is in a sense reassuring that it manages to generate the same kind of issues, and that PostgreSQL's ability to recover from such crashes is so good. Maybe one could use FS snapshots as a way to test crash recovery without needing to crash the system! Anyway, I was about to respond to Tom's comment that it looked like a snapshotting bug, but can we take it from yr subsequent post Tom that you no longer think so?
Matt <matt@ymogen.net> writes: > Anyway, I was about to respond to Tom's comment that it looked like a > snapshotting bug, but can we take it from yr subsequent post Tom that > you no longer think so? Right, assuming that what Murthy is seeing matches the behavior I described (ie, no more than a few zeroed pages, all near the ends of tables that were undergoing extension). regards, tom lane
Tom Lane [mailto:tgl@sss.pgh.pa.us] writes: >> NOTICE: Rel tbl2: Uninitialized page 'nnnnn' - fixing >Hmm. This amounts to testing what happens after a Postgres crash (the >copied PGDATA will look exactly like the system had gone down at the >instant of the freeze). In that scenario zeroed pages aren't too >improbable, although I wouldn't expect to see a large number of them. It turns out there aren't many of them, after all, and they seem to be truncated by vacuum. Here's the vacuum report from the log, for the two files "afftected": 2003-04-02 18:33:46 [4657] DEBUG: --Relation tbl2-- 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54300 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54301 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54302 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54303 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54304 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54305 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54306 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54307 - fixing 2003-04-02 18:34:33 [4657] NOTICE: Rel tbl2: Uninitialized page 54308 - fixing 2003-04-02 18:34:34 [4657] DEBUG: Removed 49173 tuples in 3946 pages. CPU 0.27s/1.07u sec elapsed 1.34 sec. 2003-04-02 18:34:34 [4657] DEBUG: Pages 54309: Changed 54300, Empty 0; Tup 620366: Vac 49173, Keep 0, UnUsed 0. Total CPU 9.78s/2.35u sec elapsed 47.74 sec. 2003-04-02 18:34:35 [4657] DEBUG: Truncated 54309 --> 50355 pages. CPU 0.38s/0.07u sec elapsed 0.44 sec. 2003-04-02 18:34:55 [4657] DEBUG: --Relation tbl3-- 2003-04-02 18:35:31 [4657] NOTICE: Rel tbl3: Uninitialized page 33172 - fixing 2003-04-02 18:35:31 [4657] NOTICE: Rel tbl3: Uninitialized page 33173 - fixing 2003-04-02 18:35:31 [4657] NOTICE: Rel tbl3: Uninitialized page 33174 - fixing 2003-04-02 18:35:31 [4657] NOTICE: Rel tbl3: Uninitialized page 33175 - fixing 2003-04-02 18:35:32 [4657] DEBUG: Removed 38874 tuples in 3230 pages. CPU 0.20s/0.88u sec elapsed 1.08 sec. 2003-04-02 18:35:32 [4657] DEBUG: Pages 33176: Changed 33172, Empty 0; Tup 354316: Vac 38874, Keep 0, UnUsed 0. Total CPU 5.66s/1.71u sec elapsed 36.95 sec. 2003-04-02 18:35:32 [4657] DEBUG: Truncated 33176 --> 29943 pages. CPU 0.41s/0.02u sec elapsed 0.42 sec. >Are you running with a nondefault WAL_BUFFERS setting? WAL_BUFFERS=64 > [ Snip ] Thanks for the explanation, and for taking a second look ;-)
Murthy Kambhampaty <murthy.kambhampaty@goeci.com> writes: >> Are you running with a nondefault WAL_BUFFERS setting? > WAL_BUFFERS=64 Okay, a dozen or so uninitialized pages seem consistent with that setting. I think we understand this one then ... regards, tom lane