Thread: PITR Recovery Question
Hi Florian, I'm moving this discussion to pgsql-admin. To give a picture of my original question, it is given below, so that other users in this mailing list will understand my original problem statement. > If you point it at a cluster's own pg_xlog directory, it won't work. > You might want to re-ead the section on the recovery process in the PTITR documentation, at http://www.postgresql.org/docs/8.4/interactive/continuous-archiving.html#BAC KUP-PITR-RECOVERY OK. My pg_xlog/ and walarchive/ directory locations are "/usr/local/pgsql/data/pg_xlog" and "/mnt/pitr/walarchive" respectively. In case, if I decide to clean the old WAL archives and set right PITR from today onwards by taking base backup, so that I can start managing and maintaining atleast from now onwards, what is the correct way/method of removing files from pg_xlog/, pg_xlog/archive_status/ and /mnt/pitr/walarchive/ directories? 1. How do I clean up WAL archives that are accumulated in pg_xlog/ directory? Does it harm the database in anyway if I remove WAL segments manually? 2. Few important command outputs: [root@dbserver data]# pwd /usr/local/pgsql/data [root@dbserver data]# ls -l pg_xlog |wc -l 14438 [root@dbserver data]# ls -l pg_xlog/archive_status |wc -l 14436 [root@dbserver data]# ls -l /mnt/zoniacpitr/walarchive | wc -l 1783 NOTE: I can see all the files under pg_xlog/archive_status/ directory having suffixed with "00*.ready". -----Original Message----- From: Gnanakumar [mailto:gnanam@zoniac.com] Sent: Thursday, June 03, 2010 7:07 PM To: pgsql-hackers@postgresql.org Subject: PITR Recovery Question Hi, My production server is running PostgreSQL v8.2.3 on CentOS release 5.2 (Final). I've setup PITR in my production server. For some reason, after setting up PITR, we're not able to manage and maintain it. Because of this our WAL archive drive become full (100% use) approximately after 1 month. PITR SETUP DETAILS We've 2 drives. Primary drive (pgsql/data/ directory resides) is 400 GB and secondary drive (WAL archive) is 30 GB. All WAL archives are written to secondary drive. Base backup taken on: Aug03, 2009 WAL archive drive become full (100% use) on: Sep05, 2009 Because this WAL archive drive has become full, all WAL archive segments to be archived are accumulated into pg_xlog/ directory itself. Eventually, 9 months (as of today from Sep05, 2009) of WAL archives are residing in pg_xlog/ directory. My question is, in case if I would like to perform recovery process as it is in this situation, will this work out? That is, I'm seeing/finding out whether recovery process would perform successfully anywhere between the date range Aug03, 2009 (my base backup date) and as of today - Jun03, 2009. Reason I'm asking this is still all my WAL archives are residing in pg_xlog/ directory. Experts advice/idea/suggestion on this appreciated. Regards, Gnanam
"Gnanakumar" <gnanam@zoniac.com> wrote: > My pg_xlog/ and walarchive/ directory locations are > "/usr/local/pgsql/data/pg_xlog" and "/mnt/pitr/walarchive" > respectively. > > In case, if I decide to clean the old WAL archives and set right > PITR from today onwards by taking base backup, so that I can start > managing and maintaining atleast from now onwards, what is the > correct way/method of removing files from pg_xlog/, > pg_xlog/archive_status/ and /mnt/pitr/walarchive/ directories? It is generally unsafe to delete any WAL files from pg_xlog. If they are there because your archive command has been failing, you need to turn off archiving or (probably more convenient) allow the archive script to return success until things clear. One trick would be to temporarily change your archive_command to 'true', delete all files from your archive, and then change the command back. Doing that without exposing yourself to a period where you have no backup might be tricky, though. If the only problem with the archive command is that the archive fs is full, I would copy the contents of the archive directory to tape or whatever medium you have for long-term storage, delete the contents, and let archive succeed. The pg_xlog directory will eventually clear, and then I would get a fresh PITR base backup (following all the documented steps for doing so). You really want to see WAL files flowing to your archive location before you start the process of getting a new base backup. If there's some other reason that the archive command has been failing, what is it? -Kevin
On Jun 4, 2010, at 13:54 , Gnanakumar wrote: > In case, if I decide to clean the old WAL archives and set right PITR from > today onwards by taking base backup, so that I can start managing and > maintaining atleast from now onwards, what is the correct way/method of > removing files from pg_xlog/, pg_xlog/archive_status/ and > /mnt/pitr/walarchive/ directories? You should *never*, I repeat, *never* remove files from pg_xlog manually unless you know *exactly* what you're doing! If you remove those, and postgres crashes, you *will* lose your whole database! About the same holds for pg_xlog/archive_status. Removing files from there won't prevent postgres from starting, but it *will*interfere with the WAL archiving process, and may lead to unusable base backups! Postgres will recycle old WAL segments automatically, once they're no longer needed for crash recovery *and* after they'vebeen archived successfully (if archiving is enabled). Once a WAL segment is recycled, it's corresponding archive_status/file(s) will be removed too. The correct way to clean out pg_xlog therefore is to either disable WAL archiving, or to make sure your archive_command succeedseventually. A common way to use PITR is the following. 1) You setup an archive_command. 2) You take a base backup. That also triggers the creation of a backup history file (<number1>.<number2>.backup) in the archivedirectory. 3) You remove old base backups. You probably want to keep more than one, though, just in case. How long you retain your basebackups determines how far back in time you'll be able to go in the event of a disaster. The furthest point you can go back to is the time pg_stop_backup()was called while taking the oldest base backup you retained. 4) You remove all WAL segments that predate the remaining base backups. For that, you find the backup history file in thearchive directory that corresponds to the oldest remaining base backup and then remove all WAL segments whose name isnumerically smaller than the <number1> from that backup history file. Keeping older WAL segments buys you nothing - WALfiles without a base backup that *predates* them are worthless. 5) Your archive directory will now slowly grow. At some point the cost of storing all those segments will out-weight thecost of taking a new base backup. Whether that happens after an hour, a day, a week or a year depends on the size of yourdatabase vs. the amount of modifications it receives. However, at some point or another you will reach that cutoff point,and that is when you continue with (2). http://www.postgresql.org/docs/8.2/static/continuous-archiving.html explains all of this pretty exhaustively. best regards, Florian Pflug
Hi Kevin, > It is generally unsafe to delete any WAL files from pg_xlog. If > they are there because your archive command has been failing, you > need to turn off archiving or (probably more convenient) allow the > archive script to return success until things clear. One trick > would be to temporarily change your archive_command to 'true', > delete all files from your archive, and then change the command > back. Doing that without exposing yourself to a period where you > have no backup might be tricky, though. I'm trying to see and understand your view point, but I couldn't able to get this particular step clearly: "One trick would be to temporarily change your archive_command to 'true', delete all files from your archive, and then change the command back ". Can you please clarify and explain on this? When you say *temporarily changing archive_command to 'true' *, do you mean enabling/disabling of WAL archiving here? Per documentation, "If this is an empty string (the default), WAL archiving is disabled.". And when you say "change the command back", I understood it as *disabling*. Is my understanding correct? > If the only problem with the archive command is that the archive fs > is full, I would copy the contents of the archive directory to tape > or whatever medium you have for long-term storage, delete the > contents, and let archive succeed. The pg_xlog directory will > eventually clear, and then I would get a fresh PITR base backup > (following all the documented steps for doing so). You really want > to see WAL files flowing to your archive location before you start > the process of getting a new base backup. Yes, probably I should go ahead and proceed as you suggested above, that is allowing archive script to run successfully until things are completely clear. > If there's some other reason that the archive command has been > failing, what is it? No other reason. It was failing only because my WAL archive drive was full. Regards, Gnanam
Hi Florian, Thanks for your valuable suggestion and a detailed step on common way to use PITR. Things are very clear now except that I've some other question in connection to this. > The correct way to clean out pg_xlog therefore is to either disable WAL archiving, or to make sure your archive_command succeeds eventually. Probably I would go with the 2nd option, that is allowing archive command to run successfully until things are completely clear. But this question is for my understanding: In case if I decide to go with 1st option, that is disable WAL archiving for a while, will it completely clean out files from pg_xlog/ and pg_xlog/archive_status/ directories, so that I can start the PITR by taking base backup by enabling WAL archiving later? > A common way to use PITR is the following. > 4) You remove all WAL segments that predate the remaining base backups. For that, you find the backup history file in the archive directory that corresponds to the oldest remaining base backup and then remove all WAL segments whose name is numerically smaller than the <number1> from that backup history file. Keeping older WAL segments buys you nothing - WAL files without a base backup that *predates* them are worthless. Can you share with me any automated shell script that takes care of this removal automatically? Or can you share any systematic way (steps) of doing things if I want to do this manually? Regards, Gnanam
"Gnanakumar" wrote: > I couldn't able to get this particular step clearly: "One trick > would be to temporarily change your archive_command to 'true', > delete all files from your archive, and then change the command > back ". Can you please clarify and explain on this? Based on other statements you've made, this isn't a trick you want to use; just make space in the archive directory, let archiving catch up, and then take a fresh base backup. That said, this trick is a way to tell PostgreSQL the archive was successful, even though it wasn't actually copied. This is occassionally a useful trick to clear out a backlog of WAL files very quickly, at the cost of creating a gap in your WAL archive. Your OS likely has an executable and/or a shell builtin named "true" which does nothing except return the "success" exit code of zero. If you have such a command on your OS and you set your archive command to that, PostgreSQL will blast through cleaning up old WAL files. kevin@kevin-desktop:~$ true kevin@kevin-desktop:~$ echo $? 0 But since you said you can copy off the contents of your archive directory and delete to make room, that's clearly the way to go. -Kevin
On Jun 5, 2010, at 9:05 , Gnanakumar wrote: > Thanks for your valuable suggestion and a detailed step on common way to use > PITR. Things are very clear now except that I've some other question in > connection to this. > >> The correct way to clean out pg_xlog therefore is to either disable WAL archiving, or to make sure your archive_commandsucceeds eventually. > > Probably I would go with the 2nd option, that is allowing archive command to > run successfully until things are completely clear. > > But this question is for my understanding: In case if I decide to go with > 1st option, that is disable WAL archiving for a while, will it completely > clean out files from pg_xlog/ and pg_xlog/archive_status/ directories, so > that I can start the PITR by taking base backup by enabling WAL archiving > later? If you disable WAL archiving by setting archive_command to 'true', it'll surely clean out the files, since postgresql willactually believe it archived them successfully. I not sure what happens if you set archive_command to '' - that mightdisable the archiving process completely, and hence prevent the cleanup. >> A common way to use PITR is the following. > >> 4) You remove all WAL segments that predate the remaining base backups. For that, you find the backup history file inthe archive directory that corresponds to the oldest remaining base backup and then remove all WAL segments whose nameis numerically smaller than the <number1> from that backup history file. Keeping older WAL segments buys you nothing- WAL files without a base backup that *predates* them are worthless. > > Can you share with me any automated shell script that takes care of this > removal automatically? Or can you share any systematic way (steps) of doing > things if I want to do this manually? Sorry, I don't have a script for this at hand. But a quick search through the pgsql-admin archive brings up this post, whichcontains such a script. http://archives.postgresql.org/pgsql-admin/2006-03/msg00337.php best regards, Florian Pflug
Here's a script to make your backup and rsync it to a remote destination: #!/bin/bash echo "checkpoint" echo "CHECKPOINT;" | /local/pkg/bin/psql template1 echo "start backup" echo "SELECT pg_start_backup('cisoradr:/cis/pgsql/katana7/backup');" | /local/pkg/bin/psql template1 echo "rsync" /local/pkg/bin/rsync --delete -azxH /local/app/postgres/data pgsql@cisoradr-ext:/cis/pgsql/katana7/backup/. echo "Stop backup" echo "SELECT pg_stop_backup();" | /local/pkg/bin/psql template1 Sam -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Florian Pflug Sent: Sunday, 6 June 2010 10:11 PM To: gnanam@zoniac.com Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] PITR Recovery Question On Jun 5, 2010, at 9:05 , Gnanakumar wrote: > Thanks for your valuable suggestion and a detailed step on common way to use > PITR. Things are very clear now except that I've some other question in > connection to this. > >> The correct way to clean out pg_xlog therefore is to either disable WAL archiving, or to make sure your archive_commandsucceeds eventually. > > Probably I would go with the 2nd option, that is allowing archive command to > run successfully until things are completely clear. > > But this question is for my understanding: In case if I decide to go with > 1st option, that is disable WAL archiving for a while, will it completely > clean out files from pg_xlog/ and pg_xlog/archive_status/ directories, so > that I can start the PITR by taking base backup by enabling WAL archiving > later? If you disable WAL archiving by setting archive_command to 'true', it'll surely clean out the files, since postgresql willactually believe it archived them successfully. I not sure what happens if you set archive_command to '' - that mightdisable the archiving process completely, and hence prevent the cleanup. >> A common way to use PITR is the following. > >> 4) You remove all WAL segments that predate the remaining base backups. For that, you find the backup history file inthe archive directory that corresponds to the oldest remaining base backup and then remove all WAL segments whose nameis numerically smaller than the <number1> from that backup history file. Keeping older WAL segments buys you nothing- WAL files without a base backup that *predates* them are worthless. > > Can you share with me any automated shell script that takes care of this > removal automatically? Or can you share any systematic way (steps) of doing > things if I want to do this manually? Sorry, I don't have a script for this at hand. But a quick search through the pgsql-admin archive brings up this post, whichcontains such a script. http://archives.postgresql.org/pgsql-admin/2006-03/msg00337.php best regards, Florian Pflug -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin
Hi Kevin, Thanks for the clarification. > kevin@kevin-desktop:~$ true > kevin@kevin-desktop:~$ echo $? > 0 Yes, my OS also has got this executable and is working. Regards, Gnanam -----Original Message----- From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov] Sent: Saturday, June 05, 2010 7:39 PM To: pgsql-admin@postgresql.org; gnanam@zoniac.com Cc: fgp@phlo.org Subject: RE: [ADMIN] PITR Recovery Question "Gnanakumar" wrote: > I couldn't able to get this particular step clearly: "One trick > would be to temporarily change your archive_command to 'true', > delete all files from your archive, and then change the command > back ". Can you please clarify and explain on this? Based on other statements you've made, this isn't a trick you want to use; just make space in the archive directory, let archiving catch up, and then take a fresh base backup. That said, this trick is a way to tell PostgreSQL the archive was successful, even though it wasn't actually copied. This is occassionally a useful trick to clear out a backlog of WAL files very quickly, at the cost of creating a gap in your WAL archive. Your OS likely has an executable and/or a shell builtin named "true" which does nothing except return the "success" exit code of zero. If you have such a command on your OS and you set your archive command to that, PostgreSQL will blast through cleaning up old WAL files. kevin@kevin-desktop:~$ true kevin@kevin-desktop:~$ echo $? 0 But since you said you can copy off the contents of your archive directory and delete to make room, that's clearly the way to go. -Kevin
Hi Florian, Thanks for the clarification and a link to a post on automated script. On Jun 5, 2010, at 9:05 , Gnanakumar wrote: > Thanks for your valuable suggestion and a detailed step on common way to use > PITR. Things are very clear now except that I've some other question in > connection to this. > >> The correct way to clean out pg_xlog therefore is to either disable WAL archiving, or to make sure your archive_command succeeds eventually. > > Probably I would go with the 2nd option, that is allowing archive command to > run successfully until things are completely clear. > > But this question is for my understanding: In case if I decide to go with > 1st option, that is disable WAL archiving for a while, will it completely > clean out files from pg_xlog/ and pg_xlog/archive_status/ directories, so > that I can start the PITR by taking base backup by enabling WAL archiving > later? If you disable WAL archiving by setting archive_command to 'true', it'll surely clean out the files, since postgresql will actually believe it archived them successfully. I not sure what happens if you set archive_command to '' - that might disable the archiving process completely, and hence prevent the cleanup. >> A common way to use PITR is the following. > >> 4) You remove all WAL segments that predate the remaining base backups. For that, you find the backup history file in the archive directory that corresponds to the oldest remaining base backup and then remove all WAL segments whose name is numerically smaller than the <number1> from that backup history file. Keeping older WAL segments buys you nothing - WAL files without a base backup that *predates* them are worthless. > > Can you share with me any automated shell script that takes care of this > removal automatically? Or can you share any systematic way (steps) of doing > things if I want to do this manually? Sorry, I don't have a script for this at hand. But a quick search through the pgsql-admin archive brings up this post, which contains such a script. http://archives.postgresql.org/pgsql-admin/2006-03/msg00337.php best regards, Florian Pflug
Hi Sam, Thank your sharing this script. > Here's a script to make your backup and rsync it to a remote destination: > #!/bin/bash > echo "checkpoint" > echo "CHECKPOINT;" | /local/pkg/bin/psql template1 > echo "start backup" > echo "SELECT pg_start_backup('cisoradr:/cis/pgsql/katana7/backup');" | /local/pkg/bin/psql template1 > echo "rsync" > /local/pkg/bin/rsync --delete -azxH /local/app/postgres/data pgsql@cisoradr-ext:/cis/pgsql/katana7/backup/. > echo "Stop backup" > echo "SELECT pg_stop_backup();" | /local/pkg/bin/psql template1
Hi, My production server is running PostgreSQL v8.2.3 on CentOS release 5.2 (Final). I've setup PITR in my production server. For some reason, after setting up PITR, we're not able to manage and maintain it. PITR SETUP DETAILS We've 2 drives. Primary drive (pgsql/data/ directory resides) is 400 GB and secondary drive (WAL archive) is 300 GB. All WAL archives are written to secondary drive. Base backup taken on: Aug03, 2009 WAL archive drive become full (100% use) on: Jul01, 2010 Because this WAL archive drive has become full, all WAL archive segments to be archived are accumulated into pg_xlog/ directory itself. Eventually, almost 4 months (as of today from Jul01, 2010) of WAL archives are residing in pg_xlog/ directory. ISSUE We then decided to stop PITR in our database. Steps which I followed are: 1. service postgresql stop 2. Detached (unmounted) 300 GB secondary drive (WAL archive) 3. Disabled WAL archiving by commenting (#) 2 postgresql.conf parameters: "archive_command" and "archive_timeout". 4. service postgresql start Now, the database has started without any issues. But when I tried to stop database using 'service postgresql stop' within few minutes after it's been started, it was not shutting down and responded with the following message after few minutes: "Stopping PostgreSQL: pg_ctl: server does not shut down" I even checked serverlog for whether any errors/issues are reported, but I don't find any. Also, when I tried to login to my database at this time, it says: psql mydb mydbuser psql: FATAL: the database system is shutting down but not actually shutting down ever. I'm clueless why it is happening so and not shutting down ever. My questions are: 1. What could be the root cause of PostgreSQL server not shutting down ever? 2. Am I following a different/incorrect way of disabling PITR? 3. Considering this situation, now how do I safely disable PITR in my database? Experts advice/idea/suggestion on this are appreciated. Regards, Gnanam
Gnanakumar wrote: > 1. What could be the root cause of PostgreSQL server not shutting > down ever? > 2. Am I following a different/incorrect way of disabling PITR? > You can check if PITR is disabled by executing: SHOW archive_command; At a psql prompt. If that's empty, then you turned it off, and it isn't involved in the shutdown issue you have anymore. Normally when the server doesn't want to shut down, that's because of a client that won't exit normally, not something in the main server. Take a look at what processes are still running against the database and see if there are clients attached after the fast shutdown attempt. If so, those are your problem, not something to PITR. If you know every client has been stopped or is doing nothing useful, you can do an immediate shutdown of the server and kick everyone off. You'll get messages about unclean shutdown, but no data should be lost. The "service" scripts don't do that though. You'll need to do something like this instead, presuming you start as root: su postgres pg_ctl -D /var/lib/pgsql/data -m immedate stop -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
> You can check if PITR is disabled by executing: > SHOW archive_command; > At a psql prompt. If that's empty, then you turned it off, and it isn't involved in the shutdown issue you have anymore. Yes, I could see empty value here. > Take a look at what processes are still running against the database and see if there are clients attached after the fast shutdown attempt. > If so, those are your problem, not something to PITR. Yes, I can see the following processes still running. I used the command 'ps -ef |grep postgres' to list. postgres 3376 1 0 07:20 ? 00:00:00 /mnt/database/pgsql/bin/postgres -D /usr/local/pgsql/data postgres 3381 3376 0 07:20 ? 00:00:00 postgres: logger process postgres 3383 3376 0 07:20 ? 00:00:00 postgres: writer process But why logger and writer process are still running/showing up here for a very long time after issuing 'service postgresql stop'. What could be the problem? Also, am sure here that no other clients (database connections) are connected at this time.
>> You can check if PITR is disabled by executing: >> SHOW archive_command; >> At a psql prompt. If that's empty, then you turned it off, and it isn't >> involved in the shutdown issue you have anymore. > Yes, I could see empty value here. >> Take a look at what processes are still running against the database and >> see if there are clients attached after the fast shutdown attempt. >> If so, those are your problem, not something to PITR. > Yes, I can see the following processes still running. I used the command 'ps > -ef |grep postgres' to list. > postgres 3376 1 0 07:20 ? 00:00:00 > /mnt/database/pgsql/bin/postgres -D /usr/local/pgsql/data > postgres 3381 3376 0 07:20 ? 00:00:00 postgres: logger process > postgres 3383 3376 0 07:20 ? 00:00:00 postgres: writer process > But why logger and writer process are still running/showing up here for a > very long time after issuing 'service postgresql stop'. What could be the > problem? Also, am sure here that no other clients (database connections) > are connected at this time. I can CONSISTENTLY reproduce this problem (PITR walarchive drive is not attached/mounted) and I still don't know the reason behind this issue: 1) If I disable PITR and start postgresql service, consistently server does not shut down on 'service postgresql stop' command. I have tried and checked this upto 4 times. NOTE: After disabling PITR, on checking "SHOW archive_command" from psql, it is showing empty value. But on 'ps -ef |grep postgres', it is always showing up logger and writer process running. 2) But if I enable back PITR and start postgresql service, consistently server shuts down properly on 'service postgresql stop' command. Again I have tried this cycle also upto 4 times. When I mean enable/disable PITR, it is about enabling/disabling 2 configuration parameters: "archive_command" and "archive_timeout". Since this is our Production server, experts assistance to solve this issue are highly appreciated.