Thread: Warm standby recovery failure
I'm new to the list. So hello everyone. This morning my monitoring system sent me an alert that my warm standby database had exited recovery mode. I checked on it when I woke up and sure enough, it had stopped looking for new WAL files and became available for queries. After some digging, I uncovered these details, in a chronological order: *NAGIOS* Date/Time: Thu Jan 29 12:36:04 UTC 2009 DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%): ** MYSTERY PERIOD ** (two crontabs to do WAL maintance are scheduled for 04:00, not applicable to this situation) *NAGIOS* Date/Time: Thu Jan 29 12:49:05 UTC 2009 DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%) * me working in a shell on the host * # date Thu Jan 29 15:33:35 UTC 2009 # df -h /dev/sdc1 Filesystem Size Used Avail Use% Mounted on /dev/sdc1 50G 3.3G 44G 7% /var/lib/postgresql /var/log/postgresql/postgresql-8.3-main.log 2009-01-29 12:41:36 UTC LOG: restored log file "000000010000002F000000A9" from archive size= [16777216] copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG' `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA' cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device 2009-01-29 12:48:14 UTC LOG: could not read from log file 47, segment 170, offset 3129344: Success 2009-01-29 12:48:14 UTC LOG: redo done at 2F/AA2FBE08 2009-01-29 12:48:14 UTC LOG: last completed transaction was at log time 2009-01-29 12:42:50.572406+00 size= [16777216] copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG `/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG' cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29 12:48:14 UTC LOG: restored log file "000000010000002F000000AA" from archive pg_xlog/RECOVERYHISTORY 2009-01-29 12:48:14 UTC LOG: selected new timeline ID: 2 pg_xlog/RECOVERYHISTORY 2009-01-29 12:48:14 UTC LOG: archive recovery complete # /etc/init.d/postgresql-8.3 start ... # ps ax | grep postgres 3499 ? Ss 0:00 postgres: startup process waiting for 000000020000002F000000AA 3503 ? S 0:00 /bin/bash /var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG 000000020000002F000000AA WTF? (before I knew about timelines) # ls -l /mnt/pg_xlog/000000020000002F000000AA ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file or directory # updatedb # locate 000000020000002F000000AA (nothing) # rm /var/lib/postgresql/8.3/main/pg_xlog/0* (perhaps a bad move? I still have these in /mnt/pg_xlog) # /etc/init.d/postgresql-8.3 restart Same problem. WTF? # rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready # /etc/init.d/postgresql-8.3 restart # ps ax | grep postgres 3499 ? Ss 0:00 postgres: startup process waiting for 000000020000002F000000AA 3503 ? S 0:00 /bin/bash /var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG 000000020000002F000000AA Why is it looking for 000000020000002F000000AA? It should be looking for 000000010000002F000000AA. This seems to be a history problem with timelines. Recovery branched into a new timeline, which is where it is now. I need to force it back to the previous timeline so it can resume recovery. I'm building a clean recovery database now but I'd like to get to the bottom of this and bring the failed recovery database back into recovery mode so I can learn how to get out of this situation more gracefully. Thanks for any help you can offer. _____________ Lee Azzarello Staff Hacker drop.io
Lee Azzarello <lee@dropio.com> writes: > cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device > 2009-01-29 12:48:14 UTC LOG: could not read from log file 47, segment > 170, offset 3129344: Success > 2009-01-29 12:48:14 UTC LOG: redo done at 2F/AA2FBE08 The bottom line here seems to be that your logfile transfer process passed a damaged (partial) WAL file to the slave postmaster, which interpreted the damage as being end of WAL. I'm not sure whether we could find a way to prevent it from doing that without causing bigger problems. In any case the most useful solution is going to involve fixing the transfer script so that it notices failures from 'cp' and doesn't pass the broken file on to the slave. We probably should add a caution about this to the manual's discussion of how to write archiving scripts. regards, tom lane
On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We probably should add a caution about this to the manual's discussion > of how to write archiving scripts. I presume you mean the copy/transfer process did not do its job correctly, Tom. Therefore, I would advise using a script that compares the MD5/SHA1 sums of the origin and destination files and retries the copy/transfer process before returning an exit status to the server (archive_command call). Do you think this would suffice? -- Jaume Sabater http://linuxsilo.net/ "Ubi sapientas ibi libertas"
Jaume Sabater <jsabater@gmail.com> writes: > On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We probably should add a caution about this to the manual's discussion >> of how to write archiving scripts. > I presume you mean the copy/transfer process did not do its job > correctly, Tom. Therefore, I would advise using a script that compares > the MD5/SHA1 sums of the origin and destination files and retries the > copy/transfer process before returning an exit status to the server > (archive_command call). Do you think this would suffice? That seems like using a sledgehammer to swat a fly. The problem as I saw it was that the script was probably simply not checking for an error result from 'cp'. regards, tom lane
thank you all for your help. It was indeed my copy script that destroyed the database. I now understand that postgresql shouldn't be concerned with validating a WAL segment, that's the responsibility of the script that hands the segment to postgres. -lee On Fri, Jan 30, 2009 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jaume Sabater <jsabater@gmail.com> writes: >> On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> We probably should add a caution about this to the manual's discussion >>> of how to write archiving scripts. > >> I presume you mean the copy/transfer process did not do its job >> correctly, Tom. Therefore, I would advise using a script that compares >> the MD5/SHA1 sums of the origin and destination files and retries the >> copy/transfer process before returning an exit status to the server >> (archive_command call). Do you think this would suffice? > > That seems like using a sledgehammer to swat a fly. The problem as > I saw it was that the script was probably simply not checking for > an error result from 'cp'. > > regards, tom lane > > -- > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-admin >
On Fri, Jan 30, 2009 at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jaume Sabater <jsabater@gmail.com> writes: >> On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> We probably should add a caution about this to the manual's discussion >>> of how to write archiving scripts. > >> I presume you mean the copy/transfer process did not do its job >> correctly, Tom. Therefore, I would advise using a script that compares >> the MD5/SHA1 sums of the origin and destination files and retries the >> copy/transfer process before returning an exit status to the server >> (archive_command call). Do you think this would suffice? > > That seems like using a sledgehammer to swat a fly. The problem as > I saw it was that the script was probably simply not checking for > an error result from 'cp'. Plus, if you're using a system like rsync it uses checksums to make sure the data transferred is correct if you use the -c (think that's the one) switch. -- When fascism comes to America, it will be the intolerant selling it as diversity.