Thread: Warm standby recovery failure

Warm standby recovery failure

From
Lee Azzarello
Date:
I'm new to the list. So hello everyone. This morning my monitoring
system sent me an alert that my warm standby database had exited
recovery mode. I checked on it when I woke up and sure enough, it had
stopped looking for new WAL files and became available for queries.
After some digging, I uncovered these details, in a chronological
order:

*NAGIOS*
Date/Time: Thu Jan 29 12:36:04 UTC 2009
DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%):

** MYSTERY PERIOD **
(two crontabs to do WAL maintance are scheduled for 04:00, not
applicable to this situation)

*NAGIOS*
Date/Time: Thu Jan 29 12:49:05 UTC 2009
DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%)

* me working in a shell on the host *
# date
Thu Jan 29 15:33:35 UTC 2009
# df -h /dev/sdc1
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc1              50G  3.3G   44G   7% /var/lib/postgresql

/var/log/postgresql/postgresql-8.3-main.log
2009-01-29 12:41:36 UTC LOG:  restored log file
"000000010000002F000000A9" from archive
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA'
cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device
2009-01-29 12:48:14 UTC LOG:  could not read from log file 47, segment
170, offset 3129344: Success
2009-01-29 12:48:14 UTC LOG:  redo done at 2F/AA2FBE08
2009-01-29 12:48:14 UTC LOG:  last completed transaction was at log
time 2009-01-29 12:42:50.572406+00
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29
12:48:14 UTC LOG:  restored log file "000000010000002F000000AA" from
archive
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG:  selected new timeline ID: 2
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG:  archive recovery complete

# /etc/init.d/postgresql-8.3 start
...
# ps ax | grep postgres
 3499 ?        Ss     0:00 postgres: startup process   waiting for
000000020000002F000000AA
 3503 ?        S      0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

WTF? (before I knew about timelines)

# ls -l /mnt/pg_xlog/000000020000002F000000AA
ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file
or directory

# updatedb
# locate 000000020000002F000000AA
(nothing)
# rm /var/lib/postgresql/8.3/main/pg_xlog/0*
(perhaps a bad move? I still have these in /mnt/pg_xlog)
# /etc/init.d/postgresql-8.3 restart

Same problem. WTF?

# rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready
# /etc/init.d/postgresql-8.3 restart
# ps ax | grep postgres
 3499 ?        Ss     0:00 postgres: startup process   waiting for
000000020000002F000000AA
 3503 ?        S      0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA

Why is it looking for 000000020000002F000000AA? It should be looking
for 000000010000002F000000AA.
This seems to be a history problem with timelines. Recovery branched
into a new timeline, which is where it is now. I need to force it back
to the previous timeline so it can resume recovery. I'm building a
clean recovery database now but I'd like to get to the bottom of this
and bring the failed recovery database back into recovery mode so I
can learn how to get out of this situation more gracefully.

Thanks for any help you can offer.
_____________
Lee Azzarello
Staff Hacker
drop.io

Re: Warm standby recovery failure

From
Tom Lane
Date:
Lee Azzarello <lee@dropio.com> writes:
> cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device
> 2009-01-29 12:48:14 UTC LOG:  could not read from log file 47, segment
> 170, offset 3129344: Success
> 2009-01-29 12:48:14 UTC LOG:  redo done at 2F/AA2FBE08

The bottom line here seems to be that your logfile transfer process
passed a damaged (partial) WAL file to the slave postmaster, which
interpreted the damage as being end of WAL.  I'm not sure whether we
could find a way to prevent it from doing that without causing bigger
problems.  In any case the most useful solution is going to involve
fixing the transfer script so that it notices failures from 'cp' and
doesn't pass the broken file on to the slave.

We probably should add a caution about this to the manual's discussion
of how to write archiving scripts.

            regards, tom lane

Re: Warm standby recovery failure

From
Jaume Sabater
Date:
On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> We probably should add a caution about this to the manual's discussion
> of how to write archiving scripts.

I presume you mean the copy/transfer process did not do its job
correctly, Tom. Therefore, I would advise using a script that compares
the MD5/SHA1 sums of the origin and destination files and retries the
copy/transfer process before returning an exit status to the server
(archive_command call). Do you think this would suffice?

--
Jaume Sabater
http://linuxsilo.net/

"Ubi sapientas ibi libertas"

Re: Warm standby recovery failure

From
Tom Lane
Date:
Jaume Sabater <jsabater@gmail.com> writes:
> On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We probably should add a caution about this to the manual's discussion
>> of how to write archiving scripts.

> I presume you mean the copy/transfer process did not do its job
> correctly, Tom. Therefore, I would advise using a script that compares
> the MD5/SHA1 sums of the origin and destination files and retries the
> copy/transfer process before returning an exit status to the server
> (archive_command call). Do you think this would suffice?

That seems like using a sledgehammer to swat a fly.  The problem as
I saw it was that the script was probably simply not checking for
an error result from 'cp'.

            regards, tom lane

Re: Warm standby recovery failure

From
Lee Azzarello
Date:
thank you all for your help. It was indeed my copy script that
destroyed the database. I now understand that postgresql shouldn't be
concerned with validating a WAL segment, that's the responsibility of
the script that hands the segment to postgres.

-lee

On Fri, Jan 30, 2009 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jaume Sabater <jsabater@gmail.com> writes:
>> On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We probably should add a caution about this to the manual's discussion
>>> of how to write archiving scripts.
>
>> I presume you mean the copy/transfer process did not do its job
>> correctly, Tom. Therefore, I would advise using a script that compares
>> the MD5/SHA1 sums of the origin and destination files and retries the
>> copy/transfer process before returning an exit status to the server
>> (archive_command call). Do you think this would suffice?
>
> That seems like using a sledgehammer to swat a fly.  The problem as
> I saw it was that the script was probably simply not checking for
> an error result from 'cp'.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
>

Re: Warm standby recovery failure

From
Scott Marlowe
Date:
On Fri, Jan 30, 2009 at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jaume Sabater <jsabater@gmail.com> writes:
>> On Fri, Jan 30, 2009 at 2:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We probably should add a caution about this to the manual's discussion
>>> of how to write archiving scripts.
>
>> I presume you mean the copy/transfer process did not do its job
>> correctly, Tom. Therefore, I would advise using a script that compares
>> the MD5/SHA1 sums of the origin and destination files and retries the
>> copy/transfer process before returning an exit status to the server
>> (archive_command call). Do you think this would suffice?
>
> That seems like using a sledgehammer to swat a fly.  The problem as
> I saw it was that the script was probably simply not checking for
> an error result from 'cp'.

Plus, if you're using a system like rsync it uses checksums to make
sure the data transferred is correct if you use the -c (think that's
the one) switch.
--
When fascism comes to America, it will be the intolerant selling it as
diversity.