Thread: Minimizing Recovery Time (wal replication)

Minimizing Recovery Time (wal replication)

From

Bryan Murphy

Date:

09 April 2009, 16:27:46

I have two hot-spare databases that use wal archiving and continuous
recovery mode.  I want to minimize recovery time when we have to fail
over to one of our hot spares.  Right now, I'm seeing the following
behavior which makes a quick recovery seem problematic:

(1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period

(2) hot spare pauses for 15 to 20 minutes, during this period pdflush
consumes 99% IO (iotop).  Dirty (from /proc/meminfo) spikes to ~760mb,
remains at that level for the first 10 minutes, and then slowly ticks
down to 0 for the second 10 minutes.

(3) goto 1

My concern is that if the database has been in recovery mode for some
time, even if it's caught up, if I go live sometime in (1) I can face
a recovery time of upwards of 20 minutes.  We've experienced delays
during fail over in the past (not 20 minutes, but long enough to make
me second guess what we are doing).

I want to better understand what is going on so that I can determine
what I can do (if anything) to minimize down time when we fail over to
one of our hot spares.

Here are my current settings:

postgres (v8.3.7):

shared_buffers = 2GB (15GB total)
effective_cache_size = 12GB (15GB total)
checkpoint_segments = 10
checkpoint_completion_target = 0.7
(other checkpoint/bgwriter settings left at default values)

sysctl:

kernel.shmmax = 2684354560
vm.dirty_background_ratio = 1
vm.dirty_ratio = 5

Thanks,
Bryan

Re: Minimizing Recovery Time (wal replication)

From

Greg Smith

Date:

09 April 2009, 20:38:20

On Thu, 9 Apr 2009, Bryan Murphy wrote:

> (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period

Yeah, if you ever let this many files queue up you're facing a long
recovery time.  You really need to get into a position where you're
applying WAL files regularly enough that you don't ever fall this far
behind.

> (2) hot spare pauses for 15 to 20 minutes, during this period pdflush
> consumes 99% IO (iotop).  Dirty (from /proc/meminfo) spikes to ~760mb,
> remains at that level for the first 10 minutes, and then slowly ticks
> down to 0 for the second 10 minutes.

What does vmstat say about the bi/bo during this time period?  It sounds
like the volume of random I/O produced by recovery is just backing up as
expected.  Some quick math:

15GB RAM * 5% dirty_ratio = 750MB ; there's where your measured 760MB
bottleneck is coming from.

750MB / 10 minutes = 1.25MB/s ; that's in the normal range for random
writes with a single disk

Therefore my bet is that "vmstat 1" will show bo~=1250 the whole time
you're waiting there, with matching figures from the iostat to the
database disk during that period.

Basically your options here are:

1) Decrease the maximum possible segment backlog so you can never get this
    far behind
2) Increase the rate at which random I/O can be flushed to disk by either
    a) Improving things with a [better] battery-backed controller disk cache
    b) Stripe across more disks

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Minimizing Recovery Time (wal replication)

From

Bryan Murphy

Date:

09 April 2009, 20:59:00

On Thu, Apr 9, 2009 at 6:38 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> What does vmstat say about the bi/bo during this time period?  It sounds
> like the volume of random I/O produced by recovery is just backing up as
> expected.  Some quick math:

I'll have to capture this, unfortunately I won't be able to do that
until tomorrow.  This machine I was looking at is already failed over
and I'm currently creating a new snapshot.  I won't have a new hot
spare to replace it until the morning.

> 15GB RAM * 5% dirty_ratio = 750MB ; there's where your measured 760MB
> bottleneck is coming from.

That was what I thought, good to have it confirmed by somebody else.

> 750MB / 10 minutes = 1.25MB/s ; that's in the normal range for random writes
> with a single disk

Yes, this is an interesting problem I'm having, more on it below...

> Therefore my bet is that "vmstat 1" will show bo~=1250 the whole time you're
> waiting there, with matching figures from the iostat to the database disk
> during that period.
>
> Basically your options here are:
>
> 1) Decrease the maximum possible segment backlog so you can never get this
>   far behind

I understand conceptually what you are saying, but I don't know how to
practically realize this. :)  Do you mean lower checkpoint_segments?

> 2) Increase the rate at which random I/O can be flushed to disk by either
>   a) Improving things with a [better] battery-backed controller disk cache
>   b) Stripe across more disks

This is the problem that has been my nightmare for the past few
months.  It actually is an 8 drive raid 10, BUT, it's on virtualized
infrastructure up in Amazon's cloud running on 8 EBS volumes.  I've
found performance to be... inconsistent at best.  Sometimes it's
great, sometimes it's not so great.

We have a legacy database (~120gb) which grew in our old data center
on very powerful hardware.  We moved it up to Amazon's cloud a few
months ago, and have been scrambling ever since.

I wouldn't change what we're doing, the benefits so far have
outweighed the pain, and we're actively working on the software to
make better use of the cloud infrastructure (i.e. many smaller
databases instead of one big database, lots of caching, the usual
stuff).  Unfortunately, that takes time and I'm trying to limp along
as best I can with the legacy database until we can get everything
migrated.

So, to recap, I've raided up the volumes, thrown as much RAM and CPU
at the process as is available and just can't seem to tease any more
performance out.

Thanks,
Bryan

Re: Minimizing Recovery Time (wal replication)

From

Greg Smith

Date:

09 April 2009, 21:33:40

On Thu, 9 Apr 2009, Bryan Murphy wrote:

>> 1) Decrease the maximum possible segment backlog so you can never get this
>>   far behind
>
> I understand conceptually what you are saying, but I don't know how to
> practically realize this. :)  Do you mean lower checkpoint_segments?

Theoretically, every time the archive_command writes a new segment out you
can immediately move that to your standby, and setup the standby to
regularly look for those and apply them as they come in.  The fact that
you're getting so many of them queued up suggests there's something in
that path that isn't moving that pipeline along aggressively enough,
without knowing more about what you're doing it's hard to say where that
is.

> It actually is an 8 drive raid 10, BUT, it's on virtualized
> infrastructure up in Amazon's cloud running on 8 EBS volumes.  I've
> found performance to be... inconsistent at best.

Yeah, EBS is not exactly a high-performance or predictable database
storage solution, particularly when you get to where you're calling fsync
a lot--which is exactly what is happening during the period you note your
system is frozen.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Minimizing Recovery Time (wal replication)

From

Bryan Murphy

Date:

09 April 2009, 21:58:20

On Thu, Apr 9, 2009 at 7:33 PM, Greg Smith <gsmith@gregsmith.com> wrote:
>>> 1) Decrease the maximum possible segment backlog so you can never get
>>> this
>>>   far behind
>>
>> I understand conceptually what you are saying, but I don't know how to
>> practically realize this. :)  Do you mean lower checkpoint_segments?
>
> Theoretically, every time the archive_command writes a new segment out you
> can immediately move that to your standby, and setup the standby to
> regularly look for those and apply them as they come in.  The fact that
> you're getting so many of them queued up suggests there's something in that
> path that isn't moving that pipeline along aggressively enough, without
> knowing more about what you're doing it's hard to say where that is.

This is our archiving command:

#!/bin/bash

echo "archiving $2.bz2"

bzip2 -k -9 -c "$1" > "/srv/pg_logs/archive/$2.bz2.tmp" || exit $?
mv "/srv/pg_logs/archive/$2.bz2.tmp" "/srv/pg_logs/archive/$2.bz2" || exit $?

scp "/srv/pg_logs/archive/$2.bz2" "w.x.y.z:/srv/logs/$2.bz2.tmp" || exit $?
ssh w.x.y.z "mv /srv/logs/$2.bz2.tmp /srv/logs/$2.bz2" || exit $?

rm "/srv/pg_logs/archive/$2.bz2" || exit $?

And this is our restoring command:

#!/bin/bash

if [ "$1" == "" ] || [ "$2" == "" ]; then
    echo "dbrestore [source] [destination]"
    exit 1
fi

echo "`date`: restoring $1"

while true
do
    if [ -f "$1.bz2" ]; then
        echo "`date`: restore $1.bz2 -> $2"
        bunzip2 -d -c "$1.bz2" > "$2.tmp"
        mv "$2.tmp" "$2"
        exit 0
    fi

    if [[ "$1" =~ ".history" ]]; then
        echo "`date`: skipping $1"
        exit 1
    fi

    if [ -f "/tmp/golive" ]; then
        echo "`date`: going live"
        rm -f "/tmp/golive"
        exit 2
    fi

    sleep 5s
done

Essentially, what we do is bzip2 the file, scp it to the backup
server, and then ssh rename it.  The bzip2 is legacy from when we were
uploading to Amazon via the public internet and can go away now.  The
rename can happen in the restore script, and is something I probably
should change anyway, one less thing for the master database to do.
We create file system snapshots of the hot spares, and I periodically
purge the old log files after I've verified that we can bring the most
recent snapshot live.

We've used NFS in the past, but we're currently investigating other
distribution alternatives (primarily londiste and pgpool2).  We've
used slony in the past, but find it introduces too much administrative
overhead and is too brittle for our tastes.

Thanks again!
Bryan

Re: Minimizing Recovery Time (wal replication)

From

Simon Riggs

Date:

12 April 2009, 07:51:54

On Thu, 2009-04-09 at 14:27 -0500, Bryan Murphy wrote:
> I have two hot-spare databases that use wal archiving and continuous
> recovery mode.  I want to minimize recovery time when we have to fail
> over to one of our hot spares.  Right now, I'm seeing the following
> behavior which makes a quick recovery seem problematic:
>
> (1) hot spare applies 70 to 75 wal files (~1.1g) in 2 to 3 min period
>
> (2) hot spare pauses for 15 to 20 minutes, during this period pdflush
> consumes 99% IO (iotop).  Dirty (from /proc/meminfo) spikes to ~760mb,
> remains at that level for the first 10 minutes, and then slowly ticks
> down to 0 for the second 10 minutes.
>
> (3) goto 1

The database is performing too frequent restartpoints.

This has been optimised in PostgreSQL 8.4 by the addition of the
bgwriter running during recovery. This will mean that your hot spare
will not pause while waiting for restartpoint to complete, so will be
significantly faster.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Re: Minimizing Recovery Time (wal replication)

From

Bryan Murphy

Date:

13 April 2009, 11:39:35

On Sun, Apr 12, 2009 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The database is performing too frequent restartpoints.
>
> This has been optimised in PostgreSQL 8.4 by the addition of the
> bgwriter running during recovery. This will mean that your hot spare
> will not pause while waiting for restartpoint to complete, so will be
> significantly faster.

8.4 is already looking like it's going to be a great upgrade for us,
this would be another win.

Thanks,
Bryan

Re: Minimizing Recovery Time (wal replication)

From

Koichi Suzuki

Date:

14 April 2009, 21:49:17

If you're using "full page write = off", prefetching data page from
WAL speeds up the recovery too.  I've proposed this patch but was
carried over to 8.5.   I'm planning to release this as an external
tool, which will not work well  with Hot Standby but will work in
usual recovery.

I'm not sure how soon I can do it but hope it works.

2009/4/13 Bryan Murphy <bmurphy1976@gmail.com>:
> On Sun, Apr 12, 2009 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The database is performing too frequent restartpoints.
>>
>> This has been optimised in PostgreSQL 8.4 by the addition of the
>> bgwriter running during recovery. This will mean that your hot spare
>> will not pause while waiting for restartpoint to complete, so will be
>> significantly faster.
>
> 8.4 is already looking like it's going to be a great upgrade for us,
> this would be another win.
>
> Thanks,
> Bryan
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--
------
Koichi Suzuki