Thread: Slow PITR restore

Slow PITR restore

From
"Joshua D. Drake"
Date:
Hello,

I have an interesting thing going on here...

2007-12-11 21:10:36 EST     LOG:  restored log file
"00000001000007CC00000012" from archive
2007-12-11 21:10:47 EST     LOG:  restored log file
"00000001000007CC00000013" from archive
2007-12-11 21:11:02 EST     LOG:  restored log file
"00000001000007CC00000014" from archive
2007-12-11 21:11:14 EST     LOG:  restored log file
"00000001000007CC00000015" from archive
2007-12-11 21:11:26 EST     LOG:  restored log file
"00000001000007CC00000016" from archive
2007-12-11 21:11:39 EST     LOG:  restored log file
"00000001000007CC00000017" from archive
2007-12-11 21:11:54 EST     LOG:  restored log file
"00000001000007CC00000018" from archive
2007-12-11 21:12:13 EST     LOG:  restored log file
"00000001000007CC00000019" from archive
2007-12-11 21:12:27 EST     LOG:  restored log file
"00000001000007CC0000001A" from archive
2007-12-11 21:12:42 EST     LOG:  restored log file
"00000001000007CC0000001B" from archive
2007-12-11 21:12:54 EST     LOG:  restored log file
"00000001000007CC0000001C" from archive
2007-12-11 21:13:08 EST     LOG:  restored log file
"00000001000007CC0000001D" from archive
2007-12-11 21:13:21 EST     LOG:  restored log file
"00000001000007CC0000001E" from archive
2007-12-11 21:13:36 EST     LOG:  restored log file
"00000001000007CC0000001F" from archive
2007-12-11 21:13:45 EST     LOG:  restored log file
"00000001000007CC00000020" from archive
2007-12-11 21:13:52 EST     LOG:  restored log file
"00000001000007CC00000021" from archive
2007-12-11 21:14:01 EST     LOG:  restored log file
"00000001000007CC00000022" from archive

Now I understand that restoring log files can be slow but this is a big
machine.

32Gig of Ram
28 Disks (two volumes each raid 10)
Using two different table spaces (over each volume)
8 Opterons.

The machine has a production configuration that can doing several
hundred thousand transactions an hour without so much as having to
stretch first. It seems to me that this machine should be *cranking*
through these transactions. Am I just being my usual uptight, impatient
self?

Sincerely,

Joshua D. Drake

Re: Slow PITR restore

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> ... Now I understand that restoring log files can be slow but this is a big
> machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

Heikki has improved matters a bit in 8.3, but I suspect it's still
not great.

            regards, tom lane

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> ... Now I understand that restoring log files can be slow but this is a big
>> machine.
>
> Yeah, restoring is known to be less than speedy, because essentially
> zero optimization work has been done on it.
>
> Heikki has improved matters a bit in 8.3, but I suspect it's still
> not great.

Wow, o.k. well it is something we (the community) really should look at
for 8.4. I am surprised that it is slower than just walking through the
xlogs on recovery. I am sure there is a reason just surprised.

Thanks for answering.

Sincerely,

Joshua D. Drake


>
>             regards, tom lane
>


Re: Slow PITR restore

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Wow, o.k. well it is something we (the community) really should look at for
> 8.4. I am surprised that it is slower than just walking through the xlogs on
> recovery. I am sure there is a reason just surprised.

Well in the worst case it has to do nearly as much work as the original
database did. And it only gets to use 1 cpu so it can only have one i/o
request pending.

bgwriter is started already when doing recovery, right? Perhaps things could
be helped by telling bgwriter to behave differently during recovery.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Slow PITR restore

From
Simon Riggs
Date:
On Wed, 2007-12-12 at 08:55 +0000, Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>
> > Wow, o.k. well it is something we (the community) really should look at for
> > 8.4. I am surprised that it is slower than just walking through the xlogs on
> > recovery. I am sure there is a reason just surprised.

It's the same speed because it is the same recovery code.

> Well in the worst case it has to do nearly as much work as the original
> database did. And it only gets to use 1 cpu so it can only have one i/o
> request pending.

That need only be slow in certain circumstances.

> bgwriter is started already when doing recovery, right? Perhaps things could
> be helped by telling bgwriter to behave differently during recovery.

It would certainly help if bgwriter came up earlier. I've been looking
at that just recently. The main issue is how to structure the code to
tell bgwriter when it can start processing in recovery mode and then
move into normal mode.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
Alvaro Herrera
Date:
Gregory Stark wrote:

> bgwriter is started already when doing recovery, right? Perhaps things could
> be helped by telling bgwriter to behave differently during recovery.

No.

--
Alvaro Herrera                         http://www.flickr.com/photos/alvherre/
"Amanece.                                               (Ignacio Reyes)
 El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"

Re: Slow PITR restore

From
Simon Riggs
Date:
On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > ... Now I understand that restoring log files can be slow but this is a big
> > machine.
>
> Yeah, restoring is known to be less than speedy, because essentially
> zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Simon Riggs wrote:
> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> ... Now I understand that restoring log files can be slow but this is a big
>>> machine.
>> Yeah, restoring is known to be less than speedy, because essentially
>> zero optimization work has been done on it.
>
> If there was a patch to improve this, would it be applied to 8.3?
>

Sheesh Simon you are really pushing this release :). I would love to see
a patch to resolve this, especially since it appears to be a fairly
glaring oversight. We can't really expect people to use PITR if they new
it would take hours to recover even on the size of machine I was working on.

On the other hand... we are about to go to RC1 :)

Joshua D. Drake

Re: Slow PITR restore

From
Simon Riggs
Date:
On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:

> We can't really expect people to use PITR if they new
> it would take hours to recover even on the size of machine I was working on.

That's not true statement in all cases and can often be improved with
some monitoring and tuning. Just get your client to call me :-)

Are you doing replication, or a PITR for another reason?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Simon Riggs wrote:
> On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:
>
>> We can't really expect people to use PITR if they new
>> it would take hours to recover even on the size of machine I was working on.
>
> That's not true statement in all cases and can often be improved with
> some monitoring and tuning. Just get your client to call me :-)
>

Uhh.. right.

> Are you doing replication, or a PITR for another reason?

Warm standby. Normally we pull every 5 minutes which is why we hadn't
noticed this before. However last night we pulled a full sync and
recover and that is when we noticed it.

8 seconds for a single archive recovery is very slow in consideration of
this machine. Even single threaded that seems slow.

Sincerely,

Joshua D. Drake



Re: Slow PITR restore

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>> Yeah, restoring is known to be less than speedy, because essentially
>> zero optimization work has been done on it.

> If there was a patch to improve this, would it be applied to 8.3?

Good grief, no.  We have not even done the research to find out where
the bottleneck(s) is/are.  We're not holding up 8.3 while we go back
into development mode, especially not when this problem has existed
for seven or eight years (even if JD failed to notice before) and
there are already some improvements for it in 8.3.

            regards, tom lane

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
>> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>>> Yeah, restoring is known to be less than speedy, because essentially
>>> zero optimization work has been done on it.
>
>> If there was a patch to improve this, would it be applied to 8.3?
>
> Good grief, no.  We have not even done the research to find out where
> the bottleneck(s) is/are.  We're not holding up 8.3 while we go back
> into development mode, especially not when this problem has existed
> for seven or eight years (even if JD failed to notice before) and
> there are already some improvements for it in 8.3.

I would also note that this "problem" is only going to be noticeable on
the highest velocity of databases. This is certainly a 10% of the users
issue. It would be great to get it fixed but there are ways around it
(namely making sure you are running pg_standby and pushing logs at
smaller intervals).

Sincerely,

Joshua D. Drake


>
>             regards, tom lane
>


Re: Slow PITR restore

From
Simon Riggs
Date:
On Wed, 2007-12-12 at 12:13 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
> >> Yeah, restoring is known to be less than speedy, because essentially
> >> zero optimization work has been done on it.
>
> > If there was a patch to improve this, would it be applied to 8.3?
>
> Good grief, no.

Understood.

Just checking to see if you were fishing for a patch from me, so I'm
relieved actually.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
Jeff Trout
Date:
On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:

> 8 seconds for a single archive recovery is very slow in
> consideration of this machine. Even single threaded that seems slow.
>

I've seen this on my PITR restores (thankfully, they were for
fetching some old data, not because we expoded).  On a 2.4ghz opteron
it took 5-50 seconds per wal segment, and there were a LOT of
segments (replay took hours and hours).  I asked a few folks and was
told it is the nature of the beast.  Hopefully something in 8.4 can
be done.

--
Jeff Trout <jeff@jefftrout.com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/




Re: Slow PITR restore

From
Tom Lane
Date:
Jeff Trout <threshar@threshar.is-a-geek.com> writes:
> I've seen this on my PITR restores (thankfully, they were for
> fetching some old data, not because we expoded).  On a 2.4ghz opteron
> it took 5-50 seconds per wal segment, and there were a LOT of
> segments (replay took hours and hours).  I asked a few folks and was
> told it is the nature of the beast.  Hopefully something in 8.4 can
> be done.

Before we get all panicked about that, someone should try to measure the
restore speed on 8.3.  It's possible that this patch already
alleviated the problem:
http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

            regards, tom lane

Re: Slow PITR restore

From
Gregory Stark
Date:
"Jeff Trout" <threshar@torgo.978.org> writes:

> On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:
>
>> 8 seconds for a single archive recovery is very slow in  consideration of
>> this machine. Even single threaded that seems slow.
>
> I've seen this on my PITR restores (thankfully, they were for  fetching some
> old data, not because we expoded).  On a 2.4ghz opteron  it took 5-50 seconds
> per wal segment,

I'm not sure what you guys' expectations are, but if you're restoring 5
minutes worth of database traffic in 8 seconds I wouldn't be complaining.

Depending on your transaction mix and what percentage of it is read-only
select queries you might reasonably expect the restore to take as long as it
took to generate them...

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 18:02:39 +0000
Gregory Stark <stark@enterprisedb.com> wrote:

> I'm not sure what you guys' expectations are, but if you're restoring
> 5 minutes worth of database traffic in 8 seconds I wouldn't be
> complaining.

I would be. This is a database that is doing nothing but restoring.
Zero concurrency. This thing should be flying.

> 
> Depending on your transaction mix and what percentage of it is
> read-only select queries you might reasonably expect the restore to
> take as long as it took to generate t

We archive selects?

Joshua D. Drake

- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCOIATb/zqfZUUQRAlbBAJ0aNmWy8xFKtmaONUmL9LegTfaaywCfSoS3
TRmDPUEhoTJJVSO8REM6YYI=
=bv/t
-----END PGP SIGNATURE-----

Re: Slow PITR restore

From
Jeff Trout
Date:
On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:
>
> I'm not sure what you guys' expectations are, but if you're
> restoring 5
> minutes worth of database traffic in 8 seconds I wouldn't be
> complaining.
>
> Depending on your transaction mix and what percentage of it is read-
> only
> select queries you might reasonably expect the restore to take as
> long as it
> took to generate them...
>

in this case it was 24hrs of data - about 1500 wal segments.  During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

I'll have to artifically build up some db traffic on an 8.3 instance
to see if the patch Tom mentioned helps.


--
Jeff Trout <jeff@jefftrout.com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/




Re: Slow PITR restore

From
Simon Riggs
Date:
On Wed, 2007-12-12 at 10:08 -0800, Joshua D. Drake wrote:

> > Depending on your transaction mix and what percentage of it is
> > read-only select queries you might reasonably expect the restore to
> > take as long as it took to generate t
>
> We archive selects?

No, but that is exactly his point.

Whatever proportion of the workload is selects will *not* have to be
reproduced on the client.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
Simon Riggs
Date:
On Wed, 2007-12-12 at 12:56 -0500, Tom Lane wrote:
> Jeff Trout <threshar@threshar.is-a-geek.com> writes:
> > I've seen this on my PITR restores (thankfully, they were for
> > fetching some old data, not because we expoded).  On a 2.4ghz opteron
> > it took 5-50 seconds per wal segment, and there were a LOT of
> > segments (replay took hours and hours).  I asked a few folks and was
> > told it is the nature of the beast.  Hopefully something in 8.4 can
> > be done.
>
> Before we get all panicked about that, someone should try to measure the
> restore speed on 8.3.  It's possible that this patch already
> alleviated the problem:
> http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

Twas this feature, amongst others, that I had asked to be put back onto
the release notes, so people knew it had improved. I don't really want
to replicate that discussion, but if we're talking about it here then it
probably should be in the notes.

It roughly doubles performance, but there's still more required in some
cases. My recommendation is always measure the performance first.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 13:13:35 -0500
Jeff Trout <threshar@threshar.is-a-geek.com> wrote:

> 
> On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:
> >
> > I'm not sure what you guys' expectations are, but if you're  
> > restoring 5
> > minutes worth of database traffic in 8 seconds I wouldn't be  
> > complaining.
> >
> > Depending on your transaction mix and what percentage of it is
> > read- only
> > select queries you might reasonably expect the restore to take as  
> > long as it
> > took to generate them...
> >
> 
> in this case it was 24hrs of data - about 1500 wal segments.  During  
> this time the machine was nearly complete idle and there wasn't very  
> much IO going on (few megs/sec).

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

Tom is also correct, we should test this on 8.3. 

Joshua D. Drake



- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCerATb/zqfZUUQRArdeAJ9D89Qi7xCqFDUOpUgKQ/QigwHNPwCdFQfN
Dl8svUbMi40WExyd93MCIzw=
=MEhU
-----END PGP SIGNATURE-----

Re: Slow PITR restore

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> On Wed, 12 Dec 2007 18:02:39 +0000
> Gregory Stark <stark@enterprisedb.com> wrote:
>
>> I'm not sure what you guys' expectations are, but if you're restoring
>> 5 minutes worth of database traffic in 8 seconds I wouldn't be
>> complaining.
>
> I would be. This is a database that is doing nothing but restoring.
> Zero concurrency. This thing should be flying.

Well you say that like concurrency is a bad thing. The lack of concurrency is
the big handicap recovery has. It has to wait while it loads one buffer so it
can twiddle some bits before it reads the next buffer and twiddles bits there.
During normal operation those two buffers were twiddled by two different
transactions in two different processes. Even if they weren't on two different
processes they could have been context switched onto the same processor while
the i/o was in progress.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>
>> On Wed, 12 Dec 2007 18:02:39 +0000
>> Gregory Stark <stark@enterprisedb.com> wrote:
>>
>>> I'm not sure what you guys' expectations are, but if you're restoring
>>> 5 minutes worth of database traffic in 8 seconds I wouldn't be
>>> complaining.
>> I would be. This is a database that is doing nothing but restoring.
>> Zero concurrency. This thing should be flying.
>
> Well you say that like concurrency is a bad thing. The lack of concurrency is
> the big handicap recovery has. It has to wait while it loads one buffer so it
> can twiddle some bits before it reads the next buffer and twiddles bits there.
> During normal operation those two buffers were twiddled by two different
> transactions in two different processes. Even if they weren't on two different
> processes they could have been context switched onto the same processor while
> the i/o was in progress.

Please see my point about saturation in another post.

Joshua D. Drake




Re: Slow PITR restore

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> Jeff Trout <threshar@threshar.is-a-geek.com> wrote:
>> in this case it was 24hrs of data - about 1500 wal segments.  During
>> this time the machine was nearly complete idle and there wasn't very
>> much IO going on (few megs/sec).

> Exactly. Which is the point I am making. Five minutes of transactions
> is nothing (speaking generally).. In short, if we are in recovery, and
> we are not saturated the I/O and at least a single CPU, there is a huge
> amount of optimization *somewhere* to be done.

You sure about that?  I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high, running pgbench for awhile, and
then killing the bgwriter to force a recovery cycle over all the WAL
generated by the pgbench run.  What I saw was that the machine was 100%
disk write bound.  Increasing shared_buffers helped, not in that the
write rate got less according to vmstat, but the completion time did.

shared_buffers            32MB    100MB

pgbench -c 5 -t 40000 bench    7m23s    2m20s
subsequent recovery        4m26s    2m21s

Typical "vmstat 1" lines during recovery:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  9  70024  29232  19876 824368    0    0     0  3152 1447  233  0  1  0 99  0
 0  9  70024  29232  19876 824368    0    0     0  3660 1474  252  0  1  0 99  0
 0  8  70024  28960  19876 824404    0    0     0  3176 1448  265  0  2  1 97  0

I don't see the machine sitting around doing nothing ...

            regards, tom lane

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Jeff Trout <threshar@threshar.is-a-geek.com> wrote:
>>> in this case it was 24hrs of data - about 1500 wal segments.  During
>>> this time the machine was nearly complete idle and there wasn't very
>>> much IO going on (few megs/sec).
>
>> Exactly. Which is the point I am making. Five minutes of transactions
>> is nothing (speaking generally).. In short, if we are in recovery, and
>> we are not saturated the I/O and at least a single CPU, there is a huge
>> amount of optimization *somewhere* to be done.
>
> You sure about that?  I tested CVS HEAD just now, by setting the
> checkpoint_ parameters really high,

Well I haven't tested CVS HEAD yet, but on 8.2:

> avg-cpu:  %user   %nice    %sys %iowait   %idle
>            0.12    0.00    0.30   12.22   87.35
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda             172.26      2927.74        46.31      14668        232
> sda1            182.63      2927.74        46.31      14668        232
> sdb               1.40         0.00        41.52          0        208
> sdb1             10.38         0.00        41.52          0        208
> sdb2              0.00         0.00         0.00          0          0
> sdc              96.21       946.11        14.37       4740         72
> sdc1             99.20       946.11        14.37       4740         72
> sdd               0.00         0.00         0.00          0          0
> sdd1              0.00         0.00         0.00          0          0

And:

> 2007-12-13 00:55:20 EST     LOG:  restored log file "00000001000007E10000006B" from archive
> 2007-12-13 00:55:34 EST     LOG:  restored log file "00000001000007E10000006C" from archive
> 2007-12-13 00:55:48 EST     LOG:  restored log file "00000001000007E10000006D" from archive
> 2007-12-13 00:56:01 EST     LOG:  restored log file "00000001000007E10000006E" from archive
> 2007-12-13 00:56:14 EST     LOG:  restored log file "00000001000007E10000006F" from archive
> 2007-12-13 00:56:28 EST     LOG:  restored log file "00000001000007E100000070" from archive
> 2007-12-13 00:56:42 EST     LOG:  restored log file "00000001000007E100000071" from archive

This is a 14 Spindles in RAID 10 on sda and sdc. As a correlation, exact
same hardware (and OS) and off peak production load of 2.25 million
xacts per hour (at data point), 3million per hour at peak.:

> avg-cpu:  %user   %nice    %sys %iowait   %idle
>           23.30    0.00   15.28   33.07   28.35
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda            2078.80     23693.60     10628.00     118468      53140
> sda1           4139.80     23693.60     10628.00     118468      53140
> sdb              24.00        94.40       318.40        472       1592
> sdb1             87.40        94.40       318.40        472       1592
> sdc             694.00     22207.20      4113.60     111036      20568
> sdc1           1575.00     22207.20      4113.60     111036      20568
> sdd             826.80         3.20      8691.20         16      43456
> sdd1            828.60         3.20      8691.20         16      43456

I would expect that recovery would be faster than prod. Maybe I am off
my rocker but the warm standby isn't doing anything but restoring the logs.

This is going to be hard to test with 8.3 with this data set but I will
see if I can get some other numbers with 8.3.

Sincerely,

Joshua D. Drake










Re: Slow PITR restore

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
>
>> Exactly. Which is the point I am making. Five minutes of transactions
>> is nothing (speaking generally).. In short, if we are in recovery, and
>> we are not saturated the I/O and at least a single CPU, there is a huge
>> amount of optimization *somewhere* to be done.
>
> You sure about that?  I tested CVS HEAD just now, by setting the
> checkpoint_ parameters really high, running pgbench for awhile, and
> then killing the bgwriter to force a recovery cycle over all the WAL
> generated by the pgbench run.  What I saw was that the machine was 100%
> disk write bound.  Increasing shared_buffers helped, not in that the
> write rate got less according to vmstat, but the completion time did.

There are at least three definitions of "saturating the I/O" and it sounds
like you two are using two different ones.

1) The processor is waiting on I/O all the time
2) The hard drives are all always handling a request
3) The hard drives are all handling the full bandwidth they're capable

You would expect (1) and (2) to be the same for a single drive -- though in
practice there seems to be a gap even between them. But for a raid array there
can be a large difference, and the wider the raid stripe the larger the
difference.

In Tom's results:

> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  9  70024  29232  19876 824368    0    0     0  3152 1447  233  0  1  0 99  0
>  0  9  70024  29232  19876 824368    0    0     0  3660 1474  252  0  1  0 99  0
>  0  8  70024  28960  19876 824404    0    0     0  3176 1448  265  0  2  1 97  0
>
> I don't see the machine sitting around doing nothing ...

Note that even though the processor is 99% in wait state the drive is only
handling about 3 MB/s. That translates into a seek time of 2.2ms which is
actually pretty fast. So if this is a single drive (1) and (2) seem to be
pretty much the same here.

But note that if this were a raid array Postgres's wouldn't be getting any
better results. A Raid array wouldn't improve i/o latency at all and since
it's already 99% waiting for i/o Postgres is not going to be able to issue any
more. But only one drive in the raid array will be busy at a time which would
be far less than the maximum random access i/o the raid array is capable of.

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

That's still going to be a far cry from the maximum bandwidth the hard drive
can handle. Even that single drive can probably handle 60MB/s sequential I/O.
That's probably the source of the unrealistic expectations people have. It's
easy to watch the bandwidth number as the headline number to measure i/o
utilization.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: Slow PITR restore

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> Tom Lane wrote:
>> You sure about that?  I tested CVS HEAD just now, by setting the
>> checkpoint_ parameters really high,

> ... And:

>> 2007-12-13 00:55:20 EST     LOG:  restored log file "00000001000007E10000006B" from archive

Hmm --- I was testing a straight crash-recovery scenario, not restoring
from archive.  Are you sure your restore_command script isn't
responsible for a lot of the delay?

            regards, tom lane

Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Tom Lane wrote:
>>> You sure about that?  I tested CVS HEAD just now, by setting the
>>> checkpoint_ parameters really high,
>
>> ... And:
>
>>> 2007-12-13 00:55:20 EST     LOG:  restored log file "00000001000007E10000006B" from archive
>
> Hmm --- I was testing a straight crash-recovery scenario, not restoring
> from archive.  Are you sure your restore_command script isn't
> responsible for a lot of the delay?

Now that's an interesting thought, I will review in the morning when I
have some more IQ points back.

Joshua D. Drake

>
>             regards, tom lane
>


Re: Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> >
> >> Exactly. Which is the point I am making. Five minutes of transactions
> >> is nothing (speaking generally).. In short, if we are in recovery, and
> >> we are not saturated the I/O and at least a single CPU, there is a huge
> >> amount of optimization *somewhere* to be done.
> >
> > You sure about that?  I tested CVS HEAD just now, by setting the
> > checkpoint_ parameters really high, running pgbench for awhile, and
> > then killing the bgwriter to force a recovery cycle over all the WAL
> > generated by the pgbench run.  What I saw was that the machine was 100%
> > disk write bound.  Increasing shared_buffers helped, not in that the
> > write rate got less according to vmstat, but the completion time did.
>
> There are at least three definitions of "saturating the I/O" and it sounds
> like you two are using two different ones.
>
> 1) The processor is waiting on I/O all the time
> 2) The hard drives are all always handling a request
> 3) The hard drives are all handling the full bandwidth they're capable
>
> You would expect (1) and (2) to be the same for a single drive -- though in
> practice there seems to be a gap even between them. But for a raid array there
> can be a large difference, and the wider the raid stripe the larger the
> difference.
>
> In Tom's results:
>
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> >  0  9  70024  29232  19876 824368    0    0     0  3152 1447  233  0  1  0 99  0
> >  0  9  70024  29232  19876 824368    0    0     0  3660 1474  252  0  1  0 99  0
> >  0  8  70024  28960  19876 824404    0    0     0  3176 1448  265  0  2  1 97  0
> >
> > I don't see the machine sitting around doing nothing ...
>
> Note that even though the processor is 99% in wait state the drive is only
> handling about 3 MB/s. That translates into a seek time of 2.2ms which is
> actually pretty fast. So if this is a single drive (1) and (2) seem to be
> pretty much the same here.
>
> But note that if this were a raid array Postgres's wouldn't be getting any
> better results. A Raid array wouldn't improve i/o latency at all and since
> it's already 99% waiting for i/o Postgres is not going to be able to issue any
> more. But only one drive in the raid array will be busy at a time which would
> be far less than the maximum random access i/o the raid array is capable of.

Agree with Greg's analysis here. Moving to -hackers now.

I've done performance profiling also. My results replicated Tom's, but I
hadn't performed them on a big enough system and so didn't realise the
I/O scalability issue could be such a large problem. Koichi showed me
some results on a much larger server that illustrated the I/O problem.

But lets remember its only a problem on large servers with a heavy write
workload and a large random I/O requirement. That's an important set of
people, but much less than Josh's 10% of people even.

> Heikki proposed a while back to use posix_fadvise() when processing logs to
> read-ahead blocks which the recover will need before actually attempting to
> recover them. On a raid array that would bring the 3MB/s above up to the
> maximum number of random accesses the raid array can handle (ie, definition
> (2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Florian's code for Hot Standby introduces a separate recovery process,
similar to an autovacuum launcher. I propose a mechanism similar to the
AV solution where we have lots of recovery workers, with one recovery
master reading the WAL files. We can then distribute WAL records to
workers in some manner.

It's true that many WAL records depend upon each other, but its also
true that the performance problems only occur in the situation when they
the WAL records don't depend upon each other. If they did, they would
touch the same blocks and it would be cached. So as long as we have a
safe mechanism for splitting up the work, everything is fine.

We can divide up the WAL records this by looking at the rmgr field, plus
looking deeper into the records themselves so we can touch different
relations/blocks.

So I'm planning to review this *after* Florian has introduced his patch,
so we can build upon it.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
>> Heikki proposed a while back to use posix_fadvise() when processing logs to
>> read-ahead blocks which the recover will need before actually attempting to
>> recover them. On a raid array that would bring the 3MB/s above up to the
>> maximum number of random accesses the raid array can handle (ie, definition
>> (2) above).
>
> It's a good idea, but it will require more complex code. I prefer the
> simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Slow PITR restore

From
Alvaro Herrera
Date:
Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>
> > It's a good idea, but it will require more complex code. I prefer the
> > simpler solution of using more processes to solve the I/O problem.
>
> Huh, I forgot about that idea. Ironically that was what I suggested when
> Heikki described the problem.
>
> I think it's more complex than using posix_fadvise. But it's also more
> ambitious. It would allow us to use not only the full random access i/o
> bandwidth but also allow us to use more cpu. In cases where the database fits
> entirely in ram and we're recovering many many operations modifying the same
> blocks over and over that might help a lot.

Actually, if you are modifying the same blocks over and over it will
help *less*, because applying each record needs to occur only after the
previous records that modify the same block have been applied.

So you have two possibilities: you skip that record and try to apply the
next one, hoping that that record applies to a block that's not locked,
(which means you have to remember the skipped record and apply it
sometime in the future), or you put the process to sleep until the lock
has been released.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: [HACKERS] Slow PITR restore

From
Simon Riggs
Date:
On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >
> > > It's a good idea, but it will require more complex code. I prefer the
> > > simpler solution of using more processes to solve the I/O problem.
> >
> > Huh, I forgot about that idea. Ironically that was what I suggested when
> > Heikki described the problem.
> >
> > I think it's more complex than using posix_fadvise. But it's also more
> > ambitious. It would allow us to use not only the full random access i/o
> > bandwidth but also allow us to use more cpu. In cases where the database fits
> > entirely in ram and we're recovering many many operations modifying the same
> > blocks over and over that might help a lot.
>
> Actually, if you are modifying the same blocks over and over it will
> help *less*, because applying each record needs to occur only after the
> previous records that modify the same block have been applied.
>
> So you have two possibilities: you skip that record and try to apply the
> next one, hoping that that record applies to a block that's not locked,
> (which means you have to remember the skipped record and apply it
> sometime in the future), or you put the process to sleep until the lock
> has been released.

Ah, OK, I can see we're on the same lines of thought there. Just posted
a reply to Heikki about this sort of idea.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Slow PITR restore

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 23:08:35 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

> Tom Lane wrote:
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> >> Tom Lane wrote:
> >>> You sure about that?  I tested CVS HEAD just now, by setting the
> >>> checkpoint_ parameters really high,
> > 
> >> ... And:
> > 
> >>> 2007-12-13 00:55:20 EST     LOG:  restored log file
> >>> "00000001000007E10000006B" from archive
> > 
> > Hmm --- I was testing a straight crash-recovery scenario, not
> > restoring from archive.  Are you sure your restore_command script
> > isn't responsible for a lot of the delay?
> 
> Now that's an interesting thought, I will review in the morning when
> I have some more IQ points back.

As promised :)... I took a look at this today and I think I found a
couple of things. It appears that once the logs are archived, the
recovery command copies the archive file to a recovery location and
then restores the file. 

If that is correct that could explain some of the latency I am seeing
here. Even with the speed of these devices, it is still a 16 MB file.
That could take 1-2 seconds to copy.

There is also the execution of pg_standby each time as the recovery
command which although I haven't timed is going to add overhead.

Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4
and then another 6 seconds.

When are fsyncs called on the recovery process?

At these types of delays even speeding the process 2 seconds per log is
going to be significant.

Sincerely,

Joshua D. Drake


- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYYQaATb/zqfZUUQRAiiNAKCNDaO+MYDDLM/lUbL4D9Q9NIEyRQCgqhye
cJ2PAv9rEzAi/jDFPzzoFNw=
=xNMz
-----END PGP SIGNATURE-----

REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Wes
Date:
Seems like it would be a common question, but I'm having problems finding an
answer in the archives on this...

I have a large database (now about 2 billion records), and about once a year
I have been dropping and recreating the indexes.  Recreating the foreign key
constraints takes as long or longer than the index creation.  Last year the
whole process took about 72 hours.  This year the DB is bigger.

I'm running 8.1.4.  Assume I have exclusive access to the DB.

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE.  I seem to remember DROP/CREATE had some advantages, but can't
find the information.  Is there a performance hit with REINDEX during
creation because of locking issues?

2. I'm assuming REINDEX would avoid the time involved in recreating the
foreign key constraints?

3. With a REINDEX DATABASE, how can I monitor progress?

Wes



Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Tom Lane
Date:
Wes <wespvp@msg.bt.com> writes:
> I'm running 8.1.4.  Assume I have exclusive access to the DB.

You really ought to update to 8.1.something-newer, but I digress.

> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
> DATABASE.

No, not if you don't mind exclusive locks.  DROP together with CREATE
INDEX CONCURRENTLY might be nicer if you were trying to do this without
completely shutting down the DB, but if you aren't running normal
operations then just use REINDEX.

> 2. I'm assuming REINDEX would avoid the time involved in recreating the
> foreign key constraints?

Right, that's one reason to do it that way.

> 3. With a REINDEX DATABASE, how can I monitor progress?

It should give you a NOTICE after each table.

BTW, what have you got maintenance_work_mem set to?

            regards, tom lane

Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Wes
Date:
On 1/24/08 12:48 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Wes <wespvp@msg.bt.com> writes:
>> I'm running 8.1.4.  Assume I have exclusive access to the DB.
>
> You really ought to update to 8.1.something-newer, but I digress.

I was planning on upgrading to 8.x at the same time as this reindex - just
do a dump/reload, but...

I guess I've been asleep at the wheel and didn't realize 8.1.11 was out.
Since that wouldn't require a DB reload, I guess that would be highly
recommended?

>> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
>> DATABASE.
>
> No, not if you don't mind exclusive locks.  DROP together with CREATE
> INDEX CONCURRENTLY might be nicer if you were trying to do this without
> completely shutting down the DB, but if you aren't running normal
> operations then just use REINDEX.

I was thinking there was some clean up that didn't happen with REINDEX,
related to disk space allocation maybe?  Perhaps this was just on older 7.x
versions - something I vaguely recall back under 7.x when I was having some
corruption issues.

>> 3. With a REINDEX DATABASE, how can I monitor progress?
>
> It should give you a NOTICE after each table.

Is there anything that shows up in ps for each index it is working on?

> BTW, what have you got maintenance_work_mem set to?

It is currently set to 983025.  Not sure where I got that strange number
from.  It's a 2 GB machine.  I've been trying to get more, but when it runs
fine day to day, it's kind of hard to justify.  Lots of disks, not so much
memory.

I guess I should also turn off fsync for the duration.

Wes



Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Gregory Stark
Date:
"Wes" <wespvp@msg.bt.com> writes:

> I guess I should also turn off fsync for the duration.

It shouldn't make a big difference. fsync only happens at the end of a
transaction or at a checkpoint.

Since you're concerned with very long operations the slowdown at the end of
the transaction won't make a big difference.

Checkpoints could be an issue, but you would be better off just raising
checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
more often than once every few minutes when you're doing large operations like
this.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Wes
Date:
On 1/25/08 5:40 AM, "Gregory Stark" <stark@enterprisedb.com> wrote:

> It shouldn't make a big difference. fsync only happens at the end of a
> transaction or at a checkpoint.
>
> Since you're concerned with very long operations the slowdown at the end of
> the transaction won't make a big difference.

Ok, I didn't realize that.

> Checkpoints could be an issue, but you would be better off just raising
> checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
> more often than once every few minutes when you're doing large operations like
> this.

I have checkpoint_segments set to 60, and no warnings showing up in the log.

Wes



Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Wes
Date:
Just a follow-up on this...  The REINDEX took about 2 1/2 days.  I didn't
gain much disk space back - a full backup takes just as long as before, but
the vacuum time dropped from 30 hours to 3 hours.

Wes

>> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
>> DATABASE.
>
> No, not if you don't mind exclusive locks.  DROP together with CREATE
> INDEX CONCURRENTLY might be nicer if you were trying to do this without
> completely shutting down the DB, but if you aren't running normal
> operations then just use REINDEX.
>
>> 2. I'm assuming REINDEX would avoid the time involved in recreating the
>> foreign key constraints?
>
> Right, that's one reason to do it that way.



Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Vivek Khera
Date:
On Feb 4, 2008, at 10:00 AM, Wes wrote:

> Just a follow-up on this...  The REINDEX took about 2 1/2 days.  I
> didn't
> gain much disk space back - a full backup takes just as long as
> before, but
> the vacuum time dropped from 30 hours to 3 hours.

what you need to do is compare the relpages from the pg_class table
for that index before and after.

if you didn't get much disk space back, make sure you have no long
running transactions that may have kept some older files open.


Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Wes
Date:
On 2/4/08 9:53 AM, "Vivek Khera" <vivek@khera.org> wrote:

> what you need to do is compare the relpages from the pg_class table
> for that index before and after.
>
> if you didn't get much disk space back, make sure you have no long
> running transactions that may have kept some older files open.

I can check that next time, but I only reindex about once a year.  There
definitely should be no outstanding transactions.

The reason for the huge change in the vacuum time is that the indexes are
scanned in index order instead of disk order.  I understand that is fixed in
8.2 or 8.3 (don't recall which I saw it in), but have never gotten
confirmation from anyone on that.

Wes



Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

From
Tom Lane
Date:
Wes <wespvp@msg.bt.com> writes:
> The reason for the huge change in the vacuum time is that the indexes are
> scanned in index order instead of disk order.  I understand that is fixed in
> 8.2 or 8.3 (don't recall which I saw it in), but have never gotten
> confirmation from anyone on that.

Yeah, 8.2.

            regards, tom lane