Thread: Slow PITR restore
Hello, I have an interesting thing going on here... 2007-12-11 21:10:36 EST LOG: restored log file "00000001000007CC00000012" from archive 2007-12-11 21:10:47 EST LOG: restored log file "00000001000007CC00000013" from archive 2007-12-11 21:11:02 EST LOG: restored log file "00000001000007CC00000014" from archive 2007-12-11 21:11:14 EST LOG: restored log file "00000001000007CC00000015" from archive 2007-12-11 21:11:26 EST LOG: restored log file "00000001000007CC00000016" from archive 2007-12-11 21:11:39 EST LOG: restored log file "00000001000007CC00000017" from archive 2007-12-11 21:11:54 EST LOG: restored log file "00000001000007CC00000018" from archive 2007-12-11 21:12:13 EST LOG: restored log file "00000001000007CC00000019" from archive 2007-12-11 21:12:27 EST LOG: restored log file "00000001000007CC0000001A" from archive 2007-12-11 21:12:42 EST LOG: restored log file "00000001000007CC0000001B" from archive 2007-12-11 21:12:54 EST LOG: restored log file "00000001000007CC0000001C" from archive 2007-12-11 21:13:08 EST LOG: restored log file "00000001000007CC0000001D" from archive 2007-12-11 21:13:21 EST LOG: restored log file "00000001000007CC0000001E" from archive 2007-12-11 21:13:36 EST LOG: restored log file "00000001000007CC0000001F" from archive 2007-12-11 21:13:45 EST LOG: restored log file "00000001000007CC00000020" from archive 2007-12-11 21:13:52 EST LOG: restored log file "00000001000007CC00000021" from archive 2007-12-11 21:14:01 EST LOG: restored log file "00000001000007CC00000022" from archive Now I understand that restoring log files can be slow but this is a big machine. 32Gig of Ram 28 Disks (two volumes each raid 10) Using two different table spaces (over each volume) 8 Opterons. The machine has a production configuration that can doing several hundred thousand transactions an hour without so much as having to stretch first. It seems to me that this machine should be *cranking* through these transactions. Am I just being my usual uptight, impatient self? Sincerely, Joshua D. Drake
"Joshua D. Drake" <jd@commandprompt.com> writes: > ... Now I understand that restoring log files can be slow but this is a big > machine. Yeah, restoring is known to be less than speedy, because essentially zero optimization work has been done on it. Heikki has improved matters a bit in 8.3, but I suspect it's still not great. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> ... Now I understand that restoring log files can be slow but this is a big >> machine. > > Yeah, restoring is known to be less than speedy, because essentially > zero optimization work has been done on it. > > Heikki has improved matters a bit in 8.3, but I suspect it's still > not great. Wow, o.k. well it is something we (the community) really should look at for 8.4. I am surprised that it is slower than just walking through the xlogs on recovery. I am sure there is a reason just surprised. Thanks for answering. Sincerely, Joshua D. Drake > > regards, tom lane >
"Joshua D. Drake" <jd@commandprompt.com> writes: > Wow, o.k. well it is something we (the community) really should look at for > 8.4. I am surprised that it is slower than just walking through the xlogs on > recovery. I am sure there is a reason just surprised. Well in the worst case it has to do nearly as much work as the original database did. And it only gets to use 1 cpu so it can only have one i/o request pending. bgwriter is started already when doing recovery, right? Perhaps things could be helped by telling bgwriter to behave differently during recovery. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
On Wed, 2007-12-12 at 08:55 +0000, Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > Wow, o.k. well it is something we (the community) really should look at for > > 8.4. I am surprised that it is slower than just walking through the xlogs on > > recovery. I am sure there is a reason just surprised. It's the same speed because it is the same recovery code. > Well in the worst case it has to do nearly as much work as the original > database did. And it only gets to use 1 cpu so it can only have one i/o > request pending. That need only be slow in certain circumstances. > bgwriter is started already when doing recovery, right? Perhaps things could > be helped by telling bgwriter to behave differently during recovery. It would certainly help if bgwriter came up earlier. I've been looking at that just recently. The main issue is how to structure the code to tell bgwriter when it can start processing in recovery mode and then move into normal mode. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Gregory Stark wrote: > bgwriter is started already when doing recovery, right? Perhaps things could > be helped by telling bgwriter to behave differently during recovery. No. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "Amanece. (Ignacio Reyes) El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"
On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > ... Now I understand that restoring log files can be slow but this is a big > > machine. > > Yeah, restoring is known to be less than speedy, because essentially > zero optimization work has been done on it. If there was a patch to improve this, would it be applied to 8.3? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote: >> "Joshua D. Drake" <jd@commandprompt.com> writes: >>> ... Now I understand that restoring log files can be slow but this is a big >>> machine. >> Yeah, restoring is known to be less than speedy, because essentially >> zero optimization work has been done on it. > > If there was a patch to improve this, would it be applied to 8.3? > Sheesh Simon you are really pushing this release :). I would love to see a patch to resolve this, especially since it appears to be a fairly glaring oversight. We can't really expect people to use PITR if they new it would take hours to recover even on the size of machine I was working on. On the other hand... we are about to go to RC1 :) Joshua D. Drake
On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote: > We can't really expect people to use PITR if they new > it would take hours to recover even on the size of machine I was working on. That's not true statement in all cases and can often be improved with some monitoring and tuning. Just get your client to call me :-) Are you doing replication, or a PITR for another reason? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote: > >> We can't really expect people to use PITR if they new >> it would take hours to recover even on the size of machine I was working on. > > That's not true statement in all cases and can often be improved with > some monitoring and tuning. Just get your client to call me :-) > Uhh.. right. > Are you doing replication, or a PITR for another reason? Warm standby. Normally we pull every 5 minutes which is why we hadn't noticed this before. However last night we pulled a full sync and recover and that is when we noticed it. 8 seconds for a single archive recovery is very slow in consideration of this machine. Even single threaded that seems slow. Sincerely, Joshua D. Drake
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote: >> Yeah, restoring is known to be less than speedy, because essentially >> zero optimization work has been done on it. > If there was a patch to improve this, would it be applied to 8.3? Good grief, no. We have not even done the research to find out where the bottleneck(s) is/are. We're not holding up 8.3 while we go back into development mode, especially not when this problem has existed for seven or eight years (even if JD failed to notice before) and there are already some improvements for it in 8.3. regards, tom lane
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote: >>> Yeah, restoring is known to be less than speedy, because essentially >>> zero optimization work has been done on it. > >> If there was a patch to improve this, would it be applied to 8.3? > > Good grief, no. We have not even done the research to find out where > the bottleneck(s) is/are. We're not holding up 8.3 while we go back > into development mode, especially not when this problem has existed > for seven or eight years (even if JD failed to notice before) and > there are already some improvements for it in 8.3. I would also note that this "problem" is only going to be noticeable on the highest velocity of databases. This is certainly a 10% of the users issue. It would be great to get it fixed but there are ways around it (namely making sure you are running pg_standby and pushing logs at smaller intervals). Sincerely, Joshua D. Drake > > regards, tom lane >
On Wed, 2007-12-12 at 12:13 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote: > >> Yeah, restoring is known to be less than speedy, because essentially > >> zero optimization work has been done on it. > > > If there was a patch to improve this, would it be applied to 8.3? > > Good grief, no. Understood. Just checking to see if you were fishing for a patch from me, so I'm relieved actually. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote: > 8 seconds for a single archive recovery is very slow in > consideration of this machine. Even single threaded that seems slow. > I've seen this on my PITR restores (thankfully, they were for fetching some old data, not because we expoded). On a 2.4ghz opteron it took 5-50 seconds per wal segment, and there were a LOT of segments (replay took hours and hours). I asked a few folks and was told it is the nature of the beast. Hopefully something in 8.4 can be done. -- Jeff Trout <jeff@jefftrout.com> http://www.dellsmartexitin.com/ http://www.stuarthamm.net/
Jeff Trout <threshar@threshar.is-a-geek.com> writes: > I've seen this on my PITR restores (thankfully, they were for > fetching some old data, not because we expoded). On a 2.4ghz opteron > it took 5-50 seconds per wal segment, and there were a LOT of > segments (replay took hours and hours). I asked a few folks and was > told it is the nature of the beast. Hopefully something in 8.4 can > be done. Before we get all panicked about that, someone should try to measure the restore speed on 8.3. It's possible that this patch already alleviated the problem: http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php regards, tom lane
"Jeff Trout" <threshar@torgo.978.org> writes: > On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote: > >> 8 seconds for a single archive recovery is very slow in consideration of >> this machine. Even single threaded that seems slow. > > I've seen this on my PITR restores (thankfully, they were for fetching some > old data, not because we expoded). On a 2.4ghz opteron it took 5-50 seconds > per wal segment, I'm not sure what you guys' expectations are, but if you're restoring 5 minutes worth of database traffic in 8 seconds I wouldn't be complaining. Depending on your transaction mix and what percentage of it is read-only select queries you might reasonably expect the restore to take as long as it took to generate them... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 12 Dec 2007 18:02:39 +0000 Gregory Stark <stark@enterprisedb.com> wrote: > I'm not sure what you guys' expectations are, but if you're restoring > 5 minutes worth of database traffic in 8 seconds I wouldn't be > complaining. I would be. This is a database that is doing nothing but restoring. Zero concurrency. This thing should be flying. > > Depending on your transaction mix and what percentage of it is > read-only select queries you might reasonably expect the restore to > take as long as it took to generate t We archive selects? Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYCOIATb/zqfZUUQRAlbBAJ0aNmWy8xFKtmaONUmL9LegTfaaywCfSoS3 TRmDPUEhoTJJVSO8REM6YYI= =bv/t -----END PGP SIGNATURE-----
On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote: > > I'm not sure what you guys' expectations are, but if you're > restoring 5 > minutes worth of database traffic in 8 seconds I wouldn't be > complaining. > > Depending on your transaction mix and what percentage of it is read- > only > select queries you might reasonably expect the restore to take as > long as it > took to generate them... > in this case it was 24hrs of data - about 1500 wal segments. During this time the machine was nearly complete idle and there wasn't very much IO going on (few megs/sec). I'll have to artifically build up some db traffic on an 8.3 instance to see if the patch Tom mentioned helps. -- Jeff Trout <jeff@jefftrout.com> http://www.dellsmartexitin.com/ http://www.stuarthamm.net/
On Wed, 2007-12-12 at 10:08 -0800, Joshua D. Drake wrote: > > Depending on your transaction mix and what percentage of it is > > read-only select queries you might reasonably expect the restore to > > take as long as it took to generate t > > We archive selects? No, but that is exactly his point. Whatever proportion of the workload is selects will *not* have to be reproduced on the client. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Wed, 2007-12-12 at 12:56 -0500, Tom Lane wrote: > Jeff Trout <threshar@threshar.is-a-geek.com> writes: > > I've seen this on my PITR restores (thankfully, they were for > > fetching some old data, not because we expoded). On a 2.4ghz opteron > > it took 5-50 seconds per wal segment, and there were a LOT of > > segments (replay took hours and hours). I asked a few folks and was > > told it is the nature of the beast. Hopefully something in 8.4 can > > be done. > > Before we get all panicked about that, someone should try to measure the > restore speed on 8.3. It's possible that this patch already > alleviated the problem: > http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php Twas this feature, amongst others, that I had asked to be put back onto the release notes, so people knew it had improved. I don't really want to replicate that discussion, but if we're talking about it here then it probably should be in the notes. It roughly doubles performance, but there's still more required in some cases. My recommendation is always measure the performance first. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 12 Dec 2007 13:13:35 -0500 Jeff Trout <threshar@threshar.is-a-geek.com> wrote: > > On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote: > > > > I'm not sure what you guys' expectations are, but if you're > > restoring 5 > > minutes worth of database traffic in 8 seconds I wouldn't be > > complaining. > > > > Depending on your transaction mix and what percentage of it is > > read- only > > select queries you might reasonably expect the restore to take as > > long as it > > took to generate them... > > > > in this case it was 24hrs of data - about 1500 wal segments. During > this time the machine was nearly complete idle and there wasn't very > much IO going on (few megs/sec). Exactly. Which is the point I am making. Five minutes of transactions is nothing (speaking generally).. In short, if we are in recovery, and we are not saturated the I/O and at least a single CPU, there is a huge amount of optimization *somewhere* to be done. Tom is also correct, we should test this on 8.3. Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYCerATb/zqfZUUQRArdeAJ9D89Qi7xCqFDUOpUgKQ/QigwHNPwCdFQfN Dl8svUbMi40WExyd93MCIzw= =MEhU -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > On Wed, 12 Dec 2007 18:02:39 +0000 > Gregory Stark <stark@enterprisedb.com> wrote: > >> I'm not sure what you guys' expectations are, but if you're restoring >> 5 minutes worth of database traffic in 8 seconds I wouldn't be >> complaining. > > I would be. This is a database that is doing nothing but restoring. > Zero concurrency. This thing should be flying. Well you say that like concurrency is a bad thing. The lack of concurrency is the big handicap recovery has. It has to wait while it loads one buffer so it can twiddle some bits before it reads the next buffer and twiddles bits there. During normal operation those two buffers were twiddled by two different transactions in two different processes. Even if they weren't on two different processes they could have been context switched onto the same processor while the i/o was in progress. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> On Wed, 12 Dec 2007 18:02:39 +0000 >> Gregory Stark <stark@enterprisedb.com> wrote: >> >>> I'm not sure what you guys' expectations are, but if you're restoring >>> 5 minutes worth of database traffic in 8 seconds I wouldn't be >>> complaining. >> I would be. This is a database that is doing nothing but restoring. >> Zero concurrency. This thing should be flying. > > Well you say that like concurrency is a bad thing. The lack of concurrency is > the big handicap recovery has. It has to wait while it loads one buffer so it > can twiddle some bits before it reads the next buffer and twiddles bits there. > During normal operation those two buffers were twiddled by two different > transactions in two different processes. Even if they weren't on two different > processes they could have been context switched onto the same processor while > the i/o was in progress. Please see my point about saturation in another post. Joshua D. Drake
"Joshua D. Drake" <jd@commandprompt.com> writes: > Jeff Trout <threshar@threshar.is-a-geek.com> wrote: >> in this case it was 24hrs of data - about 1500 wal segments. During >> this time the machine was nearly complete idle and there wasn't very >> much IO going on (few megs/sec). > Exactly. Which is the point I am making. Five minutes of transactions > is nothing (speaking generally).. In short, if we are in recovery, and > we are not saturated the I/O and at least a single CPU, there is a huge > amount of optimization *somewhere* to be done. You sure about that? I tested CVS HEAD just now, by setting the checkpoint_ parameters really high, running pgbench for awhile, and then killing the bgwriter to force a recovery cycle over all the WAL generated by the pgbench run. What I saw was that the machine was 100% disk write bound. Increasing shared_buffers helped, not in that the write rate got less according to vmstat, but the completion time did. shared_buffers 32MB 100MB pgbench -c 5 -t 40000 bench 7m23s 2m20s subsequent recovery 4m26s 2m21s Typical "vmstat 1" lines during recovery: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 I don't see the machine sitting around doing nothing ... regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Jeff Trout <threshar@threshar.is-a-geek.com> wrote: >>> in this case it was 24hrs of data - about 1500 wal segments. During >>> this time the machine was nearly complete idle and there wasn't very >>> much IO going on (few megs/sec). > >> Exactly. Which is the point I am making. Five minutes of transactions >> is nothing (speaking generally).. In short, if we are in recovery, and >> we are not saturated the I/O and at least a single CPU, there is a huge >> amount of optimization *somewhere* to be done. > > You sure about that? I tested CVS HEAD just now, by setting the > checkpoint_ parameters really high, Well I haven't tested CVS HEAD yet, but on 8.2: > avg-cpu: %user %nice %sys %iowait %idle > 0.12 0.00 0.30 12.22 87.35 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 172.26 2927.74 46.31 14668 232 > sda1 182.63 2927.74 46.31 14668 232 > sdb 1.40 0.00 41.52 0 208 > sdb1 10.38 0.00 41.52 0 208 > sdb2 0.00 0.00 0.00 0 0 > sdc 96.21 946.11 14.37 4740 72 > sdc1 99.20 946.11 14.37 4740 72 > sdd 0.00 0.00 0.00 0 0 > sdd1 0.00 0.00 0.00 0 0 And: > 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive > 2007-12-13 00:55:34 EST LOG: restored log file "00000001000007E10000006C" from archive > 2007-12-13 00:55:48 EST LOG: restored log file "00000001000007E10000006D" from archive > 2007-12-13 00:56:01 EST LOG: restored log file "00000001000007E10000006E" from archive > 2007-12-13 00:56:14 EST LOG: restored log file "00000001000007E10000006F" from archive > 2007-12-13 00:56:28 EST LOG: restored log file "00000001000007E100000070" from archive > 2007-12-13 00:56:42 EST LOG: restored log file "00000001000007E100000071" from archive This is a 14 Spindles in RAID 10 on sda and sdc. As a correlation, exact same hardware (and OS) and off peak production load of 2.25 million xacts per hour (at data point), 3million per hour at peak.: > avg-cpu: %user %nice %sys %iowait %idle > 23.30 0.00 15.28 33.07 28.35 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 2078.80 23693.60 10628.00 118468 53140 > sda1 4139.80 23693.60 10628.00 118468 53140 > sdb 24.00 94.40 318.40 472 1592 > sdb1 87.40 94.40 318.40 472 1592 > sdc 694.00 22207.20 4113.60 111036 20568 > sdc1 1575.00 22207.20 4113.60 111036 20568 > sdd 826.80 3.20 8691.20 16 43456 > sdd1 828.60 3.20 8691.20 16 43456 I would expect that recovery would be faster than prod. Maybe I am off my rocker but the warm standby isn't doing anything but restoring the logs. This is going to be hard to test with 8.3 with this data set but I will see if I can get some other numbers with 8.3. Sincerely, Joshua D. Drake
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> Exactly. Which is the point I am making. Five minutes of transactions >> is nothing (speaking generally).. In short, if we are in recovery, and >> we are not saturated the I/O and at least a single CPU, there is a huge >> amount of optimization *somewhere* to be done. > > You sure about that? I tested CVS HEAD just now, by setting the > checkpoint_ parameters really high, running pgbench for awhile, and > then killing the bgwriter to force a recovery cycle over all the WAL > generated by the pgbench run. What I saw was that the machine was 100% > disk write bound. Increasing shared_buffers helped, not in that the > write rate got less according to vmstat, but the completion time did. There are at least three definitions of "saturating the I/O" and it sounds like you two are using two different ones. 1) The processor is waiting on I/O all the time 2) The hard drives are all always handling a request 3) The hard drives are all handling the full bandwidth they're capable You would expect (1) and (2) to be the same for a single drive -- though in practice there seems to be a gap even between them. But for a raid array there can be a large difference, and the wider the raid stripe the larger the difference. In Tom's results: > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 > > I don't see the machine sitting around doing nothing ... Note that even though the processor is 99% in wait state the drive is only handling about 3 MB/s. That translates into a seek time of 2.2ms which is actually pretty fast. So if this is a single drive (1) and (2) seem to be pretty much the same here. But note that if this were a raid array Postgres's wouldn't be getting any better results. A Raid array wouldn't improve i/o latency at all and since it's already 99% waiting for i/o Postgres is not going to be able to issue any more. But only one drive in the raid array will be busy at a time which would be far less than the maximum random access i/o the raid array is capable of. Heikki proposed a while back to use posix_fadvise() when processing logs to read-ahead blocks which the recover will need before actually attempting to recover them. On a raid array that would bring the 3MB/s above up to the maximum number of random accesses the raid array can handle (ie, definition (2) above). That's still going to be a far cry from the maximum bandwidth the hard drive can handle. Even that single drive can probably handle 60MB/s sequential I/O. That's probably the source of the unrealistic expectations people have. It's easy to watch the bandwidth number as the headline number to measure i/o utilization. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane wrote: >> You sure about that? I tested CVS HEAD just now, by setting the >> checkpoint_ parameters really high, > ... And: >> 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive Hmm --- I was testing a straight crash-recovery scenario, not restoring from archive. Are you sure your restore_command script isn't responsible for a lot of the delay? regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Tom Lane wrote: >>> You sure about that? I tested CVS HEAD just now, by setting the >>> checkpoint_ parameters really high, > >> ... And: > >>> 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive > > Hmm --- I was testing a straight crash-recovery scenario, not restoring > from archive. Are you sure your restore_command script isn't > responsible for a lot of the delay? Now that's an interesting thought, I will review in the morning when I have some more IQ points back. Joshua D. Drake > > regards, tom lane >
On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > >> Exactly. Which is the point I am making. Five minutes of transactions > >> is nothing (speaking generally).. In short, if we are in recovery, and > >> we are not saturated the I/O and at least a single CPU, there is a huge > >> amount of optimization *somewhere* to be done. > > > > You sure about that? I tested CVS HEAD just now, by setting the > > checkpoint_ parameters really high, running pgbench for awhile, and > > then killing the bgwriter to force a recovery cycle over all the WAL > > generated by the pgbench run. What I saw was that the machine was 100% > > disk write bound. Increasing shared_buffers helped, not in that the > > write rate got less according to vmstat, but the completion time did. > > There are at least three definitions of "saturating the I/O" and it sounds > like you two are using two different ones. > > 1) The processor is waiting on I/O all the time > 2) The hard drives are all always handling a request > 3) The hard drives are all handling the full bandwidth they're capable > > You would expect (1) and (2) to be the same for a single drive -- though in > practice there seems to be a gap even between them. But for a raid array there > can be a large difference, and the wider the raid stripe the larger the > difference. > > In Tom's results: > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 > > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 > > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 > > > > I don't see the machine sitting around doing nothing ... > > Note that even though the processor is 99% in wait state the drive is only > handling about 3 MB/s. That translates into a seek time of 2.2ms which is > actually pretty fast. So if this is a single drive (1) and (2) seem to be > pretty much the same here. > > But note that if this were a raid array Postgres's wouldn't be getting any > better results. A Raid array wouldn't improve i/o latency at all and since > it's already 99% waiting for i/o Postgres is not going to be able to issue any > more. But only one drive in the raid array will be busy at a time which would > be far less than the maximum random access i/o the raid array is capable of. Agree with Greg's analysis here. Moving to -hackers now. I've done performance profiling also. My results replicated Tom's, but I hadn't performed them on a big enough system and so didn't realise the I/O scalability issue could be such a large problem. Koichi showed me some results on a much larger server that illustrated the I/O problem. But lets remember its only a problem on large servers with a heavy write workload and a large random I/O requirement. That's an important set of people, but much less than Josh's 10% of people even. > Heikki proposed a while back to use posix_fadvise() when processing logs to > read-ahead blocks which the recover will need before actually attempting to > recover them. On a raid array that would bring the 3MB/s above up to the > maximum number of random accesses the raid array can handle (ie, definition > (2) above). It's a good idea, but it will require more complex code. I prefer the simpler solution of using more processes to solve the I/O problem. Florian's code for Hot Standby introduces a separate recovery process, similar to an autovacuum launcher. I propose a mechanism similar to the AV solution where we have lots of recovery workers, with one recovery master reading the WAL files. We can then distribute WAL records to workers in some manner. It's true that many WAL records depend upon each other, but its also true that the performance problems only occur in the situation when they the WAL records don't depend upon each other. If they did, they would touch the same blocks and it would be cached. So as long as we have a safe mechanism for splitting up the work, everything is fine. We can divide up the WAL records this by looking at the rmgr field, plus looking deeper into the records themselves so we can touch different relations/blocks. So I'm planning to review this *after* Florian has introduced his patch, so we can build upon it. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: >> Heikki proposed a while back to use posix_fadvise() when processing logs to >> read-ahead blocks which the recover will need before actually attempting to >> recover them. On a raid array that would bring the 3MB/s above up to the >> maximum number of random accesses the raid array can handle (ie, definition >> (2) above). > > It's a good idea, but it will require more complex code. I prefer the > simpler solution of using more processes to solve the I/O problem. Huh, I forgot about that idea. Ironically that was what I suggested when Heikki described the problem. I think it's more complex than using posix_fadvise. But it's also more ambitious. It would allow us to use not only the full random access i/o bandwidth but also allow us to use more cpu. In cases where the database fits entirely in ram and we're recovering many many operations modifying the same blocks over and over that might help a lot. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > It's a good idea, but it will require more complex code. I prefer the > > simpler solution of using more processes to solve the I/O problem. > > Huh, I forgot about that idea. Ironically that was what I suggested when > Heikki described the problem. > > I think it's more complex than using posix_fadvise. But it's also more > ambitious. It would allow us to use not only the full random access i/o > bandwidth but also allow us to use more cpu. In cases where the database fits > entirely in ram and we're recovering many many operations modifying the same > blocks over and over that might help a lot. Actually, if you are modifying the same blocks over and over it will help *less*, because applying each record needs to occur only after the previous records that modify the same block have been applied. So you have two possibilities: you skip that record and try to apply the next one, hoping that that record applies to a block that's not locked, (which means you have to remember the skipped record and apply it sometime in the future), or you put the process to sleep until the lock has been released. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote: > Gregory Stark wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > > > It's a good idea, but it will require more complex code. I prefer the > > > simpler solution of using more processes to solve the I/O problem. > > > > Huh, I forgot about that idea. Ironically that was what I suggested when > > Heikki described the problem. > > > > I think it's more complex than using posix_fadvise. But it's also more > > ambitious. It would allow us to use not only the full random access i/o > > bandwidth but also allow us to use more cpu. In cases where the database fits > > entirely in ram and we're recovering many many operations modifying the same > > blocks over and over that might help a lot. > > Actually, if you are modifying the same blocks over and over it will > help *less*, because applying each record needs to occur only after the > previous records that modify the same block have been applied. > > So you have two possibilities: you skip that record and try to apply the > next one, hoping that that record applies to a block that's not locked, > (which means you have to remember the skipped record and apply it > sometime in the future), or you put the process to sleep until the lock > has been released. Ah, OK, I can see we're on the same lines of thought there. Just posted a reply to Heikki about this sort of idea. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 12 Dec 2007 23:08:35 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > Tom Lane wrote: > > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> Tom Lane wrote: > >>> You sure about that? I tested CVS HEAD just now, by setting the > >>> checkpoint_ parameters really high, > > > >> ... And: > > > >>> 2007-12-13 00:55:20 EST LOG: restored log file > >>> "00000001000007E10000006B" from archive > > > > Hmm --- I was testing a straight crash-recovery scenario, not > > restoring from archive. Are you sure your restore_command script > > isn't responsible for a lot of the delay? > > Now that's an interesting thought, I will review in the morning when > I have some more IQ points back. As promised :)... I took a look at this today and I think I found a couple of things. It appears that once the logs are archived, the recovery command copies the archive file to a recovery location and then restores the file. If that is correct that could explain some of the latency I am seeing here. Even with the speed of these devices, it is still a 16 MB file. That could take 1-2 seconds to copy. There is also the execution of pg_standby each time as the recovery command which although I haven't timed is going to add overhead. Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4 and then another 6 seconds. When are fsyncs called on the recovery process? At these types of delays even speeding the process 2 seconds per log is going to be significant. Sincerely, Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYYQaATb/zqfZUUQRAiiNAKCNDaO+MYDDLM/lUbL4D9Q9NIEyRQCgqhye cJ2PAv9rEzAi/jDFPzzoFNw= =xNMz -----END PGP SIGNATURE-----
Seems like it would be a common question, but I'm having problems finding an answer in the archives on this... I have a large database (now about 2 billion records), and about once a year I have been dropping and recreating the indexes. Recreating the foreign key constraints takes as long or longer than the index creation. Last year the whole process took about 72 hours. This year the DB is bigger. I'm running 8.1.4. Assume I have exclusive access to the DB. 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX DATABASE. I seem to remember DROP/CREATE had some advantages, but can't find the information. Is there a performance hit with REINDEX during creation because of locking issues? 2. I'm assuming REINDEX would avoid the time involved in recreating the foreign key constraints? 3. With a REINDEX DATABASE, how can I monitor progress? Wes
Wes <wespvp@msg.bt.com> writes: > I'm running 8.1.4. Assume I have exclusive access to the DB. You really ought to update to 8.1.something-newer, but I digress. > 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX > DATABASE. No, not if you don't mind exclusive locks. DROP together with CREATE INDEX CONCURRENTLY might be nicer if you were trying to do this without completely shutting down the DB, but if you aren't running normal operations then just use REINDEX. > 2. I'm assuming REINDEX would avoid the time involved in recreating the > foreign key constraints? Right, that's one reason to do it that way. > 3. With a REINDEX DATABASE, how can I monitor progress? It should give you a NOTICE after each table. BTW, what have you got maintenance_work_mem set to? regards, tom lane
On 1/24/08 12:48 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Wes <wespvp@msg.bt.com> writes: >> I'm running 8.1.4. Assume I have exclusive access to the DB. > > You really ought to update to 8.1.something-newer, but I digress. I was planning on upgrading to 8.x at the same time as this reindex - just do a dump/reload, but... I guess I've been asleep at the wheel and didn't realize 8.1.11 was out. Since that wouldn't require a DB reload, I guess that would be highly recommended? >> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX >> DATABASE. > > No, not if you don't mind exclusive locks. DROP together with CREATE > INDEX CONCURRENTLY might be nicer if you were trying to do this without > completely shutting down the DB, but if you aren't running normal > operations then just use REINDEX. I was thinking there was some clean up that didn't happen with REINDEX, related to disk space allocation maybe? Perhaps this was just on older 7.x versions - something I vaguely recall back under 7.x when I was having some corruption issues. >> 3. With a REINDEX DATABASE, how can I monitor progress? > > It should give you a NOTICE after each table. Is there anything that shows up in ps for each index it is working on? > BTW, what have you got maintenance_work_mem set to? It is currently set to 983025. Not sure where I got that strange number from. It's a 2 GB machine. I've been trying to get more, but when it runs fine day to day, it's kind of hard to justify. Lots of disks, not so much memory. I guess I should also turn off fsync for the duration. Wes
"Wes" <wespvp@msg.bt.com> writes: > I guess I should also turn off fsync for the duration. It shouldn't make a big difference. fsync only happens at the end of a transaction or at a checkpoint. Since you're concerned with very long operations the slowdown at the end of the transaction won't make a big difference. Checkpoints could be an issue, but you would be better off just raising checkpoint_segments and/or checkpoint_timeout to make sure you don't get one more often than once every few minutes when you're doing large operations like this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On 1/25/08 5:40 AM, "Gregory Stark" <stark@enterprisedb.com> wrote: > It shouldn't make a big difference. fsync only happens at the end of a > transaction or at a checkpoint. > > Since you're concerned with very long operations the slowdown at the end of > the transaction won't make a big difference. Ok, I didn't realize that. > Checkpoints could be an issue, but you would be better off just raising > checkpoint_segments and/or checkpoint_timeout to make sure you don't get one > more often than once every few minutes when you're doing large operations like > this. I have checkpoint_segments set to 60, and no warnings showing up in the log. Wes
Just a follow-up on this... The REINDEX took about 2 1/2 days. I didn't gain much disk space back - a full backup takes just as long as before, but the vacuum time dropped from 30 hours to 3 hours. Wes >> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX >> DATABASE. > > No, not if you don't mind exclusive locks. DROP together with CREATE > INDEX CONCURRENTLY might be nicer if you were trying to do this without > completely shutting down the DB, but if you aren't running normal > operations then just use REINDEX. > >> 2. I'm assuming REINDEX would avoid the time involved in recreating the >> foreign key constraints? > > Right, that's one reason to do it that way.
On Feb 4, 2008, at 10:00 AM, Wes wrote: > Just a follow-up on this... The REINDEX took about 2 1/2 days. I > didn't > gain much disk space back - a full backup takes just as long as > before, but > the vacuum time dropped from 30 hours to 3 hours. what you need to do is compare the relpages from the pg_class table for that index before and after. if you didn't get much disk space back, make sure you have no long running transactions that may have kept some older files open.
On 2/4/08 9:53 AM, "Vivek Khera" <vivek@khera.org> wrote: > what you need to do is compare the relpages from the pg_class table > for that index before and after. > > if you didn't get much disk space back, make sure you have no long > running transactions that may have kept some older files open. I can check that next time, but I only reindex about once a year. There definitely should be no outstanding transactions. The reason for the huge change in the vacuum time is that the indexes are scanned in index order instead of disk order. I understand that is fixed in 8.2 or 8.3 (don't recall which I saw it in), but have never gotten confirmation from anyone on that. Wes
Wes <wespvp@msg.bt.com> writes: > The reason for the huge change in the vacuum time is that the indexes are > scanned in index order instead of disk order. I understand that is fixed in > 8.2 or 8.3 (don't recall which I saw it in), but have never gotten > confirmation from anyone on that. Yeah, 8.2. regards, tom lane