Thread: PG8 Tuning
Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC CPUs running Solaris 10. The DB cluster is on an external fibre-attached Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN. The system is for the sole use of a couple of data warehouse developers, hence we are keen to use 'aggressive' tuning options to maximise performance. So far we have made the following changes and measured the impact on our test suite: 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement in some cases. 2) Increase work_mem from 1,024 to 524,288. 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box. Question - can Postgres only use 2GB RAM, given that shared_buffers can only be set as high as 262,143 (8K pages)? So far so good... 4) Move /pg_xlog to an internal disk within the V250. This has had a severe *negative* impact on performance. Copy job has gone from 2 mins to 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL jobs. I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to a single spindle disk? In cases such as this, where an external storage array with a hardware RAID controller is used, the normal advice to separate the data from the pg_xlog seems to come unstuck, or are we missing something? Cheers, Paul Johnson.
Paul Johnson wrote: > Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC > CPUs running Solaris 10. The DB cluster is on an external fibre-attached > Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN. > > The system is for the sole use of a couple of data warehouse developers, > hence we are keen to use 'aggressive' tuning options to maximise > performance. > > So far we have made the following changes and measured the impact on our > test suite: > > 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement > in some cases. OK > 2) Increase work_mem from 1,024 to 524,288. Don't forget you can use multiples of this in a single query. Might want to reign it back a bit. I *think* you can set it per-query if you want anyway. > 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required > setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box. > > Question - can Postgres only use 2GB RAM, given that shared_buffers can > only be set as high as 262,143 (8K pages)? Well, normally you'd want to keep a fair bit for the O.S. to cache data. One quarter of your RAM seems very high. Did you try 5000,10000,50000 too or go straight to the top end? > So far so good... > > 4) Move /pg_xlog to an internal disk within the V250. This has had a > severe *negative* impact on performance. Copy job has gone from 2 mins to > 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL > jobs. > > I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to > a single spindle disk? The key limitation will be one commit per rotation of the disk. Multiple spindles, or better still with a battery-backed write-cache will give you peak transactions. > In cases such as this, where an external storage array with a hardware > RAID controller is used, the normal advice to separate the data from the > pg_xlog seems to come unstuck, or are we missing something? Well, I think the advice then is actually "get 2 external arrays..." -- Richard Huxton Archonet Ltd
Paul, Before I say anything else, one online document which may be of assistance to you is: http://www.powerpostgresql.com/PerfList/ Some thoughts I have: 3) You're shared RAM setting seems overkill to me. Part of the challenge is you're going from 1000 to 262K with no assessment in between. Each situation can be different, but try in the range of 10 - 50K. 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog you're better off. If it is sharing with any other OS/DB resource, the performance will be impacted. From what I have learned from others on this list, RAID5 is not the best choice for the database. RAID10 would be a better solution (using 8 of your disks) then take the remaining disk and do mirror with your pg_xlog if possible. Best of luck, Steve Poe On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote: > Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC > CPUs running Solaris 10. The DB cluster is on an external fibre-attached > Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN. > > The system is for the sole use of a couple of data warehouse developers, > hence we are keen to use 'aggressive' tuning options to maximise > performance. > > So far we have made the following changes and measured the impact on our > test suite: > > 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement > in some cases. > > 2) Increase work_mem from 1,024 to 524,288. > > 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required > setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box. > > Question - can Postgres only use 2GB RAM, given that shared_buffers can > only be set as high as 262,143 (8K pages)? > > So far so good... > > 4) Move /pg_xlog to an internal disk within the V250. This has had a > severe *negative* impact on performance. Copy job has gone from 2 mins to > 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL > jobs. > > I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to > a single spindle disk? > > In cases such as this, where an external storage array with a hardware > RAID controller is used, the normal advice to separate the data from the > pg_xlog seems to come unstuck, or are we missing something? > > Cheers, > > Paul Johnson. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
On Thu, Aug 11, 2005 at 01:23:21PM +0100, Paul Johnson wrote: >I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to >a single spindle disk? > >In cases such as this, where an external storage array with a hardware >RAID controller is used, the normal advice to separate the data from the >pg_xlog seems to come unstuck Yes. That's the downside to dogma. If you're writing pg_xlog to a battery-backed ram buffer you'll see faster commits than you will with a write to a disk, even if you've got a dedicated spindle, unless you've got constant write activity. (Because once the buffer fills you're limited to disk speed as you wait for buffer flushes.) If you've got a lot of system RAM, a battery-backed disk buffer, an OS/filesystem than effectively delays writes, and bursty transactional writes it's quite possible you'll get better performance putting everything on one array rather than breaking it up to follow the "rules". You might get a performance boost by putting the transaction log on a seperate partition or lun on the external array, depending on how the fs implements syncs or whether you can optimize the filsystem choice for each partition. The correct approach is to run comparative benchmarks of each configuration. :-) Mike Stone
On Fri, 2005-08-12 at 08:47 +0000, Steve Poe wrote: > Paul, > > Before I say anything else, one online document which may be of > assistance to you is: > http://www.powerpostgresql.com/PerfList/ > > Some thoughts I have: > > 3) You're shared RAM setting seems overkill to me. Part of the challenge > is you're going from 1000 to 262K with no assessment in between. Each > situation can be different, but try in the range of 10 - 50K. > > 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog > you're better off. Like Mr. Stone said earlier, this is pure dogma. In my experience, xlogs on the same volume with data is much faster if both are on battery-backed write-back RAID controller memory. Moving from this situation to xlogs on a single normal disk is going to be much slower in most cases. -jwb
I think the T-3 RAID at least breaks some of these rules -- I've got 2 T-3's, 1 configured as RAID-10 and the other as RAID5, and they both seem to perform about the same. I use RAID5 with a hot spare, so it's using 8 spindles. I got a lot of performance improvement out of mount the fs noatime and turning journaling off. Of course it takes a *long* time to recover from a crash. Steve Poe wrote: > Paul, > > Before I say anything else, one online document which may be of > assistance to you is: > http://www.powerpostgresql.com/PerfList/ > > Some thoughts I have: > > 3) You're shared RAM setting seems overkill to me. Part of the challenge > is you're going from 1000 to 262K with no assessment in between. Each > situation can be different, but try in the range of 10 - 50K. > > 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog > you're better off. If it is sharing with any other OS/DB resource, the > performance will be impacted. > >From what I have learned from others on this list, RAID5 is not the best > choice for the database. RAID10 would be a better solution (using 8 of > your disks) then take the remaining disk and do mirror with your pg_xlog > if possible. > > Best of luck, > > Steve Poe > > On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote: > >>Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC >>CPUs running Solaris 10. The DB cluster is on an external fibre-attached >>Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN. >> >>The system is for the sole use of a couple of data warehouse developers, >>hence we are keen to use 'aggressive' tuning options to maximise >>performance. >> >>So far we have made the following changes and measured the impact on our >>test suite: >> >>1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement >>in some cases. >> >>2) Increase work_mem from 1,024 to 524,288. >> >>3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required >>setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box. >> >>Question - can Postgres only use 2GB RAM, given that shared_buffers can >>only be set as high as 262,143 (8K pages)? >> >>So far so good... >> >>4) Move /pg_xlog to an internal disk within the V250. This has had a >>severe *negative* impact on performance. Copy job has gone from 2 mins to >>12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL >>jobs. >> >>I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to >>a single spindle disk? >> >>In cases such as this, where an external storage array with a hardware >>RAID controller is used, the normal advice to separate the data from the >>pg_xlog seems to come unstuck, or are we missing something? >> >>Cheers, >> >>Paul Johnson. >> >>---------------------------(end of broadcast)--------------------------- >>TIP 4: Have you searched our list archives? >> >> http://archives.postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > >
(Musing, trying to think of a general-purpose performance-tuning rule that applies here): Actually, it seems to me that with the addition of the WAL in PostgreSQL and the subsequent decreased need to fsync the data files themselves (only during checkpoints?), that the only time a battery-backed write cache would make a really large performance difference would be on the drive(s) hosting the WAL. So although it is in general good to have a dedicated spindle for the WAL, for many workloads it is in fact significantly better to have the WAL written to a battery-backed write cache. The exception would be for applications with fewer, larger transactions, in which case you could actually use the dedicated spindle. Hmmm, on second thought, now I think I understand the rationale behind having a non-zero commit delay setting-- the problem with putting pg_xlog on a single disk without a write cache is that frequent fsync() calls might cause it to spend most of its time seeking instead of writing (as seems to be happening to Paul here). Then again, the OS IO scheduler should take care of this for you, making this a non-issue. Perhaps Solaris 10 just has really poor IO scheduling performance with this particular hardware and workload? Ah well. Thought myself in circles and have no real conclusions to show for it. Posting anyway, maybe this will give somebody some ideas to work with. -- Mark Lewis On Fri, 2005-08-12 at 08:47 +0000, Steve Poe wrote: > Paul, > > Before I say anything else, one online document which may be of > assistance to you is: > http://www.powerpostgresql.com/PerfList/ > > Some thoughts I have: > > 3) You're shared RAM setting seems overkill to me. Part of the challenge > is you're going from 1000 to 262K with no assessment in between. Each > situation can be different, but try in the range of 10 - 50K. > > 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog > you're better off. If it is sharing with any other OS/DB resource, the > performance will be impacted. > > >From what I have learned from others on this list, RAID5 is not the best > choice for the database. RAID10 would be a better solution (using 8 of > your disks) then take the remaining disk and do mirror with your pg_xlog > if possible. > > Best of luck, > > Steve Poe > > On Thu, 2005-08-11 at 13:23 +0100, Paul Johnson wrote: > > Hi all, we're running PG8 on a Sun V250 with 8GB RAM and 2*1.3GHz SPARC > > CPUs running Solaris 10. The DB cluster is on an external fibre-attached > > Sun T3 array that has 9*36GB drives configured as a single RAID5 LUN. > > > > The system is for the sole use of a couple of data warehouse developers, > > hence we are keen to use 'aggressive' tuning options to maximise > > performance. > > > > So far we have made the following changes and measured the impact on our > > test suite: > > > > 1) Increase checkpoint_segments from 3 to 64. This made a 10x improvement > > in some cases. > > > > 2) Increase work_mem from 1,024 to 524,288. > > > > 3) Increase shared_buffers from 1,000 to 262,143 (2 GB). This required > > setting SHMMAX=4294967295 (4 GB) in /etc/system and re-booting the box. > > > > Question - can Postgres only use 2GB RAM, given that shared_buffers can > > only be set as high as 262,143 (8K pages)? > > > > So far so good... > > > > 4) Move /pg_xlog to an internal disk within the V250. This has had a > > severe *negative* impact on performance. Copy job has gone from 2 mins to > > 12 mins, simple SQL job gone from 1 min to 7 mins. Not even run long SQL > > jobs. > > > > I'm guessing that this is because pg_xlog has gone from a 9 spindle LUN to > > a single spindle disk? > > > > In cases such as this, where an external storage array with a hardware > > RAID controller is used, the normal advice to separate the data from the > > pg_xlog seems to come unstuck, or are we missing something? > > > > Cheers, > > > > Paul Johnson. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
> Actually, it seems to me that with the addition of the WAL in PostgreSQL > and the subsequent decreased need to fsync the data files themselves > (only during checkpoints?), that the only time a battery-backed write > cache would make a really large performance difference would be on the > drive(s) hosting the WAL. It still helps. In my experience a good BBU Raid controller is only slightly slower than fsync=false. Fear the checkpoint storm if you don't have some write caching. Beyond that I don't really care about write delay. Another thing to watch out for is that some sync modes (varying by platform) can do >1 seeks per sync. This will absolutely kill your commit performance on the WAL without write caching. > So although it is in general good to have a dedicated spindle for the > WAL, for many workloads it is in fact significantly better to have the > WAL written to a battery-backed write cache. The exception would be for > applications with fewer, larger transactions, in which case you could > actually use the dedicated spindle. Exactly. > Hmmm, on second thought, now I think I understand the rationale behind > having a non-zero commit delay setting-- the problem with putting I don't trust commit_delay. Get a good raid controller and make sure pg is properly using it. Now, if you can't (or won't) do some type of write caching bbu or no, your system has to be very carefully designed to get any performance at all, especially with high transaction volumes. Merlin
On Thu, Aug 11, 2005 at 10:18:44AM -0700, Mark Lewis wrote: >Actually, it seems to me that with the addition of the WAL in PostgreSQL >and the subsequent decreased need to fsync the data files themselves >(only during checkpoints?), that the only time a battery-backed write >cache would make a really large performance difference would be on the >drive(s) hosting the WAL. Write cache on a raid array helps in the general case too, because it allows the controller to aggregate & reorder write requests. The OS probably tries to do this to some degree, but it can't do as well as the raid controller because it doesn't know the physical disk layout. >Hmmm, on second thought, now I think I understand the rationale behind >having a non-zero commit delay setting-- the problem with putting >pg_xlog on a single disk without a write cache is that frequent fsync() >calls might cause it to spend most of its time seeking instead of >writing (as seems to be happening to Paul here). Then again, the OS IO >scheduler should take care of this for you, making this a non-issue. The OS can't do anything much in terms of IO scheduling for synchronous writes. Either it handles them in suboptimal order or you get hideous latency while requests are queued for reordering. Neither option is really great. Mike Stone
On Aug 11, 2005, at 12:58 PM, Jeffrey W. Baker wrote: > Like Mr. Stone said earlier, this is pure dogma. In my experience, > xlogs on the same volume with data is much faster if both are on > battery-backed write-back RAID controller memory. Moving from this > situation to xlogs on a single normal disk is going to be much > slower in > most cases. > This does also point one important point about performance. Which is a touch unfortunate (and expensive to test): Your milage may vary on any of these improvements. Some people have 0 problems and incredible performance with say, 1000 shared_bufs and the WAL on the same disk.. Others need 10k shared bufs and wal split over a 900 spindle raid with data spread across 18 SAN's... Unfortunately there is no one true way :( The best bet (which is great if you can): Try out various settings.. if you still run into problems look into some more hardware.. see if you can borrow any or fabricate a "poor man"'s equivalent for testing. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Jeff, > > 4) pg_xlog: If you're pg_xlog is on a spindle is *only* for pg_xlog > > you're better off. > > Like Mr. Stone said earlier, this is pure dogma. In my experience, > xlogs on the same volume with data is much faster if both are on > battery-backed write-back RAID controller memory. Moving from this > situation to xlogs on a single normal disk is going to be much slower in > most cases. The advice on separate drives for xlog (as is all advice on that web page) is based on numerous, repeatable tests at OSDL. However, you are absolutely correct in that it's *relative* advice, not absolute advice. If, for example, you're using a $100,000 EMC SAN as your storage you'll probably be better off giving it everything and letting its controller and cache handle disk allocation etc. On the other hand, if you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter a case where a separate xlog disk did not benefit an OLTP application. For Solaris, the advantage of using a separate disk or partition is that the mount options you want for the xlog (including forcedirectio) are considerably different from what you'd use with the main database. -- Josh Berkus Aglio Database Solutions San Francisco
On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote: > However, you are absolutely correct in that it's *relative* advice, not > absolute advice. If, for example, you're using a $100,000 EMC SAN as your > storage you'll probably be better off giving it everything and letting its > controller and cache handle disk allocation etc. On the other hand, if > you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter > a case where a separate xlog disk did not benefit an OLTP application. I've been asked this a couple of times and I don't know the answer: what happens if you give XLog a single drive (unmirrored single spindle), and that drive dies? So the question really is, should you be giving two disks to XLog? -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere." (Lamar Owen)
> I've been asked this a couple of times and I don't know the answer: what > happens if you give XLog a single drive (unmirrored single spindle), and > that drive dies? So the question really is, should you be giving two > disks to XLog? If that drive dies your restoring from backup. You would need to run at least RAID 1, preferrably RAID 10. Sincerely, Joshua D. Drkae > -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
Alvaro Herrera wrote: > On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote: > > >>However, you are absolutely correct in that it's *relative* advice, not >>absolute advice. If, for example, you're using a $100,000 EMC SAN as your >>storage you'll probably be better off giving it everything and letting its >>controller and cache handle disk allocation etc. On the other hand, if >>you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter >>a case where a separate xlog disk did not benefit an OLTP application. > > > I've been asked this a couple of times and I don't know the answer: what > happens if you give XLog a single drive (unmirrored single spindle), and > that drive dies? So the question really is, should you be giving two > disks to XLog? > I can propose a simple test. Create a test database. Run postgres, insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog directory. Start postgres again, what does it do? I suppose to simulate more of a failure mode, you could kill -9 the postmaster (and all children processes) perhaps during an insert, and then delete pg_xlog. But I would like to hear from the postgres folks what they *expect* would happen if you ever lost pg_xlog. What about something like keeping pg_xlog on a ramdisk, and then rsyncing it to a hard-disk every 5 minutes. If you die in the middle, does it just restore back to the 5-minutes ago point, or does it get more thoroughly messed up? For some people, a 5-minute old restore would be okay, as long as you still have transaction safety, so that you can figure out what needs to be restored. John =:->
Attachment
On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote: >However, you are absolutely correct in that it's *relative* advice, not >absolute advice. If, for example, you're using a $100,000 EMC SAN as your >storage you'll probably be better off giving it everything and letting its >controller and cache handle disk allocation etc. Well, you don't have to spend *quite* that much to get a decent storage array. :) >On the other hand, if you're dealing with the 5 drives in a single Dell >6650, I've yet to encounter a case where a separate xlog disk did not >benefit an OLTP application. IIRC, that's an older raid controller that tops out at 128MB write cache, and 5 spindles ain't a lot--so it makes sense that it would benefit from a seperate spindle for xlog. Also note that I said the write cache advice goes out the window if you have a workload that involves constant writing (or if your xlog writes come in faster than your write cache can drain) because at that point you essentially drop back to raw disk speed; I assume the OLTP apps you mention are fairly write-intensive. OTOH, in a reasonably safe configuration I suppose you'd end up with a 3 disk raid5 / 2 disk raid1 or 2 raid 1 pairs on that dell 6650; is that how you test? Once you're down to that small a data set I'd expect the system's ram cache to be a much larger percentage of the working set, which would tend to make the xlog just about the *only* latency-critical i/o. That's a different creature from a data mining app that might really benefit from having additional spindles to accelerate read performance from indices much larger than RAM. At any rate, this just underscores the need for testing a particular workload on particular hardware. Things like the disk speed, raid configuration, write cache size, transaction size, data set size, working set size, concurrent transactions, read vs write emphasis, etc., are each going to have a fairly large impact on performance. >For Solaris, the advantage of using a separate disk or partition is that the >mount options you want for the xlog (including forcedirectio) are >considerably different from what you'd use with the main database. Yeah, having a seperate partition is often good even if you do have everything on the same disks. Mike Stone
John A Meinel <john@arbash-meinel.com> writes: > Alvaro Herrera wrote: >> I've been asked this a couple of times and I don't know the answer: what >> happens if you give XLog a single drive (unmirrored single spindle), and >> that drive dies? So the question really is, should you be giving two >> disks to XLog? > I can propose a simple test. Create a test database. Run postgres, > insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog > directory. Start postgres again, what does it do? That test would really be completely unrelated to the problem. If you are able to shut down the database cleanly, then you do not need pg_xlog anymore --- everything is on disk in the data area. You might have to use pg_resetxlog to get going again, but you won't lose anything by doing so. The question of importance is: if the xlog drive dies while the database is running, are you going to be able to get the postmaster to shut down cleanly? My suspicion is "no" --- if the kernel is reporting write failures on WAL, that's going to prevent writes to the data drives (good ol' WAL-before-data rule). You could imagine failure modes where the drive is toast but isn't actually reporting any errors ... but one hopes that's not a common scenario. In a scenario like this, it might be interesting to have a shutdown mode that deliberately ignores writing to WAL and just does its best to get all the dirty pages down onto the data drives. In the meantime ... use a mirrored drive for WAL. regards, tom lane
Tom Lane wrote: > John A Meinel <john@arbash-meinel.com> writes: > >>Alvaro Herrera wrote: >> >>>I've been asked this a couple of times and I don't know the answer: what >>>happens if you give XLog a single drive (unmirrored single spindle), and >>>that drive dies? So the question really is, should you be giving two >>>disks to XLog? > > >>I can propose a simple test. Create a test database. Run postgres, >>insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog >>directory. Start postgres again, what does it do? > > > That test would really be completely unrelated to the problem. > > If you are able to shut down the database cleanly, then you do not need > pg_xlog anymore --- everything is on disk in the data area. You might > have to use pg_resetxlog to get going again, but you won't lose anything > by doing so. So pg_xlog is really only needed for a dirty shutdown. So what about the idea of having pg_xlog on a ramdisk that is syncronized periodically to a real disk. I'm guessing you would get corruption of the database, or at least you don't know what is clean and what is dirty, since there would be no WAL entry for some of the things that completed, but also no WAL entry for things that were not completed. So what is actually written to the WAL? Is it something like: "I am writing these pages, and when page X has a certain value, I am finished" I'm just curious, because I don't believe you write to the WAL when you complete the writing the data, you only make a note about what you are going to do before you do it. So there needs to be a way to detect if you actually finished (which would be in the actual data). John =:-> > > The question of importance is: if the xlog drive dies while the database > is running, are you going to be able to get the postmaster to shut down > cleanly? My suspicion is "no" --- if the kernel is reporting write > failures on WAL, that's going to prevent writes to the data drives (good > ol' WAL-before-data rule). You could imagine failure modes where the > drive is toast but isn't actually reporting any errors ... but one hopes > that's not a common scenario. > > In a scenario like this, it might be interesting to have a shutdown mode > that deliberately ignores writing to WAL and just does its best to get > all the dirty pages down onto the data drives. > > In the meantime ... use a mirrored drive for WAL. > > regards, tom lane >
Attachment
John A Meinel <john@arbash-meinel.com> writes: > So pg_xlog is really only needed for a dirty shutdown. So what about the > idea of having pg_xlog on a ramdisk that is syncronized periodically to > a real disk. Well, if "periodically" means "at every transaction commit", that's pretty much what we do now. regards, tom lane
Michael, > Well, you don't have to spend *quite* that much to get a decent storage > array. :) Yes, I'm just pointing out that it's only the extreme cases which are clear-cut. Middle cases are a lot harder to define. For example, we've found that on DBT2 running of a 14-drive JBOD, seperating off WAL boosts performance about 8% to 14%. On DBT3 (DSS) seperate WAL (and seperate tablespaces) helps considerably during data load,but not otherwise. So it all depends. > That's a different creature from > a data mining app that might really benefit from having additional > spindles to accelerate read performance from indices much larger than > RAM. Yes, although the data mining app benefits from the special xlog disk during ETL. So it's a tradeoff. > At any rate, this just underscores the need for testing a > particular workload on particular hardware Yes, absolutely. -- --Josh Josh Berkus Aglio Database Solutions San Francisco