Thread: WAL in RAM
Hi list, Every now and then I have write peaks which causes annoying delay on my website. No particular reason it seems, just that laws of probability dictates that there will be peaks every now and then. Anyway, thinking of ways to make the peaks more bareable, I saw the new 9.1 feature to bypass WAL. Problems is mainly that some statistics tables ("x users clicked this link this month") clog the write cache, not more important writes. I could live with restoring a nightly dump of these tables and loose a days worth of logs. Though not keen on jumping over to early major versions an old idea of putting WAL in RAM came back. Not RAM in main memory but some thingie pretending to be a drive with proper battery backup. a) It seems to exist odd hardware with RAM modules and if lucky also battery b) Some drive manufactureres have done hybird ram-spindle drives (compare with possibly more common ssd-spindle hybrides). b) sounds slightly more appealing since it basically means I put everything on those drives and it magically is faster. The a) alternatives also seemed to be non ECC which is a no-no and disturbing. Does anyone here have any recommendations here? Pricing is not very important but reliability is. Thanks, Marcus
Marcus Engene <mengpg2@engene.se> wrote: > Every now and then I have write peaks which causes annoying delay > on my website > Does anyone here have any recommendations here? For our largest machines we put WAL on a RAID1 drive pair dedicated to that task, on its own controller with battery-backed cache configured for write-back. It does make a big difference, because when a DBA accidentally got this wrong once, we saw the problem you describe, and moving WAL to the dedicated drives/controller caused the problem to go away. If problems remain, look for posts by Greg Smith on how to tune this. You may want to extend your checkpoint completion target, make the background writer more aggressive, reduce shared buffers, or tune the OS. But if you can afford to put WAL on a dedicated file system something like the above, that would be a better place to start, IMO. -Kevin
On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote: > Hi list, > > Every now and then I have write peaks which causes annoying delay on my > website. No particular reason it seems, just that laws of probability > dictates that there will be peaks every now and then. > > Anyway, thinking of ways to make the peaks more bareable, I saw the new 9.1 > feature to bypass WAL. Problems is mainly that some statistics tables ("x > users clicked this link this month") clog the write cache, not more > important writes. I could live with restoring a nightly dump of these tables > and loose a days worth of logs. > > Though not keen on jumping over to early major versions an old idea of > putting WAL in RAM came back. Not RAM in main memory but some thingie > pretending to be a drive with proper battery backup. > > a) It seems to exist odd hardware with RAM modules and if lucky also battery > b) Some drive manufactureres have done hybird ram-spindle drives (compare > with possibly more common ssd-spindle hybrides). > > b) sounds slightly more appealing since it basically means I put everything > on those drives and it magically is faster. The a) alternatives also seemed > to be non ECC which is a no-no and disturbing. > > Does anyone here have any recommendations here? > > Pricing is not very important but reliability is. Have you ruled out SSD? They are a little new, but I'd be looking at the Intel 710. In every case I've seen SSD permanently ends I/O issues. DRAM storage solutions I find to be pricey and complicated when there are so many workable flash options out now. merlin
On 28 Říjen 2011, 18:11, Merlin Moncure wrote: > On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote: >> Hi list, >> >> Every now and then I have write peaks which causes annoying delay on my >> website. No particular reason it seems, just that laws of probability >> dictates that there will be peaks every now and then. >> >> Anyway, thinking of ways to make the peaks more bareable, I saw the new >> 9.1 >> feature to bypass WAL. Problems is mainly that some statistics tables >> ("x >> users clicked this link this month") clog the write cache, not more >> important writes. I could live with restoring a nightly dump of these >> tables >> and loose a days worth of logs. >> >> Though not keen on jumping over to early major versions an old idea of >> putting WAL in RAM came back. Not RAM in main memory but some thingie >> pretending to be a drive with proper battery backup. >> >> a) It seems to exist odd hardware with RAM modules and if lucky also >> battery >> b) Some drive manufactureres have done hybird ram-spindle drives >> (compare >> with possibly more common ssd-spindle hybrides). >> >> b) sounds slightly more appealing since it basically means I put >> everything >> on those drives and it magically is faster. The a) alternatives also >> seemed >> to be non ECC which is a no-no and disturbing. >> >> Does anyone here have any recommendations here? >> >> Pricing is not very important but reliability is. > > Have you ruled out SSD? They are a little new, but I'd be looking at > the Intel 710. In every case I've seen SSD permanently ends I/O > issues. DRAM storage solutions I find to be pricey and complicated > when there are so many workable flash options out now. Are you sure SSDs are a reasonable option for WAL? I personally don't think it's a good option, because WAL is written in a sequential manner, and that's not an area where SSDs beat spinners really badly. For example the Intel 710 SSD has a sequential write speed of 210MB/s, while a simple SATA 7.2k drive can write about 50-100 MB/s for less than 1/10 of the 710 price. I'm not saying SSDs are a bad thing, but I think it's a waste of money to use them for WAL. Tomas
On Fri, Oct 28, 2011 at 1:26 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 28 Říjen 2011, 18:11, Merlin Moncure wrote: >> On Fri, Oct 28, 2011 at 10:28 AM, Marcus Engene <mengpg2@engene.se> wrote: >>> Hi list, >>> >>> Every now and then I have write peaks which causes annoying delay on my >>> website. No particular reason it seems, just that laws of probability >>> dictates that there will be peaks every now and then. >>> >>> Anyway, thinking of ways to make the peaks more bareable, I saw the new >>> 9.1 >>> feature to bypass WAL. Problems is mainly that some statistics tables >>> ("x >>> users clicked this link this month") clog the write cache, not more >>> important writes. I could live with restoring a nightly dump of these >>> tables >>> and loose a days worth of logs. >>> >>> Though not keen on jumping over to early major versions an old idea of >>> putting WAL in RAM came back. Not RAM in main memory but some thingie >>> pretending to be a drive with proper battery backup. >>> >>> a) It seems to exist odd hardware with RAM modules and if lucky also >>> battery >>> b) Some drive manufactureres have done hybird ram-spindle drives >>> (compare >>> with possibly more common ssd-spindle hybrides). >>> >>> b) sounds slightly more appealing since it basically means I put >>> everything >>> on those drives and it magically is faster. The a) alternatives also >>> seemed >>> to be non ECC which is a no-no and disturbing. >>> >>> Does anyone here have any recommendations here? >>> >>> Pricing is not very important but reliability is. >> >> Have you ruled out SSD? They are a little new, but I'd be looking at >> the Intel 710. In every case I've seen SSD permanently ends I/O >> issues. DRAM storage solutions I find to be pricey and complicated >> when there are so many workable flash options out now. > > Are you sure SSDs are a reasonable option for WAL? I personally don't > think it's a good option, because WAL is written in a sequential manner, > and that's not an area where SSDs beat spinners really badly. > > For example the Intel 710 SSD has a sequential write speed of 210MB/s, > while a simple SATA 7.2k drive can write about 50-100 MB/s for less than > 1/10 of the 710 price. > > I'm not saying SSDs are a bad thing, but I think it's a waste of money to > use them for WAL. sure, but then you have to have a more complicated setup with a drive(s) designated for WAL, another for storage, etc. Also, your argument falls away if the WAL is shared with another drive. The era of the SSD is here. All new systems I plan will have SSD storage unless cost pressures are extreme -- often with a single drive unless you need the extra storage. If I need availability, instead of RAID, I'll just build hot standby in. merlin
Hi, On 28 Říjen 2011, 17:28, Marcus Engene wrote: > Hi list, > > Every now and then I have write peaks which causes annoying delay on my > website. No particular reason it seems, just that laws of probability > dictates that there will be peaks every now and then. > > Anyway, thinking of ways to make the peaks more bareable, I saw the new > 9.1 feature to bypass WAL. Problems is mainly that some statistics > tables ("x users clicked this link this month") clog the write cache, > not more important writes. I could live with restoring a nightly dump of > these tables and loose a days worth of logs. Why do you think the write activity is related to WAL? Does that mean bulk loading of data by users or what? Have you measured how many WAL segments that creates? What triggers that write activity? Write peaks usually mean a checkpoint is in progress, and that has nothing to do with WAL. More precisely - it does not write data to WAL but to data files, so moving WAL to a separate device won't help at all. The common scenario is about this: (1) The users do some changes (INSERT/UPDATE/DELETE) to the database, it's written to the WAL (fsynced to the device). This is not a lot of work, as writing to WAL is a sequential access and the actual modifications are stored in the shared buffers (not forced to the disk yet). (2) A checkpoint is triggered, i.e. either a checkpoint_timeout expires or all available WAL segments are filled - this means all the dirty buffers has to be actually written from shared buffers to the datafiles. This is a PITA, as it's a rather random access. Anyway there are options to tune the write performance - most notably checkpoint_segments, checkpoint_completion_target, checkpoint_timeout. > Though not keen on jumping over to early major versions an old idea of > putting WAL in RAM came back. Not RAM in main memory but some thingie > pretending to be a drive with proper battery backup. > > a) It seems to exist odd hardware with RAM modules and if lucky also > battery > b) Some drive manufactureres have done hybird ram-spindle drives > (compare with possibly more common ssd-spindle hybrides). > > b) sounds slightly more appealing since it basically means I put > everything on those drives and it magically is faster. The a) > alternatives also seemed to be non ECC which is a no-no and disturbing. > > Does anyone here have any recommendations here? The thing to look for when talking about WAL is a sequential write speed. The solutions you've mentioned above are great for random access, but when you need a sequential speed it's a waste of money (IMHO). > Pricing is not very important but reliability is. My recommendation is to find out what's wrong before buying anything. My impression is that you don't know the actual cause, so you really don't know what features should the device have. If you're willing to spend money without proper analysis, send the money to me - the result will be about the same. You'll spend money without actually solving the issue, plus it will make me a bit happier. Anyway my recommendation would be these: (1) Find out what actually happens, i.e. check if it's a checkpoint issue, or what is going on. Enable log_checkpoints etc. (2) Try to tune the db a bit - not sure what version are you using or what are the important values, but generally this is a good starting point for write-heavy databases checkpoint_segments = 64 checkpoint_completion_target = 0.9 checkpoint_timeout = 30 minutes (3) Provide more details - Pg version, important config values, hardware info etc. Only if I knew what's wrong and if the above things did not help, I'd consider buying a new hw. I'd probably go with one of those options: (a) Move the WAL to a separate disk (not SSD), or maybe a RAID1/RAID10 of such drives. Start with one, use more if needed and a controller with a BBWC. (b) Move the database to a SSD drive, leave the WAL on the original location (not SSD). This might be signoficantly more expensive, especially if you want to make it reliable (building RAID1 of SSDs or something like that). Tomas
On 28 Říjen 2011, 20:40, Merlin Moncure wrote: > sure, but then you have to have a more complicated setup with a > drive(s) designated for WAL, another for storage, etc. Also, your > argument falls away if the WAL is shared with another drive. The era > of the SSD is here. All new systems I plan will have SSD storage > unless cost pressures are extreme -- often with a single drive unless > you need the extra storage. If I need availability, instead of RAID, > I'll just build hot standby in. Well, sure - I'm actually a fan of SSDs. Using an SSDs for the datafiles, or using an SSD for the whole database (including WAL) makes sense, but my impression was that the OP wants to buy a new drive and use it for WAL only and that's not really cost effective I guess. Tomas
On Fri, Oct 28, 2011 at 12:28 PM, Marcus Engene <mengpg2@engene.se> wrote: > Hi list, > > Every now and then I have write peaks which causes annoying delay on my > website. No particular reason it seems, just that laws of probability > dictates that there will be peaks every now and then. > > Anyway, thinking of ways to make the peaks more bareable, I saw the new 9.1 > feature to bypass WAL. Problems is mainly that some statistics tables ("x > users clicked this link this month") clog the write cache, not more > important writes. I could live with restoring a nightly dump of these tables > and loose a days worth of logs. > ... > Does anyone here have any recommendations here? You didn't post configuration details. Just OTOMH, I'd say you have a low shared_buffers setting and that increasing it could help. That's assuming the updates you mention on statistic tables update heavily the same rows over and over, case in which shared buffers would tremendously help.
On 10/28/2011 12:26 PM, Tomas Vondra wrote: > For example the Intel 710 SSD has a sequential write speed of 210MB/s, > while a simple SATA 7.2k drive can write about 50-100 MB/s for less than > 1/10 of the 710 price. Bulk data transfer rates mean almost nothing in the context of a database (unless you're for example backing it up by copying the files to another machine...) The key factor typically is small block writes/s (for WAL) and random small block reads/s (for data). 710 or similar performance SSDs will deliver on the order of 20-50x the performance of a traditional hard drive in these areas.
On 10/28/11 5:45 , Kevin Grittner wrote: > Marcus Engene<mengpg2@engene.se> wrote: > > >> Every now and then I have write peaks which causes annoying delay >> on my website >> > > >> Does anyone here have any recommendations here? >> > > For our largest machines we put WAL on a RAID1 drive pair dedicated > to that task, on its own controller with battery-backed cache > configured for write-back. It does make a big difference, because > when a DBA accidentally got this wrong once, we saw the problem you > describe, and moving WAL to the dedicated drives/controller caused > the problem to go away. > > If problems remain, look for posts by Greg Smith on how to tune > this. You may want to extend your checkpoint completion target, > make the background writer more aggressive, reduce shared buffers, > or tune the OS. But if you can afford to put WAL on a dedicated > file system something like the above, that would be a better place > to start, IMO. > > -Kevin > > The problem I have with battery backed raid controllers is the battery part. They're simply not reliable and requires testing etc which I as a rather insignificant customer at a generic datacenter cannot have done properly. I have however found this thing which I find primising: http://news.cnet.com/8301-21546_3-10273658-10253464.html An Adaptec 5z-controller which has a supercap and flushes to a SSD drive on mishap. Perhaps that's the answer to everything? As per others suggestions I don't feel encouraged to put WAL on SSD from finding several texts by Greg Smith and others warning about this. I do have 2x OCI Sandforce 1500 drives (with supercap) for some burst load tables. The reason I started to think about putting WAL on a RAM drive to begin with was that performance figures for unlogged tables looked very promising indeed. And the test were of the sort that's occupying my bandwidth; accumulating statistical writes. The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB, software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or thereabouts. It's the peaks we want to take down. RAID6 or not, with a spindle as bottleneck there is just a certain max# of writes/s. Thanks for your answers so far! Best regards, Marcus
On Sat, Oct 29, 2011 at 11:54 AM, Marcus Engene <mengpg2@engene.se> wrote: > The problem I have with battery backed raid controllers is the battery part. > They're simply not reliable and requires testing etc which I as a rather > insignificant customer at a generic datacenter cannot have done properly. I > have however found this thing which I find primising: > http://news.cnet.com/8301-21546_3-10273658-10253464.html > An Adaptec 5z-controller which has a supercap and flushes to a SSD drive on > mishap. Perhaps that's the answer to everything? In over 10 years of using hardware RAID controllers with battery backup on many many machines, I have had exactly zero data loss due to a failed battery backup. Of course proper monitoring is important, to make sure the batteries aren't old and dead, but every single BBU RAID controller I have used automatically switched from write back to write through when they detected a bad battery pack. Proper testing is essential whether it's BBU Caching or using an SSD, and failure to do so is inconceivable if your data is at all important. Given the current high failure rate of SSDs due to firmware issues (and it's not just the intel drives experiencing such failures) I'm much more confident in Areca, 3Ware, and LSI BBU RAID controllers right now than I am in SSDs. > As per others suggestions I don't feel encouraged to put WAL on SSD from > finding several texts by Greg Smith and others warning about this. I do have > 2x OCI Sandforce 1500 drives (with supercap) for some burst load tables. > > The reason I started to think about putting WAL on a RAM drive to begin with > was that performance figures for unlogged tables looked very promising > indeed. And the test were of the sort that's occupying my bandwidth; > accumulating statistical writes. > > The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB, > software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and > SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or > thereabouts. It's the peaks we want to take down. RAID6 or not, with a > spindle as bottleneck there is just a certain max# of writes/s. First things first, get off RAID-6. A 4 drive RAID-6 gives no more storage than a 4 drive RAID-10, and is painfully slow by comparison. Looking at SSDs for WAL is putting the cart about 1,000 miles ahead of the horse at this point. You'd be much better off migrating to a single SSD for everything than running on a 4 disk RAID-6.
On 10/29/11 10:11 , Scott Marlowe wrote: > In over 10 years of using hardware RAID controllers with battery > backup on many many machines, I have had exactly zero data loss due to > a failed battery backup. Of course proper monitoring is important, to > make sure the batteries aren't old and dead, but every single BBU RAID > controller I have used automatically switched from write back to write > through when they detected a bad battery pack. > > Proper testing is essential whether it's BBU Caching or using an SSD, > and failure to do so is inconceivable if your data is at all > important. Given the current high failure rate of SSDs due to > firmware issues (and it's not just the intel drives experiencing such > failures) I'm much more confident in Areca, 3Ware, and LSI BBU RAID > controllers right now than I am in SSDs. > Rimu got me a setup with 2x5805 BBU configured as two RAID10 with SAS 15k rpm drives and on top of that 2x Xeon E5645 (the hex core). Since I heard warnings that with non software raids, the machine could be unresponsive during boot when doing a rebuild, I took small 300G drives. Not that SAS 15k come in that much bigger sizes, but still. I chickened out from pg 9.1 due to the low minor number. I also set... wal_buffers = 16MB ...which used to be default 64kB which possibly could explain some of the choke problems at write bursts. > >> As per others suggestions I don't feel encouraged to put WAL on SSD from >> finding several texts by Greg Smith and others warning about this. I do have >> 2x OCI Sandforce 1500 drives (with supercap) for some burst load tables. >> >> The reason I started to think about putting WAL on a RAM drive to begin with >> was that performance figures for unlogged tables looked very promising >> indeed. And the test were of the sort that's occupying my bandwidth; >> accumulating statistical writes. >> >> The present pg9 computer is a Pg 9.0.4, Debian Squeeze, 2xXeon, 72GB, >> software 4xRAID6(sorry) + 2xSSD. It's OLTP website with 10M products and >> SOLR for FTS. During peak it's using ~3-4% CPU, and it's 99.9% reads or >> thereabouts. It's the peaks we want to take down. RAID6 or not, with a >> spindle as bottleneck there is just a certain max# of writes/s. >> > First things first, get off RAID-6. A 4 drive RAID-6 gives no more > storage than a 4 drive RAID-10, and is painfully slow by comparison. > Looking at SSDs for WAL is putting the cart about 1,000 miles ahead of > the horse at this point. You'd be much better off migrating to a > single SSD for everything than running on a 4 disk RAID-6. > > Message received and understood :) Having read up too much on drive reliability paranoia in combination with going from 7k2 -> 15k I feel a bit uneasy, but this mama is fast. I suppose a little bit could be credited the newly restored dump instead of the little over a year entropy in the other machine. But I also did some update/write torture and it was hard to provoke any io wait. I put OS & WAL on one array and the general data files on the other. The data directory that used to be on the SSD drive was also put on the WAL raid. Thanks for your advices! Marcus