Thread: Tuning Tips for a new Server
Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice. I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I configured 6SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a different virtualdisk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS virtual drive.Does this even make sense to do? The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each). It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes. Currently, I have this for the current system which as 16Gb Ram: max_connections = 350 work_mem = 32MB maintenance_work_mem = 512MB wal_buffers = 640kB # This is what I was helped with before and made reporting queries blaze by seq_page_cost = 1.0 random_page_cost = 3.0 cpu_tuple_cost = 0.5 effective_cache_size = 8192MB Any help and input is greatly appreciated. Thank you Ogden
On 8/16/2011 8:35 PM, Ogden wrote: > Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice. > > I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I configured6 SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a differentvirtual disk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS virtualdrive. Does this even make sense to do? > > The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each). > > It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes. > > Currently, I have this for the current system which as 16Gb Ram: > > max_connections = 350 > > work_mem = 32MB > maintenance_work_mem = 512MB > wal_buffers = 640kB > > # This is what I was helped with before and made reporting queries blaze by > seq_page_cost = 1.0 > random_page_cost = 3.0 > cpu_tuple_cost = 0.5 > effective_cache_size = 8192MB > > Any help and input is greatly appreciated. > > Thank you > > Ogden What seems to be the problem? I mean, if nothing is broke, then don't fix it :-) You say reporting query's are fast, and the disk's should take care of your slow write problem from before. (Did you test the write performance?) So, whats wrong? -Andy
On Aug 17, 2011, at 8:41 AM, Andy Colson wrote: > On 8/16/2011 8:35 PM, Ogden wrote: >> Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice. >> >> I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I configured6 SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a differentvirtual disk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS virtualdrive. Does this even make sense to do? >> >> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each). >> >> It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes. >> >> Currently, I have this for the current system which as 16Gb Ram: >> >> max_connections = 350 >> >> work_mem = 32MB >> maintenance_work_mem = 512MB >> wal_buffers = 640kB >> >> # This is what I was helped with before and made reporting queries blaze by >> seq_page_cost = 1.0 >> random_page_cost = 3.0 >> cpu_tuple_cost = 0.5 >> effective_cache_size = 8192MB >> >> Any help and input is greatly appreciated. >> >> Thank you >> >> Ogden > > What seems to be the problem? I mean, if nothing is broke, then don't fix it :-) > > You say reporting query's are fast, and the disk's should take care of your slow write problem from before. (Did you testthe write performance?) So, whats wrong? I was wondering what the best parameters would be with my new setup. The work_mem obviously will increase as will everythingelse as it's a 64Gb machine as opposed to a 16Gb machine. The configuration I posted was for a 16Gb machine butthis new one is 64Gb. I needed help in how to jump these numbers up. Thank you Ogden
On 17 Srpen 2011, 3:35, Ogden wrote: > Hope all is well. I have received tremendous help from this list prior and > therefore wanted some more advice. > > I bought some new servers and instead of RAID 5 (which I think greatly > hindered our writing performance), I configured 6 SCSI 15K drives with > RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K > drives on a different virtual disk and also Raid 10, a total of 146Gb. I > was thinking of putting Postgres' xlog directory on the OS virtual drive. > Does this even make sense to do? Yes, but it greatly depends on the amount of WAL and your workload. If you need to write a lot of WAL data (e.g. during bulk loading), this may significantly improve performance. It may also help when you have a write-heavy workload (a lot of clients updating records, background writer etc.) as that usually means a lot of seeking (while WAL is written sequentially). > The system memory is 64GB and the CPUs are dual Intel E5645 chips (they > are 6-core each). > > It is a dedicated PostgreSQL box and needs to support heavy read and > moderately heavy writes. What is the size of the database? So those are the new servers? What's the difference compared to the old ones? What is the RAID controller, how much write cache is there? > Currently, I have this for the current system which as 16Gb Ram: > > max_connections = 350 > > work_mem = 32MB > maintenance_work_mem = 512MB > wal_buffers = 640kB Are you really using 350 connections? Something like "#cpus + #drives" is usually recommended as a sane number, unless the connections are idle most of the time. And even in that case a pooling is recommended usually. Anyway if this worked fine for your workload, I don't think you need to change those settings. I'd probably bump up the wal_buffers to 16MB - it might help a bit, definitely won't hurt and it's so little memory it's not worth the effort I guess. > > # This is what I was helped with before and made reporting queries blaze > by > seq_page_cost = 1.0 > random_page_cost = 3.0 > cpu_tuple_cost = 0.5 > effective_cache_size = 8192MB Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy to me, as it says reading a page sequentially is just twice as expensive as processing it. This value should be abou 100x lower or something like that. What are the checkpoint settings (segments, completion target). What about shared buffers? Tomas
On 17 Srpen 2011, 16:28, Ogden wrote: > I was wondering what the best parameters would be with my new setup. The > work_mem obviously will increase as will everything else as it's a 64Gb > machine as opposed to a 16Gb machine. The configuration I posted was for > a 16Gb machine but this new one is 64Gb. I needed help in how to jump > these numbers up. Well, that really depends on how you come to the current work_mem settings. If you've decided that with this amount of work_mem the queries run fine and higher values don't give you better performance (because the amount of data that needs to be sorted / hashed) fits into the work_mem, then don't increase it. But if you've just set it so that the memory is not exhausted, increasing it may actually help you. What I think you should review is the amount of shared buffers, checkpoints and page cache settings (see this for example http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html). Tomas
On Aug 17, 2011, at 9:44 AM, Tomas Vondra wrote: > On 17 Srpen 2011, 3:35, Ogden wrote: >> Hope all is well. I have received tremendous help from this list prior and >> therefore wanted some more advice. >> >> I bought some new servers and instead of RAID 5 (which I think greatly >> hindered our writing performance), I configured 6 SCSI 15K drives with >> RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K >> drives on a different virtual disk and also Raid 10, a total of 146Gb. I >> was thinking of putting Postgres' xlog directory on the OS virtual drive. >> Does this even make sense to do? > > Yes, but it greatly depends on the amount of WAL and your workload. If you > need to write a lot of WAL data (e.g. during bulk loading), this may > significantly improve performance. It may also help when you have a > write-heavy workload (a lot of clients updating records, background writer > etc.) as that usually means a lot of seeking (while WAL is written > sequentially). The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual disk with 100Gb should not be a problem with thedisk space should it? >> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they >> are 6-core each). >> >> It is a dedicated PostgreSQL box and needs to support heavy read and >> moderately heavy writes. > > What is the size of the database? So those are the new servers? What's the > difference compared to the old ones? What is the RAID controller, how much > write cache is there? > I am sorry I overlooked specifying this. The database is about 200Gb and yes these are new servers which bring more power(RAM, CPU) over the last one. The RAID Controller is a Perc H700 and there is 512Mb write cache. The servers are Dells. >> Currently, I have this for the current system which as 16Gb Ram: >> >> max_connections = 350 >> >> work_mem = 32MB >> maintenance_work_mem = 512MB >> wal_buffers = 640kB > > Are you really using 350 connections? Something like "#cpus + #drives" is > usually recommended as a sane number, unless the connections are idle most > of the time. And even in that case a pooling is recommended usually. > > Anyway if this worked fine for your workload, I don't think you need to > change those settings. I'd probably bump up the wal_buffers to 16MB - it > might help a bit, definitely won't hurt and it's so little memory it's not > worth the effort I guess. So just increasing the wal_buffers is okay? I thought there would be more as the memory in the system is now 4 times as much.Perhaps shared_buffers too (down below). >> >> # This is what I was helped with before and made reporting queries blaze >> by >> seq_page_cost = 1.0 >> random_page_cost = 3.0 >> cpu_tuple_cost = 0.5 >> effective_cache_size = 8192MB > > Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy > to me, as it says reading a page sequentially is just twice as expensive > as processing it. This value should be abou 100x lower or something like > that. These settings are for the old server, keep in mind. It's a 16GB machine (the new one is 64Gb). The value for cpu_tuple_costshould be 0.005? How are the other ones? > What are the checkpoint settings (segments, completion target). What about > shared buffers? #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0 - was 0.5 #checkpoint_warning = 30s # 0 disables And shared_buffers = 4096MB Thank you very much Ogden
I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?
The benchmark results are here:
Thank you
Ogden
On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: > I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? > > The benchmark results are here: > > http://malekkoheavyindustry.com/benchmark.html > > > Thank you > > Ogden That looks pretty normal to me. Ken
On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: > On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: >> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >> >> The benchmark results are here: >> >> http://malekkoheavyindustry.com/benchmark.html >> >> >> Thank you >> >> Ogden > > That looks pretty normal to me. > > Ken But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one withXFS. Is that much of a jump normal? Ogden
On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: > > On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: > > > On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: > >> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? > >> > >> The benchmark results are here: > >> > >> http://malekkoheavyindustry.com/benchmark.html > >> > >> > >> Thank you > >> > >> Ogden > > > > That looks pretty normal to me. > > > > Ken > > But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one withXFS. Is that much of a jump normal? > > Ogden Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar results with EXT4 as well, I suspect, although you did not test that. Regards, Ken
On 17/08/2011 7:26 PM, Ogden wrote: > I am using bonnie++ to benchmark our current Postgres system (on RAID > 5) with the new one we have, which I have configured with RAID 10. The > drives are the same (SAS 15K). I tried the new system with ext3 and > then XFS but the results seem really outrageous as compared to the > current system, or am I reading things wrong? > > The benchmark results are here: > > http://malekkoheavyindustry.com/benchmark.html > The results are not completely outrageous, however you don't say what drives, how many and what RAID controller you have in the current and new systems. You might expect that performance from 10/12 disks in RAID 10 with a good controller. I would say that your current system is outrageous in that is is so slow! Cheers, Gary.
On 8/17/2011 1:35 PM, ktm@rice.edu wrote: > On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: >> >> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: >> >>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: >>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >>>> >>>> The benchmark results are here: >>>> >>>> http://malekkoheavyindustry.com/benchmark.html >>>> >>>> >>>> Thank you >>>> >>>> Ogden >>> >>> That looks pretty normal to me. >>> >>> Ken >> >> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new onewith XFS. Is that much of a jump normal? >> >> Ogden > > Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar > results with EXT4 as well, I suspect, although you did not test that. > > Regards, > Ken > A while back I tested ext3 and xfs myself and found xfs performs better for PG. However, I also have a photos site with 100K files (split into a small subset of directories), and xfs sucks bad on it. So my db is on xfs, and my photos are on ext4. The numbers between raid5 and raid10 dont really surprise me either. I went from 100 Meg/sec to 230 Meg/sec going from 3 disk raid 5 to 4 disk raid 10. (I'm, of course, using SATA drives.... with 4 gig of ram... and 2 cores. Everyone with more than 8 cores and 64 gig of ram is off my Christmas list! :-) ) -Andy
On Aug 17, 2011, at 1:48 PM, Andy Colson wrote: > On 8/17/2011 1:35 PM, ktm@rice.edu wrote: >> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: >>> >>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: >>> >>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: >>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >>>>> >>>>> The benchmark results are here: >>>>> >>>>> http://malekkoheavyindustry.com/benchmark.html >>>>> >>>>> >>>>> Thank you >>>>> >>>>> Ogden >>>> >>>> That looks pretty normal to me. >>>> >>>> Ken >>> >>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new onewith XFS. Is that much of a jump normal? >>> >>> Ogden >> >> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar >> results with EXT4 as well, I suspect, although you did not test that. >> >> Regards, >> Ken >> > > A while back I tested ext3 and xfs myself and found xfs performs better for PG. However, I also have a photos site with100K files (split into a small subset of directories), and xfs sucks bad on it. > > So my db is on xfs, and my photos are on ext4. What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it betterto put the OS itself on ext4 and the /var/lib/pgsql partition on XFS? Thanks Ogden
On Aug 17, 2011, at 1:33 PM, Gary Doades wrote: > On 17/08/2011 7:26 PM, Ogden wrote: >> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >> >> The benchmark results are here: >> >> http://malekkoheavyindustry.com/benchmark.html >> > The results are not completely outrageous, however you don't say what drives, how many and what RAID controller you havein the current and new systems. You might expect that performance from 10/12 disks in RAID 10 with a good controller.I would say that your current system is outrageous in that is is so slow! > > Cheers, > Gary. Yes, under heavy writes the load would shoot right up which is what caused us to look at upgrading. If it is the RAID 5,it is mind boggling that it could be that much of a difference. I expected a difference, now that much. The new system has 6 drives, 300Gb 15K SAS and I've put them into a RAID 10 configuration. The current system is ext3 withRAID 5 over 4 disks on a Perc/5i controller which has half the write cache as the new one (256 Mb vs 512Mb). Ogden
On 17 Srpen 2011, 18:39, Ogden wrote: >> Yes, but it greatly depends on the amount of WAL and your workload. If >> you >> need to write a lot of WAL data (e.g. during bulk loading), this may >> significantly improve performance. It may also help when you have a >> write-heavy workload (a lot of clients updating records, background >> writer >> etc.) as that usually means a lot of seeking (while WAL is written >> sequentially). > > The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual > disk with 100Gb should not be a problem with the disk space should it? I think you've mentioned the database is on 6 drives, while the other volume is on 2 drives, right? That makes the OS drive about 3x slower (just a rough estimate). But if the database drive is used heavily, it might help to move the xlog directory to the OS disk. See how is the db volume utilized and if it's fully utilized, try to move the xlog directory. The only way to find out is to actualy try it with your workload. >> What is the size of the database? So those are the new servers? What's >> the difference compared to the old ones? What is the RAID controller, how >> much write cache is there? > > I am sorry I overlooked specifying this. The database is about 200Gb and > yes these are new servers which bring more power (RAM, CPU) over the last > one. The RAID Controller is a Perc H700 and there is 512Mb write cache. > The servers are Dells. OK, sounds good although I don't have much experience with this controller. >>> Currently, I have this for the current system which as 16Gb Ram: >>> >>> max_connections = 350 >>> >>> work_mem = 32MB >>> maintenance_work_mem = 512MB >>> wal_buffers = 640kB >> >> Anyway if this worked fine for your workload, I don't think you need to >> change those settings. I'd probably bump up the wal_buffers to 16MB - it >> might help a bit, definitely won't hurt and it's so little memory it's >> not >> worth the effort I guess. > > So just increasing the wal_buffers is okay? I thought there would be more > as the memory in the system is now 4 times as much. Perhaps shared_buffers > too (down below). Yes, I was just commenting that particular piece of config. Shared buffers should be increased too. >>> # This is what I was helped with before and made reporting queries >>> blaze >>> by >>> seq_page_cost = 1.0 >>> random_page_cost = 3.0 >>> cpu_tuple_cost = 0.5 >>> effective_cache_size = 8192MB >> >> Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy >> to me, as it says reading a page sequentially is just twice as expensive >> as processing it. This value should be abou 100x lower or something like >> that. > > These settings are for the old server, keep in mind. It's a 16GB machine > (the new one is 64Gb). The value for cpu_tuple_cost should be 0.005? How > are the other ones? The default values are like this: seq_page_cost = 1.0 random_page_cost = 4.0 cpu_tuple_cost = 0.01 cpu_index_tuple_cost = 0.005 cpu_operator_cost = 0.0025 Increasing the cpu_tuple_cost to 0.5 makes it way too expensive I guess, so the database believes processing two 8kB pages is just as expensive as reading one from the disk. I guess this change penalizes plans that read a lot of pages, e.g. sequential scans (and favor index scans etc.). Maybe it makes sense in your case, I'm just wondering why you set it like that. >> What are the checkpoint settings (segments, completion target). What >> about >> shared buffers? > > > #checkpoint_segments = 3 # in logfile segments, min 1, 16MB > each > #checkpoint_timeout = 5min # range 30s-1h > checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 > - 1.0 - was 0.5 > #checkpoint_warning = 30s # 0 disables You need to bump checkpoint segments up, e.g. 64 or maybe even more. This means how many WAL segments will be available until a checkpoint has to happen. Checkpoint is a process when dirty buffers from shared buffers are written to the disk, so it may be very I/O intensive. Each segment is 16MB, so 3 segments is just 48MB of data, while 64 is 1GB. More checkpoint segments result in longer recovery in case of database crash (because all the segments since last checkpoint need to be applied). But it's essential for good write performance. Completion target seems fine, but I'd consider increasing the timeout too. > shared_buffers = 4096MB The usual recommendation is about 25% of RAM for shared buffers, with 64GB of RAM that is 16GB. And you should increase effective_cache_size too. See this: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server Tomas
On 17/08/2011 7:56 PM, Ogden wrote: > On Aug 17, 2011, at 1:33 PM, Gary Doades wrote: > >> On 17/08/2011 7:26 PM, Ogden wrote: >>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >>> >>> The benchmark results are here: >>> >>> http://malekkoheavyindustry.com/benchmark.html >>> >> The results are not completely outrageous, however you don't say what drives, how many and what RAID controller you havein the current and new systems. You might expect that performance from 10/12 disks in RAID 10 with a good controller.I would say that your current system is outrageous in that is is so slow! >> >> Cheers, >> Gary. > > Yes, under heavy writes the load would shoot right up which is what caused us to look at upgrading. If it is the RAID 5,it is mind boggling that it could be that much of a difference. I expected a difference, now that much. > > The new system has 6 drives, 300Gb 15K SAS and I've put them into a RAID 10 configuration. The current system is ext3 withRAID 5 over 4 disks on a Perc/5i controller which has half the write cache as the new one (256 Mb vs 512Mb). Hmm... for only 6 disks in RAID 10 I would say that the figures are a bit higher than I would expect. The PERC 5 controller is pretty poor in my opinion, PERC 6 a lot better and the new H700's pretty good. I'm guessing you have a H700 in your new system. I've just got a Dell 515 with a H700 and 8 SAS in RAID 10 and I only get around 600 MB/s read using ext4 and Ubuntu 10.4 server. Like I say, your figures are not outrageous, just unexpectedly good :) Cheers, Gary.
On Aug 17, 2011, at 1:56 PM, Tomas Vondra wrote: > On 17 Srpen 2011, 18:39, Ogden wrote: >>> Yes, but it greatly depends on the amount of WAL and your workload. If >>> you >>> need to write a lot of WAL data (e.g. during bulk loading), this may >>> significantly improve performance. It may also help when you have a >>> write-heavy workload (a lot of clients updating records, background >>> writer >>> etc.) as that usually means a lot of seeking (while WAL is written >>> sequentially). >> >> The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual >> disk with 100Gb should not be a problem with the disk space should it? > > I think you've mentioned the database is on 6 drives, while the other > volume is on 2 drives, right? That makes the OS drive about 3x slower > (just a rough estimate). But if the database drive is used heavily, it > might help to move the xlog directory to the OS disk. See how is the db > volume utilized and if it's fully utilized, try to move the xlog > directory. > > The only way to find out is to actualy try it with your workload. Thank you for your help. I just wanted to ask then, for now I should also put the xlog directory in the /var/lib/pgsql directorywhich is on the RAID container that is over 6 drives. You see, I wanted to put it on the container with the 2 drivesbecause just the OS is installed on it and has the space (about 100Gb free). But you don't think it will be a problem to put the xlog directory along with everything else on /var/lib/pgsql/data? I hadseen someone suggesting separating it for their setup and it sounded like a good idea so I thought why not, but in retrospectand what you are saying with the OS drives being 3x slower, it may be okay just to put them on the 6 drives. Thoughts? Thank you once again for your tremendous help Ogden
On 8/17/2011 1:55 PM, Ogden wrote: > > On Aug 17, 2011, at 1:48 PM, Andy Colson wrote: > >> On 8/17/2011 1:35 PM, ktm@rice.edu wrote: >>> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: >>>> >>>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: >>>> >>>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: >>>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >>>>>> >>>>>> The benchmark results are here: >>>>>> >>>>>> http://malekkoheavyindustry.com/benchmark.html >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> Ogden >>>>> >>>>> That looks pretty normal to me. >>>>> >>>>> Ken >>>> >>>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new onewith XFS. Is that much of a jump normal? >>>> >>>> Ogden >>> >>> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar >>> results with EXT4 as well, I suspect, although you did not test that. >>> >>> Regards, >>> Ken >>> >> >> A while back I tested ext3 and xfs myself and found xfs performs better for PG. However, I also have a photos site with100K files (split into a small subset of directories), and xfs sucks bad on it. >> >> So my db is on xfs, and my photos are on ext4. > > > What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it betterto put the OS itself on ext4 and the /var/lib/pgsql partition on XFS? > > Thanks > > Ogden I doubt it matters. The OS is not going to batch delete thousands of files. Once its setup, its pretty constant. I would not worry about it. -Andy
On Wed, Aug 17, 2011 at 12:56 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > > I think you've mentioned the database is on 6 drives, while the other > volume is on 2 drives, right? That makes the OS drive about 3x slower > (just a rough estimate). But if the database drive is used heavily, it > might help to move the xlog directory to the OS disk. See how is the db > volume utilized and if it's fully utilized, try to move the xlog > directory. > > The only way to find out is to actualy try it with your workload. This is a very important point. I've found on most machines with hardware caching RAID and 8 or fewer 15k SCSI drives it's just as fast to put it all on one big RAID-10 and if necessary partition it to put the pg_xlog on its own file system. After that depending on the workload you might need a LOT of drives in the pg_xlog dir or just a pair. Under normal ops many dbs will use only a tiny % of a dedicated pg_xlog. Then something like a site indexer starts to run, and writing heavily to the db, and the usage shoots to 100% and it's the bottleneck.
On Wed, Aug 17, 2011 at 1:55 PM, Ogden <lists@darkstatic.com> wrote:
What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it better to put the OS itself on ext4 and the /var/lib/pgsql partition on XFS?
We've always put the OS on whatever default filesystem it uses, and then put PGDATA on a RAID 10/XFS and PGXLOG on RAID 1/XFS (and for our larger installations, we setup another RAID 10/XFS for heavily accessed indexes or tables). If you have a battery-backed cache on your controller (and it's been tested to work), you can increase performance by mounting the XFS partitions with "nobarrier"...just make sure your battery backup works.
I don't know how current this information is for 9.x (we're still on 8.4), but there is (used to be?) a threshold above which more shared_buffers didn't help. The numbers vary, but somewhere between 8 and 16 GB is typically quoted. We set ours to 25% RAM, but no more than 12 GB (even for our machines with 128+ GB of RAM) because that seems to be a breaking point for our workload.
Of course, no advice will take the place of testing with your workload, so be sure to test =)
On Aug 17, 2011, at 2:14 PM, Scott Marlowe wrote: > On Wed, Aug 17, 2011 at 12:56 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> >> I think you've mentioned the database is on 6 drives, while the other >> volume is on 2 drives, right? That makes the OS drive about 3x slower >> (just a rough estimate). But if the database drive is used heavily, it >> might help to move the xlog directory to the OS disk. See how is the db >> volume utilized and if it's fully utilized, try to move the xlog >> directory. >> >> The only way to find out is to actualy try it with your workload. > > This is a very important point. I've found on most machines with > hardware caching RAID and 8 or fewer 15k SCSI drives it's just as > fast to put it all on one big RAID-10 and if necessary partition it to > put the pg_xlog on its own file system. After that depending on the > workload you might need a LOT of drives in the pg_xlog dir or just a > pair. Under normal ops many dbs will use only a tiny % of a > dedicated pg_xlog. Then something like a site indexer starts to run, > and writing heavily to the db, and the usage shoots to 100% and it's > the bottleneck. I suppose this is my confusion. Or rather I am curious about this. On my current production database the pg_xlog directoryis 8Gb (our total database is 200Gb). Does this warrant a totally separate setup (and hardware) than PGDATA?
On 17 Srpen 2011, 21:22, Ogden wrote: >> This is a very important point. I've found on most machines with >> hardware caching RAID and 8 or fewer 15k SCSI drives it's just as >> fast to put it all on one big RAID-10 and if necessary partition it to >> put the pg_xlog on its own file system. After that depending on the >> workload you might need a LOT of drives in the pg_xlog dir or just a >> pair. Under normal ops many dbs will use only a tiny % of a >> dedicated pg_xlog. Then something like a site indexer starts to run, >> and writing heavily to the db, and the usage shoots to 100% and it's >> the bottleneck. > > I suppose this is my confusion. Or rather I am curious about this. On my > current production database the pg_xlog directory is 8Gb (our total > database is 200Gb). Does this warrant a totally separate setup (and > hardware) than PGDATA? This is not about database size, it's about the workload - the way you're using your database. Even a small database may produce a lot of WAL segments, if the workload is write-heavy. So it's impossible to recommend something except to try that on your own. Tomas
On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote:
On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?The benchmark results are here:http://malekkoheavyindustry.com/benchmark.htmlThank youOgdenThat looks pretty normal to me.KenBut such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one with XFS. Is that much of a jump normal?Ogden
Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
results with EXT4 as well, I suspect, although you did not test that.
i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for the Sequential Output.
So XFS would be best in this case?
Thank you
Ogden
On Wed, Aug 17, 2011 at 03:40:03PM -0500, Ogden wrote: > > On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote: > > > On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: > >> > >> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: > >> > >>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: > >>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? > >>>> > >>>> The benchmark results are here: > >>>> > >>>> http://malekkoheavyindustry.com/benchmark.html > >>>> > >>>> > >>>> Thank you > >>>> > >>>> Ogden > >>> > >>> That looks pretty normal to me. > >>> > >>> Ken > >> > >> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new onewith XFS. Is that much of a jump normal? > >> > >> Ogden > > > > Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar > > results with EXT4 as well, I suspect, although you did not test that. > > > i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for theSequential Output. > > http://malekkoheavyindustry.com/benchmark.html > > So XFS would be best in this case? > > Thank you > > Ogden It appears so for at least the Bonnie++ benchmark. I would really try to benchmark your actual DB on both EXT4 and XFS because some of the comparative benchmarks between the two give the win to EXT4 for INSERT/UPDATE database usage with PostgreSQL. Only your application will know for sure....:) Ken
On Aug 17, 2011, at 3:56 PM, ktm@rice.edu wrote: > On Wed, Aug 17, 2011 at 03:40:03PM -0500, Ogden wrote: >> >> On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote: >> >>> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote: >>>> >>>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote: >>>> >>>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote: >>>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >>>>>> >>>>>> The benchmark results are here: >>>>>> >>>>>> http://malekkoheavyindustry.com/benchmark.html >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> Ogden >>>>> >>>>> That looks pretty normal to me. >>>>> >>>>> Ken >>>> >>>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new onewith XFS. Is that much of a jump normal? >>>> >>>> Ogden >>> >>> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar >>> results with EXT4 as well, I suspect, although you did not test that. >> >> >> i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for theSequential Output. >> >> http://malekkoheavyindustry.com/benchmark.html >> >> So XFS would be best in this case? >> >> Thank you >> >> Ogden > > It appears so for at least the Bonnie++ benchmark. I would really try to benchmark > your actual DB on both EXT4 and XFS because some of the comparative benchmarks between > the two give the win to EXT4 for INSERT/UPDATE database usage with PostgreSQL. Only > your application will know for sure....:) > > Ken What are some good methods that one can use to benchmark PostgreSQL under heavy loads? Ie. to emulate heavy writes? Are thereany existing scripts and what not? Thank you Afra
On 08/17/2011 02:26 PM, Ogden wrote: > I am using bonnie++ to benchmark our current Postgres system (on RAID > 5) with the new one we have, which I have configured with RAID 10. The > drives are the same (SAS 15K). I tried the new system with ext3 and > then XFS but the results seem really outrageous as compared to the > current system, or am I reading things wrong? > > The benchmark results are here: > http://malekkoheavyindustry.com/benchmark.html Congratulations--you're now qualified to be a member of the "RAID5 sucks" club. You can find other members at http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and just terrible write ones are expected if that's on your old hardware. Your new results are what I would expect from the hardware you've described. The only thing that looks weird are your ext4 "Sequential Output - Block" results. They should be between the ext3 and the XFS results, not far lower than either. Normally this only comes from using a bad set of mount options. With a battery-backed write cache, you'd want to use "nobarrier" for example; if you didn't do that, that can crush output rates. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Greg Smith > Sent: Wednesday, August 17, 2011 3:18 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Raid 5 vs Raid 10 Benchmarks Using bonnie++ > > On 08/17/2011 02:26 PM, Ogden wrote: > > I am using bonnie++ to benchmark our current Postgres system (on RAID > > 5) with the new one we have, which I have configured with RAID 10. > The > > drives are the same (SAS 15K). I tried the new system with ext3 and > > then XFS but the results seem really outrageous as compared to the > > current system, or am I reading things wrong? > > > > The benchmark results are here: > > http://malekkoheavyindustry.com/benchmark.html > > Congratulations--you're now qualified to be a member of the "RAID5 > sucks" club. You can find other members at > http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and > just terrible write ones are expected if that's on your old hardware. > Your new results are what I would expect from the hardware you've > described. > > The only thing that looks weird are your ext4 "Sequential Output - > Block" results. They should be between the ext3 and the XFS results, > not far lower than either. Normally this only comes from using a bad > set of mount options. With a battery-backed write cache, you'd want to > use "nobarrier" for example; if you didn't do that, that can crush > output rates. > To clarify maybe for those new at using non-default mount options. With XFS the mount option is nobarrier. With ext4 I think it is barrier=0 Someone please correct me if I am misleading people or otherwise mistaken. -mark
On 08/17/2011 08:35 PM, mark wrote: > With XFS the mount option is nobarrier. With ext4 I think it is barrier=0 http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt ext4 supports both; "nobarrier" and "barrier=0" mean the same thing. I tend to use "nobarrier" just because I'm used to that name on XFS systems. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Aug 17, 2011, at 4:16 PM, Greg Smith wrote: > On 08/17/2011 02:26 PM, Ogden wrote: >> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the resultsseem really outrageous as compared to the current system, or am I reading things wrong? >> >> The benchmark results are here: >> http://malekkoheavyindustry.com/benchmark.html >> > > Congratulations--you're now qualified to be a member of the "RAID5 sucks" club. You can find other members at http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and just terrible write ones are expected if that's onyour old hardware. Your new results are what I would expect from the hardware you've described. > > The only thing that looks weird are your ext4 "Sequential Output - Block" results. They should be between the ext3 andthe XFS results, not far lower than either. Normally this only comes from using a bad set of mount options. With a battery-backedwrite cache, you'd want to use "nobarrier" for example; if you didn't do that, that can crush output rates. Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing and goodenough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. I did not do that with XFS and it did quite well - I know it's up to my app and more testing, but in your experience, whatis usually a good filesystem to use? I keep reading conflicting things.. Thank you Ogden
On 18/08/2011 11:48 AM, Ogden wrote: > Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing and goodenough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. > > Yeah, I'm confused by that too. Shouldn't a write barrier flush data to persistent storage - in this case, the RAID card's battery backed cache? Why would it force a RAID controller cache flush to disk, too? -- Craig Ringer
On 18/08/11 17:35, Craig Ringer wrote: > On 18/08/2011 11:48 AM, Ogden wrote: >> Isn't this very dangerous? I have the Dell PERC H700 card - I see >> that it has 512Mb Cache. Is this the same thing and good enough to >> switch to nobarrier? Just worried if a sudden power shut down, then >> data can be lost on this option. >> >> > Yeah, I'm confused by that too. Shouldn't a write barrier flush data > to persistent storage - in this case, the RAID card's battery backed > cache? Why would it force a RAID controller cache flush to disk, too? > > If the card's cache has a battery, then the cache is preserved in the advent of crash/power loss etc - provided it has enough charge, so setting 'writeback' property on arrays is safe. The PERC/SERVERRAID cards I'm familiar (LSI Megaraid rebranded models) all switch to write-though mode if they detect the battery is dangerously discharged so this is not normally a problem (but commit/fsync performance will fall off a cliff when this happens)! Cheers Mark
On Thu, Aug 18, 2011 at 1:35 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 18/08/2011 11:48 AM, Ogden wrote: >> >> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it >> has 512Mb Cache. Is this the same thing and good enough to switch to >> nobarrier? Just worried if a sudden power shut down, then data can be lost >> on this option. >> >> > Yeah, I'm confused by that too. Shouldn't a write barrier flush data to > persistent storage - in this case, the RAID card's battery backed cache? Why > would it force a RAID controller cache flush to disk, too? The "barrier" is the linux fs/block way of saying "these writes need to be on persistent media before I can depend on them". On typical spinning media disks, that means out of the disk cache (which is not persistent) and on platters. The way it assures that the writes are on "persistant media" is with a "flush cache" type of command. The "flush cache" is a close approximation to "make sure it's persistent". If your cache is battery backed, it is now persistent, and there is no need to "flush cache", hence the nobarrier option if you believe your cache is persistent. Now, make sure that even though your raid cache is persistent, your disks have cache in write-through mode, cause it would suck for your raid cache to "work", but believe the data is safely on disk and only find out that it was in the disks (small) cache, and you're raid is out of sync after an outage because of that... I believe most raid cards will handle that correctly for you automatically. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote: > On 18/08/11 17:35, Craig Ringer wrote: >> On 18/08/2011 11:48 AM, Ogden wrote: >>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing andgood enough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. >>> >>> >> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to persistent storage - in this case, the RAID card'sbattery backed cache? Why would it force a RAID controller cache flush to disk, too? >> >> > > If the card's cache has a battery, then the cache is preserved in the advent of crash/power loss etc - provided it hasenough charge, so setting 'writeback' property on arrays is safe. The PERC/SERVERRAID cards I'm familiar (LSI Megaraidrebranded models) all switch to write-though mode if they detect the battery is dangerously discharged so this isnot normally a problem (but commit/fsync performance will fall off a cliff when this happens)! > > Cheers > > Mark So a setting such as this: Device Name : /dev/sdb Type : SAS Read Policy : No Read Ahead Write Policy : Write Back Cache Policy : Not Applicable Stripe Element Size : 64 KB Disk Cache Policy : Enabled Is sufficient to enable nobarrier then with these settings? Thank you Ogden
On Aug 17, 2011, at 4:17 PM, Greg Smith wrote:
On 08/17/2011 02:26 PM, Ogden wrote:I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?The benchmark results are here:http://malekkoheavyindustry.com/benchmark.html
Congratulations--you're now qualified to be a member of the "RAID5 sucks" club. You can find other members at http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and just terrible write ones are expected if that's on your old hardware. Your new results are what I would expect from the hardware you've described.
The only thing that looks weird are your ext4 "Sequential Output - Block" results. They should be between the ext3 and the XFS results, not far lower than either. Normally this only comes from using a bad set of mount options. With a battery-backed write cache, you'd want to use "nobarrier" for example; if you didn't do that, that can crush output rates.
I have mounted the ext4 system with the nobarrier option:
/dev/sdb1 on /var/lib/pgsql type ext4 (rw,noatime,data=writeback,barrier=0,nobh,errors=remount-ro)
Yet the results show absolutely a decrease in performance in the ext4 "Sequential Output - Block" results:
However, the Random seeks is better, even more so than XFS...
Any thoughts as to why this is occurring?
Ogden
On 19/08/11 02:09, Ogden wrote: > On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote: > >> On 18/08/11 17:35, Craig Ringer wrote: >>> On 18/08/2011 11:48 AM, Ogden wrote: >>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing andgood enough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. >>>> >>>> >>> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to persistent storage - in this case, the RAID card'sbattery backed cache? Why would it force a RAID controller cache flush to disk, too? >>> >>> >> If the card's cache has a battery, then the cache is preserved in the advent of crash/power loss etc - provided it hasenough charge, so setting 'writeback' property on arrays is safe. The PERC/SERVERRAID cards I'm familiar (LSI Megaraidrebranded models) all switch to write-though mode if they detect the battery is dangerously discharged so this isnot normally a problem (but commit/fsync performance will fall off a cliff when this happens)! >> >> Cheers >> >> Mark > > So a setting such as this: > > Device Name : /dev/sdb > Type : SAS > Read Policy : No Read Ahead > Write Policy : Write Back > Cache Policy : Not Applicable > Stripe Element Size : 64 KB > Disk Cache Policy : Enabled > > > Is sufficient to enable nobarrier then with these settings? > Hmm - that output looks different from the cards I'm familiar with. I'd want to see the manual entries for "Cache Policy=Not Applicable" and "Disk Cache Policy=Enabled" to understand what the settings actually mean. Assuming "Disk Cache Policy=Enabled" means what I think it does (i.e writes are cached in the physical drives cache), this setting seems wrong if your card has on board cache + battery, you would want to only cache 'em in the *card's* cache (too many caches to keep straight in one's head, lol). Cheers Mark
On 19/08/11 12:52, Mark Kirkwood wrote: > On 19/08/11 02:09, Ogden wrote: >> On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote: >> >>> On 18/08/11 17:35, Craig Ringer wrote: >>>> On 18/08/2011 11:48 AM, Ogden wrote: >>>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see >>>>> that it has 512Mb Cache. Is this the same thing and good enough to >>>>> switch to nobarrier? Just worried if a sudden power shut down, >>>>> then data can be lost on this option. >>>>> >>>>> >>>> Yeah, I'm confused by that too. Shouldn't a write barrier flush >>>> data to persistent storage - in this case, the RAID card's battery >>>> backed cache? Why would it force a RAID controller cache flush to >>>> disk, too? >>>> >>>> >>> If the card's cache has a battery, then the cache is preserved in >>> the advent of crash/power loss etc - provided it has enough charge, >>> so setting 'writeback' property on arrays is safe. The >>> PERC/SERVERRAID cards I'm familiar (LSI Megaraid rebranded models) >>> all switch to write-though mode if they detect the battery is >>> dangerously discharged so this is not normally a problem (but >>> commit/fsync performance will fall off a cliff when this happens)! >>> >>> Cheers >>> >>> Mark >> >> So a setting such as this: >> >> Device Name : /dev/sdb >> Type : SAS >> Read Policy : No Read Ahead >> Write Policy : Write Back >> Cache Policy : Not Applicable >> Stripe Element Size : 64 KB >> Disk Cache Policy : Enabled >> >> >> Is sufficient to enable nobarrier then with these settings? >> > > > Hmm - that output looks different from the cards I'm familiar with. > I'd want to see the manual entries for "Cache Policy=Not Applicable" > and "Disk Cache Policy=Enabled" to understand what the settings > actually mean. Assuming "Disk Cache Policy=Enabled" means what I think > it does (i.e writes are cached in the physical drives cache), this > setting seems wrong if your card has on board cache + battery, you > would want to only cache 'em in the *card's* cache (too many caches > to keep straight in one's head, lol). > FWIW - here's what our ServerRaid (M5015) output looks like for a RAID 1 array configured with writeback, reads not cached on the card's memory, physical disk caches disabled: $ MegaCli64 -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 67.054 GB State : Optimal Strip Size : 64 KB Number Of Drives : 2 Span Depth : 1 Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disabled Encryption Type : None
apologies for such a late response to this thread, but there is domething I think is _really_ dangerous here. On Thu, 18 Aug 2011, Aidan Van Dyk wrote: > On Thu, Aug 18, 2011 at 1:35 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: >> On 18/08/2011 11:48 AM, Ogden wrote: >>> >>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it >>> has 512Mb Cache. Is this the same thing and good enough to switch to >>> nobarrier? Just worried if a sudden power shut down, then data can be lost >>> on this option. >>> >>> >> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to >> persistent storage - in this case, the RAID card's battery backed cache? Why >> would it force a RAID controller cache flush to disk, too? > > The "barrier" is the linux fs/block way of saying "these writes need > to be on persistent media before I can depend on them". On typical > spinning media disks, that means out of the disk cache (which is not > persistent) and on platters. The way it assures that the writes are > on "persistant media" is with a "flush cache" type of command. The > "flush cache" is a close approximation to "make sure it's persistent". > > If your cache is battery backed, it is now persistent, and there is no > need to "flush cache", hence the nobarrier option if you believe your > cache is persistent. > > Now, make sure that even though your raid cache is persistent, your > disks have cache in write-through mode, cause it would suck for your > raid cache to "work", but believe the data is safely on disk and only > find out that it was in the disks (small) cache, and you're raid is > out of sync after an outage because of that... I believe most raid > cards will handle that correctly for you automatically. if you don't have barriers enabled, the data may not get written out of main memory to the battery backed memory on the card as the OS has no reason to do the write out of the OS buffers now rather than later. Every raid card I have seen has ignored the 'flush cache' type of command if it has a battery and that battery is good, so you leave the barriers enabled and the card still gives you great performance. David Lang
On Mon, Sep 12, 2011 at 6:57 PM, <david@lang.hm> wrote: >> The "barrier" is the linux fs/block way of saying "these writes need >> to be on persistent media before I can depend on them". On typical >> spinning media disks, that means out of the disk cache (which is not >> persistent) and on platters. The way it assures that the writes are >> on "persistant media" is with a "flush cache" type of command. The >> "flush cache" is a close approximation to "make sure it's persistent". >> >> If your cache is battery backed, it is now persistent, and there is no >> need to "flush cache", hence the nobarrier option if you believe your >> cache is persistent. >> >> Now, make sure that even though your raid cache is persistent, your >> disks have cache in write-through mode, cause it would suck for your >> raid cache to "work", but believe the data is safely on disk and only >> find out that it was in the disks (small) cache, and you're raid is >> out of sync after an outage because of that... I believe most raid >> cards will handle that correctly for you automatically. > > if you don't have barriers enabled, the data may not get written out of main > memory to the battery backed memory on the card as the OS has no reason to > do the write out of the OS buffers now rather than later. It's not quite so simple. The "sync" calls (pick your flavour) is what tells the OS buffers they have to go out. The syscall (on a working FS) won't return until the write and data has reached the "device" safely, and is considered persistent. But in linux, a barrier is actually a "synchronization" point, not just a "flush cache"... It's a "guarantee everything up to now is persistent, I'm going to start counting on it". But depending on your card, drivers and yes, kernel version, that "barrier" is sometimes a "drain/block I/O queue, issue cache flush, wait, write specific data, flush, wait, open I/O queue". The double flush is because it needs to guarantee everything previous is good before it writes the "critical" piece, and then needs to guarantee that too. Now, on good raid hardware it's not usually that bad. And then, just to confuse people more, LVM up until 2.6.29 (so that includes all those RHEL5/CentOS5 installs out there which default to using LVM) didn't handle barriers, it just sort of threw them out as it came across them, meaning that you got the performance of nobarrier, even if you thought you were using barriers on poor raid hardware. > Every raid card I have seen has ignored the 'flush cache' type of command if > it has a battery and that battery is good, so you leave the barriers enabled > and the card still gives you great performance. XFS FAQ goes over much of it, starting at Q24: http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F So, for pure performance, on a battery-backed controller, nobarrier is the recommended *performance* setting. But, to throw a wrench into the plan, what happens when during normal battery tests, your raid controller decides the battery is failing... of course, it's going to start screaming and send all your monitoring alarms off (you're monitoring that, right?), but have you thought to make sure that your FS is remounted with barriers at the first sign of battery trouble? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Mon, 12 Sep 2011, Aidan Van Dyk wrote: > On Mon, Sep 12, 2011 at 6:57 PM, <david@lang.hm> wrote: > >>> The "barrier" is the linux fs/block way of saying "these writes need >>> to be on persistent media before I can depend on them". On typical >>> spinning media disks, that means out of the disk cache (which is not >>> persistent) and on platters. The way it assures that the writes are >>> on "persistant media" is with a "flush cache" type of command. The >>> "flush cache" is a close approximation to "make sure it's persistent". >>> >>> If your cache is battery backed, it is now persistent, and there is no >>> need to "flush cache", hence the nobarrier option if you believe your >>> cache is persistent. >>> >>> Now, make sure that even though your raid cache is persistent, your >>> disks have cache in write-through mode, cause it would suck for your >>> raid cache to "work", but believe the data is safely on disk and only >>> find out that it was in the disks (small) cache, and you're raid is >>> out of sync after an outage because of that... I believe most raid >>> cards will handle that correctly for you automatically. >> >> if you don't have barriers enabled, the data may not get written out of main >> memory to the battery backed memory on the card as the OS has no reason to >> do the write out of the OS buffers now rather than later. > > It's not quite so simple. The "sync" calls (pick your flavour) is > what tells the OS buffers they have to go out. The syscall (on a > working FS) won't return until the write and data has reached the > "device" safely, and is considered persistent. > > But in linux, a barrier is actually a "synchronization" point, not > just a "flush cache"... It's a "guarantee everything up to now is > persistent, I'm going to start counting on it". But depending on your > card, drivers and yes, kernel version, that "barrier" is sometimes a > "drain/block I/O queue, issue cache flush, wait, write specific data, > flush, wait, open I/O queue". The double flush is because it needs to > guarantee everything previous is good before it writes the "critical" > piece, and then needs to guarantee that too. > > Now, on good raid hardware it's not usually that bad. > > And then, just to confuse people more, LVM up until 2.6.29 (so that > includes all those RHEL5/CentOS5 installs out there which default to > using LVM) didn't handle barriers, it just sort of threw them out as > it came across them, meaning that you got the performance of > nobarrier, even if you thought you were using barriers on poor raid > hardware. this is part of the problem. if you have a simple fs-on-hardware you may be able to get away with the barriers, but if you have fs-on-x-on-y-on-hardware type of thing (specifically where LVM is one of the things in the middle), and those things in the middle do not honor barriers, the fsync becomes meaningless because without propogating the barrier down the stack, the writes that the fsync triggers may not get to the disk. >> Every raid card I have seen has ignored the 'flush cache' type of command if >> it has a battery and that battery is good, so you leave the barriers enabled >> and the card still gives you great performance. > > XFS FAQ goes over much of it, starting at Q24: > http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F > > So, for pure performance, on a battery-backed controller, nobarrier is > the recommended *performance* setting. > > But, to throw a wrench into the plan, what happens when during normal > battery tests, your raid controller decides the battery is failing... > of course, it's going to start screaming and send all your monitoring > alarms off (you're monitoring that, right?), but have you thought to > make sure that your FS is remounted with barriers at the first sign of > battery trouble? yep. on a good raid card with battery backed cache, the performance difference between barriers being on and barriers being off should be minimal. If it's not, I think that you have something else going on. David Lang
On Mon, Sep 12, 2011 at 8:47 PM, <david@lang.hm> wrote: >> XFS FAQ goes over much of it, starting at Q24: >> >> http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F >> >> So, for pure performance, on a battery-backed controller, nobarrier is >> the recommended *performance* setting. >> >> But, to throw a wrench into the plan, what happens when during normal >> battery tests, your raid controller decides the battery is failing... >> of course, it's going to start screaming and send all your monitoring >> alarms off (you're monitoring that, right?), but have you thought to >> make sure that your FS is remounted with barriers at the first sign of >> battery trouble? > > yep. > > on a good raid card with battery backed cache, the performance difference > between barriers being on and barriers being off should be minimal. If it's > not, I think that you have something else going on. The performance boost you'll get is that you don't have the temporary stall in parallelization that the barriers have. With barriers, even if the controller cache doesn't really flush, you still have the "can't send more writes to the device until the barrier'ed write is done", so at all those points, you have only a single write command in flight. The performance penalty of barriers on good cards comes because barriers are written to prevent the devices from reordering of write persistence, and do that by waiting for a write to be "persistent" before allowing more to be queued to the device. With nobarrier, you operate under the assumption that the block device writes are persisted in the order commands are issued to the devices, so you never have to "drain the queue", as you do in the normal barrier implementation, and can (in theory) always have more request that the raid card can be working on processing, reordering, and dispatching to platters for the maximum theoretical throughput... Of course, linux has completely re-written/changed the sync/barrier/flush methods over the past few years, and there is no guarantee they don't keep changing the implementation details in the future, so keep up on the filesystem details of whatever you're using... So keep doing burn-ins, with real pull-the-cord tests... They can't "prove" it's 100% safe, but they can quickly prove when it's not ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.