Thread: Tuning Tips for a new Server

Tuning Tips for a new Server

From

Ogden

Date:

16 August 2011, 22:41:36

Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice.

I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I configured
6SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a different
virtualdisk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS virtual
drive.Does this even make sense to do? 

The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each).

It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes.

Currently, I have this for the current system which as 16Gb Ram:

 max_connections = 350

work_mem = 32MB
maintenance_work_mem = 512MB
wal_buffers = 640kB

# This is what I was helped with before and made reporting queries blaze by
seq_page_cost = 1.0
random_page_cost = 3.0
cpu_tuple_cost = 0.5
effective_cache_size = 8192MB

Any help and input is greatly appreciated.

Thank you

Ogden

Re: Tuning Tips for a new Server

From

Andy Colson

Date:

17 August 2011, 10:41:54

On 8/16/2011 8:35 PM, Ogden wrote:
> Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice.
>
> I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I
configured6 SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a
differentvirtual disk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS
virtualdrive. Does this even make sense to do? 
>
> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each).
>
> It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes.
>
> Currently, I have this for the current system which as 16Gb Ram:
>
>   max_connections = 350
>
> work_mem = 32MB
> maintenance_work_mem = 512MB
> wal_buffers = 640kB
>
> # This is what I was helped with before and made reporting queries blaze by
> seq_page_cost = 1.0
> random_page_cost = 3.0
> cpu_tuple_cost = 0.5
> effective_cache_size = 8192MB
>
> Any help and input is greatly appreciated.
>
> Thank you
>
> Ogden

What seems to be the problem?  I mean, if nothing is broke, then don't
fix it :-)

You say reporting query's are fast, and the disk's should take care of
your slow write problem from before.  (Did you test the write
performance?)  So, whats wrong?


-Andy

Re: Tuning Tips for a new Server

From

Ogden

Date:

17 August 2011, 11:29:08

On Aug 17, 2011, at 8:41 AM, Andy Colson wrote:

> On 8/16/2011 8:35 PM, Ogden wrote:
>> Hope all is well. I have received tremendous help from this list prior and therefore wanted some more advice.
>>
>> I bought some new servers and instead of RAID 5 (which I think greatly hindered our writing performance), I
configured6 SCSI 15K drives with RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K drives on a
differentvirtual disk and also Raid 10, a total of 146Gb. I was thinking of putting Postgres' xlog directory on the OS
virtualdrive. Does this even make sense to do? 
>>
>> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they are 6-core each).
>>
>> It is a dedicated PostgreSQL box and needs to support heavy read and moderately heavy writes.
>>
>> Currently, I have this for the current system which as 16Gb Ram:
>>
>>  max_connections = 350
>>
>> work_mem = 32MB
>> maintenance_work_mem = 512MB
>> wal_buffers = 640kB
>>
>> # This is what I was helped with before and made reporting queries blaze by
>> seq_page_cost = 1.0
>> random_page_cost = 3.0
>> cpu_tuple_cost = 0.5
>> effective_cache_size = 8192MB
>>
>> Any help and input is greatly appreciated.
>>
>> Thank you
>>
>> Ogden
>
> What seems to be the problem?  I mean, if nothing is broke, then don't fix it :-)
>
> You say reporting query's are fast, and the disk's should take care of your slow write problem from before.  (Did you
testthe write performance?)  So, whats wrong? 


 I was wondering what the best parameters would be with my new setup. The work_mem obviously will increase as will
everythingelse as it's a 64Gb machine as opposed to a 16Gb machine. The configuration I posted was for a 16Gb machine
butthis new one is 64Gb. I needed help in how to jump these numbers up.  

Thank you

Ogden

Re: Tuning Tips for a new Server

From

"Tomas Vondra"

Date:

17 August 2011, 11:44:48

On 17 Srpen 2011, 3:35, Ogden wrote:
> Hope all is well. I have received tremendous help from this list prior and
> therefore wanted some more advice.
>
> I bought some new servers and instead of RAID 5 (which I think greatly
> hindered our writing performance), I configured 6 SCSI 15K drives with
> RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K
> drives on a different virtual disk and also Raid 10, a total of 146Gb. I
> was thinking of putting Postgres' xlog directory on the OS virtual drive.
> Does this even make sense to do?

Yes, but it greatly depends on the amount of WAL and your workload. If you
need to write a lot of WAL data (e.g. during bulk loading), this may
significantly improve performance. It may also help when you have a
write-heavy workload (a lot of clients updating records, background writer
etc.) as that usually means a lot of seeking (while WAL is written
sequentially).

> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they
> are 6-core each).
>
> It is a dedicated PostgreSQL box and needs to support heavy read and
> moderately heavy writes.

What is the size of the database? So those are the new servers? What's the
difference compared to the old ones? What is the RAID controller, how much
write cache is there?

> Currently, I have this for the current system which as 16Gb Ram:
>
>  max_connections = 350
>
> work_mem = 32MB
> maintenance_work_mem = 512MB
> wal_buffers = 640kB

Are you really using 350 connections? Something like "#cpus + #drives" is
usually recommended as a sane number, unless the connections are idle most
of the time. And even in that case a pooling is recommended usually.

Anyway if this worked fine for your workload, I don't think you need to
change those settings. I'd probably bump up the wal_buffers to 16MB - it
might help a bit, definitely won't hurt and it's so little memory it's not
worth the effort I guess.

>
> # This is what I was helped with before and made reporting queries blaze
> by
> seq_page_cost = 1.0
> random_page_cost = 3.0
> cpu_tuple_cost = 0.5
> effective_cache_size = 8192MB

Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy
to me, as it says reading a page sequentially is just twice as expensive
as processing it. This value should be abou 100x lower or something like
that.

What are the checkpoint settings (segments, completion target). What about
shared buffers?

Tomas

Re: Tuning Tips for a new Server

From

"Tomas Vondra"

Date:

17 August 2011, 12:08:39

On 17 Srpen 2011, 16:28, Ogden wrote:
>  I was wondering what the best parameters would be with my new setup. The
> work_mem obviously will increase as will everything else as it's a 64Gb
> machine as opposed to a 16Gb machine. The configuration I posted was for
> a 16Gb machine but this new one is 64Gb. I needed help in how to jump
> these numbers up.

Well, that really depends on how you come to the current work_mem settings.

If you've decided that with this amount of work_mem the queries run fine
and higher values don't give you better performance (because the amount of
data that needs to be sorted / hashed) fits into the work_mem, then don't
increase it.

But if you've just set it so that the memory is not exhausted, increasing
it may actually help you.

What I think you should review is the amount of shared buffers,
checkpoints and page cache settings (see this for example
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html).

Tomas

Re: Tuning Tips for a new Server

From

Ogden

Date:

17 August 2011, 13:39:35

On Aug 17, 2011, at 9:44 AM, Tomas Vondra wrote:

> On 17 Srpen 2011, 3:35, Ogden wrote:
>> Hope all is well. I have received tremendous help from this list prior and
>> therefore wanted some more advice.
>>
>> I bought some new servers and instead of RAID 5 (which I think greatly
>> hindered our writing performance), I configured 6 SCSI 15K drives with
>> RAID 10. This is dedicated to /var/lib/pgsql. The main OS has 2 SCSI 15K
>> drives on a different virtual disk and also Raid 10, a total of 146Gb. I
>> was thinking of putting Postgres' xlog directory on the OS virtual drive.
>> Does this even make sense to do?
>
> Yes, but it greatly depends on the amount of WAL and your workload. If you
> need to write a lot of WAL data (e.g. during bulk loading), this may
> significantly improve performance. It may also help when you have a
> write-heavy workload (a lot of clients updating records, background writer
> etc.) as that usually means a lot of seeking (while WAL is written
> sequentially).

The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual disk with 100Gb should not be a problem with
thedisk space should it? 

>> The system memory is 64GB and the CPUs are dual Intel E5645 chips (they
>> are 6-core each).
>>
>> It is a dedicated PostgreSQL box and needs to support heavy read and
>> moderately heavy writes.
>
> What is the size of the database? So those are the new servers? What's the
> difference compared to the old ones? What is the RAID controller, how much
> write cache is there?
>

I am sorry I overlooked specifying this. The database is about 200Gb and yes these are new servers which bring more
power(RAM, CPU) over the last one. The RAID Controller is a Perc H700 and there is 512Mb write cache. The servers are
Dells. 

>> Currently, I have this for the current system which as 16Gb Ram:
>>
>> max_connections = 350
>>
>> work_mem = 32MB
>> maintenance_work_mem = 512MB
>> wal_buffers = 640kB
>
> Are you really using 350 connections? Something like "#cpus + #drives" is
> usually recommended as a sane number, unless the connections are idle most
> of the time. And even in that case a pooling is recommended usually.
>
> Anyway if this worked fine for your workload, I don't think you need to
> change those settings. I'd probably bump up the wal_buffers to 16MB - it
> might help a bit, definitely won't hurt and it's so little memory it's not
> worth the effort I guess.

So just increasing the wal_buffers is okay? I thought there would be more as the memory in the system is now 4 times as
much.Perhaps shared_buffers too (down below).  

>>
>> # This is what I was helped with before and made reporting queries blaze
>> by
>> seq_page_cost = 1.0
>> random_page_cost = 3.0
>> cpu_tuple_cost = 0.5
>> effective_cache_size = 8192MB
>
> Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy
> to me, as it says reading a page sequentially is just twice as expensive
> as processing it. This value should be abou 100x lower or something like
> that.

These settings are for the old server, keep in mind. It's a 16GB machine (the new one is 64Gb). The value for
cpu_tuple_costshould be 0.005? How are the other ones? 


> What are the checkpoint settings (segments, completion target). What about
> shared buffers?


#checkpoint_segments = 3                # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min              # range 30s-1h
checkpoint_completion_target = 0.9      # checkpoint target duration, 0.0 - 1.0 - was 0.5
#checkpoint_warning = 30s               # 0 disables

And

shared_buffers = 4096MB


Thank you very much

Ogden

Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 15:27:05

I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?

The benchmark results are here:

http://malekkoheavyindustry.com/benchmark.html

Thank you

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

"ktm@rice.edu"

Date:

17 August 2011, 15:31:09

On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>
> The benchmark results are here:
>
> http://malekkoheavyindustry.com/benchmark.html
>
>
> Thank you
>
> Ogden

That looks pretty normal to me.

Ken

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 15:32:49

On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:

> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>
>> The benchmark results are here:
>>
>> http://malekkoheavyindustry.com/benchmark.html
>>
>>
>> Thank you
>>
>> Ogden
>
> That looks pretty normal to me.
>
> Ken

But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one
withXFS. Is that much of a jump normal? 

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

"ktm@rice.edu"

Date:

17 August 2011, 15:36:00

On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
>
> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
>
> > On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
> >> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
> >>
> >> The benchmark results are here:
> >>
> >> http://malekkoheavyindustry.com/benchmark.html
> >>
> >>
> >> Thank you
> >>
> >> Ogden
> >
> > That looks pretty normal to me.
> >
> > Ken
>
> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one
withXFS. Is that much of a jump normal? 
>
> Ogden

Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
results with EXT4 as well, I suspect, although you did not test that.

Regards,
Ken

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Gary Doades

Date:

17 August 2011, 15:40:01

On 17/08/2011 7:26 PM, Ogden wrote:
> I am using bonnie++ to benchmark our current Postgres system (on RAID
> 5) with the new one we have, which I have configured with RAID 10. The
> drives are the same (SAS 15K). I tried the new system with ext3 and
> then XFS but the results seem really outrageous as compared to the
> current system, or am I reading things wrong?
>
> The benchmark results are here:
>
> http://malekkoheavyindustry.com/benchmark.html
>
The results are not completely outrageous, however you don't say what
drives, how many and what RAID controller you have in the current and
new systems. You might expect that performance from 10/12 disks in RAID
10 with a good controller. I would say that your current system is
outrageous in that is is so slow!

Cheers,
Gary.

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Andy Colson

Date:

17 August 2011, 15:49:01

On 8/17/2011 1:35 PM, ktm@rice.edu wrote:
> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
>>
>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
>>
>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>>>
>>>> The benchmark results are here:
>>>>
>>>> http://malekkoheavyindustry.com/benchmark.html
>>>>
>>>>
>>>> Thank you
>>>>
>>>> Ogden
>>>
>>> That looks pretty normal to me.
>>>
>>> Ken
>>
>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new
onewith XFS. Is that much of a jump normal? 
>>
>> Ogden
>
> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
> results with EXT4 as well, I suspect, although you did not test that.
>
> Regards,
> Ken
>

A while back I tested ext3 and xfs myself and found xfs performs better
for PG.  However, I also have a photos site with 100K files (split into
a small subset of directories), and xfs sucks bad on it.

So my db is on xfs, and my photos are on ext4.

The numbers between raid5 and raid10 dont really surprise me either.  I
went from 100 Meg/sec to 230 Meg/sec going from 3 disk raid 5 to 4 disk
raid 10.  (I'm, of course, using SATA drives.... with 4 gig of ram...
and 2 cores.  Everyone with more than 8 cores and 64 gig of ram is off
my Christmas list! :-) )

-Andy

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 15:55:26

On Aug 17, 2011, at 1:48 PM, Andy Colson wrote:

> On 8/17/2011 1:35 PM, ktm@rice.edu wrote:
>> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
>>>
>>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
>>>
>>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
>>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>>>>
>>>>> The benchmark results are here:
>>>>>
>>>>> http://malekkoheavyindustry.com/benchmark.html
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>> Ogden
>>>>
>>>> That looks pretty normal to me.
>>>>
>>>> Ken
>>>
>>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new
onewith XFS. Is that much of a jump normal? 
>>>
>>> Ogden
>>
>> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
>> results with EXT4 as well, I suspect, although you did not test that.
>>
>> Regards,
>> Ken
>>
>
> A while back I tested ext3 and xfs myself and found xfs performs better for PG.  However, I also have a photos site
with100K files (split into a small subset of directories), and xfs sucks bad on it. 
>
> So my db is on xfs, and my photos are on ext4.


What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it
betterto put the OS itself on ext4 and the /var/lib/pgsql partition on XFS? 

Thanks

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 15:56:26

On Aug 17, 2011, at 1:33 PM, Gary Doades wrote:

> On 17/08/2011 7:26 PM, Ogden wrote:
>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>
>> The benchmark results are here:
>>
>> http://malekkoheavyindustry.com/benchmark.html
>>
> The results are not completely outrageous, however you don't say what drives, how many and what RAID controller you
havein the current and new systems. You might expect that performance from 10/12 disks in RAID 10 with a good
controller.I would say that your current system is outrageous in that is is so slow! 
>
> Cheers,
> Gary.

Yes, under heavy writes the load would shoot right up which is what caused us to look at upgrading. If it is the RAID
5,it is mind boggling that it could be that much of a difference. I expected a difference, now that much.  

The new system has 6 drives, 300Gb 15K SAS and I've put them into a RAID 10 configuration. The current system is ext3
withRAID 5 over 4 disks on a Perc/5i controller which has half the write cache as the new one (256 Mb vs 512Mb).  

Ogden

Re: Tuning Tips for a new Server

From

"Tomas Vondra"

Date:

17 August 2011, 15:56:59

On 17 Srpen 2011, 18:39, Ogden wrote:
>> Yes, but it greatly depends on the amount of WAL and your workload. If
>> you
>> need to write a lot of WAL data (e.g. during bulk loading), this may
>> significantly improve performance. It may also help when you have a
>> write-heavy workload (a lot of clients updating records, background
>> writer
>> etc.) as that usually means a lot of seeking (while WAL is written
>> sequentially).
>
> The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual
> disk with 100Gb should not be a problem with the disk space should it?

I think you've mentioned the database is on 6 drives, while the other
volume is on 2 drives, right? That makes the OS drive about 3x slower
(just a rough estimate). But if the database drive is used heavily, it
might help to move the xlog directory to the OS disk. See how is the db
volume utilized and if it's fully utilized, try to move the xlog
directory.

The only way to find out is to actualy try it with your workload.

>> What is the size of the database? So those are the new servers? What's
>> the difference compared to the old ones? What is the RAID controller, how
>> much write cache is there?
>
> I am sorry I overlooked specifying this. The database is about 200Gb and
> yes these are new servers which bring more power (RAM, CPU) over the last
> one. The RAID Controller is a Perc H700 and there is 512Mb write cache.
> The servers are Dells.

OK, sounds good although I don't have much experience with this controller.

>>> Currently, I have this for the current system which as 16Gb Ram:
>>>
>>> max_connections = 350
>>>
>>> work_mem = 32MB
>>> maintenance_work_mem = 512MB
>>> wal_buffers = 640kB
>>
>> Anyway if this worked fine for your workload, I don't think you need to
>> change those settings. I'd probably bump up the wal_buffers to 16MB - it
>> might help a bit, definitely won't hurt and it's so little memory it's
>> not
>> worth the effort I guess.
>
> So just increasing the wal_buffers is okay? I thought there would be more
> as the memory in the system is now 4 times as much. Perhaps shared_buffers
> too (down below).

Yes, I was just commenting that particular piece of config. Shared buffers
should be increased too.

>>> # This is what I was helped with before and made reporting queries
>>> blaze
>>> by
>>> seq_page_cost = 1.0
>>> random_page_cost = 3.0
>>> cpu_tuple_cost = 0.5
>>> effective_cache_size = 8192MB
>>
>> Are you sure the cpu_tuple_cost = 0.5 is correct? That seems a bit crazy
>> to me, as it says reading a page sequentially is just twice as expensive
>> as processing it. This value should be abou 100x lower or something like
>> that.
>
> These settings are for the old server, keep in mind. It's a 16GB machine
> (the new one is 64Gb). The value for cpu_tuple_cost should be 0.005? How
> are the other ones?

The default values are like this:

seq_page_cost = 1.0
random_page_cost = 4.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025

Increasing the cpu_tuple_cost to 0.5 makes it way too expensive I guess,
so the database believes processing two 8kB pages is just as expensive as
reading one from the disk. I guess this change penalizes plans that read a
lot of pages, e.g. sequential scans (and favor index scans etc.). Maybe it
makes sense in your case, I'm just wondering why you set it like that.

>> What are the checkpoint settings (segments, completion target). What
>> about
>> shared buffers?
>
>
> #checkpoint_segments = 3                # in logfile segments, min 1, 16MB
> each
> #checkpoint_timeout = 5min              # range 30s-1h
> checkpoint_completion_target = 0.9      # checkpoint target duration, 0.0
> - 1.0 - was 0.5
> #checkpoint_warning = 30s               # 0 disables

You need to bump checkpoint segments up, e.g. 64 or maybe even more. This
means how many WAL segments will be available until a checkpoint has to
happen. Checkpoint is a process when dirty buffers from shared buffers are
written to the disk, so it may be very I/O intensive. Each segment is
16MB, so 3 segments is just 48MB of data, while 64 is 1GB.

More checkpoint segments result in longer recovery in case of database
crash (because all the segments since last checkpoint need to be applied).
But it's essential for good write performance.

Completion target seems fine, but I'd consider increasing the timeout too.

> shared_buffers = 4096MB

The usual recommendation is about 25% of RAM for shared buffers, with 64GB
of RAM that is 16GB. And you should increase effective_cache_size too.

See this: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Tomas

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Gary Doades

Date:

17 August 2011, 16:04:46

On 17/08/2011 7:56 PM, Ogden wrote:
> On Aug 17, 2011, at 1:33 PM, Gary Doades wrote:
>
>> On 17/08/2011 7:26 PM, Ogden wrote:
>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>>
>>> The benchmark results are here:
>>>
>>> http://malekkoheavyindustry.com/benchmark.html
>>>
>> The results are not completely outrageous, however you don't say what drives, how many and what RAID controller you
havein the current and new systems. You might expect that performance from 10/12 disks in RAID 10 with a good
controller.I would say that your current system is outrageous in that is is so slow! 
>>
>> Cheers,
>> Gary.
>
> Yes, under heavy writes the load would shoot right up which is what caused us to look at upgrading. If it is the RAID
5,it is mind boggling that it could be that much of a difference. I expected a difference, now that much. 
>
> The new system has 6 drives, 300Gb 15K SAS and I've put them into a RAID 10 configuration. The current system is ext3
withRAID 5 over 4 disks on a Perc/5i controller which has half the write cache as the new one (256 Mb vs 512Mb). 
Hmm... for only 6 disks in RAID 10 I would say that the figures are a
bit higher than I would expect. The PERC 5 controller is pretty poor in
my opinion, PERC 6 a lot better and the new H700's pretty good. I'm
guessing you have a H700 in your new system.

I've just got a Dell 515 with a H700 and 8 SAS in RAID 10 and I only get
around 600 MB/s read using ext4 and Ubuntu 10.4 server.

Like I say, your figures are not outrageous, just unexpectedly good :)

Cheers,
Gary.

Re: Tuning Tips for a new Server

From

Ogden

Date:

17 August 2011, 16:09:50

On Aug 17, 2011, at 1:56 PM, Tomas Vondra wrote:

> On 17 Srpen 2011, 18:39, Ogden wrote:
>>> Yes, but it greatly depends on the amount of WAL and your workload. If
>>> you
>>> need to write a lot of WAL data (e.g. during bulk loading), this may
>>> significantly improve performance. It may also help when you have a
>>> write-heavy workload (a lot of clients updating records, background
>>> writer
>>> etc.) as that usually means a lot of seeking (while WAL is written
>>> sequentially).
>>
>> The database is about 200Gb so using /usr/local/pgsql/pg_xlog on a virtual
>> disk with 100Gb should not be a problem with the disk space should it?
>
> I think you've mentioned the database is on 6 drives, while the other
> volume is on 2 drives, right? That makes the OS drive about 3x slower
> (just a rough estimate). But if the database drive is used heavily, it
> might help to move the xlog directory to the OS disk. See how is the db
> volume utilized and if it's fully utilized, try to move the xlog
> directory.
>
> The only way to find out is to actualy try it with your workload.

Thank you for your help. I just wanted to ask then, for now I should also put the xlog directory in the /var/lib/pgsql
directorywhich is on the RAID container that is over 6 drives. You see, I wanted to put it on the container with the 2
drivesbecause just the OS is installed on it and has the space (about 100Gb free).  

But you don't think it will be a problem to put the xlog directory along with everything else on /var/lib/pgsql/data? I
hadseen someone suggesting separating it for their setup and it sounded like a good idea so I thought why not, but in
retrospectand what you are saying with the OS drives being 3x slower, it may be okay just to put them on the 6 drives.  

Thoughts?

Thank you once again for your tremendous help

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Andy Colson

Date:

17 August 2011, 16:13:08

On 8/17/2011 1:55 PM, Ogden wrote:
>
> On Aug 17, 2011, at 1:48 PM, Andy Colson wrote:
>
>> On 8/17/2011 1:35 PM, ktm@rice.edu wrote:
>>> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
>>>>
>>>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
>>>>
>>>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
>>>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>>>>>
>>>>>> The benchmark results are here:
>>>>>>
>>>>>> http://malekkoheavyindustry.com/benchmark.html
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Ogden
>>>>>
>>>>> That looks pretty normal to me.
>>>>>
>>>>> Ken
>>>>
>>>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new
onewith XFS. Is that much of a jump normal? 
>>>>
>>>> Ogden
>>>
>>> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
>>> results with EXT4 as well, I suspect, although you did not test that.
>>>
>>> Regards,
>>> Ken
>>>
>>
>> A while back I tested ext3 and xfs myself and found xfs performs better for PG.  However, I also have a photos site
with100K files (split into a small subset of directories), and xfs sucks bad on it. 
>>
>> So my db is on xfs, and my photos are on ext4.
>
>
> What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it
betterto put the OS itself on ext4 and the /var/lib/pgsql partition on XFS? 
>
> Thanks
>
> Ogden

I doubt it matters.  The OS is not going to batch delete thousands of
files.  Once its setup, its pretty constant.  I would not worry about it.

-Andy

Re: Tuning Tips for a new Server

From

Scott Marlowe

Date:

17 August 2011, 16:14:37

On Wed, Aug 17, 2011 at 12:56 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>
> I think you've mentioned the database is on 6 drives, while the other
> volume is on 2 drives, right? That makes the OS drive about 3x slower
> (just a rough estimate). But if the database drive is used heavily, it
> might help to move the xlog directory to the OS disk. See how is the db
> volume utilized and if it's fully utilized, try to move the xlog
> directory.
>
> The only way to find out is to actualy try it with your workload.

This is a very important point.  I've found on most machines with
hardware caching RAID and  8 or fewer 15k SCSI drives it's just as
fast to put it all on one big RAID-10 and if necessary partition it to
put the pg_xlog on its own file system.  After that depending on the
workload you might need a LOT of drives in the pg_xlog dir or just a
pair.    Under normal ops many dbs will use only a tiny % of a
dedicated pg_xlog.  Then something like a site indexer starts to run,
and writing heavily to the db, and the usage shoots to 100% and it's
the bottleneck.

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

J Sisson

Date:

17 August 2011, 16:16:24

On Wed, Aug 17, 2011 at 1:55 PM, Ogden <lists@darkstatic.com> wrote:

What about the OS itself? I put the Debian linux sysem also on XFS but haven't played around with it too much. Is it better to put the OS itself on ext4 and the /var/lib/pgsql partition on XFS?

We've always put the OS on whatever default filesystem it uses, and then put PGDATA on a RAID 10/XFS and PGXLOG on RAID 1/XFS (and for our larger installations, we setup another RAID 10/XFS for heavily accessed indexes or tables). If you have a battery-backed cache on your controller (and it's been tested to work), you can increase performance by mounting the XFS partitions with "nobarrier"...just make sure your battery backup works.

I don't know how current this information is for 9.x (we're still on 8.4), but there is (used to be?) a threshold above which more shared_buffers didn't help. The numbers vary, but somewhere between 8 and 16 GB is typically quoted. We set ours to 25% RAM, but no more than 12 GB (even for our machines with 128+ GB of RAM) because that seems to be a breaking point for our workload.

Of course, no advice will take the place of testing with your workload, so be sure to test =)

Re: Tuning Tips for a new Server

From

Ogden

Date:

17 August 2011, 16:22:32

On Aug 17, 2011, at 2:14 PM, Scott Marlowe wrote:

> On Wed, Aug 17, 2011 at 12:56 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>> I think you've mentioned the database is on 6 drives, while the other
>> volume is on 2 drives, right? That makes the OS drive about 3x slower
>> (just a rough estimate). But if the database drive is used heavily, it
>> might help to move the xlog directory to the OS disk. See how is the db
>> volume utilized and if it's fully utilized, try to move the xlog
>> directory.
>>
>> The only way to find out is to actualy try it with your workload.
>
> This is a very important point.  I've found on most machines with
> hardware caching RAID and  8 or fewer 15k SCSI drives it's just as
> fast to put it all on one big RAID-10 and if necessary partition it to
> put the pg_xlog on its own file system.  After that depending on the
> workload you might need a LOT of drives in the pg_xlog dir or just a
> pair.    Under normal ops many dbs will use only a tiny % of a
> dedicated pg_xlog.  Then something like a site indexer starts to run,
> and writing heavily to the db, and the usage shoots to 100% and it's
> the bottleneck.

I suppose this is my confusion. Or rather I am curious about this. On my current production database the pg_xlog
directoryis 8Gb (our total database is 200Gb). Does this warrant a totally separate setup (and hardware) than PGDATA?

Re: Tuning Tips for a new Server

From

"Tomas Vondra"

Date:

17 August 2011, 16:44:19

On 17 Srpen 2011, 21:22, Ogden wrote:
>> This is a very important point.  I've found on most machines with
>> hardware caching RAID and  8 or fewer 15k SCSI drives it's just as
>> fast to put it all on one big RAID-10 and if necessary partition it to
>> put the pg_xlog on its own file system.  After that depending on the
>> workload you might need a LOT of drives in the pg_xlog dir or just a
>> pair.    Under normal ops many dbs will use only a tiny % of a
>> dedicated pg_xlog.  Then something like a site indexer starts to run,
>> and writing heavily to the db, and the usage shoots to 100% and it's
>> the bottleneck.
>
> I suppose this is my confusion. Or rather I am curious about this. On my
> current production database the pg_xlog directory is 8Gb (our total
> database is 200Gb). Does this warrant a totally separate setup (and
> hardware) than PGDATA?

This is not about database size, it's about the workload - the way you're
using your database. Even a small database may produce a lot of WAL
segments, if the workload is write-heavy. So it's impossible to recommend
something except to try that on your own.

Tomas

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 17:40:14

On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote:

On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:

On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:

On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?

The benchmark results are here:

http://malekkoheavyindustry.com/benchmark.html

Thank you

Ogden

That looks pretty normal to me.

Ken

But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new one with XFS. Is that much of a jump normal?

Ogden

Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
results with EXT4 as well, I suspect, although you did not test that.

i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for the Sequential Output.

http://malekkoheavyindustry.com/benchmark.html

So XFS would be best in this case?

Thank you

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

"ktm@rice.edu"

Date:

17 August 2011, 17:56:52

On Wed, Aug 17, 2011 at 03:40:03PM -0500, Ogden wrote:
>
> On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote:
>
> > On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
> >>
> >> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
> >>
> >>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
> >>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
> >>>>
> >>>> The benchmark results are here:
> >>>>
> >>>> http://malekkoheavyindustry.com/benchmark.html
> >>>>
> >>>>
> >>>> Thank you
> >>>>
> >>>> Ogden
> >>>
> >>> That looks pretty normal to me.
> >>>
> >>> Ken
> >>
> >> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new
onewith XFS. Is that much of a jump normal? 
> >>
> >> Ogden
> >
> > Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
> > results with EXT4 as well, I suspect, although you did not test that.
>
>
> i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for
theSequential Output.  
>
> http://malekkoheavyindustry.com/benchmark.html
>
> So XFS would be best in this case?
>
> Thank you
>
> Ogden

It appears so for at least the Bonnie++ benchmark. I would really try to benchmark
your actual DB on both EXT4 and XFS because some of the comparative benchmarks between
the two give the win to EXT4 for INSERT/UPDATE database usage with PostgreSQL. Only
your application will know for sure....:)

Ken

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

17 August 2011, 18:02:00

On Aug 17, 2011, at 3:56 PM, ktm@rice.edu wrote:

> On Wed, Aug 17, 2011 at 03:40:03PM -0500, Ogden wrote:
>>
>> On Aug 17, 2011, at 1:35 PM, ktm@rice.edu wrote:
>>
>>> On Wed, Aug 17, 2011 at 01:32:41PM -0500, Ogden wrote:
>>>>
>>>> On Aug 17, 2011, at 1:31 PM, ktm@rice.edu wrote:
>>>>
>>>>> On Wed, Aug 17, 2011 at 01:26:56PM -0500, Ogden wrote:
>>>>>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>>>>>
>>>>>> The benchmark results are here:
>>>>>>
>>>>>> http://malekkoheavyindustry.com/benchmark.html
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Ogden
>>>>>
>>>>> That looks pretty normal to me.
>>>>>
>>>>> Ken
>>>>
>>>> But such a jump from the current db01 system to this? Over 20 times difference from the current system to the new
onewith XFS. Is that much of a jump normal? 
>>>>
>>>> Ogden
>>>
>>> Yes, RAID5 is bad for in many ways. XFS is much better than EXT3. You would get similar
>>> results with EXT4 as well, I suspect, although you did not test that.
>>
>>
>> i tested ext4 and the results did not seem to be that close to XFS. Especially when looking at the Block K/sec for
theSequential Output.  
>>
>> http://malekkoheavyindustry.com/benchmark.html
>>
>> So XFS would be best in this case?
>>
>> Thank you
>>
>> Ogden
>
> It appears so for at least the Bonnie++ benchmark. I would really try to benchmark
> your actual DB on both EXT4 and XFS because some of the comparative benchmarks between
> the two give the win to EXT4 for INSERT/UPDATE database usage with PostgreSQL. Only
> your application will know for sure....:)
>
> Ken


What are some good methods that one can use to benchmark PostgreSQL under heavy loads? Ie. to emulate heavy writes? Are
thereany existing scripts and what not? 

Thank you

Afra

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Greg Smith

Date:

17 August 2011, 18:17:42

On 08/17/2011 02:26 PM, Ogden wrote:
> I am using bonnie++ to benchmark our current Postgres system (on RAID
> 5) with the new one we have, which I have configured with RAID 10. The
> drives are the same (SAS 15K). I tried the new system with ext3 and
> then XFS but the results seem really outrageous as compared to the
> current system, or am I reading things wrong?
>
> The benchmark results are here:
> http://malekkoheavyindustry.com/benchmark.html

Congratulations--you're now qualified to be a member of the "RAID5
sucks" club.  You can find other members at
http://www.miracleas.com/BAARF/BAARF2.html  Reasonable read speeds and
just terrible write ones are expected if that's on your old hardware.
Your new results are what I would expect from the hardware you've
described.

The only thing that looks weird are your ext4 "Sequential Output -
Block" results.  They should be between the ext3 and the XFS results,
not far lower than either.  Normally this only comes from using a bad
set of mount options.  With a battery-backed write cache, you'd want to
use "nobarrier" for example; if you didn't do that, that can crush
output rates.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

"mark"

Date:

17 August 2011, 21:35:33


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Greg Smith
> Sent: Wednesday, August 17, 2011 3:18 PM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Raid 5 vs Raid 10 Benchmarks Using bonnie++
>
> On 08/17/2011 02:26 PM, Ogden wrote:
> > I am using bonnie++ to benchmark our current Postgres system (on RAID
> > 5) with the new one we have, which I have configured with RAID 10.
> The
> > drives are the same (SAS 15K). I tried the new system with ext3 and
> > then XFS but the results seem really outrageous as compared to the
> > current system, or am I reading things wrong?
> >
> > The benchmark results are here:
> > http://malekkoheavyindustry.com/benchmark.html
>
> Congratulations--you're now qualified to be a member of the "RAID5
> sucks" club.  You can find other members at
> http://www.miracleas.com/BAARF/BAARF2.html  Reasonable read speeds and
> just terrible write ones are expected if that's on your old hardware.
> Your new results are what I would expect from the hardware you've
> described.
>
> The only thing that looks weird are your ext4 "Sequential Output -
> Block" results.  They should be between the ext3 and the XFS results,
> not far lower than either.  Normally this only comes from using a bad
> set of mount options.  With a battery-backed write cache, you'd want to
> use "nobarrier" for example; if you didn't do that, that can crush
> output rates.
>

To clarify maybe for those new at using non-default mount options.

With XFS the mount option is nobarrier. With ext4 I think it is barrier=0

Someone please correct me if I am misleading people or otherwise mistaken.

-mark

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Greg Smith

Date:

17 August 2011, 22:08:18

On 08/17/2011 08:35 PM, mark wrote:
> With XFS the mount option is nobarrier. With ext4 I think it is barrier=0

http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt

ext4 supports both; "nobarrier" and "barrier=0" mean the same thing.  I
tend to use "nobarrier" just because I'm used to that name on XFS systems.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

18 August 2011, 00:48:18

On Aug 17, 2011, at 4:16 PM, Greg Smith wrote:

> On 08/17/2011 02:26 PM, Ogden wrote:
>> I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have
configuredwith RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the
resultsseem really outrageous as compared to the current system, or am I reading things wrong? 
>>
>> The benchmark results are here:
>> http://malekkoheavyindustry.com/benchmark.html
>>
>
> Congratulations--you're now qualified to be a member of the "RAID5 sucks" club.  You can find other members at
http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and just terrible write ones are expected if that's
onyour old hardware.  Your new results are what I would expect from the hardware you've described. 
>
> The only thing that looks weird are your ext4 "Sequential Output - Block" results.  They should be between the ext3
andthe XFS results, not far lower than either.  Normally this only comes from using a bad set of mount options.  With a
battery-backedwrite cache, you'd want to use "nobarrier" for example; if you didn't do that, that can crush output
rates.


Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing and
goodenough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option.  

I did not do that with XFS and it did quite well - I know it's up to my app and more testing, but in your experience,
whatis usually a good filesystem to use? I keep reading conflicting things.. 

Thank you

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Craig Ringer

Date:

18 August 2011, 02:36:18

On 18/08/2011 11:48 AM, Ogden wrote:
> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing and
goodenough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. 
>
>
Yeah, I'm confused by that too. Shouldn't a write barrier flush data to
persistent storage - in this case, the RAID card's battery backed cache?
Why would it force a RAID controller cache flush to disk, too?

--
Craig Ringer

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Mark Kirkwood

Date:

18 August 2011, 04:08:07

On 18/08/11 17:35, Craig Ringer wrote:
> On 18/08/2011 11:48 AM, Ogden wrote:
>> Isn't this very dangerous? I have the Dell PERC H700 card - I see
>> that it has 512Mb Cache. Is this the same thing and good enough to
>> switch to nobarrier? Just worried if a sudden power shut down, then
>> data can be lost on this option.
>>
>>
> Yeah, I'm confused by that too. Shouldn't a write barrier flush data
> to persistent storage - in this case, the RAID card's battery backed
> cache? Why would it force a RAID controller cache flush to disk, too?
>
>

If the card's cache has a battery, then the cache is preserved in the
advent of crash/power loss etc - provided it has enough charge, so
setting 'writeback' property on arrays is safe. The PERC/SERVERRAID
cards I'm familiar (LSI Megaraid rebranded models) all switch to
write-though mode if they detect the battery is dangerously discharged
so this is not normally a problem (but commit/fsync performance will
fall off a cliff when this happens)!

Cheers

Mark

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Aidan Van Dyk

Date:

18 August 2011, 10:33:07

On Thu, Aug 18, 2011 at 1:35 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 18/08/2011 11:48 AM, Ogden wrote:
>>
>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it
>> has 512Mb Cache. Is this the same thing and good enough to switch to
>> nobarrier? Just worried if a sudden power shut down, then data can be lost
>> on this option.
>>
>>
> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to
> persistent storage - in this case, the RAID card's battery backed cache? Why
> would it force a RAID controller cache flush to disk, too?

The "barrier" is the linux fs/block way of saying "these writes need
to be on persistent media before I can depend on them".  On typical
spinning media disks, that means out of the disk cache (which is not
persistent) and on platters.  The way it assures that the writes are
on "persistant media" is with a "flush cache" type of command.  The
"flush cache" is a close approximation to "make sure it's persistent".

If your cache is battery backed, it is now persistent, and there is no
need to "flush cache", hence the nobarrier option if you believe your
cache is persistent.

Now, make sure that even though your raid cache is persistent, your
disks have cache in write-through mode, cause it would suck for your
raid cache to "work", but believe the data is safely on disk and only
find out that it was in the disks (small) cache, and you're raid is
out of sync after an outage because of that...  I believe most raid
cards will handle that correctly for you automatically.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

18 August 2011, 11:09:38

On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote:

> On 18/08/11 17:35, Craig Ringer wrote:
>> On 18/08/2011 11:48 AM, Ogden wrote:
>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing
andgood enough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. 
>>>
>>>
>> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to persistent storage - in this case, the RAID
card'sbattery backed cache? Why would it force a RAID controller cache flush to disk, too? 
>>
>>
>
> If the card's cache has a battery, then the cache is preserved in the advent of crash/power loss etc - provided it
hasenough charge, so setting 'writeback' property on arrays is safe. The PERC/SERVERRAID cards I'm familiar (LSI
Megaraidrebranded models) all switch to write-though mode if they detect the battery is dangerously discharged so this
isnot normally a problem (but commit/fsync performance will fall off a cliff when this happens)! 
>
> Cheers
>
> Mark


So a setting such as this:

Device Name         : /dev/sdb
Type                : SAS
Read Policy         : No Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 64 KB
Disk Cache Policy   : Enabled


Is sufficient to enable nobarrier then with these settings?

Thank you

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Ogden

Date:

18 August 2011, 14:31:37

On Aug 17, 2011, at 4:17 PM, Greg Smith wrote:

On 08/17/2011 02:26 PM, Ogden wrote:
I am using bonnie++ to benchmark our current Postgres system (on RAID 5) with the new one we have, which I have configured with RAID 10. The drives are the same (SAS 15K). I tried the new system with ext3 and then XFS but the results seem really outrageous as compared to the current system, or am I reading things wrong?

The benchmark results are here:
http://malekkoheavyindustry.com/benchmark.html

Congratulations--you're now qualified to be a member of the "RAID5 sucks" club. You can find other members at http://www.miracleas.com/BAARF/BAARF2.html Reasonable read speeds and just terrible write ones are expected if that's on your old hardware. Your new results are what I would expect from the hardware you've described.

The only thing that looks weird are your ext4 "Sequential Output - Block" results. They should be between the ext3 and the XFS results, not far lower than either. Normally this only comes from using a bad set of mount options. With a battery-backed write cache, you'd want to use "nobarrier" for example; if you didn't do that, that can crush output rates.

I have mounted the ext4 system with the nobarrier option:

/dev/sdb1 on /var/lib/pgsql type ext4 (rw,noatime,data=writeback,barrier=0,nobh,errors=remount-ro)

Yet the results show absolutely a decrease in performance in the ext4 "Sequential Output - Block" results:

http://malekkoheavyindustry.com/benchmark.html

However, the Random seeks is better, even more so than XFS...

Any thoughts as to why this is occurring?

Ogden

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Mark Kirkwood

Date:

18 August 2011, 21:52:38

On 19/08/11 02:09, Ogden wrote:
> On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote:
>
>> On 18/08/11 17:35, Craig Ringer wrote:
>>> On 18/08/2011 11:48 AM, Ogden wrote:
>>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it has 512Mb Cache. Is this the same thing
andgood enough to switch to nobarrier? Just worried if a sudden power shut down, then data can be lost on this option. 
>>>>
>>>>
>>> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to persistent storage - in this case, the RAID
card'sbattery backed cache? Why would it force a RAID controller cache flush to disk, too? 
>>>
>>>
>> If the card's cache has a battery, then the cache is preserved in the advent of crash/power loss etc - provided it
hasenough charge, so setting 'writeback' property on arrays is safe. The PERC/SERVERRAID cards I'm familiar (LSI
Megaraidrebranded models) all switch to write-though mode if they detect the battery is dangerously discharged so this
isnot normally a problem (but commit/fsync performance will fall off a cliff when this happens)! 
>>
>> Cheers
>>
>> Mark
>
> So a setting such as this:
>
> Device Name         : /dev/sdb
> Type                : SAS
> Read Policy         : No Read Ahead
> Write Policy        : Write Back
> Cache Policy        : Not Applicable
> Stripe Element Size : 64 KB
> Disk Cache Policy   : Enabled
>
>
> Is sufficient to enable nobarrier then with these settings?
>


Hmm - that output looks different from the cards I'm familiar with. I'd
want to see the manual entries for  "Cache Policy=Not Applicable" and
"Disk Cache Policy=Enabled" to understand what the settings actually
mean. Assuming "Disk Cache Policy=Enabled" means what I think it does
(i.e writes are cached in the physical drives cache), this setting seems
wrong if your card has on board cache + battery, you would want to only
cache 'em in the *card's* cache  (too many caches to keep straight in
one's head, lol).

Cheers

Mark

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Mark Kirkwood

Date:

18 August 2011, 21:59:09

On 19/08/11 12:52, Mark Kirkwood wrote:
> On 19/08/11 02:09, Ogden wrote:
>> On Aug 18, 2011, at 2:07 AM, Mark Kirkwood wrote:
>>
>>> On 18/08/11 17:35, Craig Ringer wrote:
>>>> On 18/08/2011 11:48 AM, Ogden wrote:
>>>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see
>>>>> that it has 512Mb Cache. Is this the same thing and good enough to
>>>>> switch to nobarrier? Just worried if a sudden power shut down,
>>>>> then data can be lost on this option.
>>>>>
>>>>>
>>>> Yeah, I'm confused by that too. Shouldn't a write barrier flush
>>>> data to persistent storage - in this case, the RAID card's battery
>>>> backed cache? Why would it force a RAID controller cache flush to
>>>> disk, too?
>>>>
>>>>
>>> If the card's cache has a battery, then the cache is preserved in
>>> the advent of crash/power loss etc - provided it has enough charge,
>>> so setting 'writeback' property on arrays is safe. The
>>> PERC/SERVERRAID cards I'm familiar (LSI Megaraid rebranded models)
>>> all switch to write-though mode if they detect the battery is
>>> dangerously discharged so this is not normally a problem (but
>>> commit/fsync performance will fall off a cliff when this happens)!
>>>
>>> Cheers
>>>
>>> Mark
>>
>> So a setting such as this:
>>
>> Device Name         : /dev/sdb
>> Type                : SAS
>> Read Policy         : No Read Ahead
>> Write Policy        : Write Back
>> Cache Policy        : Not Applicable
>> Stripe Element Size : 64 KB
>> Disk Cache Policy   : Enabled
>>
>>
>> Is sufficient to enable nobarrier then with these settings?
>>
>
>
> Hmm - that output looks different from the cards I'm familiar with.
> I'd want to see the manual entries for  "Cache Policy=Not Applicable"
> and "Disk Cache Policy=Enabled" to understand what the settings
> actually mean. Assuming "Disk Cache Policy=Enabled" means what I think
> it does (i.e writes are cached in the physical drives cache), this
> setting seems wrong if your card has on board cache + battery, you
> would want to only cache 'em in the *card's* cache  (too many caches
> to keep straight in one's head, lol).
>

FWIW - here's what our ServerRaid (M5015) output looks like for a RAID 1
array configured with writeback, reads not cached on the card's memory,
physical disk caches disabled:

$ MegaCli64 -LDInfo -L0 -a0

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 67.054 GB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache
if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache
if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

david@lang.hm

Date:

12 September 2011, 19:58:04

apologies for such a late response to this thread, but there is domething
I think is _really_ dangerous here.

On Thu, 18 Aug 2011, Aidan Van Dyk wrote:

> On Thu, Aug 18, 2011 at 1:35 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
>> On 18/08/2011 11:48 AM, Ogden wrote:
>>>
>>> Isn't this very dangerous? I have the Dell PERC H700 card - I see that it
>>> has 512Mb Cache. Is this the same thing and good enough to switch to
>>> nobarrier? Just worried if a sudden power shut down, then data can be lost
>>> on this option.
>>>
>>>
>> Yeah, I'm confused by that too. Shouldn't a write barrier flush data to
>> persistent storage - in this case, the RAID card's battery backed cache? Why
>> would it force a RAID controller cache flush to disk, too?
>
> The "barrier" is the linux fs/block way of saying "these writes need
> to be on persistent media before I can depend on them".  On typical
> spinning media disks, that means out of the disk cache (which is not
> persistent) and on platters.  The way it assures that the writes are
> on "persistant media" is with a "flush cache" type of command.  The
> "flush cache" is a close approximation to "make sure it's persistent".
>
> If your cache is battery backed, it is now persistent, and there is no
> need to "flush cache", hence the nobarrier option if you believe your
> cache is persistent.
>
> Now, make sure that even though your raid cache is persistent, your
> disks have cache in write-through mode, cause it would suck for your
> raid cache to "work", but believe the data is safely on disk and only
> find out that it was in the disks (small) cache, and you're raid is
> out of sync after an outage because of that...  I believe most raid
> cards will handle that correctly for you automatically.

if you don't have barriers enabled, the data may not get written out of
main memory to the battery backed memory on the card as the OS has no
reason to do the write out of the OS buffers now rather than later.

Every raid card I have seen has ignored the 'flush cache' type of command
if it has a battery and that battery is good, so you leave the barriers
enabled and the card still gives you great performance.

David Lang

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Aidan Van Dyk

Date:

12 September 2011, 21:08:15

On Mon, Sep 12, 2011 at 6:57 PM,  <david@lang.hm> wrote:

>> The "barrier" is the linux fs/block way of saying "these writes need
>> to be on persistent media before I can depend on them".  On typical
>> spinning media disks, that means out of the disk cache (which is not
>> persistent) and on platters.  The way it assures that the writes are
>> on "persistant media" is with a "flush cache" type of command.  The
>> "flush cache" is a close approximation to "make sure it's persistent".
>>
>> If your cache is battery backed, it is now persistent, and there is no
>> need to "flush cache", hence the nobarrier option if you believe your
>> cache is persistent.
>>
>> Now, make sure that even though your raid cache is persistent, your
>> disks have cache in write-through mode, cause it would suck for your
>> raid cache to "work", but believe the data is safely on disk and only
>> find out that it was in the disks (small) cache, and you're raid is
>> out of sync after an outage because of that...  I believe most raid
>> cards will handle that correctly for you automatically.
>
> if you don't have barriers enabled, the data may not get written out of main
> memory to the battery backed memory on the card as the OS has no reason to
> do the write out of the OS buffers now rather than later.

It's not quite so simple.  The "sync" calls (pick your flavour) is
what tells the OS buffers they have to go out.  The syscall (on a
working FS) won't return until the write and data has reached the
"device" safely, and is considered persistent.

But in linux, a barrier is actually a "synchronization" point, not
just a "flush cache"...  It's a "guarantee everything up to now is
persistent, I'm going to start counting on it".  But depending on your
card, drivers and yes, kernel version, that "barrier" is sometimes a
"drain/block I/O queue, issue cache flush, wait, write specific data,
flush, wait, open I/O queue".  The double flush is because it needs to
guarantee everything previous is good before it writes the "critical"
piece, and then needs to guarantee that too.

Now, on good raid hardware it's not usually that bad.

And then, just to confuse people more, LVM up until 2.6.29 (so that
includes all those RHEL5/CentOS5 installs out there which default to
using LVM) didn't handle barriers, it just sort of threw them out as
it came across them, meaning that you got the performance of
nobarrier, even if you thought you were using barriers on poor raid
hardware.

> Every raid card I have seen has ignored the 'flush cache' type of command if
> it has a battery and that battery is good, so you leave the barriers enabled
> and the card still gives you great performance.

XFS FAQ  goes over much of it, starting at Q24:
   http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F

So, for pure performance, on a battery-backed controller, nobarrier is
the recommended *performance* setting.

But, to throw a wrench into the plan, what happens when during normal
battery tests, your raid controller decides the battery is failing...
of course, it's going to start screaming and send all your monitoring
alarms off (you're monitoring that, right?), but have you thought to
make sure that your FS is remounted with barriers at the first sign of
battery trouble?

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

david@lang.hm

Date:

12 September 2011, 21:48:15

On Mon, 12 Sep 2011, Aidan Van Dyk wrote:

> On Mon, Sep 12, 2011 at 6:57 PM,  <david@lang.hm> wrote:
>
>>> The "barrier" is the linux fs/block way of saying "these writes need
>>> to be on persistent media before I can depend on them".  On typical
>>> spinning media disks, that means out of the disk cache (which is not
>>> persistent) and on platters.  The way it assures that the writes are
>>> on "persistant media" is with a "flush cache" type of command.  The
>>> "flush cache" is a close approximation to "make sure it's persistent".
>>>
>>> If your cache is battery backed, it is now persistent, and there is no
>>> need to "flush cache", hence the nobarrier option if you believe your
>>> cache is persistent.
>>>
>>> Now, make sure that even though your raid cache is persistent, your
>>> disks have cache in write-through mode, cause it would suck for your
>>> raid cache to "work", but believe the data is safely on disk and only
>>> find out that it was in the disks (small) cache, and you're raid is
>>> out of sync after an outage because of that...  I believe most raid
>>> cards will handle that correctly for you automatically.
>>
>> if you don't have barriers enabled, the data may not get written out of main
>> memory to the battery backed memory on the card as the OS has no reason to
>> do the write out of the OS buffers now rather than later.
>
> It's not quite so simple.  The "sync" calls (pick your flavour) is
> what tells the OS buffers they have to go out.  The syscall (on a
> working FS) won't return until the write and data has reached the
> "device" safely, and is considered persistent.
>
> But in linux, a barrier is actually a "synchronization" point, not
> just a "flush cache"...  It's a "guarantee everything up to now is
> persistent, I'm going to start counting on it".  But depending on your
> card, drivers and yes, kernel version, that "barrier" is sometimes a
> "drain/block I/O queue, issue cache flush, wait, write specific data,
> flush, wait, open I/O queue".  The double flush is because it needs to
> guarantee everything previous is good before it writes the "critical"
> piece, and then needs to guarantee that too.
>
> Now, on good raid hardware it's not usually that bad.
>
> And then, just to confuse people more, LVM up until 2.6.29 (so that
> includes all those RHEL5/CentOS5 installs out there which default to
> using LVM) didn't handle barriers, it just sort of threw them out as
> it came across them, meaning that you got the performance of
> nobarrier, even if you thought you were using barriers on poor raid
> hardware.

this is part of the problem.

if you have a simple fs-on-hardware you may be able to get away with the
barriers, but if you have fs-on-x-on-y-on-hardware type of thing
(specifically where LVM is one of the things in the middle), and those
things in the middle do not honor barriers, the fsync becomes meaningless
because without propogating the barrier down the stack, the writes that
the fsync triggers may not get to the disk.

>> Every raid card I have seen has ignored the 'flush cache' type of command if
>> it has a battery and that battery is good, so you leave the barriers enabled
>> and the card still gives you great performance.
>
> XFS FAQ  goes over much of it, starting at Q24:
>   http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F
>
> So, for pure performance, on a battery-backed controller, nobarrier is
> the recommended *performance* setting.
>
> But, to throw a wrench into the plan, what happens when during normal
> battery tests, your raid controller decides the battery is failing...
> of course, it's going to start screaming and send all your monitoring
> alarms off (you're monitoring that, right?), but have you thought to
> make sure that your FS is remounted with barriers at the first sign of
> battery trouble?

yep.

on a good raid card with battery backed cache, the performance difference
between barriers being on and barriers being off should be minimal. If
it's not, I think that you have something else going on.

David Lang

Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From

Aidan Van Dyk

Date:

12 September 2011, 23:15:36

On Mon, Sep 12, 2011 at 8:47 PM,  <david@lang.hm> wrote:

>> XFS FAQ  goes over much of it, starting at Q24:
>>
>>  http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F
>>
>> So, for pure performance, on a battery-backed controller, nobarrier is
>> the recommended *performance* setting.
>>
>> But, to throw a wrench into the plan, what happens when during normal
>> battery tests, your raid controller decides the battery is failing...
>> of course, it's going to start screaming and send all your monitoring
>> alarms off (you're monitoring that, right?), but have you thought to
>> make sure that your FS is remounted with barriers at the first sign of
>> battery trouble?
>
> yep.
>
> on a good raid card with battery backed cache, the performance difference
> between barriers being on and barriers being off should be minimal. If it's
> not, I think that you have something else going on.

The performance boost you'll get is that you don't have the temporary
stall in parallelization that the barriers have.  With barriers, even
if the controller cache doesn't really flush, you still have the
"can't send more writes to the device until the barrier'ed write is
done", so at all those points, you have only a single write command in
flight.  The performance penalty of barriers on good cards comes
because barriers are written to prevent the devices from reordering of
write persistence, and do that by waiting for a write to be
"persistent" before allowing more to be queued to the device.

With nobarrier, you operate under the assumption that the block device
writes are persisted in the order commands are issued to the devices,
so you never have to "drain the queue", as you do in the normal
barrier implementation, and can (in theory) always have more request
that the raid card can be working on processing, reordering, and
dispatching to platters for the maximum theoretical throughput...

Of course, linux has completely re-written/changed the
sync/barrier/flush methods over the past few years, and there is no
guarantee they don't keep changing the implementation details in the
future, so keep up on the filesystem details of whatever you're
using...

So keep doing burn-ins, with real pull-the-cord tests... They can't
"prove" it's 100% safe, but they can quickly prove when it's not ;-)

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.