Thread: performance on new linux box
Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3). Basically a desktop with linux on it. I experienced slow performance.
So, I finally moved it to a real server. A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even worse performance issues.
My system is consistently highly transactional. However, there is also regular complex queries and occasional bulk loads.
On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries. The smaller transactional queries seem comparable but i had expected an improvement. Performing a db import via psql -d databas -f dbfile illustrates this problem. It takes 5 hours to run this import. By contrast, if I perform this same exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour. Same deal with the old linux machine. How is this possible?
Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB
effective_cache_size = 2560MB
work_mem = 16MB
maintenance_work_mem = 128MB
checkpoint_segments = 7
checkpoint_timeout = 7min
checkpoint_completion_target = 0.5
I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result. Is there a setting change I should be considering?
Does 8.4 have performance problems or is this unique to me?
thanks
So, I finally moved it to a real server. A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even worse performance issues.
My system is consistently highly transactional. However, there is also regular complex queries and occasional bulk loads.
On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries. The smaller transactional queries seem comparable but i had expected an improvement. Performing a db import via psql -d databas -f dbfile illustrates this problem. It takes 5 hours to run this import. By contrast, if I perform this same exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour. Same deal with the old linux machine. How is this possible?
Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB
effective_cache_size = 2560MB
work_mem = 16MB
maintenance_work_mem = 128MB
checkpoint_segments = 7
checkpoint_timeout = 7min
checkpoint_completion_target = 0.5
I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result. Is there a setting change I should be considering?
Does 8.4 have performance problems or is this unique to me?
thanks
Ryan Wexler <ryan@iridiumsuite.com> writes: > Postgresql was previously running on a single cpu linux machine with 2 gigs > of memory and a single sata drive (v8.3). Basically a desktop with linux on > it. I experienced slow performance. > So, I finally moved it to a real server. A dually zeon centos machine with > 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even > worse performance issues. I'm wondering if you moved to a kernel+filesystem version that actually enforces fsync, from one that didn't. If so, the apparently faster performance on the old box was being obtained at the cost of (lack of) crash safety. That probably goes double for your windows-box comparison point. You could try test_fsync from the Postgres sources to confirm that theory, or do some pgbench benchmarking to have more quantifiable numbers. See past discussions about write barriers in this list's archives for more detail. regards, tom lane
On Wed, Jul 7, 2010 at 4:06 PM, Ryan Wexler <ryan@iridiumsuite.com> wrote: > Postgresql was previously running on a single cpu linux machine with 2 gigs > of memory and a single sata drive (v8.3). Basically a desktop with linux on > it. I experienced slow performance. > > So, I finally moved it to a real server. A dually zeon centos machine with > 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even > worse performance issues. > > My system is consistently highly transactional. However, there is also > regular complex queries and occasional bulk loads. > > On the new system the bulk loads are extremely slower than on the previous > machine and so are the more complex queries. The smaller transactional > queries seem comparable but i had expected an improvement. Performing a db > import via psql -d databas -f dbfile illustrates this problem. It takes 5 > hours to run this import. By contrast, if I perform this same exact import > on my crappy windows box with only 2 gigs of memory and default postgres > settings it takes 1 hour. Same deal with the old linux machine. How is > this possible? > > Here are some of my key config settings: > max_connections = 100 > shared_buffers = 768MB > effective_cache_size = 2560MB > work_mem = 16MB > maintenance_work_mem = 128MB > checkpoint_segments = 7 > checkpoint_timeout = 7min > checkpoint_completion_target = 0.5 > > I have tried varying the shared_buffers size from 128 all the way to 1500mbs > and got basically the same result. Is there a setting change I should be > considering? > > Does 8.4 have performance problems or is this unique to me? > > thanks > > I think the most likely explanation is that the crappy box lied about fsync'ing data and your server is not. Did you purchase a raid card with a bbu? If so, can you set the write cache policy to write-back? -- Rob Wultsch wultsch@gmail.com
On 07/07/2010 06:06 PM, Ryan Wexler wrote: > Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3). Basically a desktop with linux on it. I experienced slow performance. > > So, I finally moved it to a real server. A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even worse performance issues. > > My system is consistently highly transactional. However, there is also regular complex queries and occasional bulk loads. > > On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries. The smaller transactional queries seem comparable but i had expected an improvement. Performing a db import viapsql -d databas -f dbfile illustrates this problem. It takes 5 hours to run this import. By contrast, if I perform thissame exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour. Same deal with the old linux machine. How is this possible? > > Here are some of my key config settings: > max_connections = 100 > shared_buffers = 768MB > effective_cache_size = 2560MB > work_mem = 16MB > maintenance_work_mem = 128MB > checkpoint_segments = 7 > checkpoint_timeout = 7min > checkpoint_completion_target = 0.5 > > I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result. Is therea setting change I should be considering? > > Does 8.4 have performance problems or is this unique to me? > > thanks > Yeah, I inherited a "server" (the quotes are sarcastic air quotes), with really bad disk IO... er.. really safe disk IO. Try the dd test. On my desktop I get 60-70 meg a second. On this "server" (I laugh) I got about 20. I had to go outof my way (way out) to enable the disk caching, and even then only got 50 meg a second. http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm -Andy
> On the new system the bulk loads are extremely slower than on the > previous > machine and so are the more complex queries. The smaller transactional > queries seem comparable but i had expected an improvement. Performing a > db > import via psql -d databas -f dbfile illustrates this problem. If you use psql (not pg_restore) and your file contains no BEGIN/COMMIT statements, you're probably doing 1 transaction per SQL command. As the others say, if the old box lied about fsync, and the new one doesn't, performance will suffer greatly. If this is the case, remember to do your imports the proper way : either use pg_restore, or group inserts in a transaction, and build indexes in parallel.
On Wed, Jul 7, 2010 at 10:07 PM, Andy Colson <andy@squeakycode.net> wrote:
For about $2k - $3k, you can get a server that will do upwards of 300 MB/sec, assuming the bulk of that cost goes to a good hardware-based RAID controller with a battery backed-up cache and some good 15k RPM SAS drives. Since it sounds like you are disk I/O bound, it's probably not worth it for you to spend extra on CPU and memory. Sink the money into the disk array instead. If you have an extra $4k more money in your budget, you might even try 4 of these in a RAID 10:
http://www.provantage.com/ocz-technology-oczssd2-2vtxex100g~7OCZT0L9.htm
Yeah, I inherited a "server" (the quotes are sarcastic air quotes), with really bad disk IO... er.. really safe disk IO. Try the dd test. On my desktop I get 60-70 meg a second. On this "server" (I laugh) I got about 20. I had to go out of my way (way out) to enable the disk caching, and even then only got 50 meg a second.On 07/07/2010 06:06 PM, Ryan Wexler wrote:Postgresql was previously running on a single cpu linux machine with 2 gigs of memory and a single sata drive (v8.3). Basically a desktop with linux on it. I experienced slow performance.
So, I finally moved it to a real server. A dually zeon centos machine with 6 gigs of memory and raid 10, postgres 8.4. But, I am now experiencing even worse performance issues.
My system is consistently highly transactional. However, there is also regular complex queries and occasional bulk loads.
On the new system the bulk loads are extremely slower than on the previous machine and so are the more complex queries. The smaller transactional queries seem comparable but i had expected an improvement. Performing a db import via psql -d databas -f dbfile illustrates this problem. It takes 5 hours to run this import. By contrast, if I perform this same exact import on my crappy windows box with only 2 gigs of memory and default postgres settings it takes 1 hour. Same deal with the old linux machine. How is this possible?
Here are some of my key config settings:
max_connections = 100
shared_buffers = 768MB
effective_cache_size = 2560MB
work_mem = 16MB
maintenance_work_mem = 128MB
checkpoint_segments = 7
checkpoint_timeout = 7min
checkpoint_completion_target = 0.5
I have tried varying the shared_buffers size from 128 all the way to 1500mbs and got basically the same result. Is there a setting change I should be considering?
Does 8.4 have performance problems or is this unique to me?
thanks
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm
For about $2k - $3k, you can get a server that will do upwards of 300 MB/sec, assuming the bulk of that cost goes to a good hardware-based RAID controller with a battery backed-up cache and some good 15k RPM SAS drives. Since it sounds like you are disk I/O bound, it's probably not worth it for you to spend extra on CPU and memory. Sink the money into the disk array instead. If you have an extra $4k more money in your budget, you might even try 4 of these in a RAID 10:
http://www.provantage.com/ocz-technology-oczssd2-2vtxex100g~7OCZT0L9.htm
--
Eliot Gable
Eliot Gable <egable+pgsql-performance@gmail.com> wrote: > For about $2k - $3k, you can get a server that will do upwards of > 300 MB/sec, assuming the bulk of that cost goes to a good > hardware-based RAID controller with a battery backed-up cache and > some good 15k RPM SAS drives. FWIW, I concur that the description so far suggests that this server either doesn't have a good RAID controller card with battery backed- up (BBU) cache, or that it isn't configured properly. -Kevin
On Thu, Jul 8, 2010 at 9:53 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Eliot Gable <egable+pgsql-performance@gmail.com> wrote:FWIW, I concur that the description so far suggests that this server
> For about $2k - $3k, you can get a server that will do upwards of
> 300 MB/sec, assuming the bulk of that cost goes to a good
> hardware-based RAID controller with a battery backed-up cache and
> some good 15k RPM SAS drives.
either doesn't have a good RAID controller card with battery backed-
up (BBU) cache, or that it isn't configured properly.
On another note, it is also entirely possible that just re-writing your queries will completely solve your problem and make your performance bottleneck go away. Sometimes throwing hardware at a problem is not the best (or cheapest) solution. Personally, I would never throw hardware at a problem until I am certain that I have everything else optimized as much as possible. One of the stored procedures I recently wrote in pl/pgsql was originally chewing up my entire development box's processing capabilities at just 20 transactions per second. It's a pretty wimpy box, so I was not really expecting a lot out of it. However, after spending several weeks optimizing my queries, I now have it doing twice as much work at 120 transactions per second on the same box. So, if I had thrown hardware at the problem, I would have spent 12 times more on hardware than I need to spend now for the same level of performance.
If you can post some of your queries, there are a lot of bright people on this discussion list that can probably help you solve your bottleneck without spending a ton of money on new hardware. Obviously, there is no guarantee -- you might already be as optimized as you can get in your queries, but I doubt it. Even after spending months tweaking my queries, I am still finding things here and there where I can get a bit more performance out of them.
--
Eliot Gable
Eliot Gable <egable+pgsql-performance@gmail.com> wrote: > If you can post some of your queries, there are a lot of bright > people on this discussion list that can probably help you solve > your bottleneck Sure, but the original post was because the brand new server class machine was performing much worse than the single-drive desktop machine *on the same queries*, which seems like an issue worthy of investigation independently of what you suggest. -Kevin
Thanks a lot for all the comments. The fact that both my windows box and the old linux box both show a massive performance improvement over the new linux box seems to point to hardware to me. I am not sure how to test the fsync issue, but i don't see how that could be it.
The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card
Looking it up, it seems to indicate that it has BBU
The only other difference between the boxes is the postgresql version. The new one has 8.4-2 from the yum install instructions on the site:
http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
Any more thoughts?
The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card
Looking it up, it seems to indicate that it has BBU
The only other difference between the boxes is the postgresql version. The new one has 8.4-2 from the yum install instructions on the site:
http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
Any more thoughts?
On Thu, Jul 8, 2010 at 8:02 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> If you can post some of your queries, there are a lot of brightSure, but the original post was because the brand new server class
> people on this discussion list that can probably help you solve
> your bottleneck
machine was performing much worse than the single-drive desktop
machine *on the same queries*, which seems like an issue worthy of
investigation independently of what you suggest.
-Kevin
On Thu, 2010-07-08 at 09:31 -0700, Ryan Wexler wrote: > The raid card the server has in it is: > 3Ware 4 Port 9650SE-4LPML RAID Card > > Looking it up, it seems to indicate that it has BBU No. It supports a BBU. It doesn't have one necessarily. You need to go into your RAID BIOS. It will tell you. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
Re: performance on new linux boxeradmin(11983)i: STATEMENT: update license set expires= '2010-06-15' where lic
From
John Rouillard
Date:
On Thu, Jul 08, 2010 at 09:31:32AM -0700, Ryan Wexler wrote: > Thanks a lot for all the comments. The fact that both my windows box and > the old linux box both show a massive performance improvement over the new > linux box seems to point to hardware to me. I am not sure how to test the > fsync issue, but i don't see how that could be it. > > The raid card the server has in it is: > 3Ware 4 Port 9650SE-4LPML RAID Card > > Looking it up, it seems to indicate that it has BBU By "looking it up", I assume you mean running tw_cli and looking at the output to make sure the bbu is enabled and the cache is turned on for the raid array u0 or u1 ...? -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
On 7/8/10 9:31 AM, Ryan Wexler wrote: > Thanks a lot for all the comments. The fact that both my windows box > and the old linux box both show a massive performance improvement over > the new linux box seems to point to hardware to me. I am not sure how > to test the fsync issue, but i don't see how that could be it. > > The raid card the server has in it is: > 3Ware 4 Port 9650SE-4LPML RAID Card > > Looking it up, it seems to indicate that it has BBU Make sure the battery isn't dead. Most RAID controllers drop to non-BBU speeds if they detect that the battery is faulty. Craig
---------- Forwarded message ----------
From: Ryan Wexler <ryan@iridiumsuite.com>
Date: Thu, Jul 8, 2010 at 10:12 AM
Subject: Re: [PERFORM] performance on new linux box
To: Craig James <craig_james@emolecules.com>
Thanks. The server is hosted, so it is a bit of a hassle to figure this stuff out, but I am having someone check.
From: Ryan Wexler <ryan@iridiumsuite.com>
Date: Thu, Jul 8, 2010 at 10:12 AM
Subject: Re: [PERFORM] performance on new linux box
To: Craig James <craig_james@emolecules.com>
On Thu, Jul 8, 2010 at 10:10 AM, Craig James <craig_james@emolecules.com> wrote:
On 7/8/10 9:31 AM, Ryan Wexler wrote:Make sure the battery isn't dead. Most RAID controllers drop to non-BBU speeds if they detect that the battery is faulty.Thanks a lot for all the comments. The fact that both my windows box
and the old linux box both show a massive performance improvement over
the new linux box seems to point to hardware to me. I am not sure how
to test the fsync issue, but i don't see how that could be it.
The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card
Looking it up, it seems to indicate that it has BBU
Craig
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Thursday, July 8, 2010, 7:16:47 PM you wrote: > Thanks. The server is hosted, so it is a bit of a hassle to figure this > stuff out, but I am having someone check. If you have root access to the machine, you should try 'tw_cli /cx show', where the x in /cx is the controller number. If not present on the machine, the command-line-tools are available from 3ware in their download-section. You should get an output showing something like this: Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 202 01-Jan-1970 Don't ask why the 'LastCapTest' does not show a valid value, the bbu here completed the test successfully. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
On Thu, Jul 8, 2010 at 12:13 PM, Jochen Erwied <jochen@pgsql-performance.erwied.eu> wrote:
Thursday, July 8, 2010, 7:16:47 PM you wrote:If you have root access to the machine, you should try 'tw_cli /cx show',
> Thanks. The server is hosted, so it is a bit of a hassle to figure this
> stuff out, but I am having someone check.
where the x in /cx is the controller number. If not present on the machine,
the command-line-tools are available from 3ware in their download-section.
You should get an output showing something like this:
Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 202 01-Jan-1970
Don't ask why the 'LastCapTest' does not show a valid value, the bbu here
completed the test successfully.
--
Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
The twi_cli package doesn't appear to be installed. I will try to hunt it down.
However, I just verified with the hosting company that BBU is off on the raid controller. I am trying to find out my options, turn it on, different card, etc...
Ryan Wexler <ryan@iridiumsuite.com> wrote: > I just verified with the hosting company that BBU is off on the > raid controller. I am trying to find out my options, turn it on, > different card, etc... In the "etc." category, make sure that when you get it turned on, the cache is configured for "write back" mode, not "write through" mode. Ideally (if you can't afford to lose the data), it will be configured to degrade to "write through" if the battery fails. -Kevin
Thursday, July 8, 2010, 9:18:20 PM you wrote: > However, I just verified with the hosting company that BBU is off on the > raid controller. I am trying to find out my options, turn it on, different > card, etc... Turning it on requires the external BBU to be installed, so even if a 9650 has BBU support, it requires the hardware on a pluggable card. And even If the BBU is present, it requires to pass the selftest once until you are able to turn on write caching. -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
On Thu, Jul 8, 2010 at 12:32 PM, Jochen Erwied <jochen@pgsql-performance.erwied.eu> wrote:
Thursday, July 8, 2010, 9:18:20 PM you wrote:Turning it on requires the external BBU to be installed, so even if a 9650
> However, I just verified with the hosting company that BBU is off on the
> raid controller. I am trying to find out my options, turn it on, different
> card, etc...
has BBU support, it requires the hardware on a pluggable card.
And even If the BBU is present, it requires to pass the selftest once until
you are able to turn on write caching.
--Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
One thing I don't understand is why BBU will result in a huge performance gain. I thought BBU was all about power failures?
On Jul 8, 2010, at 12:37 PM, Ryan Wexler wrote: > One thing I don't understand is why BBU will result in a huge performance gain. I thought BBU was all about power failures? When you have a working BBU, the raid card can safely do write caching. Without it, many raid cards are good about turningoff write caching on the disks and refusing to do it themselves. (Safety over performance.)
Ryan Wexler <ryan@iridiumsuite.com> wrote: > One thing I don't understand is why BBU will result in a huge > performance gain. I thought BBU was all about power failures? Well, it makes it safe for the controller to consider the write complete as soon as it hits the RAM cache, rather than waiting for persistence to the disk itself. It can then schedule the writes in a manner which is efficient based on the physical medium. Something like this was probably happening on your non-server machines, but without BBU it was not actually safe. Server class machines tend to be more conservative about not losing your data, but without a RAID controller with BBU cache, that slows writes down to the speed of the rotating disks. -Kevin
On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Thanks for the explanations that makes things clearer. It still amazes me that it would account for a 5x change in IO.Ryan Wexler <ryan@iridiumsuite.com> wrote:Well, it makes it safe for the controller to consider the write
> One thing I don't understand is why BBU will result in a huge
> performance gain. I thought BBU was all about power failures?
complete as soon as it hits the RAM cache, rather than waiting for
persistence to the disk itself. It can then schedule the writes in
a manner which is efficient based on the physical medium.
Something like this was probably happening on your non-server
machines, but without BBU it was not actually safe. Server class
machines tend to be more conservative about not losing your data,
but without a RAID controller with BBU cache, that slows writes down
to the speed of the rotating disks.
-Kevin
On 7/8/2010 1:47 PM, Ryan Wexler wrote: > Thanks for the explanations that makes things clearer. It still > amazes me that it would account for a 5x change in IO. The buffering allows decoupling of the write rate from the disk rotation speed. Disks don't spin that fast, at least not relative to the speed the CPU is running at.
Ryan Wexler <ryan@iridiumsuite.com> wrote: > It still amazes me that it would account for a 5x change in IO. If you were doing one INSERT per database transaction, for instance, that would not be at all surprising. If you were doing one COPY in of a million rows, it would be a bit more surprising. Each COMMIT of a database transaction, without caching, requires that you wait for the disk to rotate around to the right position. Compared to the speed of RAM, that can take quite a long time. With write caching, you might write quite a few adjacent disk sectors to the cache, which can then all be streamed to disk on one rotation. It can also do tricks like writing a bunch of sectors on one part of the disk before pulling the heads all the way over to another portion of the disk to write a bunch of sectors. It is very good for performance to cache writes. -Kevin
On 7/8/10 12:47 PM, Ryan Wexler wrote: > > > On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov <mailto:Kevin.Grittner@wicourts.gov>> wrote: > > Ryan Wexler <ryan@iridiumsuite.com <mailto:ryan@iridiumsuite.com>> > wrote: > > > One thing I don't understand is why BBU will result in a huge > > performance gain. I thought BBU was all about power failures? > > Well, it makes it safe for the controller to consider the write > complete as soon as it hits the RAM cache, rather than waiting for > persistence to the disk itself. It can then schedule the writes in > a manner which is efficient based on the physical medium. > > Something like this was probably happening on your non-server > machines, but without BBU it was not actually safe. Server class > machines tend to be more conservative about not losing your data, > but without a RAID controller with BBU cache, that slows writes down > to the speed of the rotating disks. > > -Kevin > > Thanks for the explanations that makes things clearer. It still amazes > me that it would account for a 5x change in IO. It's not exactly a 5x change in I/O, rather it's a 5x change in *transactions*. Without a BBU Postgres has to wait for eachtransaction to by physically written to the disk, which at 7200 RPM (or 10K or 15K) means a few hundred per second. Most of the time Postgres is just sitting there waiting for the disk to say, "OK, I did it." With BBU, once the RAID cardhas the data, it's virtually guaranteed it will get to the disk even if the power fails, so the RAID controller says,"OK, I did it" even though the data is still in the controller's cache and not actually on the disk. It means there's no tight relationship between the disk's rotational speed and your transaction rate. Craig
On Thu, Jul 8, 2010 at 12:13 PM, Jochen Erwied <jochen@pgsql-performance.erwied.eu> wrote:
Thursday, July 8, 2010, 7:16:47 PM you wrote:If you have root access to the machine, you should try 'tw_cli /cx show',
> Thanks. The server is hosted, so it is a bit of a hassle to figure this
> stuff out, but I am having someone check.
where the x in /cx is the controller number. If not present on the machine,
the command-line-tools are available from 3ware in their download-section.
You should get an output showing something like this:
Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 202 01-Jan-1970
Don't ask why the 'LastCapTest' does not show a valid value, the bbu here
completed the test successfully.
--
Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
Here is what I got:
# ./tw_cli /c0 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-10 OK - - 64K 465.641 OFF ON
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 233.81 GB 490350672 WD-WCAT1F502612
p1 OK u0 233.81 GB 490350672 WD-WCAT1F472718
p2 OK u0 233.81 GB 490350672 WD-WCAT1F216268
p3 OK u0 233.81 GB 490350672 WD-WCAT1F216528
How does the linux machine know that there is a BBU installed and to change its behavior or change the behavior of Postgres? I am experiencing performance issues, not with searching but more with IO. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James Sent: Thursday, July 08, 2010 4:02 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] performance on new linux box On 7/8/10 12:47 PM, Ryan Wexler wrote: > > > On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov <mailto:Kevin.Grittner@wicourts.gov>> wrote: > > Ryan Wexler <ryan@iridiumsuite.com <mailto:ryan@iridiumsuite.com>> > wrote: > > > One thing I don't understand is why BBU will result in a huge > > performance gain. I thought BBU was all about power failures? > > Well, it makes it safe for the controller to consider the write > complete as soon as it hits the RAM cache, rather than waiting for > persistence to the disk itself. It can then schedule the writes in > a manner which is efficient based on the physical medium. > > Something like this was probably happening on your non-server > machines, but without BBU it was not actually safe. Server class > machines tend to be more conservative about not losing your data, > but without a RAID controller with BBU cache, that slows writes down > to the speed of the rotating disks. > > -Kevin > > Thanks for the explanations that makes things clearer. It still amazes > me that it would account for a 5x change in IO. It's not exactly a 5x change in I/O, rather it's a 5x change in *transactions*. Without a BBU Postgres has to wait for each transaction to by physically written to the disk, which at 7200 RPM (or 10K or 15K) means a few hundred per second. Most of the time Postgres is just sitting there waiting for the disk to say, "OK, I did it." With BBU, once the RAID card has the data, it's virtually guaranteed it will get to the disk even if the power fails, so the RAID controller says, "OK, I did it" even though the data is still in the controller's cache and not actually on the disk. It means there's no tight relationship between the disk's rotational speed and your transaction rate. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Thursday, July 8, 2010, 11:02:50 PM you wrote: > Here is what I got: > # ./tw_cli /c0 show If that's all you get, than there's no BBU installed, or not correctly connected to the controller. You could try 'tw_cli /c0/bbu show all' to be sure, but I doubt your output will change- -- Jochen Erwied | home: jochen@erwied.eu +49-208-38800-18, FAX: -19 Sauerbruchstr. 17 | work: joe@mbs-software.de +49-2151-7294-24, FAX: -50 D-45470 Muelheim | mobile: jochen.erwied@vodafone.de +49-173-5404164
On 7/8/10 2:18 PM, Timothy.Noonan@emc.com wrote: > How does the linux machine know that there is a BBU installed and to > change its behavior or change the behavior of Postgres? I am > experiencing performance issues, not with searching but more with IO. It doesn't. It trusts the disk controller. Linux says, "Flush your cache" and the controller says, "OK, it's flushed." In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a battery-backedmemory that will survive even if the power goes out. In the case of a non-BBU controller (RAID or non-RAID),the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin aroundto the right sector, then write the data. Only then can it say, "OK, it's flushed." So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers. Craig > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James > Sent: Thursday, July 08, 2010 4:02 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] performance on new linux box > > On 7/8/10 12:47 PM, Ryan Wexler wrote: >> >> >> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner >> <Kevin.Grittner@wicourts.gov<mailto:Kevin.Grittner@wicourts.gov>> > wrote: >> >> Ryan Wexler<ryan@iridiumsuite.com<mailto:ryan@iridiumsuite.com>> >> wrote: >> >> > One thing I don't understand is why BBU will result in a huge >> > performance gain. I thought BBU was all about power failures? >> >> Well, it makes it safe for the controller to consider the write >> complete as soon as it hits the RAM cache, rather than waiting for >> persistence to the disk itself. It can then schedule the writes > in >> a manner which is efficient based on the physical medium. >> >> Something like this was probably happening on your non-server >> machines, but without BBU it was not actually safe. Server class >> machines tend to be more conservative about not losing your data, >> but without a RAID controller with BBU cache, that slows writes > down >> to the speed of the rotating disks. >> >> -Kevin >> >> Thanks for the explanations that makes things clearer. It still > amazes >> me that it would account for a 5x change in IO. > > It's not exactly a 5x change in I/O, rather it's a 5x change in > *transactions*. Without a BBU Postgres has to wait for each transaction > to by physically written to the disk, which at 7200 RPM (or 10K or 15K) > means a few hundred per second. Most of the time Postgres is just > sitting there waiting for the disk to say, "OK, I did it." With BBU, > once the RAID card has the data, it's virtually guaranteed it will get > to the disk even if the power fails, so the RAID controller says, "OK, I > did it" even though the data is still in the controller's cache and not > actually on the disk. > > It means there's no tight relationship between the disk's rotational > speed and your transaction rate. > > Craig >
On 7/8/2010 3:18 PM, Timothy.Noonan@emc.com wrote: > How does the linux machine know that there is a BBU installed and to > change its behavior or change the behavior of Postgres? I am > experiencing performance issues, not with searching but more with IO. > It doesn't change its behavior at all. It's in the business of writing stuff to a file and waiting until that stuff has been put on the disk (it wants a durable write). What the write buffer/cache does is to inform the OS, and hence PG, that the write has been done when in fact it hasn't (yet). So the change in behavior is only to the extent that the application doesn't spend as much time waiting.
On Thu, 2010-07-08 at 09:31 -0700, Ryan Wexler wrote: > The raid card the server has in it is: > 3Ware 4 Port 9650SE-4LPML RAID Card > > Looking it up, it seems to indicate that it has BBU No. It supports a BBU. It doesn't have one necessarily. You need to go into your RAID BIOS. It will tell you. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
On 09/07/10 02:31, Ryan Wexler wrote:
Any more thoughts?
Really dumb idea, you don't happen to have the build of the RPM's that had debug enabled do you? That resulted in significant performance problem?
Regards
Russell
Thanks a lot for all the comments. The fact that both my windows box and the old linux box both show a massive performance improvement over the new linux box seems to point to hardware to me. I am not sure how to test the fsync issue, but i don't see how that could be it.http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
The raid card the server has in it is:
3Ware 4 Port 9650SE-4LPML RAID Card
Looking it up, it seems to indicate that it has BBU
The only other difference between the boxes is the postgresql version. The new one has 8.4-2 from the yum install instructions on the site:
Any more thoughts?
Really dumb idea, you don't happen to have the build of the RPM's that had debug enabled do you? That resulted in significant performance problem?
Regards
Russell
On Fri, Jul 9, 2010 at 2:08 AM, Russell Smith <mr-russ@pws.com.au> wrote: > On 09/07/10 02:31, Ryan Wexler wrote: > > > The only other difference between the boxes is the postgresql version. The > new one has 8.4-2 from the yum install instructions on the site: > http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html > > Any more thoughts? > > Really dumb idea, you don't happen to have the build of the RPM's that had > debug enabled do you? That resulted in significant performance problem? > The OP mentions that the new system underperforms on a straight dd test, so it isn't the database config or postgres build.
On Fri, Jul 9, 2010 at 2:38 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:
On Fri, Jul 9, 2010 at 2:08 AM, Russell Smith <mr-russ@pws.com.au> wrote:
> On 09/07/10 02:31, Ryan Wexler wrote:
>
>> The only other difference between the boxes is the postgresql version. TheThe OP mentions that the new system underperforms on a straight dd
> new one has 8.4-2 from the yum install instructions on the site:
> http://yum.pgrpms.org/reporpms/repoview/pgdg-centos.html
>
> Any more thoughts?
>
> Really dumb idea, you don't happen to have the build of the RPM's that had
> debug enabled do you? That resulted in significant performance problem?
>
test, so it isn't the database config or postgres build.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Well I got me a new raid card, MegaRAID 8708EM2, fully equipped with BBU and read and write caching are enabled. It completely solved my performance problems. Now everything is way faster than the previous server. Thanks for all the help everyone.
One question I do have is this card has a setting called Read Policy which apparently helps with sequentially reads. Do you think that is something I should enable?
Ryan Wexler wrote: > One question I do have is this card has a setting called Read Policy > which apparently helps with sequentially reads. Do you think that is > something I should enable? Linux will do some amount of read-ahead in a similar way on its own. You run "blockdev --getra" and "blockdev --setra" on each disk device on the system to see the settings and increase them. I've found that tweaking there, where you can control exactly the amount of readahead, to be more effective than relying on the less tunable Read Policy modes in RAID cards that do something similar. That said, it doesn't seem to hurt to use both on the LSI card you have; giving more information there to the controller for its use in optimizing how it caches things, by changing to the more aggressive Read Policy setting, hasn't ever degraded results significantly when I've tried. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 07/11/2010 03:02 PM, Ryan Wexler wrote: > > Well I got me a new raid card, MegaRAID 8708EM2, fully equipped with > BBU and read and write caching are enabled. It completely solved my > performance problems. Now everything is way faster than the previous > server. Thanks for all the help everyone. > > One question I do have is this card has a setting called Read Policy > which apparently helps with sequentially reads. Do you think that is > something I should enable? > > > I would think it depends on your usage. If you use clustered indexes (and understand how/when they help) then enabling itwould help (cuz clustered is assuming sequential reads). or if you seq scan a table, it might help (as long as the table is stored relatively close together). But if you have a big db, that doesnt fit into cache, and you bounce all over the place doing seeks, I doubt it'll help. -Andy
But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching on theRAID, it should still be similar to the one disk setup. Unless that one-disk setup turned off fsync() or was configured with synchronous_commit off. Even low end laptop drivesdon't lie these days about a cache flush or sync() -- OS's/file systems can, and some SSD's do. If loss of a transaction during a power failure is OK, then just turn synchronous_commit off and get the performance back. The discussion about transaction rate being limited by the disks is related to that, and its not necessary _IF_ itsok to lose a transaction if the power fails. For most applications, losing a transaction or two in a power failure isfine. Obviously, its not with financial transactions or other such work. On Jul 8, 2010, at 2:42 PM, Craig James wrote: > On 7/8/10 2:18 PM, Timothy.Noonan@emc.com wrote: >> How does the linux machine know that there is a BBU installed and to >> change its behavior or change the behavior of Postgres? I am >> experiencing performance issues, not with searching but more with IO. > > It doesn't. It trusts the disk controller. Linux says, "Flush your cache" and the controller says, "OK, it's flushed." In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a battery-backedmemory that will survive even if the power goes out. In the case of a non-BBU controller (RAID or non-RAID),the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin aroundto the right sector, then write the data. Only then can it say, "OK, it's flushed." > > So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers. > > Craig > >> >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James >> Sent: Thursday, July 08, 2010 4:02 PM >> To: pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] performance on new linux box >> >> On 7/8/10 12:47 PM, Ryan Wexler wrote: >>> >>> >>> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner >>> <Kevin.Grittner@wicourts.gov<mailto:Kevin.Grittner@wicourts.gov>> >> wrote: >>> >>> Ryan Wexler<ryan@iridiumsuite.com<mailto:ryan@iridiumsuite.com>> >>> wrote: >>> >>>> One thing I don't understand is why BBU will result in a huge >>>> performance gain. I thought BBU was all about power failures? >>> >>> Well, it makes it safe for the controller to consider the write >>> complete as soon as it hits the RAM cache, rather than waiting for >>> persistence to the disk itself. It can then schedule the writes >> in >>> a manner which is efficient based on the physical medium. >>> >>> Something like this was probably happening on your non-server >>> machines, but without BBU it was not actually safe. Server class >>> machines tend to be more conservative about not losing your data, >>> but without a RAID controller with BBU cache, that slows writes >> down >>> to the speed of the rotating disks. >>> >>> -Kevin >>> >>> Thanks for the explanations that makes things clearer. It still >> amazes >>> me that it would account for a 5x change in IO. >> >> It's not exactly a 5x change in I/O, rather it's a 5x change in >> *transactions*. Without a BBU Postgres has to wait for each transaction >> to by physically written to the disk, which at 7200 RPM (or 10K or 15K) >> means a few hundred per second. Most of the time Postgres is just >> sitting there waiting for the disk to say, "OK, I did it." With BBU, >> once the RAID card has the data, it's virtually guaranteed it will get >> to the disk even if the power fails, so the RAID controller says, "OK, I >> did it" even though the data is still in the controller's cache and not >> actually on the disk. >> >> It means there's no tight relationship between the disk's rotational >> speed and your transaction rate. >> >> Craig >> > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Jul 14, 2010, at 6:57 PM, Scott Carey wrote: > But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching on theRAID, it should still be similar to the one disk setup. Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on theirown buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on thedrives. Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you want,but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid cardwith a BBU.
On Wed, Jul 14, 2010 at 6:57 PM, Scott Carey <scott@richrelevance.com> wrote:
But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching on the RAID, it should still be similar to the one disk setup.
Unless that one-disk setup turned off fsync() or was configured with synchronous_commit off. Even low end laptop drives don't lie these days about a cache flush or sync() -- OS's/file systems can, and some SSD's do.
If loss of a transaction during a power failure is OK, then just turn synchronous_commit off and get the performance back. The discussion about transaction rate being limited by the disks is related to that, and its not necessary _IF_ its ok to lose a transaction if the power fails. For most applications, losing a transaction or two in a power failure is fine. Obviously, its not with financial transactions or other such work.
On Jul 8, 2010, at 2:42 PM, Craig James wrote:
> On 7/8/10 2:18 PM, Timothy.Noonan@emc.com wrote:
>> How does the linux machine know that there is a BBU installed and to
>> change its behavior or change the behavior of Postgres? I am
>> experiencing performance issues, not with searching but more with IO.
>
> It doesn't. It trusts the disk controller. Linux says, "Flush your cache" and the controller says, "OK, it's flushed." In the case of a BBU controller, the controller can say that almost instantly because it's got the data in a battery-backed memory that will survive even if the power goes out. In the case of a non-BBU controller (RAID or non-RAID), the controller has to actually wait for the head to move to the right spot, then wait for the disk to spin around to the right sector, then write the data. Only then can it say, "OK, it's flushed."
>
> So to Linux, it just appears to be a disk that's exceptionally fast at flushing its buffers.
>
> Craig
>
>>
>> -----Original Message-----
>> From: pgsql-performance-owner@postgresql.org
>> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James
>> Sent: Thursday, July 08, 2010 4:02 PM
>> To: pgsql-performance@postgresql.org
>> Subject: Re: [PERFORM] performance on new linux box
>>
>> On 7/8/10 12:47 PM, Ryan Wexler wrote:
>>>
>>>
>>> On Thu, Jul 8, 2010 at 12:46 PM, Kevin Grittner
>>> <Kevin.Grittner@wicourts.gov<mailto:Kevin.Grittner@wicourts.gov>>
>> wrote:
>>>
>>> Ryan Wexler<ryan@iridiumsuite.com<mailto:ryan@iridiumsuite.com>>
>>> wrote:
>>>
>>>> One thing I don't understand is why BBU will result in a huge
>>>> performance gain. I thought BBU was all about power failures?
>>>
>>> Well, it makes it safe for the controller to consider the write
>>> complete as soon as it hits the RAM cache, rather than waiting for
>>> persistence to the disk itself. It can then schedule the writes
>> in
>>> a manner which is efficient based on the physical medium.
>>>
>>> Something like this was probably happening on your non-server
>>> machines, but without BBU it was not actually safe. Server class
>>> machines tend to be more conservative about not losing your data,
>>> but without a RAID controller with BBU cache, that slows writes
>> down
>>> to the speed of the rotating disks.
>>>
>>> -Kevin
>>>
>>> Thanks for the explanations that makes things clearer. It still
>> amazes
>>> me that it would account for a 5x change in IO.
>>
>> It's not exactly a 5x change in I/O, rather it's a 5x change in
>> *transactions*. Without a BBU Postgres has to wait for each transaction
>> to by physically written to the disk, which at 7200 RPM (or 10K or 15K)
>> means a few hundred per second. Most of the time Postgres is just
>> sitting there waiting for the disk to say, "OK, I did it." With BBU,
>> once the RAID card has the data, it's virtually guaranteed it will get
>> to the disk even if the power fails, so the RAID controller says, "OK, I
>> did it" even though the data is still in the controller's cache and not
>> actually on the disk.
>>
>> It means there's no tight relationship between the disk's rotational
>> speed and your transaction rate.
>>
>> Craig
>>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Something was clearly wrong with my former raid card. Frankly, I am not sure if it was configuration or simply hardware failure. The server is hosted so I only had so much access. But the card was swapped out with a new one and now performance is quite good. I am just trying to tune the new card now.
thanks for all the input
On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote: > On Jul 14, 2010, at 6:57 PM, Scott Carey wrote: > >> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching on theRAID, it should still be similar to the one disk setup. > > Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. This does not make sense. Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request withno BBU. > > Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you want,but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid cardwith a BBU. Sure, or you can use an OS/File System combination that respects fsync() which will call the drive's write cache flush. There are some issues with certain file systems and barriers for file system metadata, but for the WAL log, we're only dalkingabout fdatasync() equivalency, which most file systems do just fine even with a drive's write cache on.
On Jul 15, 2010, at 9:30 AM, Scott Carey wrote: >> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. > > This does not make sense. > Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier requestwith no BBU. You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches.
On Jul 15, 2010, at 12:40 PM, Ryan Wexler wrote:
On Wed, Jul 14, 2010 at 7:50 PM, Ben Chobot <bench@silentmedia.com> wrote:On Jul 14, 2010, at 6:57 PM, Scott Carey wrote:Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.
> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching on the RAID, it should still be similar to the one disk setup.
Take away the controller, and most OS's by default enable the write cache on the drive. You can turn it off if you want, but if you know how to do that, then you're probably also the same kind of person that would have purchased a raid card with a BBU.--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Ben I don't quite follow your message. Could you spell it out a little clearer for me?
thanks
-ryan
Most (all?) hard drives have cache built into them. Many raid cards have cache built into them. When the power dies, all the data in any cache is lost, which is why it's dangerous to use it for write caching. For that reason, you can attach a BBU to a raid card which keeps the cache alive until the power is restored (hopefully). But no hard drive I am aware of lets you attach a battery, so using a hard drive's cache for write caching will always be dangerous.
That's why many raid cards will always disable write caching on the hard drives themselves, and only enable write caching using their own memory when a BBU is installed.
Does that make more sense?
On Thu, Jul 15, 2010 at 12:35 PM, Ben Chobot <bench@silentmedia.com> wrote:
On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire point of the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches.
>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.
>
> This does not make sense.
> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request with no BBU.--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
So you are saying write caching is a dangerous proposition on a raid card with or without BBU?
On Jul 15, 2010, at 2:40 PM, Ryan Wexler wrote:
On Thu, Jul 15, 2010 at 12:35 PM, Ben Chobot <bench@silentmedia.com> wrote:On Jul 15, 2010, at 9:30 AM, Scott Carey wrote:You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire point of the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches.
>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature on their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache on the drives.
>
> This does not make sense.
> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier request with no BBU.--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
So you are saying write caching is a dangerous proposition on a raid card with or without BBU?
Er, no, sorry, I am not being very clear it seems.
Using a cache for write caching is dangerous, unless you protect it with a battery. Caches on a raid card can be protected by a BBU, so, when you use a BBU, write caching on the raid card is safe. (Just don't read the firmware changelog for your raid card or you will always be paranoid.) If you don't have a BBU, many raid cards default to disabling caching. You can still enable it, but the card will often tell you it's a bad idea.
There are also caches on all your disk drives. Write caching there is always dangerous, which is why almost all raid cards always disable the hard drive write caching, with or without a BBU. I'm not even sure how many raid cards let you enable the write cache on a drive... hopefully, not many.
> Most (all?) hard drives have cache built into them. Many raid cards have > cache built into them. When the power dies, all the data in any cache is > lost, which is why it's dangerous to use it for write caching. For that > reason, you can attach a BBU to a raid card which keeps the cache alive > until the power is restored (hopefully). But no hard drive I am aware of > lets you attach a battery, so using a hard drive's cache for write > caching will always be dangerous. > > That's why many raid cards will always disable write caching on the hard > drives themselves, and only enable write caching using their own memory > when a BBU is installed. > > Does that make more sense? > Actually write cache is only dangerous if the OS and postgres think some stuff is written to the disk when in fact it is only in the cache and not written yet. When power is lost, cache contents are SUPPOSED to be lost. In a normal situation, postgres and the OS assume nothing is written to the disk (ie, it may be in cache not on disk) until a proper cache flush is issued and responded to by the hardware. That's what xlog and journals are for. If the hardware doesn't lie, and the kernel/FS doesn't have any bugs, no problem. You can't get decent write performance on rotating media without a write cache somewhere...
On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <scott@richrelevance.com> wrote: > > On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote: > >> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote: >> >>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching onthe RAID, it should still be similar to the one disk setup. >> >> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. > > This does not make sense. Basically, you can have cheap, fast and dangerous (drive with write cache enabled, which responds positively to fsync even when it hasn't actually fsynced the data. You can have cheap, slow and safe with a drive that has a cache but since it'll be fsyncing it all the the time the write cache won't actually get used, or fast, expensive, and safe, which is what a BBU RAID card gets by saying the data is fsynced when it's actually just in cache, but a safe cache that won't get lost on power down. I don't find it that complicated.
On Jul 15, 2010, at 12:35 PM, Ben Chobot wrote: > On Jul 15, 2010, at 9:30 AM, Scott Carey wrote: > >>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. >> >> This does not make sense. >> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier requestwith no BBU. > > You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches. If the power dies suddenly, then the data that is in the OS RAM will also be lost. What about that? Well it doesn't matter because the DB is only relying on data being persisted to disk that it thinks has been persisted todisk via fsync(). The data in the disk cache is the same thing as RAM. As long as fsync() works _properly_ which is true for any file system+ disk combination with a damn (not HFS+ on OSX, not FAT, not a few other things), then it will tell the drive to flushits cache _before_ fsync() returns. There is NO REASON for a raid card to turn off a drive cache unless it does nottrust the drive cache. In write-through mode, it should not return to the OS with a fsync, direct write, or other "theOS thinks this data is persisted now" call until it has flushed the disk cache. That does not mean it has to turn offthe disk cache.
On Jul 15, 2010, at 6:22 PM, Scott Marlowe wrote: > On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <scott@richrelevance.com> wrote: >> >> On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote: >> >>> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote: >>> >>>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching onthe RAID, it should still be similar to the one disk setup. >>> >>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. >> >> This does not make sense. > > Basically, you can have cheap, fast and dangerous (drive with write > cache enabled, which responds positively to fsync even when it hasn't > actually fsynced the data. You can have cheap, slow and safe with a > drive that has a cache but since it'll be fsyncing it all the the time > the write cache won't actually get used, or fast, expensive, and safe, > which is what a BBU RAID card gets by saying the data is fsynced when > it's actually just in cache, but a safe cache that won't get lost on > power down. > > I don't find it that complicated. It doesn't make sense that a raid 10 will be slower than a 1-disk setup unless the former respects fsync() and the latterdoes not. Individual drive write cache does not explain the situation. That is what does not make sense. When in _write-through_ mode, there is no reason to turn off the drive's write cache unless the drive does not properly respectits cache-flush command, or the RAID card is too dumb to issue cache-flush commands. The RAID card simply has toissue its writes, then issue the flush commands, then return to the OS when those complete. With drive write caches on,this is perfectly safe. The only way it is unsafe is if the drive lies and returns from a cache flush before the datafrom its cache is actually flushed. Some SSD's on the market currently lie. A handful of the thousands of all hard drive models in the server, desktop, andlaptop space in the last decade did not respect the cache flush command properly, and none of them in the SAS/SCSI or'enterprise SATA' space lie to my knowledge. Information on this topic has come across this list several times. The explanation why one setup respects fsync() and another does not almost always lies in the FS + OS combination. HFS+on OSX does not respect fsync. ext3 until recently only did fdatasync() when you told it to fsync() (which is fine forpostgres' transaction log anyway). A raid card, especially with any SAS/SCSI drives has no reason to turn off the drive's write cache unless it _wants_ to returnto the OS before the data is on the drive. That condition occurs in write-back cache mode when the RAID card's cacheis safe via a battery or some other mechanism. In that case, it should turn off the drive's write cache so that itcan be sure that data is on disk when a power fails without having to call the cache-flush command on every write. Thatway, it can remove data from its RAM as soon as the drive returns from the write. In write-through mode it should turn the caches back on and rely on the flush command to pass through direct writes, cacheflush demands, and barrier requests. It could optionally turn the caches off, but that won't improve data safety unlessthe drive cannot faithfully flush its cache.
On Jul 15, 2010, at 8:16 PM, Scott Carey wrote: > On Jul 15, 2010, at 12:35 PM, Ben Chobot wrote: > >> On Jul 15, 2010, at 9:30 AM, Scott Carey wrote: >> >>>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the featureon their own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cacheon the drives. >>> >>> This does not make sense. >>> Write caching on all hard drives in the last decade are safe because they support a write cache flush command properly. If the card is "smart" it would issue the drive's write cache flush command to fulfill an fsync() or barrier requestwith no BBU. >> >> You're missing the point. If the power dies suddenly, there's no time to flush any cache anywhere. That's the entire pointof the BBU - it keeps the RAM powered up on the raid card. It doesn't keep the disks spinning long enough to flush caches. > > If the power dies suddenly, then the data that is in the OS RAM will also be lost. What about that? > > Well it doesn't matter because the DB is only relying on data being persisted to disk that it thinks has been persistedto disk via fsync(). Right, we agree that only what has been fsync()'d has a chance to be safe.... > The data in the disk cache is the same thing as RAM. As long as fsync() works _properly_ which is true for any file system+ disk combination with a damn (not HFS+ on OSX, not FAT, not a few other things), then it will tell the drive to flushits cache _before_ fsync() returns. There is NO REASON for a raid card to turn off a drive cache unless it does nottrust the drive cache. In write-through mode, it should not return to the OS with a fsync, direct write, or other "theOS thinks this data is persisted now" call until it has flushed the disk cache. That does not mean it has to turn offthe disk cache. ...and here you are also right in that a write-through write cache is safe, with or without a battery. A write-through cacheis a win for things that don't often fsync, but my understanding is that with a database, you end up fsyncing all thetime, which makes a write-through cache not worth very much. The only good way to get good *database* performance outof spinning media is with a write-back cache, and the only way to make that safe is to hook up a BBU.
On 16/07/10 06:18, Ben Chobot wrote: > There are also caches on all your disk drives. Write caching there is always dangerous, which is why almost all raid cardsalways disable the hard drive write caching, with or without a BBU. I'm not even sure how many raid cards let you enablethe write cache on a drive... hopefully, not many. AFAIK Disk drive caches can be safe to leave in write-back mode (ie write cache enabled) *IF* the OS uses write barriers (properly) and the drive understands them. Big if. -- Craig Ringer
On 16/07/10 09:22, Scott Marlowe wrote: > On Thu, Jul 15, 2010 at 10:30 AM, Scott Carey <scott@richrelevance.com> wrote: >> >> On Jul 14, 2010, at 7:50 PM, Ben Chobot wrote: >> >>> On Jul 14, 2010, at 6:57 PM, Scott Carey wrote: >>> >>>> But none of this explains why a 4-disk raid 10 is slower than a 1 disk system. If there is no write-back caching onthe RAID, it should still be similar to the one disk setup. >>> >>> Many raid controllers are smart enough to always turn off write caching on the drives, and also disable the feature ontheir own buffer without a BBU. Add a BBU, and the cache on the controller starts getting used, but *not* the cache onthe drives. >> >> This does not make sense. > > Basically, you can have cheap, fast and dangerous (drive with write > cache enabled, which responds positively to fsync even when it hasn't > actually fsynced the data. You can have cheap, slow and safe with a > drive that has a cache but since it'll be fsyncing it all the the time > the write cache won't actually get used, or fast, expensive, and safe, > which is what a BBU RAID card gets by saying the data is fsynced when > it's actually just in cache, but a safe cache that won't get lost on > power down. Speaking of BBUs... do you ever find yourself wishing you could use software RAID with battery backup? I tend to use software RAID quite heavily on non-database servers, as it's cheap, fast, portable from machine to machine, and (in the case of Linux 'md' raid) reliable. Alas, I can't really use it for DB servers due to the need for write-back caching. There's no technical reason I know of why sw raid couldn't write-cache to some non-volatile memory on the host. A dedicated a battery-backed pair of DIMMS on a PCI-E card mapped into memory would be ideal. Failing that, a PCI-E card with onboard RAM+BATT or fast flash that presents an AHCI interface so it can be used as a virtual HDD would do pretty well. Even one of those SATA "RAM Drive" units would do the job, though forcing everything though the SATA2 bus would be a performance downside. The only issue I see with sw raid write caching is that it probably couldn't be done safely on the root file system. The OS would have to come up, init software raid, and find the caches before it'd be safe to read or write volumes with s/w raid write caching enabled. It's not the sort of thing that'd be practical to implement in GRUB's raid support. -- Craig Ringer
Scott Carey wrote: > As long as fsync() works _properly_ which is true for any file system + disk combination with a damn (not HFS+ on OSX,not FAT, not a few other things), then it will tell the drive to flush its cache _before_ fsync() returns. There isNO REASON for a raid card to turn off a drive cache unless it does not trust the drive cache. In write-through mode, itshould not return to the OS with a fsync, direct write, or other "the OS thinks this data is persisted now" call untilit has flushed the disk cache. That does not mean it has to turn off the disk cache. > Assuming that the operating system will pass through fsync calls to flush data all the way to drive level in all situations is an extremely dangerous assumption. Most RAID controllers don't know how to force things out of the individual drive caches; that's why they turn off write caching on them. Few filesystems get the details right to handle individual drive cache flushing correctly. On Linux, XFS and ext4 are the only two with any expectation that will happen, and of those two ext4 is still pretty new and therefore should still be presumed to be buggy. Please don't advise people about what is safe based on theoretical grounds here, in practice there are way too many bugs in the implementation of things like drive barriers to trust them most of the time. There is no substitute for a pull the plug test using something that looks for bad cache flushes, i.e. diskchecker.pl: http://brad.livejournal.com/2116715.html If you do that you'll discover you must turn off the individual drive caches when using a battery-backed RAID controller, and you can't ever trust barriers on ext3 because of bugs that were only fixed in ext4. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us