Thread: SAN performance mystery
We have a customer who are having performance problems. They have a large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details of the EMC SAN model, or the type of fibre-channel card at the moment). They're running RedHat ES3 (which means a 2.4.something Linux kernel). They are unhappy about their query performance. We've been doing various things to try to work out what we can do. One thing that has been apparent is that autovacuum has not been able to keep the database sufficiently tamed. A pg_dump/pg_restore cycle reduced the total database size from 81G to 36G. Performing the restore took about 23 hours. We tried restoring the pg_dump output to one of our machines, a dual-core pentium D with a single SATA disk, no raid, I forget how much RAM but definitely much less than 8G. The restore took five hours. So it would seem that our machine, which on paper should be far less impressive than the customer's box, does more than four times the I/O performance. To simplify greatly - single local SATA disk beats EMC SAN by factor of four. Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? I'd be grateful for any clues anyone can offer, Tim
Attachment
On Thu, 2006-06-15 at 16:50, Tim Allen wrote: > We have a customer who are having performance problems. They have a > large (36G+) postgres 8.1.3 database installed on an 8-way opteron with > 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details > of the EMC SAN model, or the type of fibre-channel card at the moment). > They're running RedHat ES3 (which means a 2.4.something Linux kernel). > > They are unhappy about their query performance. We've been doing various > things to try to work out what we can do. One thing that has been > apparent is that autovacuum has not been able to keep the database > sufficiently tamed. A pg_dump/pg_restore cycle reduced the total > database size from 81G to 36G. Performing the restore took about 23 hours. Do you have the ability to do any simple IO performance testing, like with bonnie++ (the old bonnie is not really capable of properly testing modern equipment, but bonnie++ will give you some idea of the throughput of the SAN) Or even just timing a dd write to the SAN? > We tried restoring the pg_dump output to one of our machines, a > dual-core pentium D with a single SATA disk, no raid, I forget how much > RAM but definitely much less than 8G. The restore took five hours. So it > would seem that our machine, which on paper should be far less > impressive than the customer's box, does more than four times the I/O > performance. > > To simplify greatly - single local SATA disk beats EMC SAN by factor of > four. > > Is that expected performance, anyone? It doesn't sound right to me. Does > anyone have any clues about what might be going on? Buggy kernel > drivers? Buggy kernel, come to think of it? Does a SAN just not provide > adequate performance for a large database? Yes, this is not uncommon. It is very likely that your SATA disk is lying about fsync. What kind of backup are you using? insert statements or copy statements? If insert statements, then the difference is quite believable. If copy statements, less so. Next time, on their big server, see if you can try a restore with fsync turned off and see if that makes the restore faster. Note you should turn fsync back on after the restore, as running without it is quite dangerous should you suffer a power outage. How are you mounting to the EMC SAN? NFS, iSCSI? Other?
Tim Allen wrote: > We have a customer who are having performance problems. They have a > large (36G+) postgres 8.1.3 database installed on an 8-way opteron > with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have > details of the EMC SAN model, or the type of fibre-channel card at the > moment). They're running RedHat ES3 (which means a 2.4.something Linux > kernel). > > They are unhappy about their query performance. We've been doing > various things to try to work out what we can do. One thing that has > been apparent is that autovacuum has not been able to keep the > database sufficiently tamed. A pg_dump/pg_restore cycle reduced the > total database size from 81G to 36G. Performing the restore took about > 23 hours. > > We tried restoring the pg_dump output to one of our machines, a > dual-core pentium D with a single SATA disk, no raid, I forget how > much RAM but definitely much less than 8G. The restore took five > hours. So it would seem that our machine, which on paper should be far > less impressive than the customer's box, does more than four times the > I/O performance. > > To simplify greatly - single local SATA disk beats EMC SAN by factor > of four. > > Is that expected performance, anyone? It doesn't sound right to me. > Does anyone have any clues about what might be going on? Buggy kernel > drivers? Buggy kernel, come to think of it? Does a SAN just not > provide adequate performance for a large database? > > I'd be grateful for any clues anyone can offer, I'm actually in a not dissimiliar position here- I was seeing the performance of Postgres going to an EMC Raid over iSCSI running at about 1/2 the speed of a lesser machine hitting a local SATA drive. That was, until I noticed that the SATA drive Postgres installation had fsync turned off, and the EMC version had fsync turned on. Turning fsync on on the SATA drive dropped it's performance to being about 1/4th that of EMC. Moral of the story: make sure you're comparing apples to apples. Brian
On 6/15/06, Tim Allen <tim@proximity.com.au> wrote:
Tim,
Here are the areas I would look at first if we're considering hardware to be the problem:
HBA and driver:
Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I would need to know the SAN model to see what the backend of the SAN is itself. EMC has some FC-attach models that actually have SATA disks underneath. You also might want to look at the cache size of the controllers on the SAN.
- Something also to note is that EMC provides a add-on called PowerPath for load balancing multiple HBAs. If they don't have this, it might be worth investigating.
- As with anything, disk layout is important. With the lower end IBM SAN (DS4000) you actually have to operate on physical spindle level. On our 4300, when I create a LUN, I select the exact disks I want and which of the two controllers are the preferred path. On our DS6800, I just ask for storage. I THINK all the EMC models are the "ask for storage" type of scenario. However with the 6800, you select your storage across extent pools.
Have they done any benchmarking of the SAN outside of postgres? Before we settle on a new LUN configuration, we always do the dd,umount,mount,dd routine. It's not a perfect test for databases but it will help you catch GROSS performance issues.
SAN itself:
- Could the SAN be oversubscribed? How many hosts and LUNs total do they have and what are the queue_depths for those hosts? With the qlogic card, you can set the queue depth in the BIOS of the adapter when the system is booting up. CTRL-Q I think. If the system has enough local DASD to relocate the database internally, it might be a valid test to do so and see if you can isolate the problem to the SAN itself.
PG itself:
If you think it's a pgsql configuration, I'm guessing you already configured postgresql.conf to match thiers (or at least a fraction of thiers since the memory isn't the same?). What about loading a "from-scratch" config file and restarting the tuning process?
Just a dump of my thought process from someone who's been spending too much time tuning his SAN and postgres lately.
<snipped>
Is that expected performance, anyone? It doesn't sound right to me. Does
anyone have any clues about what might be going on? Buggy kernel
drivers? Buggy kernel, come to think of it? Does a SAN just not provide
adequate performance for a large database?
I'd be grateful for any clues anyone can offer,
Tim
Tim,
Here are the areas I would look at first if we're considering hardware to be the problem:
HBA and driver:
Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I would need to know the SAN model to see what the backend of the SAN is itself. EMC has some FC-attach models that actually have SATA disks underneath. You also might want to look at the cache size of the controllers on the SAN.
- Something also to note is that EMC provides a add-on called PowerPath for load balancing multiple HBAs. If they don't have this, it might be worth investigating.
Have they done any benchmarking of the SAN outside of postgres? Before we settle on a new LUN configuration, we always do the dd,umount,mount,dd routine. It's not a perfect test for databases but it will help you catch GROSS performance issues.
SAN itself:
- Could the SAN be oversubscribed? How many hosts and LUNs total do they have and what are the queue_depths for those hosts? With the qlogic card, you can set the queue depth in the BIOS of the adapter when the system is booting up. CTRL-Q I think. If the system has enough local DASD to relocate the database internally, it might be a valid test to do so and see if you can isolate the problem to the SAN itself.
PG itself:
If you think it's a pgsql configuration, I'm guessing you already configured postgresql.conf to match thiers (or at least a fraction of thiers since the memory isn't the same?). What about loading a "from-scratch" config file and restarting the tuning process?
Just a dump of my thought process from someone who's been spending too much time tuning his SAN and postgres lately.
Brian Hurt <bhurt@janestcapital.com> writes: > Tim Allen wrote: >> To simplify greatly - single local SATA disk beats EMC SAN by factor >> of four. > I'm actually in a not dissimiliar position here- I was seeing the > performance of Postgres going to an EMC Raid over iSCSI running at about > 1/2 the speed of a lesser machine hitting a local SATA drive. That was, > until I noticed that the SATA drive Postgres installation had fsync > turned off, and the EMC version had fsync turned on. Turning fsync on > on the SATA drive dropped it's performance to being about 1/4th that of EMC. And that's assuming that the SATA drive isn't configured to lie about write completion ... I agree with Brian's suspicion that the SATA drive isn't properly fsync'ing to disk, resulting in bogusly high throughput. However, ISTM a well-configured SAN ought to be able to match even the bogus throughput, because it should be able to rely on battery-backed cache to hold written blocks across a power failure, and hence should be able to report write-complete as soon as it's got the page in cache rather than having to wait till it's really down on magnetic platter. Which is what the SATA drive is doing ... only it can't keep the promise it's making for lack of any battery backup on its on-board cache. So I'm thinking *both* setups may be misconfigured. Or else you forgot to buy the battery-backed-cache option on the SAN hardware. regards, tom lane
On Thu, 2006-06-15 at 18:24 -0400, Tom Lane wrote: > I agree with Brian's suspicion that the SATA drive isn't properly > fsync'ing to disk, resulting in bogusly high throughput. However, > ISTM a well-configured SAN ought to be able to match even the bogus > throughput, because it should be able to rely on battery-backed > cache to hold written blocks across a power failure, and hence should > be able to report write-complete as soon as it's got the page in cache > rather than having to wait till it's really down on magnetic platter. > Which is what the SATA drive is doing ... only it can't keep the promise > it's making for lack of any battery backup on its on-board cache. It really depends on your SAN RAID controller. We have an HP SAN; I don't remember the model number exactly, but we ran some tests and with the battery-backed write cache enabled, we got some improvement in write performance but it wasn't NEARLY as fast as an SATA drive which lied about write completion. The write-and-fsync latency was only about 2-3 times better than with no write cache at all. So I wouldn't assume that just because you've got a write cache on your SAN, that you're getting the same speed as fsync=off, at least for some cheap controllers. -- Mark Lewis
Given the fact that most SATA drives have only an 8MB cache, and your RAID controller should have at least 64MB, I would argue that the system with the RAID controller should always be faster. If it's not, you're getting short-changed somewhere, which is typical on linux, because the drivers just aren't there for a great many controllers that are out there.
Alex.
Alex.
On 6/15/06, Mark Lewis <mark.lewis@mir3.com> wrote:
On Thu, 2006-06-15 at 18:24 -0400, Tom Lane wrote:
> I agree with Brian's suspicion that the SATA drive isn't properly
> fsync'ing to disk, resulting in bogusly high throughput. However,
> ISTM a well-configured SAN ought to be able to match even the bogus
> throughput, because it should be able to rely on battery-backed
> cache to hold written blocks across a power failure, and hence should
> be able to report write-complete as soon as it's got the page in cache
> rather than having to wait till it's really down on magnetic platter.
> Which is what the SATA drive is doing ... only it can't keep the promise
> it's making for lack of any battery backup on its on-board cache.
It really depends on your SAN RAID controller. We have an HP SAN; I
don't remember the model number exactly, but we ran some tests and with
the battery-backed write cache enabled, we got some improvement in write
performance but it wasn't NEARLY as fast as an SATA drive which lied
about write completion.
The write-and-fsync latency was only about 2-3 times better than with no
write cache at all. So I wouldn't assume that just because you've got a
write cache on your SAN, that you're getting the same speed as
fsync=off, at least for some cheap controllers.
-- Mark Lewis
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Tim Allen wrote: > We have a customer who are having performance problems. They have a > large (36G+) postgres 8.1.3 database installed on an 8-way opteron with > 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details > of the EMC SAN model, or the type of fibre-channel card at the moment). > They're running RedHat ES3 (which means a 2.4.something Linux kernel). > > They are unhappy about their query performance. We've been doing various > things to try to work out what we can do. One thing that has been > apparent is that autovacuum has not been able to keep the database > sufficiently tamed. A pg_dump/pg_restore cycle reduced the total > database size from 81G to 36G. Performing the restore took about 23 hours. Hi Tim! to give you some comparision - we have a similiar sized database here (~38GB after a fresh restore and ~76GB after some months into production). the server is a 4 core Opteron @2,4Ghz with 16GB RAM, connected via 2 QLogic 2Gbit HBA's to the SAN (IBM DS4300 Turbo). It took us quite a while to get this combination up to speed but a full dump&restore cycle (via a pg_dump | psql pipe over the net) now takes only about an hour. 23 hours or even 5 hours sounds really excessive - I'm wondering about some basic issues with the SAN. If you are using any kind of multipathing (most likely the one in the QLA-drivers) I would at first assume that you are playing ping-pong between the controllers (ie the FC-cards do send IO to more than one SAN-head causing those to failover constantly completely destroying performance). ES3 is rather old too and I don't think that even their hacked up kernel is very good at driving a large Opteron SMP box (2.6 should be MUCH better in that regard). Other than that - how well is your postgresql instance tuned to your hardware ? Stefan
Tim Allen wrote: > We have a customer who are having performance problems. They have a > large (36G+) postgres 8.1.3 database installed on an 8-way opteron with > 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details > of the EMC SAN model, or the type of fibre-channel card at the moment). > They're running RedHat ES3 (which means a 2.4.something Linux kernel). > To simplify greatly - single local SATA disk beats EMC SAN by factor of > four. > > Is that expected performance, anyone? It doesn't sound right to me. Does > anyone have any clues about what might be going on? Buggy kernel > drivers? Buggy kernel, come to think of it? Does a SAN just not provide > adequate performance for a large database? > > I'd be grateful for any clues anyone can offer, > > Tim Thanks to all who have replied so far. I've learned a few new things in the meantime. Firstly, the fibrechannel card is an Emulex LP1050. The customer seems to have rather old drivers for it, so I have recommended that they upgrade asap. I've also suggested they might like to upgrade their kernel to something recent too (eg upgrade to RHEL4), but no telling whether they'll accept that recommendation. The fact that SATA drives are wont to lie about write completion, which several posters have pointed out, presumably has an effect on write performance (ie apparent write performance is increased at the cost of an increased risk of data-loss), but, again presumably, not much of an effect on read performance. After loading the customer's database on our fairly modest box with the single SATA disk, we also tested select query performance, and while we didn't see a factor of four gain, we certainly saw that read performance is also substantially better. So the fsync issue possibly accounts for part of our factor-of-four, but not all of it. Ie, the SAN is still not doing well by comparison, even allowing for the presumption that it is more honest. One curious thing is that some postgres backends seem to spend an inordinate amount of time in uninterruptible iowait state. I found a posting to this list from December 2004 from someone who reported that very same thing. For example, bringing down postgres on the customer box requires kill -9, because there are invariably one or two processes so deeply uninterruptible as to not respond to a politer signal. That indicates something not quite right, doesn't it? Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
"Alex Turner" <armtuk@gmail.com> writes: > Given the fact that most SATA drives have only an 8MB cache, and your RAID > controller should have at least 64MB, I would argue that the system with the > RAID controller should always be faster. If it's not, you're getting > short-changed somewhere, which is typical on linux, because the drivers just > aren't there for a great many controllers that are out there. Alternatively Linux is using the 1-4 gigabytes of cache available to it effectively enough that the 64 megabytes of mostly duplicated cache just isn't especially helpful... I never understood why disk caches on the order of megabytes are exciting. Why should disk manufacturers be any better about cache management than OS authors? In the case of RAID 5 this could actually work against you since the RAID controller can _only_ use its cache to find parity blocks when writing. Software raid can use all of the OS's disk cache to that end. -- greg
We've seen similar results with our EMC CX200 (fully equipped) when compared to a single (1) SCSI disk machine. For sequential reads/writes (import, export, updates on 5-10 30M+ row tables), performance is downright awful. A big DB update took 5-6h in pre-prod (single SCSI), and 10-14?h (don't recall the exact details) in production (EMC SAN). And this was with a proprietary DB, btw - no fsync on/off affecting the results here. FC isn't exactly known for great bandwidth, iirc a 2Gbit FC channel tops at 192Mb/s. So, especially if you mostly have DW/BI type of workloads, go for DAD (Direct Attached Disks) instead. /Mikael -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Allen Sent: den 15 juni 2006 23:50 To: pgsql-performance@postgresql.org Subject: [PERFORM] SAN performance mystery We have a customer who are having performance problems. They have a large (36G+) postgres 8.1.3 database installed on an 8-way opteron with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details of the EMC SAN model, or the type of fibre-channel card at the moment). They're running RedHat ES3 (which means a 2.4.something Linux kernel). They are unhappy about their query performance. We've been doing various things to try to work out what we can do. One thing that has been apparent is that autovacuum has not been able to keep the database sufficiently tamed. A pg_dump/pg_restore cycle reduced the total database size from 81G to 36G. Performing the restore took about 23 hours. We tried restoring the pg_dump output to one of our machines, a dual-core pentium D with a single SATA disk, no raid, I forget how much RAM but definitely much less than 8G. The restore took five hours. So it would seem that our machine, which on paper should be far less impressive than the customer's box, does more than four times the I/O performance. To simplify greatly - single local SATA disk beats EMC SAN by factor of four. Is that expected performance, anyone? It doesn't sound right to me. Does anyone have any clues about what might be going on? Buggy kernel drivers? Buggy kernel, come to think of it? Does a SAN just not provide adequate performance for a large database? I'd be grateful for any clues anyone can offer, Tim
On 6/16/06, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com> wrote: > We've seen similar results with our EMC CX200 (fully equipped) when > compared to a single (1) SCSI disk machine. For sequential reads/writes > (import, export, updates on 5-10 30M+ row tables), performance is > downright awful. A big DB update took 5-6h in pre-prod (single SCSI), > and 10-14?h (don't recall the exact details) in production (EMC SAN). > And this was with a proprietary DB, btw - no fsync on/off affecting the > results here. You are in good company. We bought a Hitachi AMS200, 2gb FC and a gigabyte of cache. We were shocked and dismayed to find the unit could do about 50 mb/sec measured from dd (yes, around the performance of a single consumer grade sata drive). It is my (unconfirmted) belief that the unit was governed internally to encourage you to buy the more expensive version, AMS500, etc. needless to say, we sent the unit back, and are now waiting on a xyratex 4gb FC attached SAS unit. we spoke directly to their performance people who told us to expect the unit to be network bandwitdh bottlenecked as you would expect. they were even talking about a special mode where you could bond the dual fc ports, now that's power. If the unit really does what they claim, I will be back here talking about it for sure ;) The bottom line is that most SANs, even from some of the biggest vendors, are simply worthless from a performance angle. You have to be really critical when you buy them, don't beleive anything the sales rep tells you, and make sure to negotiate in advance a return policy if the unit does not perform. There is tons of b.s. out there, but so far my impression of xyratex is really favorable (fingers crossed), and I'm hearing lots of great stuff about them from the channel. merlin
On Jun 16, 2006, at 5:11 AM, Tim Allen wrote: > > One curious thing is that some postgres backends seem to spend an > inordinate amount of time in uninterruptible iowait state. I found > a posting to this list from December 2004 from someone who reported > that very same thing. For example, bringing down postgres on the > customer box requires kill -9, because there are invariably one or > two processes so deeply uninterruptible as to not respond to a > politer signal. That indicates something not quite right, doesn't it? > Sounds like there could be a driver/array/kernel bug there that is kicking the performance down the tube. If it was PG's fault it wouldn't be stuck uninterruptable. -- Jeff Trout <jeff@jefftrout.com> http://www.dellsmartexitin.com/ http://www.stuarthamm.net/
On Jun 16, 2006, at 6:28 AM, Greg Stark wrote: > I never understood why disk caches on the order of megabytes are > exciting. Why > should disk manufacturers be any better about cache management than OS > authors? > > In the case of RAID 5 this could actually work against you since > the RAID > controller can _only_ use its cache to find parity blocks when > writing. > Software raid can use all of the OS's disk cache to that end. IIRC some of the Bizgres folks have found better performance with software raid for just that reason. The big advantage HW raid has is that you can do a battery-backed cache, something you'll never be able to duplicate in a general-purpose computer (sure, you could battery-back the DRAM if you really wanted to, but if the kernel crashed you'd be completely screwed, which isn't the case with a battery-backed RAID controller). The quality of the RAID controller also makes a huge difference. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jeff Trout wrote: > On Jun 16, 2006, at 5:11 AM, Tim Allen wrote: >> One curious thing is that some postgres backends seem to spend an >> inordinate amount of time in uninterruptible iowait state. I found a >> posting to this list from December 2004 from someone who reported >> that very same thing. For example, bringing down postgres on the >> customer box requires kill -9, because there are invariably one or >> two processes so deeply uninterruptible as to not respond to a >> politer signal. That indicates something not quite right, doesn't it? > > Sounds like there could be a driver/array/kernel bug there that is > kicking the performance down the tube. > If it was PG's fault it wouldn't be stuck uninterruptable. That's what I thought. I've advised the customer to upgrade their kernel drivers, and to preferably upgrade their kernel as well. We'll see if they accept the advice :-|. Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
Scott Marlowe wrote: > On Thu, 2006-06-15 at 16:50, Tim Allen wrote: > >>We have a customer who are having performance problems. They have a >>large (36G+) postgres 8.1.3 database installed on an 8-way opteron with >>8G RAM, attached to an EMC SAN via fibre-channel (I don't have details >>of the EMC SAN model, or the type of fibre-channel card at the moment). >>They're running RedHat ES3 (which means a 2.4.something Linux kernel). >> >>They are unhappy about their query performance. We've been doing various >>things to try to work out what we can do. One thing that has been >>apparent is that autovacuum has not been able to keep the database >>sufficiently tamed. A pg_dump/pg_restore cycle reduced the total >>database size from 81G to 36G. Performing the restore took about 23 hours. > > Do you have the ability to do any simple IO performance testing, like > with bonnie++ (the old bonnie is not really capable of properly testing > modern equipment, but bonnie++ will give you some idea of the throughput > of the SAN) Or even just timing a dd write to the SAN? I've done some timed dd's. The timing results vary quite a bit, but it seems you can write to the SAN at about 20MB/s and read from it at about 12MB/s. Not an entirely scientific test, as I wasn't able to stop other activity on the machine, though I don't think much else was happening. Certainly not impressive figures, compared with our machine with the SATA disk (referred to below), which can get 161MB/s copying files on the same disk, and 48MB/s and 138Mb/s copying files from the sata disk respectively to and from a RAID5 array. The customer is a large organisation, with a large IT department who guard their turf carefully, so there is no way I could get away with installing any heavier duty testing tools like bonnie++ on their machine. >>We tried restoring the pg_dump output to one of our machines, a >>dual-core pentium D with a single SATA disk, no raid, I forget how much >>RAM but definitely much less than 8G. The restore took five hours. So it >>would seem that our machine, which on paper should be far less >>impressive than the customer's box, does more than four times the I/O >>performance. >> >>To simplify greatly - single local SATA disk beats EMC SAN by factor of >>four. >> >>Is that expected performance, anyone? It doesn't sound right to me. Does >>anyone have any clues about what might be going on? Buggy kernel >>drivers? Buggy kernel, come to think of it? Does a SAN just not provide >>adequate performance for a large database? > Yes, this is not uncommon. It is very likely that your SATA disk is > lying about fsync. I guess a sustained write will flood the disk's cache and negate the effect of the write-completion dishonesty. But I have no idea how large a copy would have to be to do that - can anyone suggest a figure? Certainly, the read performance of the SATA disk still beats the SAN, and there is no way to lie about read performance. > What kind of backup are you using? insert statements or copy > statements? If insert statements, then the difference is quite > believable. If copy statements, less so. A binary pg_dump, which amounts to copy statements, if I'm not mistaken. > Next time, on their big server, see if you can try a restore with fsync > turned off and see if that makes the restore faster. Note you should > turn fsync back on after the restore, as running without it is quite > dangerous should you suffer a power outage. > > How are you mounting to the EMC SAN? NFS, iSCSI? Other? iSCSI, I believe. Some variant of SCSI, anyway, of that I'm certain. The conclusion I'm drawing here is that this SAN does not perform at all well, and is not a good database platform. It's sounding from replies from other people that this might be a general property of SAN's, or at least the ones that are not stratospherically priced. Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
On Mon, Jun 19, 2006 at 08:09:47PM +1000, Tim Allen wrote: >Certainly, the read performance of the SATA disk still beats the SAN, >and there is no way to lie about read performance. Sure there is: you have the data cached in system RAM. I find it real hard to believe that you can sustain 161MB/s off a single SATA disk. Mike Stone
John Vincent wrote: > <snipped> > Is that expected performance, anyone? It doesn't sound right to me. Does > anyone have any clues about what might be going on? Buggy kernel > drivers? Buggy kernel, come to think of it? Does a SAN just not provide > adequate performance for a large database? > > Tim, > > Here are the areas I would look at first if we're considering hardware > to be the problem: > > HBA and driver: > Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I > would need to know the SAN model to see what the backend of the SAN is > itself. EMC has some FC-attach models that actually have SATA disks > underneath. You also might want to look at the cache size of the > controllers on the SAN. As I noted in another thread, the HBA is an Emulex LP1050, and they have a rather old driver for it. I've recommended that they update ASAP. This hasn't happened yet. I know very little about the SAN itself - the customer hasn't provided any information other than the brand name, as they selected it and installed it themselves. I shall ask for more information. > - Something also to note is that EMC provides a add-on called > PowerPath for load balancing multiple HBAs. If they don't have this, it > might be worth investigating. OK, thanks, I'll ask the customer whether they've used PowerPath at all. They do seem to have it installed on the machine, but I suppose that doesn't guarantee it's being used correctly. However, it looks like they have just the one HBA, so, if I've correctly understood what load balancing means in this context, it's not going to help; right? > - As with anything, disk layout is important. With the lower end IBM > SAN (DS4000) you actually have to operate on physical spindle level. On > our 4300, when I create a LUN, I select the exact disks I want and which > of the two controllers are the preferred path. On our DS6800, I just ask > for storage. I THINK all the EMC models are the "ask for storage" type > of scenario. However with the 6800, you select your storage across > extent pools. > > Have they done any benchmarking of the SAN outside of postgres? Before > we settle on a new LUN configuration, we always do the > dd,umount,mount,dd routine. It's not a perfect test for databases but it > will help you catch GROSS performance issues. I've done some dd'ing myself, as described in another thread. The results are not at all encouraging - their SAN seems to do about 20MB/s or less. > SAN itself: > - Could the SAN be oversubscribed? How many hosts and LUNs total do > they have and what are the queue_depths for those hosts? With the qlogic > card, you can set the queue depth in the BIOS of the adapter when the > system is booting up. CTRL-Q I think. If the system has enough local > DASD to relocate the database internally, it might be a valid test to do > so and see if you can isolate the problem to the SAN itself. The SAN possibly is over-subscribed. Can you suggest any easy ways for me to find out? The customer has an IT department who look after their SANs, and they're not keen on outsiders poking their noses in. It's hard for me to get any direct access to the SAN itself. > PG itself: > > If you think it's a pgsql configuration, I'm guessing you already > configured postgresql.conf to match thiers (or at least a fraction of > thiers since the memory isn't the same?). What about loading a > "from-scratch" config file and restarting the tuning process? The pg configurations are not identical. However, given the differences in raw I/O speed observed, it doesn't seem likely that the difference in configuration is responsible. Yes, as you guessed, we set more conservative options on the less capable box. Doing proper double-blind tests on the customer box is difficult, as it is in production and the customer has a very low tolerance for downtime. > Just a dump of my thought process from someone who's been spending too > much time tuning his SAN and postgres lately. Thanks for all the suggestions, John. I'll keep trying to follow some of them up. Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
* Tim Allen (tim@proximity.com.au) wrote: > The conclusion I'm drawing here is that this SAN does not perform at all > well, and is not a good database platform. It's sounding from replies > from other people that this might be a general property of SAN's, or at > least the ones that are not stratospherically priced. I'd have to agree with you about the specific SAN/setup you're working with there. I certainly disagree that it's a general property of SAN's though. We've got a DS4300 with FC controllers and drives, hosts are generally dual-controller load-balanced and it works quite decently. Indeed, the EMC SANs are generally the high-priced ones too, so not really sure what to tell you about the poor performance you're seeing out of it. Your IT folks and/or your EMC rep. should be able to resolve that, really... Enjoy, Stephen
Attachment
On 6/19/06, Tim Allen <tim@proximity.com.au> wrote:
Yeah, I saw that in a later thread. I would suggest also that the BIOS settings on the HBA itself have been investigated. An example is the Qlogic HBAs have a profile of sorts, one for tape and one for disk. Could be something there.
If they have a single HBA then no it won't help. I'm not very intimate on powerpath but it might even HURT if they have it enabled with one HBA. As an example, we were in the process of migrating an AIX LPAR to our DS6800. We only had one spare HBA to assign it. The default policy with the SDD driver is lb (load balancing). The problem is that with the SDD driver you see multiple hdisks per HBA per controller port on the SAN. Since we had 4 controller ports active on the SAN, our HBA saw 4 hdisks per LUN. The SDD driver abstracts that out as a single vpath and you use the vpaths as your pv on the system. The problem was that it was attempting to load balance across a single hba which was NOT what we wanted.
I saw that as well.
When I say over-subscribed, you have to look at all the active LUNs and all of the systems attached as well. With the DS4300 (standard not turbo option), the SAN can handle 512 I/Os per second. If I have 4 LUNs assigned to four systems (1 per system), and each LUN has a queue_depth of 128 from each system, I''ll oversubscribe with the next host attach unless I back the queue_depth off on each host. Contrast that with the Turbo controller option which does 1024 I/Os per sec and I can duplicate what I have now or add a second LUN per host. I can't even find how much our DS6800 supports.
From what I can tell, it sounds like the SATA problem other people have mentioned sounds like the culprit.
As I noted in another thread, the HBA is an Emulex LP1050, and they have
a rather old driver for it. I've recommended that they update ASAP. This
hasn't happened yet.
Yeah, I saw that in a later thread. I would suggest also that the BIOS settings on the HBA itself have been investigated. An example is the Qlogic HBAs have a profile of sorts, one for tape and one for disk. Could be something there.
OK, thanks, I'll ask the customer whether they've used PowerPath at all.
They do seem to have it installed on the machine, but I suppose that
doesn't guarantee it's being used correctly. However, it looks like they
have just the one HBA, so, if I've correctly understood what load
balancing means in this context, it's not going to help; right?
If they have a single HBA then no it won't help. I'm not very intimate on powerpath but it might even HURT if they have it enabled with one HBA. As an example, we were in the process of migrating an AIX LPAR to our DS6800. We only had one spare HBA to assign it. The default policy with the SDD driver is lb (load balancing). The problem is that with the SDD driver you see multiple hdisks per HBA per controller port on the SAN. Since we had 4 controller ports active on the SAN, our HBA saw 4 hdisks per LUN. The SDD driver abstracts that out as a single vpath and you use the vpaths as your pv on the system. The problem was that it was attempting to load balance across a single hba which was NOT what we wanted.
I've done some dd'ing myself, as described in another thread. The
results are not at all encouraging - their SAN seems to do about 20MB/s
or less.
I saw that as well.
The SAN possibly is over-subscribed. Can you suggest any easy ways for
me to find out? The customer has an IT department who look after their
SANs, and they're not keen on outsiders poking their noses in. It's hard
for me to get any direct access to the SAN itself.
When I say over-subscribed, you have to look at all the active LUNs and all of the systems attached as well. With the DS4300 (standard not turbo option), the SAN can handle 512 I/Os per second. If I have 4 LUNs assigned to four systems (1 per system), and each LUN has a queue_depth of 128 from each system, I''ll oversubscribe with the next host attach unless I back the queue_depth off on each host. Contrast that with the Turbo controller option which does 1024 I/Os per sec and I can duplicate what I have now or add a second LUN per host. I can't even find how much our DS6800 supports.
Thanks for all the suggestions, John. I'll keep trying to follow some of
them up.
From what I can tell, it sounds like the SATA problem other people have mentioned sounds like the culprit.
I'd have to agree with you about the specific SAN/setup you're working
with there. I certainly disagree that it's a general property of SAN's
though. We've got a DS4300 with FC controllers and drives, hosts are
generally dual-controller load-balanced and it works quite decently.
How are you guys doing the load balancing? IIRC, the RDAC driver only does failover. Or are you using the OS level multipathing instead? While we were on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and assigned one to each controller. With 2 HBAs and LVM stripping that was about the best we could get in terms of load balancing.Indeed, the EMC SANs are generally the high-priced ones too, so not
really sure what to tell you about the poor performance you're seeing
out of it. Your IT folks and/or your EMC rep. should be able to resolve
that, really...
The only exception I've heard to this is the Clarion AX150. We looked at one and we were warned off of it by some EMC gearheads.
* John Vincent (pgsql-performance@lusis.org) wrote: > >> I'd have to agree with you about the specific SAN/setup you're working > >> with there. I certainly disagree that it's a general property of SAN's > >> though. We've got a DS4300 with FC controllers and drives, hosts are > >> generally dual-controller load-balanced and it works quite decently. > >> > >How are you guys doing the load balancing? IIRC, the RDAC driver only does > >failover. Or are you using the OS level multipathing instead? While we were > >on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and > >assigned one to each controller. With 2 HBAs and LVM stripping that was > >about the best we could get in terms of load balancing. We're using the OS-level multipathing. I tend to prefer using things like multipath over specific-driver options. I havn't spent a huge amount of effort profiling the SAN, honestly, but it's definitely faster than the direct-attached hardware-RAID5 SCSI system we used to use (from nStor), though that could have been because they were smaller, slower, regular SCSI disks (not FC). A simple bonnie++ run on one of the systems on the SAN gave me this: Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP vardamir 32200M 40205 15 22399 5 102572 10 288.4 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2802 99 +++++ +++ +++++ +++ 2600 99 +++++ +++ 10205 100 So, 40MB/s out, 102MB/s in, or so. This was on an ext3 filesystem. Underneath that array it's a 3-disk RAID5 of 300GB 10k RPM FC disks. We also have a snapshot on that array, but it was disabled at the time. > >Indeed, the EMC SANs are generally the high-priced ones too, so not > >> really sure what to tell you about the poor performance you're seeing > >> out of it. Your IT folks and/or your EMC rep. should be able to resolve > >> that, really... > > > > > >The only exception I've heard to this is the Clarion AX150. We looked at > >one and we were warned off of it by some EMC gearheads. Yeah, the Clarion is the EMC "cheap" line, and I think the AX150 was the extra-cheap one which Dell rebranded and sold. Thanks, Stephen
Attachment
Michael Stone wrote: > On Mon, Jun 19, 2006 at 08:09:47PM +1000, Tim Allen wrote: >> Certainly, the read performance of the SATA disk still beats the SAN, >> and there is no way to lie about read performance. > > Sure there is: you have the data cached in system RAM. I find it real > hard to believe that you can sustain 161MB/s off a single SATA disk. > Agreed - approx 60-70Mb/s seems to be the ballpark for modern SATA drives, so get get 161Mb/s you would need about 3 of them striped together (or a partially cached file as indicated). What is interesting is that (presumably) the same test is getting such uninspiring results on the SAN... Having said that, I've been there too, about 4 years ago with a SAN that had several 6 disk RAID5 arrays, and the best sequential *read* performance we ever saw from them was about 50Mb/s. I recall trying to get performance data from the vendor - only to be told that if we were doing benchmarks - could they have our results when we were finished! regards Mark
I'd have to agree with you about the specific SAN/setup you're working
with there. I certainly disagree that it's a general property of SAN's
though. We've got a DS4300 with FC controllers and drives, hosts are
generally dual-controller load-balanced and it works quite decently.
How are you guys doing the load balancing? IIRC, the RDAC driver only does failover. Or are you using the OS level multipathing instead? While we were on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and assigned one to each controller. With 2 HBAs and LVM stripping that was about the best we could get in terms of load balancing.
Indeed, the EMC SANs are generally the high-priced ones too, so not
really sure what to tell you about the poor performance you're seeing
out of it. Your IT folks and/or your EMC rep. should be able to resolve
that, really...
The only exception I've heard to this is the Clarion AX150. We looked at one and we were warned off of it by some EMC gearheads.
Enjoy,
Stephen
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
iD8DBQFElpuRrzgMPqB3kigRAuo8AJ9vlxRK7VPMb9rN7AFm/qMNHLbdBwCfZiih
ZHApIcDhhj/J/Es9KPXEl/s=
=25MX
-----END PGP SIGNATURE-----
Hi, Tim, Tim Allen wrote: > One thing that has been > apparent is that autovacuum has not been able to keep the database > sufficiently tamed. A pg_dump/pg_restore cycle reduced the total > database size from 81G to 36G. Two first shots: - Increase your free_space_map settings, until (auto)vacuum does not warn about a too small FSM setting any more - Tune autovacuum to run more often, possibly with a higher delay setting to lower the load. If you still have the original database around, > Performing the restore took about 23 hours. Try to put the WAL on another spindle, and increase the WAL size / checkpoint segments. When most of the restore time was spent in index creation, increase the sort mem / maintainance work mem settings. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Hi, Tim, Seems I sent my message to fast, cut in middle of a sencence: Markus Schaber wrote: >> A pg_dump/pg_restore cycle reduced the total >> database size from 81G to 36G. > If you still have the original database around, ... can you check wether VACUUM FULL and REINDEX achieve the same effect? Thanks, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org