Thread: SAN performance mystery

SAN performance mystery

From
Tim Allen
Date:
We have a customer who are having performance problems. They have a
large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
of the EMC SAN model, or the type of fibre-channel card at the moment).
They're running RedHat ES3 (which means a 2.4.something Linux kernel).

They are unhappy about their query performance. We've been doing various
things to try to work out what we can do. One thing that has been
apparent is that autovacuum has not been able to keep the database
sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
database size from 81G to 36G. Performing the restore took about 23 hours.

We tried restoring the pg_dump output to one of our machines, a
dual-core pentium D with a single SATA disk, no raid, I forget how much
RAM but definitely much less than 8G. The restore took five hours. So it
would seem that our machine, which on paper should be far less
impressive than the customer's box, does more than four times the I/O
performance.

To simplify greatly - single local SATA disk beats EMC SAN by factor of
four.

Is that expected performance, anyone? It doesn't sound right to me. Does
anyone have any clues about what might be going on? Buggy kernel
drivers? Buggy kernel, come to think of it? Does a SAN just not provide
adequate performance for a large database?

I'd be grateful for any clues anyone can offer,

Tim



Attachment

Re: SAN performance mystery

From
Scott Marlowe
Date:
On Thu, 2006-06-15 at 16:50, Tim Allen wrote:
> We have a customer who are having performance problems. They have a
> large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
> 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
> of the EMC SAN model, or the type of fibre-channel card at the moment).
> They're running RedHat ES3 (which means a 2.4.something Linux kernel).
>
> They are unhappy about their query performance. We've been doing various
> things to try to work out what we can do. One thing that has been
> apparent is that autovacuum has not been able to keep the database
> sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
> database size from 81G to 36G. Performing the restore took about 23 hours.

Do you have the ability to do any simple IO performance testing, like
with bonnie++ (the old bonnie is not really capable of properly testing
modern equipment, but bonnie++ will give you some idea of the throughput
of the SAN)  Or even just timing a dd write to the SAN?

> We tried restoring the pg_dump output to one of our machines, a
> dual-core pentium D with a single SATA disk, no raid, I forget how much
> RAM but definitely much less than 8G. The restore took five hours. So it
> would seem that our machine, which on paper should be far less
> impressive than the customer's box, does more than four times the I/O
> performance.
>
> To simplify greatly - single local SATA disk beats EMC SAN by factor of
> four.
>
> Is that expected performance, anyone? It doesn't sound right to me. Does
> anyone have any clues about what might be going on? Buggy kernel
> drivers? Buggy kernel, come to think of it? Does a SAN just not provide
> adequate performance for a large database?

Yes, this is not uncommon.  It is very likely that your SATA disk is
lying about fsync.

What kind of backup are you using?  insert statements or copy
statements?  If insert statements, then the difference is quite
believable.  If copy statements, less so.

Next time, on their big server, see if you can try a restore with fsync
turned off and see if that makes the restore faster.  Note you should
turn fsync back on after the restore, as running without it is quite
dangerous should you suffer a power outage.

How are you mounting to the EMC SAN?  NFS, iSCSI? Other?

Re: SAN performance mystery

From
Brian Hurt
Date:
Tim Allen wrote:

> We have a customer who are having performance problems. They have a
> large (36G+) postgres 8.1.3 database installed on an 8-way opteron
> with 8G RAM, attached to an EMC SAN via fibre-channel (I don't have
> details of the EMC SAN model, or the type of fibre-channel card at the
> moment). They're running RedHat ES3 (which means a 2.4.something Linux
> kernel).
>
> They are unhappy about their query performance. We've been doing
> various things to try to work out what we can do. One thing that has
> been apparent is that autovacuum has not been able to keep the
> database sufficiently tamed. A pg_dump/pg_restore cycle reduced the
> total database size from 81G to 36G. Performing the restore took about
> 23 hours.
>
> We tried restoring the pg_dump output to one of our machines, a
> dual-core pentium D with a single SATA disk, no raid, I forget how
> much RAM but definitely much less than 8G. The restore took five
> hours. So it would seem that our machine, which on paper should be far
> less impressive than the customer's box, does more than four times the
> I/O performance.
>
> To simplify greatly - single local SATA disk beats EMC SAN by factor
> of four.
>
> Is that expected performance, anyone? It doesn't sound right to me.
> Does anyone have any clues about what might be going on? Buggy kernel
> drivers? Buggy kernel, come to think of it? Does a SAN just not
> provide adequate performance for a large database?
>
> I'd be grateful for any clues anyone can offer,


I'm actually in a not dissimiliar position here- I was seeing the
performance of Postgres going to an EMC Raid over iSCSI running at about
1/2 the speed of a lesser machine hitting a local SATA drive.  That was,
until I noticed that the SATA drive Postgres installation had fsync
turned off, and the EMC version had fsync turned on.  Turning fsync on
on the SATA drive dropped it's performance to being about 1/4th that of EMC.

Moral of the story: make sure you're comparing apples to apples.

Brian


Re: SAN performance mystery

From
"John Vincent"
Date:
On 6/15/06, Tim Allen <tim@proximity.com.au> wrote:
<snipped>
Is that expected performance, anyone? It doesn't sound right to me. Does
anyone have any clues about what might be going on? Buggy kernel
drivers? Buggy kernel, come to think of it? Does a SAN just not provide
adequate performance for a large database?

I'd be grateful for any clues anyone can offer,

Tim

Tim,

Here are the areas I would look at first if we're considering hardware to be the problem:

HBA and driver:
   Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I would need to know the SAN model to see what the backend of the SAN is itself. EMC has some FC-attach models that actually have SATA disks underneath. You also might want to look at the cache size of the controllers on the SAN.
   - Something also to note is that EMC provides a add-on called PowerPath for load balancing multiple HBAs. If they don't have this, it might be worth investigating.
  - As with anything, disk layout is important. With the lower end IBM SAN (DS4000) you actually have to operate on physical spindle level. On our 4300, when I create a LUN, I select the exact disks I want and which of the two controllers are the preferred path. On our DS6800, I just ask for storage. I THINK all the EMC models are the "ask for storage" type of scenario. However with the 6800, you select your storage across extent pools.


Have they done any benchmarking of the SAN outside of postgres? Before we settle on a new LUN configuration, we always do the dd,umount,mount,dd routine. It's not a perfect test for databases but it will help you catch GROSS performance issues.

SAN itself:
  - Could the SAN be oversubscribed? How many hosts and LUNs total do they have and what are the queue_depths for those hosts? With the qlogic card, you can set the queue depth in the BIOS of the adapter when the system is booting up. CTRL-Q I think.  If the system has enough local DASD to relocate the database internally, it might be a valid test to do so and see if you can isolate the problem to the SAN itself.

PG itself:
 
 If you think it's a pgsql configuration, I'm guessing you already configured postgresql.conf to match thiers (or at least a fraction of thiers since the memory isn't the same?). What about loading a "from-scratch" config file and restarting the tuning process?


Just a dump of my thought process from someone who's been spending too much time tuning his SAN and postgres lately.

Re: SAN performance mystery

From
Tom Lane
Date:
Brian Hurt <bhurt@janestcapital.com> writes:
> Tim Allen wrote:
>> To simplify greatly - single local SATA disk beats EMC SAN by factor
>> of four.

> I'm actually in a not dissimiliar position here- I was seeing the
> performance of Postgres going to an EMC Raid over iSCSI running at about
> 1/2 the speed of a lesser machine hitting a local SATA drive.  That was,
> until I noticed that the SATA drive Postgres installation had fsync
> turned off, and the EMC version had fsync turned on.  Turning fsync on
> on the SATA drive dropped it's performance to being about 1/4th that of EMC.

And that's assuming that the SATA drive isn't configured to lie about
write completion ...

I agree with Brian's suspicion that the SATA drive isn't properly
fsync'ing to disk, resulting in bogusly high throughput.  However,
ISTM a well-configured SAN ought to be able to match even the bogus
throughput, because it should be able to rely on battery-backed
cache to hold written blocks across a power failure, and hence should
be able to report write-complete as soon as it's got the page in cache
rather than having to wait till it's really down on magnetic platter.
Which is what the SATA drive is doing ... only it can't keep the promise
it's making for lack of any battery backup on its on-board cache.

So I'm thinking *both* setups may be misconfigured.  Or else you forgot
to buy the battery-backed-cache option on the SAN hardware.

            regards, tom lane

Re: SAN performance mystery

From
Mark Lewis
Date:
On Thu, 2006-06-15 at 18:24 -0400, Tom Lane wrote:
> I agree with Brian's suspicion that the SATA drive isn't properly
> fsync'ing to disk, resulting in bogusly high throughput.  However,
> ISTM a well-configured SAN ought to be able to match even the bogus
> throughput, because it should be able to rely on battery-backed
> cache to hold written blocks across a power failure, and hence should
> be able to report write-complete as soon as it's got the page in cache
> rather than having to wait till it's really down on magnetic platter.
> Which is what the SATA drive is doing ... only it can't keep the promise
> it's making for lack of any battery backup on its on-board cache.

It really depends on your SAN RAID controller.  We have an HP SAN; I
don't remember the model number exactly, but we ran some tests and with
the battery-backed write cache enabled, we got some improvement in write
performance but it wasn't NEARLY as fast as an SATA drive which lied
about write completion.

The write-and-fsync latency was only about 2-3 times better than with no
write cache at all.  So I wouldn't assume that just because you've got a
write cache on your SAN, that you're getting the same speed as
fsync=off, at least for some cheap controllers.

-- Mark Lewis

Re: SAN performance mystery

From
"Alex Turner"
Date:
Given the fact that most SATA drives have only an 8MB cache, and your RAID controller should have at least 64MB, I would argue that the system with the RAID controller should always be faster.  If it's not, you're getting short-changed somewhere, which is typical on linux, because the drivers just aren't there for a great many controllers that are out there.

Alex.

On 6/15/06, Mark Lewis <mark.lewis@mir3.com> wrote:
On Thu, 2006-06-15 at 18:24 -0400, Tom Lane wrote:
> I agree with Brian's suspicion that the SATA drive isn't properly
> fsync'ing to disk, resulting in bogusly high throughput.  However,
> ISTM a well-configured SAN ought to be able to match even the bogus
> throughput, because it should be able to rely on battery-backed
> cache to hold written blocks across a power failure, and hence should
> be able to report write-complete as soon as it's got the page in cache
> rather than having to wait till it's really down on magnetic platter.
> Which is what the SATA drive is doing ... only it can't keep the promise
> it's making for lack of any battery backup on its on-board cache.

It really depends on your SAN RAID controller.  We have an HP SAN; I
don't remember the model number exactly, but we ran some tests and with
the battery-backed write cache enabled, we got some improvement in write
performance but it wasn't NEARLY as fast as an SATA drive which lied
about write completion.

The write-and-fsync latency was only about 2-3 times better than with no
write cache at all.  So I wouldn't assume that just because you've got a
write cache on your SAN, that you're getting the same speed as
fsync=off, at least for some cheap controllers.

-- Mark Lewis

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: SAN performance mystery

From
Stefan Kaltenbrunner
Date:
Tim Allen wrote:
> We have a customer who are having performance problems. They have a
> large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
> 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
> of the EMC SAN model, or the type of fibre-channel card at the moment).
> They're running RedHat ES3 (which means a 2.4.something Linux kernel).
>
> They are unhappy about their query performance. We've been doing various
> things to try to work out what we can do. One thing that has been
> apparent is that autovacuum has not been able to keep the database
> sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
> database size from 81G to 36G. Performing the restore took about 23 hours.

Hi Tim!

to give you some comparision - we have a similiar sized database here
(~38GB after a fresh restore and ~76GB after some months into
production). the server is a 4 core Opteron @2,4Ghz with 16GB RAM,
connected via 2 QLogic 2Gbit HBA's to the SAN (IBM DS4300 Turbo).

It took us quite a while to get this combination up to speed but a full
dump&restore cycle (via a pg_dump | psql pipe over the net) now takes
only about an hour.
23 hours or even 5 hours sounds really excessive - I'm wondering about
some basic issues with the SAN.
If you are using any kind of multipathing (most likely the one in the
QLA-drivers) I would at first assume that you are playing ping-pong
between the controllers (ie the FC-cards do send IO to more than one
SAN-head causing those to failover constantly completely destroying
performance).
ES3 is rather old too and I don't think that even their hacked up kernel
is very good at driving a large Opteron SMP box (2.6 should be MUCH
better in that regard).

Other than that - how well is your postgresql instance tuned to your
hardware ?


Stefan

Re: SAN performance mystery

From
Tim Allen
Date:
Tim Allen wrote:
> We have a customer who are having performance problems. They have a
> large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
> 8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
> of the EMC SAN model, or the type of fibre-channel card at the moment).
> They're running RedHat ES3 (which means a 2.4.something Linux kernel).

> To simplify greatly - single local SATA disk beats EMC SAN by factor of
> four.
>
> Is that expected performance, anyone? It doesn't sound right to me. Does
> anyone have any clues about what might be going on? Buggy kernel
> drivers? Buggy kernel, come to think of it? Does a SAN just not provide
> adequate performance for a large database?
>
> I'd be grateful for any clues anyone can offer,
>
> Tim

Thanks to all who have replied so far. I've learned a few new things in
the meantime.

Firstly, the fibrechannel card is an Emulex LP1050. The customer seems
to have rather old drivers for it, so I have recommended that they
upgrade asap. I've also suggested they might like to upgrade their
kernel to something recent too (eg upgrade to RHEL4), but no telling
whether they'll accept that recommendation.

The fact that SATA drives are wont to lie about write completion, which
several posters have pointed out, presumably has an effect on write
performance (ie apparent write performance is increased at the cost of
an increased risk of data-loss), but, again presumably, not much of an
effect on read performance. After loading the customer's database on our
fairly modest box with the single SATA disk, we also tested select query
performance, and while we didn't see a factor of four gain, we certainly
saw that read performance is also substantially better. So the fsync
issue possibly accounts for part of our factor-of-four, but not all of
it. Ie, the SAN is still not doing well by comparison, even allowing for
the presumption that it is more honest.

One curious thing is that some postgres backends seem to spend an
inordinate amount of time in uninterruptible iowait state. I found a
posting to this list from December 2004 from someone who reported that
very same thing. For example, bringing down postgres on the customer box
requires kill -9, because there are invariably one or two processes so
deeply uninterruptible as to not respond to a politer signal. That
indicates something not quite right, doesn't it?

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: SAN performance mystery

From
Greg Stark
Date:
"Alex Turner" <armtuk@gmail.com> writes:

> Given the fact that most SATA drives have only an 8MB cache, and your RAID
> controller should have at least 64MB, I would argue that the system with the
> RAID controller should always be faster.  If it's not, you're getting
> short-changed somewhere, which is typical on linux, because the drivers just
> aren't there for a great many controllers that are out there.

Alternatively Linux is using the 1-4 gigabytes of cache available to it
effectively enough that the 64 megabytes of mostly duplicated cache just isn't
especially helpful...

I never understood why disk caches on the order of megabytes are exciting. Why
should disk manufacturers be any better about cache management than OS
authors?

In the case of RAID 5 this could actually work against you since the RAID
controller can _only_ use its cache to find parity blocks when writing.
Software raid can use all of the OS's disk cache to that end.

--
greg

Re: SAN performance mystery

From
"Mikael Carneholm"
Date:
We've seen similar results with our EMC CX200 (fully equipped) when
compared to a single (1) SCSI disk machine. For sequential reads/writes
(import, export, updates on 5-10 30M+ row tables), performance is
downright awful. A big DB update took 5-6h in pre-prod (single SCSI),
and 10-14?h (don't recall the exact details) in production (EMC SAN).
And this was with a proprietary DB, btw - no fsync on/off affecting the
results here.

FC isn't exactly known for great bandwidth, iirc a 2Gbit FC channel tops
at 192Mb/s. So, especially if you mostly have DW/BI type of workloads,
go for DAD (Direct Attached Disks) instead.

/Mikael

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Allen
Sent: den 15 juni 2006 23:50
To: pgsql-performance@postgresql.org
Subject: [PERFORM] SAN performance mystery

We have a customer who are having performance problems. They have a
large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
of the EMC SAN model, or the type of fibre-channel card at the moment).
They're running RedHat ES3 (which means a 2.4.something Linux kernel).

They are unhappy about their query performance. We've been doing various
things to try to work out what we can do. One thing that has been
apparent is that autovacuum has not been able to keep the database
sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
database size from 81G to 36G. Performing the restore took about 23
hours.

We tried restoring the pg_dump output to one of our machines, a
dual-core pentium D with a single SATA disk, no raid, I forget how much
RAM but definitely much less than 8G. The restore took five hours. So it
would seem that our machine, which on paper should be far less
impressive than the customer's box, does more than four times the I/O
performance.

To simplify greatly - single local SATA disk beats EMC SAN by factor of
four.

Is that expected performance, anyone? It doesn't sound right to me. Does
anyone have any clues about what might be going on? Buggy kernel
drivers? Buggy kernel, come to think of it? Does a SAN just not provide
adequate performance for a large database?

I'd be grateful for any clues anyone can offer,

Tim




Re: SAN performance mystery

From
"Merlin Moncure"
Date:
On 6/16/06, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com> wrote:
> We've seen similar results with our EMC CX200 (fully equipped) when
> compared to a single (1) SCSI disk machine. For sequential reads/writes
> (import, export, updates on 5-10 30M+ row tables), performance is
> downright awful. A big DB update took 5-6h in pre-prod (single SCSI),
> and 10-14?h (don't recall the exact details) in production (EMC SAN).
> And this was with a proprietary DB, btw - no fsync on/off affecting the
> results here.

You are in good company.  We bought a Hitachi AMS200, 2gb FC and a
gigabyte of cache.  We were shocked and dismayed to find the unit
could do about 50 mb/sec measured from dd (yes, around the performance
of a single consumer grade sata drive).   It is my (unconfirmted)
belief that the unit was governed internally to encourage you to buy
the more expensive version, AMS500, etc.

needless to say, we sent the unit back, and are now waiting on a
xyratex 4gb FC attached SAS unit.  we spoke directly to their
performance people who told us to expect the unit to be network
bandwitdh bottlenecked as you would expect.  they were even talking
about a special mode where you could bond the dual fc ports, now
that's power.  If the unit really does what they claim, I will be back
here talking about it for sure ;)

The bottom line is that most SANs, even from some of the biggest
vendors, are simply worthless from a performance angle.  You have to
be really critical when you buy them, don't beleive anything the sales
rep tells you, and make sure to negotiate in advance a return policy
if the unit does not perform.  There is tons of b.s. out there, but so
far my impression of xyratex is really favorable (fingers crossed),
and I'm hearing lots of great stuff about them from the channel.

merlin

Re: SAN performance mystery

From
Jeff Trout
Date:
On Jun 16, 2006, at 5:11 AM, Tim Allen wrote:
>
> One curious thing is that some postgres backends seem to spend an
> inordinate amount of time in uninterruptible iowait state. I found
> a posting to this list from December 2004 from someone who reported
> that very same thing. For example, bringing down postgres on the
> customer box requires kill -9, because there are invariably one or
> two processes so deeply uninterruptible as to not respond to a
> politer signal. That indicates something not quite right, doesn't it?
>

Sounds like there could be a driver/array/kernel bug there that is
kicking the performance down the tube.
If it was PG's fault it wouldn't be stuck uninterruptable.

--
Jeff Trout <jeff@jefftrout.com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/




Re: SAN performance mystery

From
Jim Nasby
Date:
On Jun 16, 2006, at 6:28 AM, Greg Stark wrote:
> I never understood why disk caches on the order of megabytes are
> exciting. Why
> should disk manufacturers be any better about cache management than OS
> authors?
>
> In the case of RAID 5 this could actually work against you since
> the RAID
> controller can _only_ use its cache to find parity blocks when
> writing.
> Software raid can use all of the OS's disk cache to that end.

IIRC some of the Bizgres folks have found better performance with
software raid for just that reason. The big advantage HW raid has is
that you can do a battery-backed cache, something you'll never be
able to duplicate in a general-purpose computer (sure, you could
battery-back the DRAM if you really wanted to, but if the kernel
crashed you'd be completely screwed, which isn't the case with a
battery-backed RAID controller).

The quality of the RAID controller also makes a huge difference.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461




--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: SAN performance mystery

From
Tim Allen
Date:
Jeff Trout wrote:
> On Jun 16, 2006, at 5:11 AM, Tim Allen wrote:
>> One curious thing is that some postgres backends seem to spend an
>> inordinate amount of time in uninterruptible iowait state. I found  a
>> posting to this list from December 2004 from someone who reported
>> that very same thing. For example, bringing down postgres on the
>> customer box requires kill -9, because there are invariably one or
>> two processes so deeply uninterruptible as to not respond to a
>> politer signal. That indicates something not quite right, doesn't it?
>
> Sounds like there could be a driver/array/kernel bug there that is
> kicking the performance down the tube.
> If it was PG's fault it wouldn't be stuck uninterruptable.

That's what I thought. I've advised the customer to upgrade their kernel
drivers, and to preferably upgrade their kernel as well. We'll see if
they accept the advice :-|.

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: SAN performance mystery

From
Tim Allen
Date:
Scott Marlowe wrote:
> On Thu, 2006-06-15 at 16:50, Tim Allen wrote:
>
>>We have a customer who are having performance problems. They have a
>>large (36G+) postgres 8.1.3 database installed on an 8-way opteron with
>>8G RAM, attached to an EMC SAN via fibre-channel (I don't have details
>>of the EMC SAN model, or the type of fibre-channel card at the moment).
>>They're running RedHat ES3 (which means a 2.4.something Linux kernel).
>>
>>They are unhappy about their query performance. We've been doing various
>>things to try to work out what we can do. One thing that has been
>>apparent is that autovacuum has not been able to keep the database
>>sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
>>database size from 81G to 36G. Performing the restore took about 23 hours.
>
> Do you have the ability to do any simple IO performance testing, like
> with bonnie++ (the old bonnie is not really capable of properly testing
> modern equipment, but bonnie++ will give you some idea of the throughput
> of the SAN)  Or even just timing a dd write to the SAN?

I've done some timed dd's. The timing results vary quite a bit, but it
seems you can write to the SAN at about 20MB/s and read from it at about
  12MB/s. Not an entirely scientific test, as I wasn't able to stop
other activity on the machine, though I don't think much else was
happening. Certainly not impressive figures, compared with our machine
with the SATA disk (referred to below), which can get 161MB/s copying
files on the same disk, and 48MB/s and 138Mb/s copying files from the
sata disk respectively to and from a RAID5 array.

The customer is a large organisation, with a large IT department who
guard their turf carefully, so there is no way I could get away with
installing any heavier duty testing tools like bonnie++ on their machine.

>>We tried restoring the pg_dump output to one of our machines, a
>>dual-core pentium D with a single SATA disk, no raid, I forget how much
>>RAM but definitely much less than 8G. The restore took five hours. So it
>>would seem that our machine, which on paper should be far less
>>impressive than the customer's box, does more than four times the I/O
>>performance.
>>
>>To simplify greatly - single local SATA disk beats EMC SAN by factor of
>>four.
>>
>>Is that expected performance, anyone? It doesn't sound right to me. Does
>>anyone have any clues about what might be going on? Buggy kernel
>>drivers? Buggy kernel, come to think of it? Does a SAN just not provide
>>adequate performance for a large database?

> Yes, this is not uncommon.  It is very likely that your SATA disk is
> lying about fsync.

I guess a sustained write will flood the disk's cache and negate the
effect of the write-completion dishonesty. But I have no idea how large
a copy would have to be to do that - can anyone suggest a figure?
Certainly, the read performance of the SATA disk still beats the SAN,
and there is no way to lie about read performance.

> What kind of backup are you using?  insert statements or copy
> statements?  If insert statements, then the difference is quite
> believable.  If copy statements, less so.

A binary pg_dump, which amounts to copy statements, if I'm not mistaken.

> Next time, on their big server, see if you can try a restore with fsync
> turned off and see if that makes the restore faster.  Note you should
> turn fsync back on after the restore, as running without it is quite
> dangerous should you suffer a power outage.
>
> How are you mounting to the EMC SAN?  NFS, iSCSI? Other?

iSCSI, I believe. Some variant of SCSI, anyway, of that I'm certain.

The conclusion I'm drawing here is that this SAN does not perform at all
well, and is not a good database platform. It's sounding from replies
from other people that this might be a general property of SAN's, or at
least the ones that are not stratospherically priced.

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: SAN performance mystery

From
Michael Stone
Date:
On Mon, Jun 19, 2006 at 08:09:47PM +1000, Tim Allen wrote:
>Certainly, the read performance of the SATA disk still beats the SAN,
>and there is no way to lie about read performance.

Sure there is: you have the data cached in system RAM. I find it real
hard to believe that you can sustain 161MB/s off a single SATA disk.

Mike Stone

Re: SAN performance mystery

From
Tim Allen
Date:
John Vincent wrote:
>     <snipped>
>     Is that expected performance, anyone? It doesn't sound right to me. Does
>     anyone have any clues about what might be going on? Buggy kernel
>     drivers? Buggy kernel, come to think of it? Does a SAN just not provide
>     adequate performance for a large database?
>
> Tim,
>
> Here are the areas I would look at first if we're considering hardware
> to be the problem:
>
> HBA and driver:
>    Since this is a Intel/Linux system, the HBA is PROBABLY a qlogic. I
> would need to know the SAN model to see what the backend of the SAN is
> itself. EMC has some FC-attach models that actually have SATA disks
> underneath. You also might want to look at the cache size of the
> controllers on the SAN.

As I noted in another thread, the HBA is an Emulex LP1050, and they have
a rather old driver for it. I've recommended that they update ASAP. This
hasn't happened yet.

I know very little about the SAN itself - the customer hasn't provided
any information other than the brand name, as they selected it and
installed it themselves. I shall ask for more information.

>    - Something also to note is that EMC provides a add-on called
> PowerPath for load balancing multiple HBAs. If they don't have this, it
> might be worth investigating.

OK, thanks, I'll ask the customer whether they've used PowerPath at all.
They do seem to have it installed on the machine, but I suppose that
doesn't guarantee it's being used correctly. However, it looks like they
have just the one HBA, so, if I've correctly understood what load
balancing means in this context, it's not going to help; right?

>   - As with anything, disk layout is important. With the lower end IBM
> SAN (DS4000) you actually have to operate on physical spindle level. On
> our 4300, when I create a LUN, I select the exact disks I want and which
> of the two controllers are the preferred path. On our DS6800, I just ask
> for storage. I THINK all the EMC models are the "ask for storage" type
> of scenario. However with the 6800, you select your storage across
> extent pools.
>
> Have they done any benchmarking of the SAN outside of postgres? Before
> we settle on a new LUN configuration, we always do the
> dd,umount,mount,dd routine. It's not a perfect test for databases but it
> will help you catch GROSS performance issues.

I've done some dd'ing myself, as described in another thread. The
results are not at all encouraging - their SAN seems to do about 20MB/s
or less.

> SAN itself:
>   - Could the SAN be oversubscribed? How many hosts and LUNs total do
> they have and what are the queue_depths for those hosts? With the qlogic
> card, you can set the queue depth in the BIOS of the adapter when the
> system is booting up. CTRL-Q I think.  If the system has enough local
> DASD to relocate the database internally, it might be a valid test to do
> so and see if you can isolate the problem to the SAN itself.

The SAN possibly is over-subscribed. Can you suggest any easy ways for
me to find out? The customer has an IT department who look after their
SANs, and they're not keen on outsiders poking their noses in. It's hard
for me to get any direct access to the SAN itself.

> PG itself:
>
>  If you think it's a pgsql configuration, I'm guessing you already
> configured postgresql.conf to match thiers (or at least a fraction of
> thiers since the memory isn't the same?). What about loading a
> "from-scratch" config file and restarting the tuning process?

The pg configurations are not identical. However, given the differences
in raw I/O speed observed, it doesn't seem likely that the difference in
configuration is responsible. Yes, as you guessed, we set more
conservative options on the less capable box. Doing proper double-blind
tests on the customer box is difficult, as it is in production and the
customer has a very low tolerance for downtime.

> Just a dump of my thought process from someone who's been spending too
> much time tuning his SAN and postgres lately.

Thanks for all the suggestions, John. I'll keep trying to follow some of
them up.

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: SAN performance mystery

From
Stephen Frost
Date:
* Tim Allen (tim@proximity.com.au) wrote:
> The conclusion I'm drawing here is that this SAN does not perform at all
> well, and is not a good database platform. It's sounding from replies
> from other people that this might be a general property of SAN's, or at
> least the ones that are not stratospherically priced.

I'd have to agree with you about the specific SAN/setup you're working
with there.  I certainly disagree that it's a general property of SAN's
though.  We've got a DS4300 with FC controllers and drives, hosts are
generally dual-controller load-balanced and it works quite decently.

Indeed, the EMC SANs are generally the high-priced ones too, so not
really sure what to tell you about the poor performance you're seeing
out of it.  Your IT folks and/or your EMC rep. should be able to resolve
that, really...

    Enjoy,

        Stephen

Attachment

Re: SAN performance mystery

From
"John Vincent"
Date:
On 6/19/06, Tim Allen <tim@proximity.com.au> wrote:

As I noted in another thread, the HBA is an Emulex LP1050, and they have
a rather old driver for it. I've recommended that they update ASAP. This
hasn't happened yet.

Yeah, I saw that in a later thread. I would suggest also that the BIOS settings on the HBA itself have been investigated. An example is the Qlogic HBAs have a profile of sorts, one for tape and one for disk. Could be something there.


OK, thanks, I'll ask the customer whether they've used PowerPath at all.
They do seem to have it installed on the machine, but I suppose that
doesn't guarantee it's being used correctly. However, it looks like they
have just the one HBA, so, if I've correctly understood what load
balancing means in this context, it's not going to help; right?

If they have a single HBA then no it won't help. I'm not very intimate on powerpath but it might even HURT if they have it enabled with one HBA. As an example, we were in the process of migrating an AIX LPAR to our DS6800. We only had one spare HBA to assign it. The default policy with the SDD driver is lb (load balancing). The problem is that with the SDD driver you see multiple hdisks per HBA per controller port on the SAN. Since we had 4 controller ports active on the SAN, our HBA saw 4 hdisks per LUN. The SDD driver abstracts that out as a single vpath and you use the vpaths as your pv on the system. The problem was that it was attempting to load balance across a single hba which was NOT what we wanted.




I've done some dd'ing myself, as described in another thread. The
results are not at all encouraging - their SAN seems to do about 20MB/s
or less.

I saw that as well.


The SAN possibly is over-subscribed. Can you suggest any easy ways for
me to find out? The customer has an IT department who look after their
SANs, and they're not keen on outsiders poking their noses in. It's hard
for me to get any direct access to the SAN itself.

When I say over-subscribed, you have to look at all the active LUNs and all of the systems attached as well. With the DS4300 (standard not turbo option), the SAN can handle 512 I/Os per second. If I have 4 LUNs assigned to four systems (1 per system), and each LUN has a queue_depth of 128 from each system, I''ll oversubscribe with the next host attach unless I back the queue_depth off on each host. Contrast that with the Turbo controller option which does 1024 I/Os per sec and I can duplicate what I have now or add a second LUN per host. I can't even find how much our DS6800 supports.


Thanks for all the suggestions, John. I'll keep trying to follow some of
them up.

From what I can tell, it sounds like the SATA problem other people have mentioned sounds like the culprit.



Re: SAN performance mystery

From
"John Vincent"
Date:





I'd have to agree with you about the specific SAN/setup you're working
with there.  I certainly disagree that it's a general property of SAN's
though.  We've got a DS4300 with FC controllers and drives, hosts are
generally dual-controller load-balanced and it works quite decently.
How are you guys doing the load balancing? IIRC, the RDAC driver only does failover. Or are you using the OS level multipathing instead? While we were on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and assigned one to each controller. With 2 HBAs and LVM stripping that was about the best we could get in terms of load balancing.

Indeed, the EMC SANs are generally the high-priced ones too, so not
really sure what to tell you about the poor performance you're seeing
out of it.  Your IT folks and/or your EMC rep. should be able to resolve
that, really...

The only exception I've heard to this is the Clarion AX150. We looked at one and we were warned off of it by some EMC gearheads.



Re: SAN performance mystery

From
Stephen Frost
Date:
* John Vincent (pgsql-performance@lusis.org) wrote:
> >> I'd have to agree with you about the specific SAN/setup you're working
> >> with there.  I certainly disagree that it's a general property of SAN's
> >> though.  We've got a DS4300 with FC controllers and drives, hosts are
> >> generally dual-controller load-balanced and it works quite decently.
> >>
> >How are you guys doing the load balancing? IIRC, the RDAC driver only does
> >failover. Or are you using the OS level multipathing instead? While we were
> >on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and
> >assigned one to each controller. With 2 HBAs and LVM stripping that was
> >about the best we could get in terms of load balancing.

We're using the OS-level multipathing.  I tend to prefer using things
like multipath over specific-driver options.  I havn't spent a huge
amount of effort profiling the SAN, honestly, but it's definitely faster
than the direct-attached hardware-RAID5 SCSI system we used to use (from
nStor), though that could have been because they were smaller, slower,
regular SCSI disks (not FC).

A simple bonnie++ run on one of the systems on the SAN gave me this:
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
vardamir     32200M           40205  15 22399   5           102572  10 288.4   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  2802  99 +++++ +++ +++++ +++  2600  99 +++++ +++ 10205 100

So, 40MB/s out, 102MB/s in, or so.  This was on an ext3 filesystem.
Underneath that array it's a 3-disk RAID5 of 300GB 10k RPM FC disks.
We also have a snapshot on that array, but it was disabled at the time.

> >Indeed, the EMC SANs are generally the high-priced ones too, so not
> >> really sure what to tell you about the poor performance you're seeing
> >> out of it.  Your IT folks and/or your EMC rep. should be able to resolve
> >> that, really...
> >
> >
> >The only exception I've heard to this is the Clarion AX150. We looked at
> >one and we were warned off of it by some EMC gearheads.

Yeah, the Clarion is the EMC "cheap" line, and I think the AX150 was the
extra-cheap one which Dell rebranded and sold.

    Thanks,

        Stephen

Attachment

Re: SAN performance mystery

From
Mark Kirkwood
Date:
Michael Stone wrote:
> On Mon, Jun 19, 2006 at 08:09:47PM +1000, Tim Allen wrote:
>> Certainly, the read performance of the SATA disk still beats the SAN,
>> and there is no way to lie about read performance.
>
> Sure there is: you have the data cached in system RAM. I find it real
> hard to believe that you can sustain 161MB/s off a single SATA disk.
>

Agreed - approx 60-70Mb/s seems to be the ballpark for modern SATA
drives, so get get 161Mb/s you would need about 3 of them striped
together (or a partially cached file as indicated).

What is interesting is that (presumably) the same test is getting such
uninspiring results on the SAN...

Having said that, I've been there too, about 4 years ago with a SAN that
had several 6 disk RAID5 arrays, and the best sequential *read*
performance we ever saw from them was about 50Mb/s. I recall trying to
get performance data from the vendor - only to be told that if we were
doing benchmarks - could they have our results when we were finished!

regards

Mark


Re: SAN performance mystery

From
"John Vincent"
Date:




I'd have to agree with you about the specific SAN/setup you're working
with there.  I certainly disagree that it's a general property of SAN's
though.  We've got a DS4300 with FC controllers and drives, hosts are
generally dual-controller load-balanced and it works quite decently.

How are you guys doing the load balancing? IIRC, the RDAC driver only does failover. Or are you using the OS level multipathing instead? While we were on the 4300 for our AIX boxes, we just created two big RAID5 LUNs and assigned one to each controller. With 2 HBAs and LVM stripping that was about the best we could get in terms of load balancing.

Indeed, the EMC SANs are generally the high-priced ones too, so not
really sure what to tell you about the poor performance you're seeing
out of it.  Your IT folks and/or your EMC rep. should be able to resolve
that, really...

The only exception I've heard to this is the Clarion AX150. We looked at one and we were warned off of it by some EMC gearheads.

        Enjoy,

                Stephen


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFElpuRrzgMPqB3kigRAuo8AJ9vlxRK7VPMb9rN7AFm/qMNHLbdBwCfZiih
ZHApIcDhhj/J/Es9KPXEl/s=
=25MX
-----END PGP SIGNATURE-----



Re: SAN performance mystery

From
Markus Schaber
Date:
Hi, Tim,

Tim Allen wrote:
> One thing that has been
> apparent is that autovacuum has not been able to keep the database
> sufficiently tamed. A pg_dump/pg_restore cycle reduced the total
> database size from 81G to 36G.

Two first shots:

- Increase your free_space_map settings, until (auto)vacuum does not
warn about a too small FSM setting any more

- Tune autovacuum to run more often, possibly with a higher delay
setting to lower the load.

If you still have the original database around,

> Performing the restore took about 23 hours.

Try to put the WAL on another spindle, and increase the WAL size /
checkpoint segments.

When most of the restore time was spent in index creation, increase the
sort mem / maintainance work mem settings.


HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: SAN performance mystery

From
Markus Schaber
Date:
Hi, Tim,

Seems I sent my message to fast, cut in middle of a sencence:

Markus Schaber wrote:
>> A pg_dump/pg_restore cycle reduced the total
>> database size from 81G to 36G.

> If you still have the original database around,

... can you check wether VACUUM FULL and REINDEX achieve the same effect?

Thanks,
Markus


--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org